# Assignments: Cleaning Data

## 1. Data in Python Request

* Read in data from the Excel spreadsheet (Alarm Survey Data.xlsx) and put into a Pandas DataFrame
* Check the data type of each column
* Convert object columns into numeric or datetime columns, as needed

In [1]:
import pandas as pd 
df=pd.read_excel(r"C:\Users\lzhen\Downloads\Data+Science+in+Python+-+Data+Prep+&+EDA\Data\Alarm Survey Data.xlsx")

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   survey_id           6433 non-null   int64  
 1   age                 6433 non-null   int64  
 2   number_of_children  6433 non-null   float64
 3   activity_level      6433 non-null   object 
 4   sleep_quality       6082 non-null   float64
 5   number_of_snoozes   6433 non-null   int64  
 6   alarm_rating        6433 non-null   object 
dtypes: float64(2), int64(3), object(2)
memory usage: 351.9+ KB


In [None]:
df['number_of_children']=df['number_of_children'].astype('int')

df.alarm_rating=pd.to_numeric(df.alarm_rating.str.replace('stars',''))
df.alarm_rating

0       5
1       3
2       1
3       4
4       3
       ..
6428    5
6429    4
6430    3
6431    3
6432    1
Name: alarm_rating, Length: 6433, dtype: int64

## 2. Missing Data Check

* Find any missing data
* Deal with the missing data

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   survey_id           6433 non-null   int64  
 1   age                 6433 non-null   int64  
 2   number_of_children  6433 non-null   int64  
 3   activity_level      6433 non-null   object 
 4   sleep_quality       6082 non-null   float64
 5   number_of_snoozes   6433 non-null   int64  
 6   alarm_rating        6433 non-null   int64  
dtypes: float64(1), int64(5), object(1)
memory usage: 351.9+ KB


In [None]:
df[df.isna().any(axis=1)] #returns 1 (True) if any cell has Nan in the row
df.sleep_quality.value_counts(dropna=False)
#df_clean = df[~df.isna().any(axis=1)] Will remove any row that has Nan data

sleep_quality
5.0    2721
4.0    2261
3.0     997
NaN     351
1.0     103
Name: count, dtype: int64

In [6]:
df['sleep_quality']=df.sleep_quality.fillna(2)
#df.sleep_quality.fillna(2,inplace=True)
df.sleep_quality.value_counts(dropna=False)

sleep_quality
5.0    2721
4.0    2261
3.0     997
2.0     351
1.0     103
Name: count, dtype: int64

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   survey_id           6433 non-null   int64  
 1   age                 6433 non-null   int64  
 2   number_of_children  6433 non-null   int64  
 3   activity_level      6433 non-null   object 
 4   sleep_quality       6433 non-null   float64
 5   number_of_snoozes   6433 non-null   int64  
 6   alarm_rating        6433 non-null   int64  
dtypes: float64(1), int64(5), object(1)
memory usage: 351.9+ KB


## 3. Inconsistent Text & Typos Check

* Find any inconsistent text and typos
* Deal with the inconsistent text and typos

In [8]:
import numpy as np 
df.activity_level=np.where(df['activity_level'].isin (['light_activity','light']),'lightly active',df.activity_level)
df['activity_level']

0               active
1       lightly active
2       lightly active
3               active
4       lightly active
             ...      
6428            active
6429    lightly active
6430    lightly active
6431            active
6432    lightly active
Name: activity_level, Length: 6433, dtype: object

## 4. Duplicate Data Check

* Find any duplicate data
* Deal with the duplicate data

In [9]:
df=df.drop_duplicates().reset_index(drop=True)
df

Unnamed: 0,survey_id,age,number_of_children,activity_level,sleep_quality,number_of_snoozes,alarm_rating
0,1,34,3,active,3.0,1,5
1,2,31,3,lightly active,3.0,3,3
2,3,18,0,lightly active,4.0,1,1
3,4,42,4,active,4.0,1,4
4,5,30,1,lightly active,1.0,4,3
...,...,...,...,...,...,...,...
6361,6362,27,2,active,5.0,0,5
6362,6363,31,1,lightly active,4.0,0,4
6363,6364,26,0,lightly active,5.0,0,3
6364,6365,27,1,active,5.0,0,3


## 5. Outliers Check

* Find any outliers
* Deal with the outliers

In [20]:
df[df.isna().any(axis=1)]
df.number_of_snoozes.describe()

count    6366.000000
mean        1.157870
std         1.603528
min         0.000000
25%         0.000000
50%         0.000000
75%         2.000000
max        19.000000
Name: number_of_snoozes, dtype: float64

In [26]:
import numpy as np 
df.number_of_snoozes.max()
df.number_of_snoozes=np.where(df.number_of_snoozes>=19,10,df.number_of_snoozes )
df.number_of_snoozes.describe()

count    6366.000000
mean        1.156456
std         1.591719
min         0.000000
25%         0.000000
50%         0.000000
75%         2.000000
max        10.000000
Name: number_of_snoozes, dtype: float64

## 6. Data Issues Check

* Quickly explore the updated DataFrame. How do things look now after handling the data issues compared to the original DataFrame?

In [28]:
#missing data?
df[df.isna().any(axis=1)]

Unnamed: 0,survey_id,age,number_of_children,activity_level,sleep_quality,number_of_snoozes,alarm_rating


In [29]:
#check data inconsistency
df.activity_level.value_counts()

activity_level
lightly active    3288
active            2422
very active        656
Name: count, dtype: int64

In [31]:
#check duplicate value
df[df.duplicated()]

Unnamed: 0,survey_id,age,number_of_children,activity_level,sleep_quality,number_of_snoozes,alarm_rating


In [32]:
#check outliners
df.describe()

Unnamed: 0,survey_id,age,number_of_children,sleep_quality,number_of_snoozes,alarm_rating
count,6366.0,6366.0,6366.0,6366.0,6366.0,6366.0
mean,3183.5,29.075243,1.38093,4.10776,1.156456,2.955231
std,1837.850239,7.476856,1.389855,0.963601,1.591719,1.100328
min,1.0,13.0,0.0,1.0,0.0,1.0
25%,1592.25,23.0,0.0,4.0,0.0,2.0
50%,3183.5,28.0,1.0,4.0,0.0,3.0
75%,4774.75,34.0,2.0,5.0,2.0,4.0
max,6366.0,47.0,5.0,5.0,10.0,5.0


## 7. Create Columns From Numeric Data

* Read data into Python
* Check the data type of each column
* Create a numeric column using arithmetic
* Create a numeric column using conditional logic

In [None]:
# Create a “Total Spend” column that includes both the pen cost and shipping cost for each sale
# Create a “Free Shipping” column that says yes if the sale included free shipping, and no otherwise

In [2]:
import pandas as pd 
pens=pd.read_excel(r"C:\Users\lzhen\Downloads\Data+Science+in+Python+-+Data+Prep+&+EDA\Data\Pen Sales Data.xlsx")
pens.head()

Unnamed: 0,Customer,Item,Pen Cost,Shipping Cost,Purchase Date,Delivery Date,Review
0,5201,Ballpoint Pens,5.99,2.99,2023-05-01,2023-05-03,"DoodleWithMe|I love the way this pen writes, b..."
1,5202,Sharpies,12.99,0.0,2023-05-01,2023-05-04,ScribbleMaster|The classic Sharpie marker has ...
2,5203,Ballpoint Pens (Bold),6.95,4.99,2023-05-01,2023-05-02,PenPalForever|The retractable ballpoint pen ha...
3,5204,Gel Pens,5.99,2.99,2023-05-01,2023-05-04,TheWriteWay|This gel pen has a comfortable gri...
4,5205,Rollerball Pens,12.99,1.99,2023-05-01,2023-05-03,PenAndPaperPerson|The rollerball pen has a smo...


In [7]:
import numpy as np
pens['Total Spend']=pens['Pen Cost']+pens['Shipping Cost']
pens['Free Shipping']=np.where(pens['Shipping Cost']==0,'Yes','No')
pens.head()

Unnamed: 0,Customer,Item,Pen Cost,Shipping Cost,Purchase Date,Delivery Date,Review,Total Spend,Free Shipping
0,5201,Ballpoint Pens,5.99,2.99,2023-05-01,2023-05-03,"DoodleWithMe|I love the way this pen writes, b...",8.98,No
1,5202,Sharpies,12.99,0.0,2023-05-01,2023-05-04,ScribbleMaster|The classic Sharpie marker has ...,12.99,Yes
2,5203,Ballpoint Pens (Bold),6.95,4.99,2023-05-01,2023-05-02,PenPalForever|The retractable ballpoint pen ha...,11.94,No
3,5204,Gel Pens,5.99,2.99,2023-05-01,2023-05-04,TheWriteWay|This gel pen has a comfortable gri...,8.98,No
4,5205,Rollerball Pens,12.99,1.99,2023-05-01,2023-05-03,PenAndPaperPerson|The rollerball pen has a smo...,14.98,No


## 8. Create Columns From DateTime Data

* Calculate the difference between two datetime columns and save it as a new column
* Take the average of a column

In [None]:
# Calculate the number of days between the purchase and delivery date for each sale
# Save it as a new column called “Delivery Time”
# What were the average days from purchase to delivery?

In [15]:

pens['Delivery Time']=pens['Delivery Date']-pens['Purchase Date']
pens['Delivery Time'].mean()

Timedelta('3 days 05:45:36')

## 9. Create Columns From Text Data

* Split one column into multiple columns
* Create a Boolean column (True / False) to show whether a text field contains particular words

In [None]:
# Split the reviews on the “|” character to create two new columns: “User Name” and “Review Text”
# Create a “Leak or Spill” column that flags the reviews that mention either “leak” or “spill”