## Preparing the Speed Dating Dataset
As an entrepreneur, you are planning to launch a new dating app into the market. The key feature that will differentiate your app from other competitors will be your high performing user-matching algorithm. Before building this model, you have partnered with a speed dating company to collect data from real events. You just received the dataset from your partner company but realized it is not as clean as you expected; there are missing and incorrect values. Your task is to fix the main data quality issues in this dataset.

In [1]:
import pandas as pd

In [2]:
file_url = ('https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop'\
            '/master/Chapter11/dataset/Speed_Dating_Data.csv')

In [3]:
df = pd.read_csv(file_url)
print(df.shape)
df.head()

(8378, 195)


Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1,1.0,0,1,1,1,10,7,,4,...,5.0,7.0,7.0,7.0,7.0,,,,,
1,1,1.0,0,1,1,1,10,7,,3,...,5.0,7.0,7.0,7.0,7.0,,,,,
2,1,1.0,0,1,1,1,10,7,,10,...,5.0,7.0,7.0,7.0,7.0,,,,,
3,1,1.0,0,1,1,1,10,7,,5,...,5.0,7.0,7.0,7.0,7.0,,,,,
4,1,1.0,0,1,1,1,10,7,,7,...,5.0,7.0,7.0,7.0,7.0,,,,,


In [4]:
# duplicated rows
df.duplicated().sum()

0

In [5]:
# duplicated rows for identifier columns (iid, id, partner, pid)
df.loc[df.duplicated(), ['iid', 'id', 'partner', 'pid']]

Unnamed: 0,iid,id,partner,pid


In [6]:
df['imprace'].unique()

array([ 2.,  8.,  1.,  4.,  7.,  3.,  9., 10., nan,  5.,  6.,  0.])

In [7]:
# key about data attributes provided by author says that values for imprace should be from 1-10. 
# observations where imprace is 0
df.loc[df['imprace'] == 0]

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
6452,419,4.0,0,7,1,16,8,4,4.0,4,...,,,,,,,,,,
6453,419,4.0,0,7,1,16,8,4,4.0,5,...,,,,,,,,,,
6454,419,4.0,0,7,1,16,8,4,4.0,8,...,,,,,,,,,,
6455,419,4.0,0,7,1,16,8,4,4.0,3,...,,,,,,,,,,
6456,419,4.0,0,7,1,16,8,4,4.0,6,...,,,,,,,,,,
6457,419,4.0,0,7,1,16,8,4,4.0,1,...,,,,,,,,,,
6458,419,4.0,0,7,1,16,8,4,4.0,7,...,,,,,,,,,,
6459,419,4.0,0,7,1,16,8,4,4.0,2,...,,,,,,,,,,


In [8]:
# 8 observations have 0 as the value for imprace. Replace these values with the value 1
# 1 is the closest number on the scale
df['imprace'] = df['imprace'].replace(0, 1)
df['imprace'].unique()

array([ 2.,  8.,  1.,  4.,  7.,  3.,  9., 10., nan,  5.,  6.])

In [9]:
# check for incorrect values for imprelig variable
df['imprelig'].unique()

array([ 4.,  5.,  1.,  3.,  2.,  8., 10.,  6., nan,  7.,  9.])

In [10]:
# check for incorrect values for sports variable
df['sports'].unique()

array([ 9.,  3.,  1.,  7., 10.,  5.,  2.,  4.,  8., nan,  6.])

In [11]:
# check for incorrect values for tvsports variable
df['tvsports'].unique()


array([ 2.,  8.,  1.,  4.,  3.,  9.,  7.,  6.,  5., 10., nan])

In [12]:
# check for incorrect values for exercise variable
df['exercise'].unique()


array([ 8.,  7.,  6.,  9.,  4.,  1.,  2.,  5., 10.,  3., nan])

In [13]:
# check for incorrect values for dining variable
df['dining'].unique()

array([ 9., 10.,  8.,  7.,  6.,  5., nan,  3.,  4.,  1.,  2.])

In [14]:
# check for incorrect values for museums variable
df['museums'].unique()

array([ 1.,  8.,  5.,  6., 10.,  9.,  7.,  3.,  4., nan,  2.,  0.])

In [27]:
# correct incorrect values for museums variable; approved values 1-10
df['museums'] = df['museums'].replace(0, 1)
# check to see if the incorrect values have been replaced.
df['museums'].unique()

array([ 1.,  8.,  5.,  6., 10.,  9.,  7.,  3.,  4., nan,  2.])

In [15]:
# check for incorrect values for art variable
df['art'].unique()


array([ 1.,  6.,  5.,  7.,  8., 10.,  9.,  3.,  4., nan,  2.,  0.])

In [28]:
# correct incorrect values for art variable; approved values 1-10
df['art'] = df['art'].replace(0, 1)
# check to see if the incorrect values have been replaced.
df['art'].unique()

array([ 1.,  6.,  5.,  7.,  8., 10.,  9.,  3.,  4., nan,  2.])

In [16]:
# check for incorrect values for hiking variable
df['hiking'].unique()

array([ 5.,  3.,  8.,  7.,  6.,  9.,  2.,  4., 10.,  1., nan,  0.])

In [29]:
# correct incorrect values for hiking variable; approved values 1-10
df['hiking'] = df['hiking'].replace(0, 1)
# check to see if the incorrect values have been replaced.
df['hiking'].unique()

array([ 5.,  3.,  8.,  7.,  6.,  9.,  2.,  4., 10.,  1., nan])

In [17]:
# check for incorrect values for gaming variable
df['gaming'].unique()

array([ 1.,  5.,  4.,  6.,  2.,  3.,  7.,  8., 10., nan,  9., 14.,  0.])

In [31]:
# correct incorrect values for gaming variable; approved values 1-10
df['gaming'] = df['gaming'].replace(0, 1)
df['gaming'] = df['gaming'].replace(14, 10)
# check to see if the incorrect values have been replaced.
df['gaming'].unique()

array([ 1.,  5.,  4.,  6.,  2.,  3.,  7.,  8., 10., nan,  9.])

In [18]:
# check for incorrect values for clubbing variable
df['clubbing'].unique()

array([ 5.,  8.,  7.,  6., 10.,  4.,  9.,  2.,  3., nan,  1.,  0.])

In [32]:
# correct incorrect values for clubbing variable; approved values 1-10
df['clubbing'] = df['clubbing'].replace(0, 1)
# check to see if the incorrect values have been replaced.
df['clubbing'].unique()

array([ 5.,  8.,  7.,  6., 10.,  4.,  9.,  2.,  3., nan,  1.])

In [19]:
# check for incorrect values for reading variable
df['reading'].unique()

array([ 6., 10.,  7.,  9.,  8.,  4.,  5., nan,  2.,  3.,  1., 13.])

In [33]:
# correct incorrect values for reading variable; approved values 1-10
df['reading'] = df['reading'].replace(13, 10)
# check to see if the incorrect values have been replaced.
df['reading'].unique()

array([ 6., 10.,  7.,  9.,  8.,  4.,  5., nan,  2.,  3.,  1.])

In [20]:
# check for incorrect values for tv variable
df['tv'].unique()

array([ 9.,  1.,  8.,  7.,  2., 10.,  5.,  6.,  3., nan,  4.])

In [21]:
# check for incorrect values for theater variable
df['theater'].unique()

array([ 1.,  9.,  7.,  6.,  5., 10.,  4.,  3.,  8., nan,  2.,  0.])

In [34]:
# correct incorrect values for theater variable; approved values 1-10
df['theater'] = df['theater'].replace(0, 1)
# check to see if the incorrect values have been replaced.
df['theater'].unique()

array([ 1.,  9.,  7.,  6.,  5., 10.,  4.,  3.,  8., nan,  2.])

In [22]:
# check for incorrect values for movies variable
df['movies'].unique()

array([10.,  8.,  7.,  6.,  9.,  5., nan,  4.,  3.,  2.,  0.])

In [35]:
# correct incorrect values for movies variable; approved values 1-10
df['movies'] = df['movies'].replace(0, 1)
# check to see if the incorrect values have been replaced.
df['movies'].unique()

array([10.,  8.,  7.,  6.,  9.,  5., nan,  4.,  3.,  2.,  1.])

In [23]:
# check for incorrect values for concerts variable
df['concerts'].unique()

array([10.,  7.,  8.,  3.,  6.,  9.,  4.,  5., nan,  2.,  1.,  0.])

In [36]:
# correct incorrect values for concerts variable; approved values 1-10
df['concerts'] = df['concerts'].replace(0, 1)
# check to see if the incorrect values have been replaced.
df['concerts'].unique()

array([10.,  7.,  8.,  3.,  6.,  9.,  4.,  5., nan,  2.,  1.])

In [24]:
# check for incorrect values for music variable
df['music'].unique()

array([ 9.,  8.,  5.,  7.,  4., 10.,  6., nan,  1.,  2.,  3.])

In [25]:
# check for incorrect values for shopping variable
df['shopping'].unique()

array([ 8.,  3.,  1., 10.,  7.,  5.,  6.,  2., nan,  9.,  4.])

In [26]:
# check for incorrect values for yoga variable
df['yoga'].unique()

array([ 1.,  7.,  8.,  3., 10.,  6.,  5.,  2.,  4., nan,  9.,  0.])

In [37]:
# correct incorrect values for yoga variable; approved values 1-10
df['yoga'] = df['yoga'].replace(0, 1)
# check to see if the incorrect values have been replaced.
df['yoga'].unique()

array([ 1.,  7.,  8.,  3., 10.,  6.,  5.,  2.,  4., nan,  9.])

### Check the data type of the different columns using .dtypes.

In [24]:
# data types of columns
df.dtypes

iid           int64
id          float64
gender        int64
idg           int64
condtn        int64
             ...   
attr5_3     float64
sinc5_3     float64
intel5_3    float64
fun5_3      float64
amb5_3      float64
Length: 195, dtype: object

### Change the data types to categorical for the columns that don't contain numerical values using .astype()


In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8378 entries, 0 to 8377
Columns: 195 entries, iid to amb5_3
dtypes: float64(178), int64(13), object(4)
memory usage: 12.5+ MB


In [45]:
df.select_dtypes(include='object').columns

Index(['field', 'undergra', 'from', 'career'], dtype='object')

In [None]:
# ['field', 'undergra', 'from', 'career']
df

### check for any missing values using .isna() and .sum()

### Replace the missing values for each numerical variable with their corresponding mean or median values using .fillna(), .mean(), and .median().