Step1 : Identify and Handle missing values 

Missing age is replaced with avg age of users

Missing weight and height are replaced with the avg of those values 

Missing values in purchases are replaced with a number that we can decide later


pandas recog NaN as null value

In [19]:
import pandas as pd
import json as js
import numpy as np  # For numerical operations

In [20]:
df=pd.read_json('Data/data.json')
df.head()

Unnamed: 0,user,age,height_cm,weight_kg,purchases,income
0,User1,25.0,175.0,70.0,"[1, 0, 2, None]",40000
1,User2,,160.0,,"[0, 1, None, 3]",25000
2,User3,35.0,,85.0,"[None, 2, 1, 0]",70000
3,User4,50.0,180.0,95.0,"[3, None, 2, 1]",50000
4,User5,150.0,170.0,200.0,"[2, 1, 0, 3]",120000


filling missing data with avg values


In [21]:
avg_age = df['age'].mean()  #missing data is replaced
avg_height = df['height_cm'].mean()
avg_weight = df['weight_kg'].mean()

In [22]:
print(avg_age)
print(avg_height)
print(avg_weight)

65.0
171.25
112.5


In [23]:
df['age'] = df['age'].fillna(round(avg_age)) #fillna only works with single value
df['height_cm'] = df['height_cm'].fillna(round(avg_height))
df['weight_kg'] = df['weight_kg'].fillna(round(avg_weight))

In [24]:
df.head()

Unnamed: 0,user,age,height_cm,weight_kg,purchases,income
0,User1,25.0,175.0,70.0,"[1, 0, 2, None]",40000
1,User2,65.0,160.0,112.0,"[0, 1, None, 3]",25000
2,User3,35.0,171.0,85.0,"[None, 2, 1, 0]",70000
3,User4,50.0,180.0,95.0,"[3, None, 2, 1]",50000
4,User5,150.0,170.0,200.0,"[2, 1, 0, 3]",120000


making None in purchase list with 0

In [25]:
def clean_purchases(purchases):
    return [0 if p is None else p for p in purchases]

df['purchases'] = df['purchases'].apply(clean_purchases)
df.head()

Unnamed: 0,user,age,height_cm,weight_kg,purchases,income
0,User1,25.0,175.0,70.0,"[1, 0, 2, 0]",40000
1,User2,65.0,160.0,112.0,"[0, 1, 0, 3]",25000
2,User3,35.0,171.0,85.0,"[0, 2, 1, 0]",70000
3,User4,50.0,180.0,95.0,"[3, 0, 2, 1]",50000
4,User5,150.0,170.0,200.0,"[2, 1, 0, 3]",120000


step 2: managing unrealistic values (outliers) like the age and weight in user 5

In [26]:
max_age=100
max_weight=150

In [27]:
df.head()

Unnamed: 0,user,age,height_cm,weight_kg,purchases,income
0,User1,25.0,175.0,70.0,"[1, 0, 2, 0]",40000
1,User2,65.0,160.0,112.0,"[0, 1, 0, 3]",25000
2,User3,35.0,171.0,85.0,"[0, 2, 1, 0]",70000
3,User4,50.0,180.0,95.0,"[3, 0, 2, 1]",50000
4,User5,150.0,170.0,200.0,"[2, 1, 0, 3]",120000


In [28]:
df['age'] = df['age'].apply(lambda x : min(x, max_age))
df['weight_kg'] = df['weight_kg'].apply(lambda x :min(x, max_weight))

df.head()

Unnamed: 0,user,age,height_cm,weight_kg,purchases,income
0,User1,25.0,175.0,70.0,"[1, 0, 2, 0]",40000
1,User2,65.0,160.0,112.0,"[0, 1, 0, 3]",25000
2,User3,35.0,171.0,85.0,"[0, 2, 1, 0]",70000
3,User4,50.0,180.0,95.0,"[3, 0, 2, 1]",50000
4,User5,100.0,170.0,150.0,"[2, 1, 0, 3]",120000


Step3 : normalizing data 

In [29]:
#first compute max values 
max_height = df['height_cm'].max()
max_weight = df['weight_kg'].max() # its ok not write max_weigt 
#coz we already have max_weight =150 previously  
max_income = df['income'].max()

In [30]:
max_height = max_height if max_height !=0 else 1
max_weight = max_weight if max_weight !=0 else 1
max_income = max_income if max_income !=0 else 1

In [31]:
df['height_cm'] = df['height_cm']/max_height
df['weight_kg'] = df['weight_kg']/max_weight
df['income'] = df['income']/max_income

In [32]:
df[['user','height_cm','weight_kg','income']].head()

Unnamed: 0,user,height_cm,weight_kg,income
0,User1,0.972222,0.466667,0.333333
1,User2,0.888889,0.746667,0.208333
2,User3,0.95,0.566667,0.583333
3,User4,1.0,0.633333,0.416667
4,User5,0.944444,1.0,1.0


One-hot-encoding

df['gender']=df['gender'].apply(lambda g: 1 if g=='M' else 0)

In [33]:
df_new ='./data/data_cleaned.json'
df.to_json(df_new,orient='records',indent=4)  #saves cleaned data to new json file

print(f'the cleaned data has been saved to {df_new}')

the cleaned data has been saved to ./data/data_cleaned.json


In [34]:
df.head()

Unnamed: 0,user,age,height_cm,weight_kg,purchases,income
0,User1,25.0,0.972222,0.466667,"[1, 0, 2, 0]",0.333333
1,User2,65.0,0.888889,0.746667,"[0, 1, 0, 3]",0.208333
2,User3,35.0,0.95,0.566667,"[0, 2, 1, 0]",0.583333
3,User4,50.0,1.0,0.633333,"[3, 0, 2, 1]",0.416667
4,User5,100.0,0.944444,1.0,"[2, 1, 0, 3]",1.0


In [37]:
df['purchases'].head()

0    [1, 0, 2, 0]
1    [0, 1, 0, 3]
2    [0, 2, 1, 0]
3    [3, 0, 2, 1]
4    [2, 1, 0, 3]
Name: purchases, dtype: object