# Practical - Data Pre-processing Part 2
This practical session will demonstrate how to handle missing data. We assume everyone to have adequate understanding of Python programming language. For those who would like to refresh Python skill, we would like to recommend our <b>"Programming for Data Science Series"</b> where we covered almost all aspects of Python programming in data science domain.
Refer below URL for full playlist of almost 10 hours video lesson in Burmese Language.
URL : https://www.youtube.com/watch?v=jOZNjVVZIVs&list=PLD_eiqVVLZDi9GZZJDC8Zx4-3Np8LHs52

In [39]:
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/myanmards/resource_files/master/sample-clean.csv')
data.head()

Unnamed: 0,emp_id,birth_date,first_name,last_name,gender,hire_date,salary,height,weight,birth_place
0,1,02-09-53,Georgi,Facello,M,26-06-86,500000,165,64.0,NY
1,2,02-06-64,Bezalel,Simmel,F,21-11-85,120000,155,59.0,Chicago
2,3,03-12-59,Parto,Bamford,M,28-08-86,350000,158,75.0,Los Angeles
3,4,01-05-54,Chirstian,Koblick,M,01-12-86,400000,149,77.0,Los Angeles
4,5,21-01-55,Kyoichi,Maliniak,M,12-09-89,200000,169,74.0,LA


#### Simple featured scaling
Normalize value is equal to old value divided by maximum value

In [40]:
data["height"] = data["height"]/data["height"].max()
data.head()

Unnamed: 0,emp_id,birth_date,first_name,last_name,gender,hire_date,salary,height,weight,birth_place
0,1,02-09-53,Georgi,Facello,M,26-06-86,500000,0.846154,64.0,NY
1,2,02-06-64,Bezalel,Simmel,F,21-11-85,120000,0.794872,59.0,Chicago
2,3,03-12-59,Parto,Bamford,M,28-08-86,350000,0.810256,75.0,Los Angeles
3,4,01-05-54,Chirstian,Koblick,M,01-12-86,400000,0.764103,77.0,Los Angeles
4,5,21-01-55,Kyoichi,Maliniak,M,12-09-89,200000,0.866667,74.0,LA


#### Min-Max Normalization
Normalize value is equal to old value minus minimum value divided by maximum value minus minimum value

In [41]:
data["weight"] = (data["weight"] - data["weight"].min()) / (data["weight"].max() - data["weight"].min())
data.head()

Unnamed: 0,emp_id,birth_date,first_name,last_name,gender,hire_date,salary,height,weight,birth_place
0,1,02-09-53,Georgi,Facello,M,26-06-86,500000,0.846154,0.46875,NY
1,2,02-06-64,Bezalel,Simmel,F,21-11-85,120000,0.794872,0.3125,Chicago
2,3,03-12-59,Parto,Bamford,M,28-08-86,350000,0.810256,0.8125,Los Angeles
3,4,01-05-54,Chirstian,Koblick,M,01-12-86,400000,0.764103,0.875,Los Angeles
4,5,21-01-55,Kyoichi,Maliniak,M,12-09-89,200000,0.866667,0.78125,LA


#### Z-score Normalization
Normalize value is equal to old value minus mean value divided by standard deviation

In [42]:
data.describe()

Unnamed: 0,emp_id,salary,height,weight
count,100.0,100.0,100.0,78.0
mean,50.5,325200.0,0.858718,0.491987
std,29.011492,193294.039894,0.04609,0.247473
min,1.0,120000.0,0.764103,0.0
25%,25.75,200000.0,0.820513,0.3125
50%,50.5,300000.0,0.861538,0.5
75%,75.25,400000.0,0.888462,0.679688
max,100.0,750000.0,1.0,1.0


In [43]:
data.mean()

emp_id        50.500000
salary    325200.000000
height         0.858718
weight         0.491987
dtype: float64

In [44]:
data.std()

emp_id        29.011492
salary    193294.039894
height         0.046090
weight         0.247473
dtype: float64

In [45]:
data["salary"] = (data["salary"] - data["salary"].mean()) /data["salary"].std()
data.head()

Unnamed: 0,emp_id,birth_date,first_name,last_name,gender,hire_date,salary,height,weight,birth_place
0,1,02-09-53,Georgi,Facello,M,26-06-86,0.904322,0.846154,0.46875,NY
1,2,02-06-64,Bezalel,Simmel,F,21-11-85,-1.061595,0.794872,0.3125,Chicago
2,3,03-12-59,Parto,Bamford,M,28-08-86,0.128302,0.810256,0.8125,Los Angeles
3,4,01-05-54,Chirstian,Koblick,M,01-12-86,0.386975,0.764103,0.875,Los Angeles
4,5,21-01-55,Kyoichi,Maliniak,M,12-09-89,-0.647718,0.866667,0.78125,LA


### Converting

In [46]:
pd.get_dummies(data, columns=["gender"])

Unnamed: 0,emp_id,birth_date,first_name,last_name,hire_date,salary,height,weight,birth_place,gender_F,gender_M
0,1,02-09-53,Georgi,Facello,26-06-86,0.904322,0.846154,0.46875,NY,0,1
1,2,02-06-64,Bezalel,Simmel,21-11-85,-1.061595,0.794872,0.31250,Chicago,1,0
2,3,03-12-59,Parto,Bamford,28-08-86,0.128302,0.810256,0.81250,Los Angeles,0,1
3,4,01-05-54,Chirstian,Koblick,01-12-86,0.386975,0.764103,0.87500,Los Angeles,0,1
4,5,21-01-55,Kyoichi,Maliniak,12-09-89,-0.647718,0.866667,0.78125,LA,0,1
5,6,20-04-53,Anneke,Preusig,02-06-89,-0.130371,0.830769,,Houston,1,0
6,7,23-05-57,Tzvetan,Zielinski,10-02-89,-0.906391,0.820513,0.40625,Chicago,1,0
7,8,19-02-58,Saniya,Kalloufi,15-09-94,2.197688,0.815385,0.75000,WDC,0,1
8,9,19-04-52,Sumant,Peac,18-02-85,2.197688,0.887179,0.50000,WDC,1,0
9,10,01-06-63,Duangkaew,Piveteau,24-08-89,-0.647718,0.825641,,Washington D.C.,1,0


In [47]:
data.head()

Unnamed: 0,emp_id,birth_date,first_name,last_name,gender,hire_date,salary,height,weight,birth_place
0,1,02-09-53,Georgi,Facello,M,26-06-86,0.904322,0.846154,0.46875,NY
1,2,02-06-64,Bezalel,Simmel,F,21-11-85,-1.061595,0.794872,0.3125,Chicago
2,3,03-12-59,Parto,Bamford,M,28-08-86,0.128302,0.810256,0.8125,Los Angeles
3,4,01-05-54,Chirstian,Koblick,M,01-12-86,0.386975,0.764103,0.875,Los Angeles
4,5,21-01-55,Kyoichi,Maliniak,M,12-09-89,-0.647718,0.866667,0.78125,LA


In [37]:
data = pd.get_dummies(data, columns=['gender'])
data.head()

Unnamed: 0,emp_id,birth_date,first_name,last_name,hire_date,salary,height,weight,birth_place,gender_F,gender_M
0,1,02-09-53,Georgi,Facello,26-06-86,0.904322,0.846154,0.46875,NY,0,1
1,2,02-06-64,Bezalel,Simmel,21-11-85,-1.061595,0.794872,0.3125,Chicago,1,0
2,3,03-12-59,Parto,Bamford,28-08-86,0.128302,0.810256,0.8125,Los Angeles,0,1
3,4,01-05-54,Chirstian,Koblick,01-12-86,0.386975,0.764103,0.875,Los Angeles,0,1
4,5,21-01-55,Kyoichi,Maliniak,12-09-89,-0.647718,0.866667,0.78125,LA,0,1


### Discrete Binning

In [49]:
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/myanmards/resource_files/master/sample-clean.csv')
data.head()

Unnamed: 0,emp_id,birth_date,first_name,last_name,gender,hire_date,salary,height,weight,birth_place
0,1,02-09-53,Georgi,Facello,M,26-06-86,500000,165,64.0,NY
1,2,02-06-64,Bezalel,Simmel,F,21-11-85,120000,155,59.0,Chicago
2,3,03-12-59,Parto,Bamford,M,28-08-86,350000,158,75.0,Los Angeles
3,4,01-05-54,Chirstian,Koblick,M,01-12-86,400000,149,77.0,Los Angeles
4,5,21-01-55,Kyoichi,Maliniak,M,12-09-89,200000,169,74.0,LA


In [50]:
import numpy as np
bins = np.linspace(min(data["salary"]), max(data["salary"]), 6)
bins

array([120000., 246000., 372000., 498000., 624000., 750000.])

In [51]:
bin_names = ["Very Low", "Low", "Medium", "High", "Very High"]
data["salary-group"] = pd.cut(data["salary"], bins, labels=bin_names, include_lowest=True)

In [52]:
data.head(10)

Unnamed: 0,emp_id,birth_date,first_name,last_name,gender,hire_date,salary,height,weight,birth_place,salary-group
0,1,02-09-53,Georgi,Facello,M,26-06-86,500000,165,64.0,NY,High
1,2,02-06-64,Bezalel,Simmel,F,21-11-85,120000,155,59.0,Chicago,Very Low
2,3,03-12-59,Parto,Bamford,M,28-08-86,350000,158,75.0,Los Angeles,Low
3,4,01-05-54,Chirstian,Koblick,M,01-12-86,400000,149,77.0,Los Angeles,Medium
4,5,21-01-55,Kyoichi,Maliniak,M,12-09-89,200000,169,74.0,LA,Very Low
5,6,20-04-53,Anneke,Preusig,F,02-06-89,300000,162,,Houston,Low
6,7,23-05-57,Tzvetan,Zielinski,F,10-02-89,150000,160,62.0,Chicago,Very Low
7,8,19-02-58,Saniya,Kalloufi,M,15-09-94,750000,159,73.0,WDC,Very High
8,9,19-04-52,Sumant,Peac,F,18-02-85,750000,173,65.0,WDC,Very High
9,10,01-06-63,Duangkaew,Piveteau,F,24-08-89,200000,161,,Washington D.C.,Very Low
