# Data Preprocessing

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# Importing the dataset
dataset = pd.read_csv('Data.csv')
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [3]:
dataset['Salary'].mean(skipna = True)

63777.77777777778

## 1. Taking care of missing data

`axis=0` argument calculates the column wise mean of the dataframe and `axis=1` is row wise mean

Ref links:  
- [Pandas mean calculation for missing values](https://www.geeksforgeeks.org/python-pandas-dataframe-mean/)
- [Pandas Doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html)
- [Using apply fn](https://stackoverflow.com/questions/18689823/pandas-dataframe-replace-nan-values-with-average-of-columns)

In [4]:
dataset['Salary'].isnull()

0    False
1    False
2    False
3    False
4     True
5    False
6    False
7    False
8    False
9    False
Name: Salary, dtype: bool

In [5]:
dataset['Salary'] = dataset['Salary'].fillna(dataset['Salary'].mean())
dataset['Age'] = dataset['Age'].fillna(dataset['Age'].mean())
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,63777.777778,Yes


## 2. Encoding categorical data


Q). Do we need to apply OneHotEncoder to encode an independent variable that gives the size S, M or L of a t-shirt ?

Ans: No  (Better to use OneHotEncoder with Nominal categorical variable.)

In [6]:
dataset = pd.get_dummies(dataset, columns = ['Country', 'Purchased'], drop_first = True)
dataset

Unnamed: 0,Age,Salary,Country_Germany,Country_Spain,Purchased_Yes
0,44.0,72000.0,0,0,0
1,27.0,48000.0,0,1,1
2,30.0,54000.0,1,0,0
3,38.0,61000.0,0,1,0
4,40.0,63777.777778,1,0,1
5,35.0,58000.0,0,0,1
6,38.777778,52000.0,0,1,0
7,48.0,79000.0,0,0,1
8,50.0,83000.0,1,0,0
9,37.0,67000.0,0,0,1


## 3. Splitting the dataset into the Training set and Test set

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
X = dataset.iloc[:,:-1]
X

Unnamed: 0,Age,Salary,Country_Germany,Country_Spain
0,44.0,72000.0,0,0
1,27.0,48000.0,0,1
2,30.0,54000.0,1,0
3,38.0,61000.0,0,1
4,40.0,63777.777778,1,0
5,35.0,58000.0,0,0
6,38.777778,52000.0,0,1
7,48.0,79000.0,0,0
8,50.0,83000.0,1,0
9,37.0,67000.0,0,0


In [9]:
y = dataset.iloc[:,-1]
y

0    0
1    1
2    0
3    0
4    1
5    1
6    0
7    1
8    0
9    1
Name: Purchased_Yes, dtype: uint8

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [11]:
X_train

Unnamed: 0,Age,Salary,Country_Germany,Country_Spain
4,40.0,63777.777778,1,0
9,37.0,67000.0,0,0
1,27.0,48000.0,0,1
6,38.777778,52000.0,0,1
7,48.0,79000.0,0,0
3,38.0,61000.0,0,1
0,44.0,72000.0,0,0
5,35.0,58000.0,0,0


In [12]:
X_test

Unnamed: 0,Age,Salary,Country_Germany,Country_Spain
2,30.0,54000.0,1,0
8,50.0,83000.0,1,0


In [13]:
y_train

4    1
9    1
1    1
6    0
7    1
3    0
0    0
5    1
Name: Purchased_Yes, dtype: uint8

In [14]:
y_test

2    0
8    0
Name: Purchased_Yes, dtype: uint8

## 4. Feature scaling

Observe that the data columns __Age__ and __Salary__ are not in the same scale range so they can cause problems in our machine learning model because most of the ML models are based on the euclidean distance.

![](Euclidean-distance.PNG)

Assume Age -> x-axis  
and    Salary -> y-axis

But as Salary has a wide range so the euclidean distance will be dominated by Salary
(Age, Salary) - 1.(48, 79000) and 2. (27, 48000)  

So, => (79000 - 48000)^2 = 96100000  
    =>  (48-27)^2  = 441  
    
    So observe Salary difference is dominated by age difference. So in the ML eqns it would be like age doesnt even exist as it will be dominated by Salary. So it is VERY IMPORTANT  to put the variables in the same scale.
    
Ways of scaling the data:
![](feature_scaling.PNG)


The scaling algorithm puts your data on one scale. This is helpful with largely sparse datasets. In simple words, your data is vastly spread out. For example the values of X maybe like so:

X = [1, 4, 400, 10000, 100000]

The issue with sparsity is that it very biased or in statistical terms skewed. So, therefore, scaling the data brings all your values onto one scale eliminating the sparsity. 

Here I will be using the [MinMax Scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) - which uses normalizaton.

In [15]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [16]:
X_train = scaler.fit_transform(X_train)
X_train

  return self.partial_fit(X, y)


array([[0.61904762, 0.50896057, 1.        , 0.        ],
       [0.47619048, 0.61290323, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 1.        ],
       [0.56084656, 0.12903226, 0.        , 1.        ],
       [1.        , 1.        , 0.        , 0.        ],
       [0.52380952, 0.41935484, 0.        , 1.        ],
       [0.80952381, 0.77419355, 0.        , 0.        ],
       [0.38095238, 0.32258065, 0.        , 0.        ]])

In [17]:
X_test = scaler.transform(X_test)
X_test

array([[0.14285714, 0.19354839, 1.        , 0.        ],
       [1.0952381 , 1.12903226, 1.        , 0.        ]])

In [18]:
y_train

4    1
9    1
1    1
6    0
7    1
3    0
0    0
5    1
Name: Purchased_Yes, dtype: uint8

In [19]:
y_test

2    0
8    0
Name: Purchased_Yes, dtype: uint8