![](images/ml_banner.jpg)

## Part 1: Data Preprocessing

* Get the dataset.
* Importing the libraries.
* Importing the dataset.
* Missing Data.
* Categorical Data.
* Splitting the Dataset into the Training and Testing set.
* Feature Scaling.
* Data Processing Template.

### Get the dataset.
Getting the course datasets from [here](https://github.com/johnmaged/Learning/tree/master/ML_python/datasets)

If you open the dataset data.csv, it will be like the below figure.
![alt text](images/data.csv.jpg "Data.csv")

Notes about the above data:

1. Contains 4 columns.
2. Has 10 observations.
3. *Country*, *Age*, and *Salary* are **Independent Variables** while *Purchased* is **Dependent Variable**



### Importing the libraries

In [1]:
# Importing the libraries
import numpy as np                  ## Mathematical operations
import matplotlib.pyplot as plt     ## visualizations
import pandas as pd                 ## importing and managing datasets

### Importing the dataset

In [2]:
# remember to set the current working directory to be your dataset path
dataset = pd.read_csv('datasets//data.csv')

In [3]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


Now we will create the following:
    + Matrix of Features (X)
    + Dependent Variable Vector (y) 

In [4]:
X = dataset.iloc[:,:-1].values  # -1 here means all columns except the last one.

In [5]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [6]:
y = dataset.iloc[:, 3].values

In [7]:
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'], dtype=object)

### Missing Data

As your can see below, there is some missing data here:

In [8]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


You may delete the rows that have missing values but these rows may have a big impact on our model! We need to figure out a better idea! We may get the average value of the column (mean value) and set all the missing values to that mean. Let us take a look of how to do so.

In [9]:
from sklearn.preprocessing import Imputer  #Imputer will take care of the missing data
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis=0) # the missing_values appears in our dataset above are 'NaN'
                                                             # strategy could be 'mean', 'median', or 'most_frequent'
                                                             # axis = 0 for column and axis = 1 for rows
imputer.fit(X[:,1:3]) # We are selecting the columns that have the missing values.
                      # note that the upperbound is excluded while the lowerbound is included.
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [10]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

**Congratulations! the missing data are handled :)**

### Categorical Data

We have two categorical variables **Country** and **Purchased** and in Machine Learning algorithms, it preferred to deal with numbers not categorical variables so we have to find a way to **encode** these categorical variables.

In [11]:
from sklearn.preprocessing import LabelEncoder 
labelencoder_X = LabelEncoder()  # Handling the first categorical variable - Country
X[:,0] = labelencoder_X.fit_transform(X[:,0])

In [12]:
X

array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

If we see the new X above, you will notice that:

* **France** encoded as **0**
* **Germany** encoded as **1**
* **Spain** encoded as **2**

But this may cause a big problem! These values may lead the algorithms to understand their values as priorities, orders, or importance which will heavily impact the resulting model.

To solve this problem, we will use **Dummy Variables** as below:
**Country** column will be splitted into **3 columns** (equal to number of categories in **Country** - France, Germany, and Spain). And these 3 columns together will tell us if this Country is France, Germany, or Spain.

Example: 

* **France** could be encoded like **0 1 0**
* **Germany** could be encoded like **1 0 0**
* **Spain** could be encoded like **0 0 1**   

Let us do that:

In [13]:
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features=[0]) # 0 here is for the first column.
X = onehotencoder.fit_transform(X).toarray()

In [14]:
X

array([[  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          4.40000000e+01,   7.20000000e+04],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          2.70000000e+01,   4.80000000e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          3.00000000e+01,   5.40000000e+04],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          3.80000000e+01,   6.10000000e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          4.00000000e+01,   6.37777778e+04],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          3.50000000e+01,   5.80000000e+04],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          3.87777778e+01,   5.20000000e+04],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          4.80000000e+01,   7.90000000e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          5.00000000e+01,   8.30000000e+04],
       [  1.00000000e+00,   0.0000000

Now, we will handle the **Purchased** column using the **LabelEncoder** class as we do not need to make it the same as above.

In [15]:
labelencoder_y = LabelEncoder()  # Handling categorical variable - Purchased
y = labelencoder_y.fit_transform(y)

In [16]:
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1], dtype=int64)

### Splitting the Dataset into the Training set and Test set

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
                                                # test_size is a number between 0 and 1
                                                # random_state = 0 is used for random sampling

### Feature Scaling

![](images/EuclideanDistanceGraphic.jpg "Euclidean Distance")

If we applied this equation to all points in our dataset, we will note that there is a huge gap between **Age** and **Salary** in their calculations so we have to normalize these values to be more closer. This is called **Feature Scaling**

![](images/feature_scaling_two_ways.jpg "Two ways of Feature Scaling")

And we will go with the first option **Standardisation**

Note: Do we need to scale the dummy variables? It depends on the context.

No need to do feature scaling on our dependent variable here as they are on the same scale.

In [18]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

In [19]:
X_train

array([[-1.        ,  2.64575131, -0.77459667,  0.26306757,  0.12381479],
       [ 1.        , -0.37796447, -0.77459667, -0.25350148,  0.46175632],
       [-1.        , -0.37796447,  1.29099445, -1.97539832, -1.53093341],
       [-1.        , -0.37796447,  1.29099445,  0.05261351, -1.11141978],
       [ 1.        , -0.37796447, -0.77459667,  1.64058505,  1.7202972 ],
       [-1.        , -0.37796447,  1.29099445, -0.0813118 , -0.16751412],
       [ 1.        , -0.37796447, -0.77459667,  0.95182631,  0.98614835],
       [ 1.        , -0.37796447, -0.77459667, -0.59788085, -0.48214934]])

In [20]:
X_test

array([[-1.        ,  2.64575131, -0.77459667, -1.45882927, -0.90166297],
       [-1.        ,  2.64575131, -0.77459667,  1.98496442,  2.13981082]])