# **Data Preprocessing Tools**
<hr>

- Techniques used to improve the quality of our data
- To remove or to minimize the problems with noises, outliers, incorrect values, inconsistent, duplicates, or absents.
- Turn the data more adequate to a certain algorithm.


## **Importing the libraries**
<hr>

The **numpy** library allow us to work with fast arrays and matrices

In [1]:
import numpy as np 

The **matplotlib** allow us to plot very nice charts

In [2]:
import matplotlib.pyplot as plt 

The **pandas** allow us to import the dataset and create the matrix of features and the dependent variable vector

In [3]:
import pandas as pd

## **Importing the dataset**
<hr>

In [4]:
dataset_path = "Data.csv"
df = pd.read_csv(dataset_path)
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [5]:
df.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [6]:
df.tail()

Unnamed: 0,Country,Age,Salary,Purchased
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [7]:
df.dtypes

Country       object
Age          float64
Salary       float64
Purchased     object
dtype: object

In [8]:
df.shape

(10, 4)

In any dataset which we're going to train a machine learning projetct, we have the same entities.
- **Matrix of Features:** The features are the columns (independent variables) which you're going to predict the dependent variable vector. 
    - **Note**: A great tip to select the features that will be used to train our model is to remove the variables that we have the **assurance** that won't influence the model, like the person name, and make correlations with all attributes, if we have 2 attributes high correlated (more than 0.66 or 0.8), we can remove one of these attributes, since that indicates data redundancy. Furthermore, it can happen that one independent attribute is highly correlated with the dependent variable, and just with this independent variable we can predict the dependent variable but pay attention, correlation does not indicate causality.


- **Dependent Variable Vector:** In this dataset, this variable is the last column, because this company would like to predict if future clients will purchase or will not based, on their information. 
    - **Note**: In supervised learning we have to pay attention to the balance of the classes to predict, it's not good for the model to be trained with a dependent variable with 95%  of one class, and 5% of another class, in this case we have to do a balance of our data using some techniques. So, descritive statistics is very important.
    
**Select the right data** to traing our model is very important, we have to know a lot about the data that we're working. Once present the best experience to the machine, or the best data, finally we can expect solutions with the **minimum human bias**. So, whether the data passed to our model training is biased, our model will be biased too.

Creating our **Matrix of Features**:

In [9]:
x = df.iloc[:, :-1] #Slicing the dataframe by the indexes -> df.iloc[rows, columns]
x.head()

Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,


In [10]:
x = x.values
x

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

Creating our **Dependent Variable Vector**:

In [11]:
y = df.iloc[:, -1:]
y.head()

Unnamed: 0,Purchased
0,No
1,Yes
2,No
3,No
4,Yes


In [12]:
y = y.values
y

array([['No'],
       ['Yes'],
       ['No'],
       ['No'],
       ['Yes'],
       ['Yes'],
       ['No'],
       ['Yes'],
       ['No'],
       ['Yes']], dtype=object)

<br>

## **Taking care of missing data**
<hr>

If we have look at our df, we noticed that there is a missing salary and age value. So, generally we don't want to have any missing data in our dataset, because it can cause some erros when we are training our ML model and therefore we must handle them. Thus we have some ways to handle missing data.

**Ways to handle missing data**:

1. Just ignore that observation by deleting it. It's a good option if we have a large dataset and just 1% of missing data. Since that deleting 1% of a large dataset won't change much the learning quality of our model.

2. Replace the missing value by the average of all values in the column in which the data is missing. This is a classic way of handling missing data.
    - **Note:** It is a good strategy but not the best one always, for example, we mustn't put a mean weight of our dataset for a baby weight in a dataset where the most part is adult. So, it **depends on your business problem, on the way your data is distributed and on the number of missing values**. If for example you have a lot of missing values, then mean substitution is not the best thing. Other strategies include "median" imputation, "most frequent" imputation or prediction imputation. Prediction Imputation is actually another great strategy that is recommended but we didn't learn yet. Once we learn this, using the k-NN algorithm is a good choice for this situation, but this could possibly be adding inaccuracy to our model.

In scikit-learn library we have a tool to replace easily the missing value by the average or median or the most frequent value of all values in the column called SimpleImputer. In this case, we gonna replace by the average:

In [13]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy="mean") #strategy can be "mean", "median" or "most_frequent”

The **fit** method looks to the missing values and computes the average, when called we must have to pass only numerical columns as argument.

In [14]:
imputer.fit(x[:, 1:3])

SimpleImputer()

The **transform** method applies the transformation in our data and return it, we must have to pass the same columns passed as argument in the fit method.

In [15]:
x[:, 1:3] = imputer.transform(x[:, 1:3])

In [16]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


So, the **fit** part is used to extract some info of the data on which the object is applied (here, Imputer will spot the missing values and get the mean of the column). Then, the **transform** part is used to apply some transformation (here, Imputer will replace the missing value by the mean).

## **Encoding categorical data**
<hr>

Generally, our dataset can contain categorical columns. In the current dataset that we are working on, we have the column *Country* which has categories, like "France", "Spain", or "Germany". This type of data will difficult for a ML Model to compute some correlations between the features and the dependent variable. 

To solve this, we can't encode the categories in number, like "France" = 0, "Spain" = 1, and "Germany" = 2, because has a numerical order, due to this, it could interpret that this order matters and of course it is absolutely wrong in this case. Therefore, we can actually do much better than just encode these three into zero, one, and two. 

To make this much better we will use the *OneHotEncoding* that consists of turning the country column into three new columns because there are three different categorical values in this column. If the first column it's equal to one, so the country of this row is "France", if the second one it's equal to one, the country of the row is "Spain", and the same thing to the third column.

Now, let's talk about the dependent variable. In this case, where the dependent variable is a binary value, just can be "Yes" or "No", we can replace it with zero and one. Doing this in a binary dependent variable won't compromise the future accuracy of the model if we just replace it.

**Note:** In other words, if we have order in our categorical variable or is it a binary value we can use LabelEncoder, if not OneHotEncoding.

### **Encoding the Independent Variable**

In [17]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

                      #transformers = [(type, transformer, columns)], remainder= "drop" or "passthrough" 
ct = ColumnTransformer(transformers=[("encoder", OneHotEncoder(), [0])], remainder="passthrough")

The method **fit_transform** receive as an argument the data that will be encoded and return encoded. Once returned, we have to convert to a numpy array. An encoded variable is called too as dummy variable.

In [18]:
x = np.array(ct.fit_transform(x))
x

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

### **Encoding the Dependent Variable**

In [19]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder() #it transforms the categories strings into integers
# ravel will convert that array shape to (n, )
y = le.fit_transform(y.ravel()) #it's not obligatory to pass a numpy array dependent variable to our future models
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

## **Splitting the dataset into the Training set and Test set**
<hr>

**Splitting the dataset into the Training set and Test set** consists of making two separate sets, one training set where we're going to train our ML Model on existing observations and one test set where you're going to evaluate the performance of our model on new observations. In this case, new observations are making the paper of some future observations that we're going to get when we're going to deploy our ML Model.

**Feature Scaling** simply consists of scaling all our features to make sure they all take values on the same scale. We do this to prevent that one feature dominates another, which therefore would be neglected by the ML Model. 

Due to this, we have to **split the dataset before applying feature scaling**, because we are preparing our data for the model, and *when we put new observations it will be completely new to the model*. If we do the scaling feature before the split, the model will be influenced by the test data observations, so we are doing something like **information leakage** or **data leakage** and when we are going to evaluate our model the new data won't be as new as we thought.

**Data leakage** in machine learning happens when the data that we are used to training a machine learning algorithm is having the information which the model is trying to predict, this results in unreliable and bad prediction outcomes after model deployment.

In [20]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=1/5, random_state=1)

It is recommended to separate 1/3 or 1/5 of our data to test the ML model, depending on the size of the dataset. The random state guarantees that we will always have the same random factor during the split. The data split will happen randomly. **Pay attention**, it can happen that one training sample does not represent the problem that we want to modelate.

In [21]:
X_train

array([[0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 35.0, 58000.0]], dtype=object)

In [22]:
y_train

array([0, 1, 0, 0, 1, 1, 0, 1])

In [23]:
X_test

array([[0.0, 1.0, 0.0, 30.0, 54000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

In [24]:
y_test

array([0, 1])

## **Feature Scaling**
<hr>

Feature Scaling allows us to put all our features on the same scale. We need to do this to prevent that one feature dominates another, which therefore would be neglected by the ML Model. Furthermore, we need to be aware that we won't have to apply feature scaling for all ML Models, just for some of them. Actually, generally, we apply this technique if we have features taking very different values.

How do we do feature scaling?

The main two feature scaling techniques are **Standardisation** and **Normalisation**.

- **Standardisation**
    - When applied all the features will take values between -3 and +3;
    - Actually works well all the time, it will do the job all the time;

$$
x_{stand}= \frac{x - mean(x)}{standard \ deviation(x)}
$$

<br>

- **Normalisation**
    - Since the numerator and denominator are positive. Furthermore, the denominator will be always greater than the numerator, the features will take values between 0 and 1;
    - It's recommended when we have a normal distribution in most of our features;

$$
x_{norm} = \frac{x - min(x)}{max(x) - min(x)}
$$

Let's coding it!

In [25]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

**Have we to apply feature scaling to dummy variables?**

No, because the feature scaling aims to put all features in the same range, and when we apply standardisation all features will take values between -3 and 3, but the dummy variables are 0 and 1, in other words, they already are between -3 and 3. Indeed, standadisation will only make it worse, because it will transform these values between -3 and 3, and making this we lose the interpretation of the variables. In this dataset, if we apply feature scaling in dummy variables, we will lose the information of which country corresponds to the observation. However, we have to apply the feature scaling to the age and salary columns, since we have a wide range of values. But even so, despite losing the interpretation of our data, in some cases, apply feature scaling to dummy variables can optimize the accuracy of our model.

In [26]:
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_train

array([[0.0, 0.0, 1.0, -0.19159184384578545, -1.0781259408412425],
       [0.0, 1.0, 0.0, -0.014117293757057777, -0.07013167641635372],
       [1.0, 0.0, 0.0, 0.566708506533324, 0.633562432710455],
       [0.0, 0.0, 1.0, -0.30453019390224867, -0.30786617274297867],
       [0.0, 0.0, 1.0, -1.9018011447007988, -1.420463615551582],
       [1.0, 0.0, 0.0, 1.1475343068237058, 1.232653363453549],
       [0.0, 1.0, 0.0, 1.4379472069688968, 1.5749910381638885],
       [1.0, 0.0, 0.0, -0.7401495441200351, -0.5646194287757332]],
      dtype=object)

To the test data, we have to make the scaling feature with the same standardisation scale (mean and std) used in the pre-processing of the model trained, to make this, we just call the method **transform** of the object used to fit the standardisation scale in our train data. Once that when we are going to apply new data in our model, we will have just one observation with a different range of values, so we have to put in the same range of our training data, but without making the fit, seeking not to disturb the model.

In [27]:
X_test[:, 3:] = sc.transform(X_test[:, 3:])
X_test

array([[0.0, 1.0, 0.0, -1.4661817944830124, -0.9069571034860727],
       [1.0, 0.0, 0.0, -0.44973664397484414, 0.2056403393225306]],
      dtype=object)