# Data Preprocessing Tools

## Importing the libraries

In [17]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [18]:
dataset = pd.read_csv('Data.csv')
print(type(dataset))
print(dataset)
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

<class 'pandas.core.frame.DataFrame'>
   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


### 🔴 NOTE
- `dataset` will be the `data frame`

In [19]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [20]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [21]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

### 🔴 NOTE
- If some of the cells accross multiple features are missing, then we can delete the entire record only if the record that needs to be deleted are of small percentage from the overall records that are available
- If the record that needs to be deleted are in most numbers then a classic way of handling the missing cells from the 
`numerical feature` is to put the `mean` or `median` etc
- There are many other ways to handle the missing values, like for `categorical feature`, the empty cells can be replaced with `most frequent value` etc

In [22]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### Encoding the Independent Variable

In [23]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

### 🔴 NOTE
- Here as the `Country` column is `independent categorical feature`, then the `Age` & `Salary` column are `independent numerical feature`, we may think to make the model have better understanding while training, we may need to encode the categorical feature into a numerical feature like `0 = France`, `1 = Spain` & `2 = Germany` 
- But this will make the model to build unwanted relationship between the numeric values of encoded `Country` feature during training
- This is fine if, order matters between categories like, 

| Category | Encoded Value |
|----------|----------------|
| Low      | 0              |
| Medium   | 1              |
| High     | 2              |

- But if, order doesn't matter between categories, then doing this simple encoding will reduce the model performance

| Category | Encoded Value |
|----------|----------------|
| red      | 0              |
| green    | 1              |
| blue     | 2              |

- So instead, we can do `One Hot Encoding` here for this categorical feature, which is nothing but replacing `0` & `1` by creating as many new columns 

### 🔴 NOTE
- For `dependent columns` it is fine to just do simple encoding
- For `dependent columns` in `ML`, `One Hot Encoding` is not needed, but for `dependent columns` in `DL` models it may need

In [25]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [26]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [27]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

In [28]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [29]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [30]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [31]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [32]:
print(y_test)

[0 1]


## Feature Scaling

### 🔴 NOTE
- `Feature scaling` has to be always applied after the splitting of train and test data
- Not all models require feature scaling, for example to train `Multi Linear Regression model` we don't need to do feature scaling even though one feature dominates other features
- `Normalization`, this feature scaling method will work when there exist normal distribution curve in most of our features
- `Standardization`, this feature scaling method will work in all cases

In [33]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])

### 🔴 NOTE
- `fit_transform` on `X_train`:
    - `fit()` calculates the parameters needed for scaling (like mean and standard deviation for StandardScaler).
    - `transform()` applies the scaling using those parameters.
    - `fit_transform()` does both together, it learns the scaling parameters from training data and applies the scaling.

- `transform` on `X_test`:
    - The test set should be transformed using the same parameters learned from the training set.
    - You do not fit again on the test set because it must simulate new, unseen data and should not influence your scaling parameters.
    - So, you use only `transform()` to apply the existing scaling parameters from training.

### 🔴 NOTE
- Here `Feature scaling` is not needed to be applied for the first three columns, as they are already between `[-3; +3]` <br>(considering all column values will be in this range)
- The additional reason for not to `feature scale` the first three columns is, <br>
if done, it will change the meaning of encoding the categorical values into `0` & `1` by reassigning it with non sense values

In [34]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [36]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
