# <font color="gold"><b>Feature Transformation and Scaling</b></font>
## Feature Transformation and Scaling


# <font color="teal"><b>Table of Contents</b></font>
- [Why](#why_feature_trans) do we need Feature Transformation and Scaling?
- [MinMax Scaler](#MinMax_Scaler)
- [Standard Scaler](#Standard_Scaler)
- [MaxAbsScaler](#MaxAbsScaler)
- [Robust Scaler](#Robust_Scaler)
- [Quantile Transformer Scaler](#Quantile_Transformer_Scaler)
- [Log Transformation](#Log_Transformation)
- [Power Transformer Scaler](#Power_Transformer_Scaler)
- [Unit Vector Scaler/Normalizer](#Unit_Vector_Scaler_Normalizer)

Source: [Analytics Vidhya](https://www.analyticsvidhya.com/blog/2020/07/types-of-feature-transformation-and-scaling/)

<a id="why_feature_trans"></a>
# <font color="teal"><b>Why do we need Feature Transformation and Scaling?</b></font>

1) Too few features and your model won’t have much to learn from <br>
2) Too many features and we might be feeding unnecessary information to the model <br>
3) The values in each of the features need to be considered as well <br>

`
Datasets in which different columns have different units `
- like one column can be in kilograms, while another column can be in centimeters <br>

`
Furthermore, columns like income which can range from 20,000 to 100,000, and even more;
while an age column which can range from 0 to 100(at the most). 
Thus, Income is about 1,000 times larger than age.
`
> When we feed these features to the model as is, there is every chance that the income will influence the result more due to its larger value <br>
> But this doesn’t necessarily mean it is more important as a predictor <br>
> So, <font color="green">to give importance to both Age and Income, feature scaling needs to be applied</font><br>


In [6]:
# Working DF
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")


df = pd.DataFrame({
    'Income': [15000, 1800, 120000, 10000],
    'Age': [25, 18, 42, 51],
    'Department': ['HR','Legal','Marketing','Management']
})
df

Unnamed: 0,Income,Age,Department
0,15000,25,HR
1,1800,18,Legal
2,120000,42,Marketing
3,10000,51,Management


> Since, non-numeric values cannot be scaled <br>
create a copy of our dataframe and store the numerical feature names in a list, and their values

In [2]:
df_scaled = df.copy()
col_names = ['Income', 'Age']
features = df_scaled[col_names]
features

Unnamed: 0,Income,Age
0,15000,25
1,1800,18
2,120000,42
3,10000,51


<a id="MinMax_Scaler"></a>
# <font color="gold"><b>MinMax Scaler:</b></font>

The MinMax scaler is one of the simplest scalers to understand.  It just scales all the data between 0 and 1 <br>
The formula for calculating the scaled value is- <br>
> $x_{scaled} = (x – x_{min})/(x_{max} – x_{min})$

- We will first need to import it
```python
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
```

- Apply it on only the values of the features:

```python
df_scaled[col_names] = scaler.fit_transform(features.values)
```

In [7]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

Unnamed: 0,Income,Age,Department
0,0.111675,0.212121,HR
1,0.0,0.0,Legal
2,1.0,0.727273,Marketing
3,0.069374,1.0,Management


In [8]:
# Let us take the range to be (5, 10)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(5, 10))

df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

Unnamed: 0,Income,Age,Department
0,5.558376,6.060606,HR
1,5.0,5.0,Legal
2,10.0,8.636364,Marketing
3,5.34687,10.0,Management


<a id="Standard_Scaler"></a>
# <font color="gold"><b>Standard Scaler</b></font>
`For each feature, the Standard Scaler scales the values such that the mean is 0 and the standard deviation is 1(or the variance)` <br>
$x_{scaled} = x – mean / std_dev$
> However, Standard Scaler assumes that the distribution of the variable is normal <br>

Thus, in case, the variables are not normally distributed:
- either choose a different scaler, e.g. <font color="gold"><b>Quantile Transformer Scaler</b></font>
- or first, convert the variables to a normal distribution and then apply this scaler

Test for data normality:
- The Shapiro-Wilk test
- The Anderson-Darling test, and
- The Kolmogorov-Smirnov test

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

<a id="MaxAbsScaler"></a>
# <font color="gold"><b>MaxAbsScaler</b></font>
`The MaxAbs scaler takes the absolute maximum value of each column and divides each value in the column by the maximum value`
> This operation scales the data between the range [-1, 1]

In [16]:
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()

df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

Unnamed: 0,Income,Age,Department
0,0.125,0.490196,HR
1,0.015,0.352941,Legal
2,1.0,0.823529,Marketing
3,0.083333,1.0,Management


<a id="Robust_Scaler"></a>
# <font color="gold"><b>Robust Scaler</b></font>
`The scalers we used so far, each of them was using values like the mean, maximum and minimum values of the columns` <br>
`All these values are sensitive to outliers. If there are too many outliers in the data, they will influence the mean and the max value or the min value` <br>
`Thus, even if we scale this data using the above methods, we cannot guarantee a balanced data with a normal distribution` <br>

The __Robust Scaler__, _as the name suggests_, is not sensitive to outliers. This scaler-
- removes the median from the data
- scales the data by the InterQuartile Range(IQR)

> $x_{scaled} = (x – Quartile1)/(Quartile3 – Quartile1)$

In [17]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()

df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

Unnamed: 0,Income,Age,Department
0,0.075075,-0.404762,HR
1,-0.321321,-0.738095,Legal
2,3.228228,0.404762,Marketing
3,-0.075075,0.833333,Management


<a id="Quantile_Transformer_Scaler"></a>
# <font color="gold"><b>Quantile Transformer Scaler</b></font>
### Quantile Transformer Scaler:
` The Quantile Transformer Scaler converts the variable distribution to a normal distribution. and scales it accordingly` <br>
` Since it makes the variable normally distributed, it also deals with the outliers` <br>

1. It computes the cumulative distribution function of the variable
2. It uses this cdf to map the values to a normal distribution
3. Maps the obtained values to the desired output distribution using the associated quantile function
<br>

> Since this scaler changes the very distribution of the variables, linear relationships among variables may be destroyed by using this scaler<br>
Thus, it is best to use this for non-linear data

In [18]:
from sklearn.preprocessing import QuantileTransformer
scaler = QuantileTransformer()

df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

Unnamed: 0,Income,Age,Department
0,0.6666667,0.3333333,HR
1,1e-07,1e-07,Legal
2,0.9999999,0.6666667,Marketing
3,0.3333333,0.9999999,Management


<a id="Log_Transformation"></a>
# <font color="gold"><b>Log Transformation</b></font>
### Log Transformation:
`Log Transformation is primarily used to convert a skewed distribution to a normal distribution/less-skewed distribution` <br>
`In this transform, we take the log of the values in a column and use these values as the column instead` <br>
> log(10) = 1 <br>
log(100) = 2, and <br>
log(10000) = 4 <br>

### The log operation had a dual role:
- Reducing the impact of too-low values
- Reducing the impact of too-high values.

In [19]:
df['log_income'] = np.log(df['Income'])
df
# We created a new column to store the log values

Unnamed: 0,Income,Age,Department,log_income
0,15000,25,HR,9.615805
1,1800,18,Legal,7.495542
2,120000,42,Marketing,11.695247
3,10000,51,Management,9.21034


<a id="Power_Transformer_Scaler"></a>
# <font color="gold"><b>Power Transformer Scaler




<a id="Unit_Vector_Scaler_Normalizer"></a>
# <font color="gold"><b>Unit Vector Scaler/Normalizer
