# ScalingTransformer
This notebook shows the functionality of the `ScalingTransformer` class. This transformer allows the user to perform different types of scaling to data, `X`. <br>
There are 3 types of scaling from the `sklearn.preprocessing` module that this transformer allows users to access;
- min max scaling from the [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler)
- max absolute scaling from the [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler)
- standardisation from the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)

In [1]:
import pandas as pd
import numpy as np

In [2]:
import tubular
from tubular.numeric import ScalingTransformer

In [3]:
tubular.__version__

'0.2.8'

## Load Boston house price dataset from sklearn
Note, the load_boston script modifies the original Boston dataset to include nulls values and pandas categorical dtypes.

In [4]:
boston_df = tubular.testing.test_data.prepare_boston_df()
boston_df.shape

(506, 17)

In [5]:
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target,ZN_cat,CHAS_cat,RAD_cat
0,0.00632,18.0,2.31,0.0,0.538,6.575,,4.09,,296.0,15.3,396.9,4.98,24.0,18.0,0.0,
1,0.02731,,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6,,0.0,2.0
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,,17.8,392.83,4.03,34.7,0.0,0.0,2.0
3,,,2.18,0.0,0.458,,45.8,6.0622,3.0,222.0,18.7,,,33.4,,0.0,3.0
4,0.06905,0.0,2.18,0.0,0.458,,,6.0622,3.0,222.0,18.7,396.9,5.33,36.2,0.0,0.0,3.0


In [6]:
boston_df.dtypes

CRIM         float64
ZN            object
INDUS        float64
CHAS          object
NOX          float64
RM           float64
AGE          float64
DIS          float64
RAD           object
TAX          float64
PTRATIO      float64
B            float64
LSTAT        float64
target       float64
ZN_cat      category
CHAS_cat    category
RAD_cat     category
dtype: object

## Simple usage

### Initialising ScalingTransformer

When initialising the `ScalingTransformer` the user must specify;
- `columns` the columns to apply scaling to
- `scaler` which must be one of 'min_max', 'max_abs', 'standard', this argument is used to set which type of scaling will be used

The user can also specify `scaler_kwargs` which is a dictionary of keyword arguments that will be passed to the scaler when it is initalised. See the docs for each scaler ([MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler), [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler), [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)) for the different arguments that each accepts.

In [7]:
scaling_1 = ScalingTransformer(columns = ['NOX', 'CRIM'], scaler = 'standard')

### ScalingTransformer fit
The `ScalingTransformer` must be `fit` on data before running `transform` to apply the scaling. The `fit` method determines the scaling values based off the input data. 

In [8]:
scaling_1.fit(boston_df)



ScalingTransformer(columns=['NOX', 'CRIM'],
                   scaler=StandardScaler(copy=True, with_mean=True,
                                         with_std=True),
                   scaler_kwargs=None)

### ScalingTransformer transform
The `transform` method applies scaling to the input data `X`. The specified columns are modified in place rather than creating new columns.

In [9]:
boston_df['NOX'].quantile([0, 0.2, 0.4, 0.6, 0.8, 1.0])

0.0    0.38500
0.2    0.44218
0.4    0.50400
0.6    0.57300
0.8    0.66620
1.0    0.87100
Name: NOX, dtype: float64

In [10]:
boston_df['CRIM'].quantile([0, 0.2, 0.4, 0.6, 0.8, 1.0])

0.0     0.00632
0.2     0.06617
0.4     0.14932
0.6     0.53700
0.8     5.66637
1.0    88.97620
Name: CRIM, dtype: float64

In [11]:
boston_df_2 = scaling_1.transform(boston_df)

In [12]:
boston_df_2['NOX'].quantile([0, 0.2, 0.4, 0.6, 0.8, 1.0])

0.0   -1.451334
0.2   -0.959818
0.4   -0.428415
0.6    0.164706
0.8    0.965849
1.0    2.726301
Name: NOX, dtype: float64

In [13]:
boston_df_2['CRIM'].quantile([0, 0.2, 0.4, 0.6, 0.8, 1.0])

0.0   -0.416016
0.2   -0.409358
0.4   -0.400108
0.6   -0.356981
0.8    0.213636
1.0    9.481427
Name: CRIM, dtype: float64

## Max absolute scaling
Below is an example of using max absolute scaling.

In [14]:
scaling_2 = ScalingTransformer(columns = ['NOX', 'CRIM'], scaler = 'max_abs')

In [15]:
scaling_2.fit(boston_df)



ScalingTransformer(columns=['NOX', 'CRIM'], scaler=MaxAbsScaler(copy=True),
                   scaler_kwargs=None)

In [16]:
boston_df_3 = scaling_2.transform(boston_df)

In [17]:
boston_df_3['NOX'].quantile([0, 0.2, 0.4, 0.6, 0.8, 1.0])

0.0    0.442021
0.2    0.507669
0.4    0.578645
0.6    0.657865
0.8    0.764868
1.0    1.000000
Name: NOX, dtype: float64

In [18]:
boston_df_3['CRIM'].quantile([0, 0.2, 0.4, 0.6, 0.8, 1.0])

0.0    0.000071
0.2    0.000744
0.4    0.001678
0.6    0.006035
0.8    0.063684
1.0    1.000000
Name: CRIM, dtype: float64

## Min max scaling
Below is an example of using min max scaling.

In [19]:
scaling_3 = ScalingTransformer(columns = ['NOX', 'CRIM'], scaler = 'min_max')

In [20]:
scaling_3.fit(boston_df)



ScalingTransformer(columns=['NOX', 'CRIM'],
                   scaler=MinMaxScaler(copy=True, feature_range=(0, 1)),
                   scaler_kwargs=None)

In [21]:
boston_df_4 = scaling_3.transform(boston_df)

In [22]:
boston_df_4['NOX'].quantile([0, 0.2, 0.4, 0.6, 0.8, 1.0])

0.0    0.000000
0.2    0.117654
0.4    0.244856
0.6    0.386831
0.8    0.578601
1.0    1.000000
Name: NOX, dtype: float64

In [23]:
boston_df_4['CRIM'].quantile([0, 0.2, 0.4, 0.6, 0.8, 1.0])

0.0    0.000000
0.2    0.000673
0.4    0.001607
0.6    0.005965
0.8    0.063618
1.0    1.000000
Name: CRIM, dtype: float64