# ScalingTransformer
This notebook shows the functionality of the `ScalingTransformer` class. This transformer allows the user to perform different types of scaling to data, `X`. <br>
There are 3 types of scaling from the `sklearn.preprocessing` module that this transformer allows users to access;
- min max scaling from the [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler)
- max absolute scaling from the [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler)
- standardisation from the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing

In [2]:
import tubular
from tubular.numeric import ScalingTransformer

In [3]:
tubular.__version__

'0.3.0'

## Load California housing dataset from sklearn

In [4]:
cali = fetch_california_housing()
cali_df = pd.DataFrame(cali["data"], columns=cali["feature_names"])

In [5]:
cali_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [6]:
cali_df.dtypes

MedInc        float64
HouseAge      float64
AveRooms      float64
AveBedrms     float64
Population    float64
AveOccup      float64
Latitude      float64
Longitude     float64
dtype: object

## Simple usage

### Initialising ScalingTransformer

When initialising the `ScalingTransformer` the user must specify;
- `columns` the columns to apply scaling to
- `scaler` which must be one of 'min_max', 'max_abs', 'standard', this argument is used to set which type of scaling will be used

The user can also specify `scaler_kwargs` which is a dictionary of keyword arguments that will be passed to the scaler when it is initalised. See the docs for each scaler ([MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler), [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler), [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)) for the different arguments that each accepts.

In [7]:
scaling_1 = ScalingTransformer(columns=["AveRooms", "AveBedrms"], scaler="standard")

### ScalingTransformer fit
The `ScalingTransformer` must be `fit` on data before running `transform` to apply the scaling. The `fit` method determines the scaling values based off the input data. 

In [8]:
scaling_1.fit(cali_df)

ScalingTransformer(columns=['AveRooms', 'AveBedrms'], scaler=StandardScaler(),
                   scaler_kwargs=None)

### ScalingTransformer transform
The `transform` method applies scaling to the input data `X`. The specified columns are modified in place rather than creating new columns.

In [9]:
cali_df["AveRooms"].quantile([0, 0.2, 0.4, 0.6, 0.8, 1.0])

0.0      0.846154
0.2      4.266667
0.4      4.934005
0.6      5.520848
0.8      6.268581
1.0    141.909091
Name: AveRooms, dtype: float64

In [10]:
cali_df["AveBedrms"].quantile([0, 0.2, 0.4, 0.6, 0.8, 1.0])

0.0     0.333333
0.2     0.995448
0.4     1.032452
0.6     1.065969
0.8     1.115385
1.0    34.066667
Name: AveBedrms, dtype: float64

In [11]:
cali_df_2 = scaling_1.transform(cali_df)

In [12]:
cali_df_2["AveRooms"].quantile([0, 0.2, 0.4, 0.6, 0.8, 1.0])

0.0    -1.852319
0.2    -0.469798
0.4    -0.200070
0.6     0.037124
0.8     0.339347
1.0    55.163236
Name: AveRooms, dtype: float64

In [13]:
cali_df_2["AveBedrms"].quantile([0, 0.2, 0.4, 0.6, 0.8, 1.0])

0.0    -1.610768
0.2    -0.213606
0.4    -0.135521
0.6    -0.064796
0.8     0.039480
1.0    69.571713
Name: AveBedrms, dtype: float64

## Max absolute scaling
Below is an example of using max absolute scaling.

In [14]:
scaling_2 = ScalingTransformer(columns=["AveRooms", "AveBedrms"], scaler="max_abs")

In [15]:
scaling_2.fit(cali_df)

ScalingTransformer(columns=['AveRooms', 'AveBedrms'], scaler=MaxAbsScaler(),
                   scaler_kwargs=None)

In [16]:
cali_df_3 = scaling_2.transform(cali_df)

In [17]:
cali_df_3["AveRooms"].quantile([0, 0.2, 0.4, 0.6, 0.8, 1.0])

0.0    0.005963
0.2    0.030066
0.4    0.034769
0.6    0.038904
0.8    0.044173
1.0    1.000000
Name: AveRooms, dtype: float64

In [18]:
cali_df_3["AveBedrms"].quantile([0, 0.2, 0.4, 0.6, 0.8, 1.0])

0.0    0.009785
0.2    0.029221
0.4    0.030307
0.6    0.031291
0.8    0.032741
1.0    1.000000
Name: AveBedrms, dtype: float64

## Min max scaling
Below is an example of using min max scaling.

In [19]:
scaling_3 = ScalingTransformer(columns=["AveRooms", "AveBedrms"], scaler="min_max")

In [20]:
scaling_3.fit(cali_df)

ScalingTransformer(columns=['AveRooms', 'AveBedrms'], scaler=MinMaxScaler(),
                   scaler_kwargs=None)

In [21]:
cali_df_4 = scaling_3.transform(cali_df)

In [22]:
cali_df_4["AveRooms"].quantile([0, 0.2, 0.4, 0.6, 0.8, 1.0])

0.0    0.000000
0.2    0.024248
0.4    0.028979
0.6    0.033139
0.8    0.038440
1.0    1.000000
Name: AveRooms, dtype: float64

In [23]:
cali_df_4["AveBedrms"].quantile([0, 0.2, 0.4, 0.6, 0.8, 1.0])

0.0    0.000000
0.2    0.019628
0.4    0.020725
0.6    0.021718
0.8    0.023183
1.0    1.000000
Name: AveBedrms, dtype: float64