<a href="https://colab.research.google.com/github/i-wizard/ML-sample/blob/main/Feature_Scaling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Generate Sample Data

In [23]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

In [39]:
X, y = fetch_california_housing(as_frame=True, return_X_y=True)

In [25]:
X.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31


Split data

In [40]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
count,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0
mean,3.876298,28.619065,5.432607,1.096174,1425.437742,3.030518,35.624783,-119.563008
std,1.906783,12.600999,2.523415,0.489529,1143.062756,6.44134,2.136552,2.005251
min,0.4999,1.0,0.846154,0.333333,3.0,0.75,32.54,-124.35
25%,2.566625,18.0,4.447644,1.00576,785.0,2.428016,33.93,-121.8
50%,3.5485,29.0,5.234243,1.04872,1166.0,2.817937,34.25,-118.49
75%,4.747575,37.0,6.059008,1.1,1724.0,3.283243,37.71,-118.0
max,15.0001,52.0,141.909091,34.066667,35682.0,599.714286,41.95,-114.31


Test Scaler Accuracy

In [41]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline

def scaled_model_accuracy(model, scaler_instance):
  model_scaled = Pipeline([
      ("scale", scaler_instance),
      ("model", model)
  ])
  model_scaled.fit(X_train, y_train)
  return model_scaled.score(X_test, y_test)

def unscaled_model_accuracy(model):
  model_unscaled = Pipeline([
      ("model", model)
  ])
  model_unscaled.fit(X_train, y_train)
  return model_unscaled.score(X_test, y_test)




**Transform data with Standard Scaler**.

It doesn't perform well with data with a lot of outliers because it uses the mean and standard deviation(std) of each feature.
It does well in Linear models (KNeighborRegressor) and has no effect on tree classifiers (RandomForestRegressor)

Formula ->

For each feature Xi in the data:

  New Xi = (Xi - Xmean)/Xstd

In [44]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data = scaler.fit_transform(X_train)
df = pd.DataFrame(data).describe()
s_accuracy = scaled_model_accuracy(KNeighborsRegressor(), scaler)
un_accuracy = unscaled_model_accuracy(KNeighborsRegressor())
s_accuracy

0.6898252870009414

MinMAx Scaler

Formular

for each feature:

    new Xi = (Xi -Xmin/) (Xmax - Xmin)

In [43]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data = scaler.fit_transform(X_train)
df = pd.DataFrame(data)
s_accuracy = scaled_model_accuracy(KNeighborsRegressor(), scaler)
un_accuracy = unscaled_model_accuracy(KNeighborsRegressor())
s_accuracy

0.6965930580761679

Robust Scaler

This is not affected by outliers because it does not use the mean or Max min value instead it uses Inter Quantile Range. It also does not change the shape of the distribution

Formular

new Xi = (Xi - Xmedian)/ (X0.75 - X0.25)

In [38]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
data = scaler.fit_transform(X_train)
df = pd.DataFrame(data)
s_accuracy = scaled_model_accuracy(KNeighborsRegressor(), scaler)
un_accuracy = unscaled_model_accuracy(KNeighborsRegressor())
s_accuracy

0.679645665190272

**MaxAbsScaler**

Not so useful when feature has outliers

Formula
Xi = Xi / abs(Xmax)


In [42]:
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
data = scaler.fit_transform(X_train)
df = pd.DataFrame(data)
s_accuracy = scaled_model_accuracy(KNeighborsRegressor(), scaler)
un_accuracy = unscaled_model_accuracy(KNeighborsRegressor())
s_accuracy

0.5566896195690834

**PowerTransformer**

box-cox (cannot be used where negative values are present)

yeo-johnson

In [47]:
from sklearn.preprocessing import PowerTransformer
scaler = PowerTransformer(method='yeo-johnson')
data = scaler.fit_transform(X_train)
df = pd.DataFrame(data)
s_accuracy = scaled_model_accuracy(KNeighborsRegressor(), scaler)
un_accuracy = unscaled_model_accuracy(KNeighborsRegressor())
s_accuracy

0.6696330143270286

Quantile Transformers

useful when feature has outliers



In [48]:
from sklearn.preprocessing import QuantileTransformer
scaler = QuantileTransformer(output_distribution="normal")
data = scaler.fit_transform(X_train)
df = pd.DataFrame(data)
s_accuracy = scaled_model_accuracy(KNeighborsRegressor(), scaler)
un_accuracy = unscaled_model_accuracy(KNeighborsRegressor())
s_accuracy

0.7305458236916657