# Standardization
- transform _continuous_ data to appear _normally distributed_ 

## When to standardize
- Model in _linear_ space. Eg.kNN, Linear regression, K-Means Clustering
- Dataset features have high variance
- Features are on different scales Eg.Predicting house prices using no. bedrooms and last sale price 

In [1]:
import pandas as pd
wine = pd.read_csv('datasets/wine_types.csv')
wine.head()

Unnamed: 0,Type,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [2]:
X = wine.drop('Type',axis=1)
y = wine['Type']

# Without standardization

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

In [4]:
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

knn = KNeighborsClassifier()

# Fit the knn model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test,y_test))

0.7777777777777778


# Log normalization
- useful for features with high variance
- applies logarithm transformation
- Natural log using the constant e(==2.718)

In [5]:
# Checking for the candidate perfect for log normalization
wine.var()

Type                                0.600679
Alcohol                             0.659062
Malic acid                          1.248015
Ash                                 0.075265
Alcalinity of ash                  11.152686
Magnesium                         203.989335
Total phenols                       0.391690
Flavanoids                          0.997719
Nonflavanoid phenols                0.015489
Proanthocyanins                     0.327595
Color intensity                     5.374449
Hue                                 0.052245
OD280/OD315 of diluted wines        0.504086
Proline                         99166.717355
dtype: float64

**Proline** has extremely high variance.

In [7]:
import numpy as np

# Print out the variance of the Proline column
print(wine['Proline'].var())

# Apply the log normalization function to the Proline column
wine['Proline_log'] = np.log(wine['Proline'])

# Check the variance of the normalized Proline column
print(wine['Proline_log'].var())

99166.71735542436
0.17231366191842012


# Scaling data for feature comparison
- Center features around 0 and transform to variance of 1
- Transforms to approx. normal distribution

In [8]:
# Checking the different scales in different features that could lead to bias in linear model
wine[['Ash','Alcalinity of ash','Magnesium']]

Unnamed: 0,Ash,Alcalinity of ash,Magnesium
0,2.43,15.6,127
1,2.14,11.2,100
2,2.67,18.6,101
3,2.50,16.8,113
4,2.87,21.0,118
...,...,...,...
173,2.45,20.5,95
174,2.48,23.0,102
175,2.26,20.0,120
176,2.37,20.0,120


This proves that the `Ash`, `Alcalinity of ash`, and `Magnesium` columns in the wine dataset are all on different scales, so let's standardize them in a way that allows for use in a linear model.

In [9]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Create the scaler
scaler = StandardScaler()

# Subset the DataFrame you want to scale 
wine_subset = wine[['Ash','Alcalinity of ash','Magnesium']]

# Apply the scaler to wine_subset
wine_subset_scaled = scaler.fit_transform(wine_subset)

In [10]:
wine_subset_scaled

array([[ 2.32052541e-01, -1.16959318e+00,  1.91390522e+00],
       [-8.27996323e-01, -2.49084714e+00,  1.81450206e-02],
       [ 1.10933436e+00, -2.68738198e-01,  8.83583612e-02],
       [ 4.87926405e-01, -8.09251184e-01,  9.30918449e-01],
       [ 1.84040254e+00,  4.51945783e-01,  1.28198515e+00],
       [ 3.05159359e-01, -1.28970717e+00,  8.60705108e-01],
       [ 3.05159359e-01, -1.46987817e+00, -2.62708342e-01],
       [ 8.90013905e-01, -5.69023190e-01,  1.49262517e+00],
       [-7.18336096e-01, -1.65004916e+00, -1.92495001e-01],
       [-3.52802005e-01, -1.04947918e+00, -1.22281661e-01],
       [-2.43141777e-01, -4.48909194e-01,  3.69211724e-01],
       [-1.70034959e-01, -8.09251184e-01, -3.32921683e-01],
       [ 1.58945723e-01, -1.04947918e+00, -7.54201726e-01],
       [ 8.58389045e-02, -2.43079014e+00, -6.13775045e-01],
       [ 4.92854954e-02, -2.25061915e+00,  1.58571702e-01],
       [ 1.21899459e+00, -6.89137187e-01,  8.60705108e-01],
       [ 1.29210141e+00,  1.51660791e-01

# With standardization

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Instantiate a StandardScaler
scaler = StandardScaler()

# Scale the training and test features
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train_scaled, y_train)

# Score the model on the test data
print(knn.score(X_test_scaled, y_test))

0.9333333333333333


The model has improved from 77.78% to 93.33%.