In [2]:
import numpy as np
from sklearn.preprocessing import (
    MinMaxScaler,
    StandardScaler,
)

## Normalisation

This is the process of translating a range of values which are numerical into another range, typically $[-1, 1]$ or $[0, 1]$

Say for example a numerical feature ranges from 250 – 1000. If the minimum of this feature (250) was subtracted from all the features and then divided by the difference between the min and max of the feature (1000 – 250 = 750) it would normalise the feature between 0 and 1. This can expressed as:

\begin{equation}
\bar{x}^{(j)} = \frac{x^{(j)} -  min^{(j)}}{max^{(j)} - min^{(j)}}
\end{equation}

where $min^{(j)}$ and $max^{(j)}$ are the minimum and maximum of the feature ($j$) respectively.

Normalising is important to help increase training speed although it is not strictly required. For example, if we are training a model on two features ranging from 0 - 1 and 0 - 10,000 then the derivative with respect to the larger feature will dominate the update. Generally speaking, it is good practise to ensure the features are in similar ranges when training a model.

We can do this the `MinMaxScaler` in sklearn. Example is shown below:

In [13]:
data = np.array([5, 10, 6, 5, 8, 7, 8, 9, 9, 5, 10]).reshape(-1, 1)
scaler = MinMaxScaler()
scaler.fit(data)

MinMaxScaler()

In [15]:
scaler.data_max_

array([10.])

In [16]:
scaler.transform(data)

array([[0. ],
       [1. ],
       [0.2],
       [0. ],
       [0.6],
       [0.4],
       [0.6],
       [0.8],
       [0.8],
       [0. ],
       [1. ]])

Data has now been scaled between 0 and 1 where the initial range was from 5 - 10

## Standardisation

This method of scaling ensures the given feature when re-scaled have the properties of the standard normal distribution where $\mu=0$ and $\sigma=1$.
- $\mu$: Average value in the dataset
- $\sigma$: Standard deviation from the mean
The calculation of standard scores (A.K.A z-score) can be formulated as follows:

\begin{equation}
\hat{x}^{(j)} = \frac{x^{(j)} -  \mu^{(j)}}{\sigma^{(j)}}
\end{equation}

Should you noramlise or standardise a feature? There is no correct answer here, try both and see what works best on your dataset. If you are short on time, generally speaking:
- Unsupervised methods benefit more from standardisation
- If the feature is normally distributed standardisation is preferred
- Features with extreme outliers standardisation will be  preferred as normalisation will compress normal values into a tight range
- In other cases it should be okay to use either.

Sklearn example of using standardisation:

In [3]:
data = [[0, 0], [0, 0], [1, 1], [1, 1]]

In [4]:
scaler = StandardScaler()

In [5]:
scaler.fit(data)

StandardScaler()

In [6]:
scaler.mean_

array([0.5, 0.5])

In [8]:
transformed = scaler.transform(data)
transformed

array([[-1., -1.],
       [-1., -1.],
       [ 1.,  1.],
       [ 1.,  1.]])

In [9]:
# Calculate mean column wise should be zero
transformed.mean(axis=0)

array([0., 0.])

In [11]:
# Calculate standard deviation column wise should be 1
transformed.std(axis=0)

array([1., 1.])