In [1]:
import numpy as np
from sklearn import preprocessing
import matplotlib
import pandas
import seaborn

## Mean removal

It involves removing the mean from each feature so that it is centered on zero. Mean removal helps in removing any bias from the features.

In [2]:
input_data = np.array([[3, -1.5, 3, -6.4], [0, 3, -1.3, 4.1], [1, 2.3, -2.9, -4.3]])

In [4]:
data_standardized = preprocessing.scale(input_data)
print("\nMean = ", data_standardized.mean(axis = 0))
print("Std deviation = ", data_standardized.std(axis = 0))


Mean =  [ 5.55111512e-17 -3.70074342e-17  0.00000000e+00 -1.85037171e-17]
Std deviation =  [1. 1. 1. 1.]


## Scaling

The values of every feature in a data point can vary between random values. So, it is important to scale them so that this matches specified rules.

In [7]:
data_scaler = preprocessing.MinMaxScaler(feature_range = (0, 1))
data_scaled = data_scaler.fit_transform(input_data)
print("\nMin max scaled data = ", data_scaled)


Min max scaled data =  [[1.         0.         1.         0.        ]
 [0.         1.         0.27118644 1.        ]
 [0.33333333 0.84444444 0.         0.2       ]]


## Normalization

Normalization involves adjusting the values in the feature vector so as to measure them on a common scale. Here, the values of a feature vector are adjusted so that they sum up to 1. 

In [9]:
data_normalized = preprocessing.normalize(input_data, norm  = 'l1')
print("\nL1 normalized data = ", data_normalized)


L1 normalized data =  [[ 0.21582734 -0.10791367  0.21582734 -0.46043165]
 [ 0.          0.35714286 -0.1547619   0.48809524]
 [ 0.0952381   0.21904762 -0.27619048 -0.40952381]]


## One Hot Encoding

It may be required to deal with numerical values that are few and scattered, and you may not need to store these values. In such situations you can use One Hot Encoding technique.

If the number of distinct values is k, it will transform the feature into a k-dimensional vector where only one value is 1 and all other values are 0.

In [12]:
encoder = preprocessing.OneHotEncoder()
encoder.fit([  [0, 2, 1, 12], 
               [1, 3, 5, 3], 
               [2, 3, 2, 12], 
               [1, 2, 4, 3]
])
encoded_vector = encoder.transform([[2, 3, 5, 3]]).toarray()
print("\nEncoded vector =", encoded_vector)


Encoded vector = [[0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]
