* change raw feature vectors into a representation that is more suitable for the downstream estimators.
* learning algorithms benefit from standardization of the data set. If some outliers are present in the set, robust scalers or transformers are more appropriate.

Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

In [26]:
from sklearn import preprocessing
import numpy as np

In [27]:
X_train = np.array([[ 1., -1.,  2.],
...                     [ 2.,  0.,  0.],
...                     [ 0.,  1., -1.]])
print(X_train)

[[ 1. -1.  2.]
 [ 2.  0.  0.]
 [ 0.  1. -1.]]


In [28]:
X_scaled = preprocessing.scale(X_train)
X_scaled
# it has zero mean and unit variance

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [29]:
X_scaled.mean(axis=0)

array([0., 0., 0.])

In [30]:
X_scaled.std(axis=0)

array([1., 1., 1.])

The preprocessing module further provides a utility class StandardScaler that implements the Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set.

In [31]:
scaler = preprocessing.StandardScaler().fit(X_train)

In [32]:
scaler.mean_

array([1.        , 0.        , 0.33333333])

In [34]:
scaler.scale_

array([0.81649658, 0.81649658, 1.24721913])

In [35]:
scaler.transform(X_train)

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [40]:
# scaler instance can be used for transform the another set of data
X_test = [[-1., 1., 0.]]
scaler.transform(X_test)

array([[-2.44948974,  1.22474487, -0.26726124]])

Scaling feature to range; If we want the data should be scaled in between the range (min, max) 


In [42]:
X_new_train = np.array([[ 1., -1.,  2.],
...                     [ 2.,  0.,  0.],
...                     [ 0.,  1., -1.]])
# min is 0 and max is 1

min_max_scaler = preprocessing.MinMaxScaler()
print(min_max_scaler)

MinMaxScaler(copy=True, feature_range=(0, 1))


In [46]:
min_max_scaler.fit_transform(X_new_train)

array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

In [48]:
min_max_scaler.scale_

array([0.5       , 0.5       , 0.33333333])

In [49]:
min_max_scaler.min_

array([0.        , 0.5       , 0.33333333])

MaxAbsScaler works in a very similar fashion, but scales in a way that the training data lies within the range [-1, 1] by dividing through the largest maximum value in each feature. It is meant for data that is already centered at zero or sparse data.

In [57]:
max_abs_scaler = preprocessing.MaxAbsScaler()

In [58]:
max_abs_scaler

MaxAbsScaler(copy=True)

In [60]:
X_train_maxabs = max_abs_scaler.fit_transform(X_new_train)
X_train_maxabs

array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])

In [None]:
X_train_maxabs.scale_