Package: `sklearn.preprocessing`

- change raw feature vectors into a representation that is more suitable for the downstream estimators
    

## Standardization, or mean removal and variance scaling

Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit. They might behave badly if the individual features do not more or less look like standard normally distributed data:*Gaussian with zero mean and unit variance*

The function `scale` provides a quick and easy way to perform this operation on a single array-like dataset:

In [1]:
from sklearn import preprocessing
import numpy as np
X = np.array([[1.,-1.,2.],
              [2., 0.,0.],
              [0., 1.,-1.]])
X_scaled = preprocessing.scale(X)
X_scaled

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [2]:
X_scaled.mean(axis = 0)

array([ 0.,  0.,  0.])

In [3]:
X_scaled.std(axis=0)

array([ 1.,  1.,  1.])

Another utility class: `StandardScaler`, that implements the `Transformer` API to compute the mean and standard deviation on a training dataset so as to be able to later reapply the same transformation on the testing set. 


In [4]:
scaler = preprocessing.StandardScaler().fit(X)
scaler

StandardScaler(copy=True, with_mean=True, with_std=True)

In [5]:
scaler.mean_

array([ 1.        ,  0.        ,  0.33333333])

In [6]:
scaler.scale_

array([ 0.81649658,  0.81649658,  1.24721913])

In [7]:
scaler.transform(X)

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [8]:
scaler.transform([[-1.,1.,0.]])

array([[-2.44948974,  1.22474487, -0.26726124]])

### Scaling features to a range
Example: scaling feature to lie between a given minimum and maximum value, often between zero and one. Or the maximum absolute value of each feature is scaled to unit size 
Use `MinMaxScaler` or `MaxAbsScaler`.


In [9]:
X_train = np.array([[1., -1., 2.],
                    [2.,  0., 0.],
                    [0.,  1.,-1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax

array([[ 0.5       ,  0.        ,  1.        ],
       [ 1.        ,  0.5       ,  0.33333333],
       [ 0.        ,  1.        ,  0.        ]])

In [10]:
X_test = np.array([[-3., -1., 4.]])
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax

array([[-1.5       ,  0.        ,  1.66666667]])

In [11]:
min_max_scaler.scale_

array([ 0.5       ,  0.5       ,  0.33333333])

In [12]:
min_max_scaler.min_

array([ 0.        ,  0.5       ,  0.33333333])

`MaxAbsScaler` works in a very similar fashion, but scales in a way that the training data lies within the range [-1,1] by dividing through the largest maximum value in each feature. It is used for data that is already centered at zero or sparse data. 


### Scaling sparse data

## References

- [scikit-learn: Preprocessing data](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing)