Package: `sklearn.preprocessing`

- change raw feature vectors into a representation that is more suitable for the downstream estimators
    

## Standardization, or mean removal and variance scaling

Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit. They might behave badly if the individual features do not more or less look like standard normally distributed data:*Gaussian with zero mean and unit variance*

The function `scale` provides a quick and easy way to perform this operation on a single array-like dataset:

In [2]:
from sklearn import preprocessing
import numpy as np
X = np.array([[1.,-1.,2.],
              [2., 0.,0.],
              [0., 1.,-1.]])
X_scaled = preprocessing.scale(X)
X_scaled

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [3]:
X_scaled.mean(axis = 0)

array([ 0.,  0.,  0.])

In [4]:
X_scaled.std(axis=0)

array([ 1.,  1.,  1.])

Another utility class: `StandardScaler`, that implements the `Transformer` API to compute the mean and standard deviation on a training dataset so as to be able to later reapply the same transformation on the testing set. 


In [5]:
scaler = preprocessing.StandardScaler().fit(X)
scaler

StandardScaler(copy=True, with_mean=True, with_std=True)

In [6]:
scaler.mean_

array([ 1.        ,  0.        ,  0.33333333])

In [7]:
scaler.scale_

array([ 0.81649658,  0.81649658,  1.24721913])

In [8]:
scaler.transform(X)

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [9]:
scaler.transform([[-1.,1.,0.]])

array([[-2.44948974,  1.22474487, -0.26726124]])

### Scaling features to a range
Example: scaling feature to lie between a given minimum and maximum value, often between zero and one. Or the maximum absolute value of each feature is scaled to unit size 
Use `MinMaxScaler` or `MaxAbsScaler`.


In [10]:
X_train = np.array([[1., -1., 2.],
                    [2.,  0., 0.],
                    [0.,  1.,-1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax

array([[ 0.5       ,  0.        ,  1.        ],
       [ 1.        ,  0.5       ,  0.33333333],
       [ 0.        ,  1.        ,  0.        ]])

In [11]:
X_test = np.array([[-3., -1., 4.]])
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax

array([[-1.5       ,  0.        ,  1.66666667]])

In [12]:
min_max_scaler.scale_

array([ 0.5       ,  0.5       ,  0.33333333])

In [13]:
min_max_scaler.min_

array([ 0.        ,  0.5       ,  0.33333333])

`MaxAbsScaler` works in a very similar fashion, but scales in a way that the training data lies within the range [-1,1] by dividing through the largest maximum value in each feature. It is used for data that is already centered at zero or sparse data. 


### Scaling sparse data
`MaxAbsScaler` and `maxabs_scale` were specifically designed for scaling sparse data, and are the recommend way to go about this. 

...
### Scaling data with outliers 

If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. You can use `robust_scale` and `RobustScaler` as drop-in replacements instead. 
...
### Centering kernel matrices
...
## Normalization 
Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples. 

The function `normalize` provides a quick and easy way to perform this operation on a single array-like dataset, either using the `l1` or `l2` norms:


In [16]:
X = [[ 1., -1., 2.],
     [ 2.,  0., 0.],
     [ 0.,  1.,-1.]]
X_normalized = preprocessing.normalize(X,norm='l2')
X_normalized

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

### Sparse input
`normalize` and `Normalizer` accept both dense array-like and sparse matrices from scipy.sparse as input. 

## Binarization

Feature binarization is the process of thresholding numerical features to get boolean values. 
...

As for the `Normalizer`, the utility class `Binarizer` is meant to be used in the early stages of `sklearn.pipeline.Pipeline`. 


In [17]:
X = [[ 1., -1., 2.],
     [ 2.,  0., 0.],
     [ 0.,  1.,-1.]]
binarizer = preprocessing.Binarizer(). fit(X) # fit does nothing
binarizer

Binarizer(copy=True, threshold=0.0)

In [18]:
binarizer.transform(X)

array([[ 1.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.]])

In [19]:
binarizer = preprocessing.Binarizer(threshold=1.1)
binarizer.transform(X)

array([[ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  0.]])

The preprocessing module provides a companion function `binarize` to be used when the transformer API is not necessary.

`binarize` and `Binarizer` accept both dense array-like and sparse matrices from scipy.sparse as input. 

## Encoding categorical features

Integer representation cannot be used directly with scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired. 

One possibility to convert categorical features  to features that can be used with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is implemented in `OneHotEncoder`. This estimator transforms each categorical feature with `m` possible values into `m` binary features, with only one active. 



In [21]:
enc = preprocessing.OneHotEncoder()
enc.fit([[0,0,3],[1,1,0],[0,2,1],[1,0,2]])

OneHotEncoder(categorical_features='all', dtype=<type 'float'>,
       handle_unknown='error', n_values='auto', sparse=True)

In [22]:
enc.transform([[0,1,3]]).toarray()

array([[ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])

In the result, the first two numbers encode the first feature, the next set of three numbers the second feature and the last four the third feature. 

## Imputation of missing values

## References

- [scikit-learn: Preprocessing data](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing)