# Data Preprocessing Techniques
Some common data preprocessing techniques are outlined here. The `preprocessing` library in `sklearn` is the main library used for preprocessing. The functions used here are listed below with their documentations:

- [scale](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html#sklearn.preprocessing.scale)


In [2]:
from sklearn import preprocessing
import numpy as np

## Scaling

In [10]:
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
X_scaled = preprocessing.scale(X_train)
print(X_scaled)  
print(X_scaled.mean(axis = 0)) # axis = 0 means evaluated along a column
print(X_scaled.std(axis = 0))

[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]
[ 0.  0.  0.]
[ 1.  1.  1.]


### Using the `StandardScaler` function
When we want to scale the test set, we must use the mean and variance computed from the training set. We cannot use the mean and variance from the test set because those would be different for different test sets and hence introduce bias. The `StandardScaler` function can be used to scale different data sets using the mean and variance computed from the training set. To disable either centering or scaling, set the `with_mean` or `with_std` parameters to `False` respectively.

In [14]:
# We first fit the Standard Scaler to compute the mean and variance
scaler = preprocessing.StandardScaler().fit(X_train)
print(scaler)
print(scaler.mean_) 
print(scaler.scale_)   

# Then we can use transform to scale both the training and test sets
scaler.transform(X_train)

StandardScaler(copy=True, with_mean=True, with_std=True)
[ 1.          0.          0.33333333]
[ 0.81649658  0.81649658  1.24721913]


array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

The scaler instance can then be used on new data to transform it the same way it did on the training set:

In [15]:
X_test = [[-1., 1., 0.]]
scaler.transform(X_test)

array([[-2.44948974,  1.22474487, -0.26726124]])

Other scaler functions include [`MinMaxScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler) and [`MaxAbsScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler).

`MaxAbsScaler` is recommended for scaling sparse matrices.

## Binarization
Everything below a threshold becomes 0 and everything above it becomes 1.

In [20]:
X = [[ 1., -1.,  2.],
      [ 2.,  0.,  0.],
      [ 0.,  1., -1.]]

binarizer = preprocessing.Binarizer(threshold=1.1).fit(X)  # default threshold is 0
binarizer.transform(X)

array([[ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  0.]])

## One Hot Encoding
[`OneHotEncoder`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder)

To convert categorical features into features that can be used with scikit-learn estimators, we can use a one-of-$K$ or one-hot encoding, which is implemented in `OneHotEncoder` function. This estimator transforms each categorical feature with $m$ possible values into $m$ binary features, with only one active.

In [17]:
enc = preprocessing.OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
enc.transform([[0, 1, 3]]).toarray()

array([[ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])

In the original data, the first column has two categories so it is represented by the first two columns in the one hot encoded data. Similarly, the second and the third variables have 3 and 4 possible classes and hence are represented by 3 and 4 columns in the one hot encoded data with the corresponding columns having 1 and the rest 0s.

Note that, if there is a possibility that the training data might have missing categorical features, one has to explicitly set `n_values`. For example,

In [21]:
enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])
# Note that there are missing categorical values for the 2nd and 3rd features
enc.fit([[1, 2, 3], [0, 2, 0]])
enc.transform([[1, 0, 0]]).toarray()

array([[ 0.,  1.,  1.,  0.,  0.,  1.,  0.,  0.,  0.]])

## Imputation
[`Imputer`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html#sklearn.preprocessing.Imputer)

The following snippet demonstrates how to replace missing values, encoded as `np.nan`, using the mean value of the columns (axis 0) that contain the missing values:

In [23]:
imp = preprocessing.Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit([[1, 2], [np.nan, 3], [7, 6]])
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X)) 

[[ 4.          2.        ]
 [ 6.          3.66666667]
 [ 7.          6.        ]]


A better imputation [example](http://scikit-learn.org/stable/auto_examples/plot_missing_values.html#sphx-glr-auto-examples-plot-missing-values-py).

## Generating polynomial features
[preprocessing.PolynomialFeatures()](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures)

In [28]:
X = np.arange(6).reshape(3, 2)
print(X)
poly = preprocessing.PolynomialFeatures(2)
poly.fit_transform(X)

[[0 1]
 [2 3]
 [4 5]]


array([[  1.,   0.,   1.,   0.,   0.,   1.],
       [  1.,   2.,   3.,   4.,   6.,   9.],
       [  1.,   4.,   5.,  16.,  20.,  25.]])

The first column is a bias vector (the constant vector with all ones).

To find out the names of the features generated, to indicate which column is a power and which one is an interaction term, we can proceed as follows:

In [27]:
poly.get_feature_names()

['1', 'x0', 'x1', 'x0^2', 'x0 x1', 'x1^2']

## Custom transformers
Implement a transformer from an arbitrary function with [FunctionTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer).

In [32]:
# taking column wise mean (just to show how the function works) 
transformer = preprocessing.FunctionTransformer(np.mean, kw_args={'axis':0}) 
X = np.array([[0, 1], [2, 3]])
print(transformer.transform(X))

# taking square root (more practical transformation)
transformer = preprocessing.FunctionTransformer(np.sqrt) 
X = np.array([[0, 1], [2, 3]])
print(transformer.transform(X))

[ 1.  2.]
[[ 0.          1.        ]
 [ 1.41421356  1.73205081]]
