# 4.3. Preprocessing data

http://scikit-learn.org/stable/modules/preprocessing.html

In [1]:
import numpy as np
from sklearn import preprocessing

## 4.3.1. Standardization（标准化）

Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: **Gaussian with zero mean and unit variance**.

In [2]:
X = np.array([[ 1., -1.,  2.],
              [ 2.,  0.,  0.],
              [ 0.,  1., -1.]])

X_scaled = preprocessing.scale(X)
X_scaled

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [3]:
print X_scaled.mean(axis=0), X_scaled.std(axis=0)

[ 0.  0.  0.] [ 1.  1.  1.]


In [4]:
# 先由训练数据计算出平均值、标准差，之后可以用在测试数据
scaler = preprocessing.StandardScaler().fit(X)

# 平均值、标准差
scaler.mean_, scaler.scale_

(array([ 1.        ,  0.        ,  0.33333333]),
 array([ 0.81649658,  0.81649658,  1.24721913]))

In [5]:
scaler.transform(X)

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

## 4.3.2. Normalization（正则化）

- Normalization is the process of **scaling individual samples to have unit norm**.
- This process can be useful if you plan **to use a quadratic form to quantify the similarity of any pair of samples**.

In [6]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]

X_normalized = preprocessing.normalize(X, norm='l2')
X_normalized

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

## 4.3.3. Binarization（二值化）

Feature binarization is the process of **thresholding numerical features to get boolean values**.

In [7]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]

binarizer = preprocessing.Binarizer().fit(X)  # fit does nothing
binarizer.transform(X)

array([[ 1.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.]])

## 4.3.4. Encoding categorical features（类别编码）

In [8]:
# Cat1: ["male", "female"]
# Cat2: ["from Europe", "from US", "from Asia"]
# Cat3: ["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]

enc = preprocessing.OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
enc.transform([[0, 1, 3]]).toarray()

array([[ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])

In [9]:
# to specify this explicitly using the parameter n_values
enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])

# Note that there are missing categorical values for the 2nd and 3rd features
enc.fit([[1, 2, 3], [0, 2, 0]])  

enc.transform([[0, 1, 3]]).toarray()

array([[ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])

## 4.3.5. Imputation of missing values（填补缺值）

The following snippet demonstrates how to replace missing values, encoded as np.nan, using the mean value of the columns (axis 0) that contain the missing values:

In [10]:
import numpy as np
from sklearn.preprocessing import Imputer

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit([[1, 2], [np.nan, 3], [7, 6]])
imp.statistics_

array([ 4.        ,  3.66666667])

In [11]:
X = [[np.nan, 2], [6, np.nan], [7, 6]]
imp.transform(X)

array([[ 4.        ,  2.        ],
       [ 6.        ,  3.66666667],
       [ 7.        ,  6.        ]])

## 4.3.6. Generating polynomial features（多项式特征）

Often it’s useful to add complexity to the model by considering nonlinear features of the input data. A simple and common method to use is polynomial features, which can get features’ high-order and interaction terms.

PolynomialFeatures(2): The features of X have been transformed from (X<sub>1</sub>, X<sub>2</sub>) to (1, X<sub>1</sub>, X<sub>2</sub>, X<sub>1</sub><sup>2</sup>, X<sub>1</sub>X<sub>2</sub>, X<sub>2</sub><sup>2</sup>)

In [12]:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(2)

X = np.arange(6).reshape(3, 2)
poly.fit_transform(X)  

array([[  1.,   0.,   1.,   0.,   0.,   1.],
       [  1.,   2.,   3.,   4.,   6.,   9.],
       [  1.,   4.,   5.,  16.,  20.,  25.]])

## 4.3.7. Custom transformers（定制化转换）

Often, you will want to convert an existing Python function into a transformer to assist in data cleaning or processing. You can implement a transformer from an arbitrary function with `FunctionTransformer`.

In [13]:
import numpy as np
from sklearn.preprocessing import FunctionTransformer

transformer = FunctionTransformer(np.log1p)

X = np.array([[0, 1], [2, 3]])
transformer.transform(X)

array([[ 0.        ,  0.69314718],
       [ 1.09861229,  1.38629436]])