In the real world, the data is not perfect. You need to spend a lot of time on data preprocessing, such as cleaning, scaling, normalizing, and so on. Data preprocessing may be the most important step in the whole machine learning process. You may have heard the phrase, `"Garbage in, garbage out"`. If the data quality is not high, no matter how fancy the model is, an ideal result will not be achieved. Typically, for most engineers, 70 percent of the time is spent processing data.

The **preprocessing** package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.

>`Notice`: There are many preprocessing types and many preprocessing types. In this lesson, we will cover some most commonly used methods. If you want to learn more, just launch the `Jupyter` file at the end of this lesson. 

## Scale numerical feature

Most of the time, the features in your dataset are highly varying in range. However, most of the machine learning algorithms use Euclidean distance as the metrics to measure the distance between two data points, this is a problem. So, in this case, we should make sure that the features in the same range. To solve this problem, you need to scale your data. There are many ways to scale the data.

At first, let's create a 2D array for our examples. As you can see from the output below, we create a matrix(4 * 4). This dataset has four features with different ranges.

In [4]:
import numpy as np

X = np.random.randint(2, 10, size=(4,2))
X2 = np.random.randint(100, 10000, size=(4,2))
X = np.concatenate((X, X2), axis=1)
print("The original matrix")
print(X)

The original matrix
[[   5    4 8997 9823]
 [   6    4 8541 1570]
 [   6    2 3740 3341]
 [   8    4 7377 8485]]


### MinMax

It shrinks the range such that it is now between 0 and 1

$$\frac{x_{i}-\min (x)}{\max (x)-\min (x)}$$

In [5]:
import sklearn.preprocessing as preprocessing

minmax = preprocessing.MinMaxScaler()
minmax.fit(X)
X_minmax = minmax.transform(X)
print("The transform data using min-max scaler")
print(X_minmax)

The transform data using min-max scaler
[[0.         1.         1.         1.        ]
 [0.33333333 1.         0.91325851 0.        ]
 [0.33333333 0.         0.         0.21458863]
 [1.         1.         0.69183945 0.83787714]]


### Standard

This scaler assumes that your feature is following the normal distribution. The mean and standard is calculated on the feature you want to scale.

$$\frac{x_{i}-\operatorname{mean}(\boldsymbol{x})}{\operatorname{stdev}(\boldsymbol{x})}$$

In [6]:
std = preprocessing.StandardScaler()
std.fit(X)
X_std = std.transform(X)
print("The transform data using Standard scaler")
print(X_std)

The transform data using Standard scaler
[[-1.14707867  0.57735027  0.88859948  1.16811018]
 [-0.22941573  0.57735027  0.66757051 -1.231047  ]
 [-0.22941573 -1.73205081 -1.65953496 -0.71621514]
 [ 1.60591014  0.57735027  0.10336497  0.77915195]]


### Robust

This scaler is very similar to the MinMax scaler. However, it uses the interquartile instead of the min and max. The reason is that, sometimes, there are outliers in your dataset, it will make the maximum or minimum unusually high. To eliminate such effect, a robust scaler uses the interquartile.

$$\frac{x_{i}-Q_{1}(\boldsymbol{x})}{Q_{3}(\boldsymbol{x})-Q_{1}(\boldsymbol{x})}$$

In [7]:
robust = preprocessing.RobustScaler()
robust.fit(X)
X_robust = robust.transform(X)
print("The transform data using robust scaler")
print(X_robust)

The transform data using robust scaler
[[-1.33333333  0.          0.47456852  0.66033354]
 [ 0.          0.          0.26608755 -0.73346   ]
 [ 0.         -4.         -1.92890616 -0.43436774]
 [ 2.66666667  0.         -0.26608755  0.43436774]]


### MaxAbs

This scaler is very similar to the MinMax scaler. The difference is that the scaler don't use the min and take the absolute value of the maximum.

$$\frac{x}{max(abs(x))}$$

In [8]:
maxabs = preprocessing.MaxAbsScaler()
maxabs.fit(X)
X_maxabs = maxabs.transform(X)
print("The transform data using MaxAbs scaler")
print(X_maxabs)

The transform data using MaxAbs scaler
[[0.625      1.         1.         1.        ]
 [0.75       1.         0.94931644 0.15982897]
 [0.75       0.5        0.41569412 0.34012013]
 [1.         1.         0.81993998 0.86378907]]


### Normalization

Normalization is also a scaler, however, it scales individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

The function normalize provides a quick and easy way to perform this operation on a single array-like dataset, either using the **l1** or **l2** norms. The default norm is **l2**.

In [9]:
norml2 = preprocessing.Normalizer()
norml2.fit(X)
X_norm = norml2.transform(X)
print("The transform data using Normalizer with L2 norm")
print("------------------------------------------------")
print(X_norm)

print("------------------------------------------------")
print("------------------------------------------------")
print("------------------------------------------------")

norml1 = preprocessing.Normalizer(norm='l1')
norml1.fit(X)
X_norm = norml1.transform(X)
print("The transform data using Normalizer with L1 norm")
print("------------------------------------------------")
print(X_norm)

The transform data using Normalizer with L2 norm
------------------------------------------------
[[3.75359531e-04 3.00287625e-04 6.75421940e-01 7.37431334e-01]
 [6.90917700e-04 4.60611800e-04 9.83521346e-01 1.80790132e-01]
 [1.19641800e-03 3.98805999e-04 7.45767219e-01 6.66205422e-01]
 [7.11524628e-04 3.55762314e-04 6.56114648e-01 7.54660809e-01]]
------------------------------------------------
------------------------------------------------
------------------------------------------------
The transform data using Normalizer with L1 norm
------------------------------------------------
[[2.65547825e-04 2.12438260e-04 4.77826757e-01 5.21695257e-01]
 [5.92826796e-04 3.95217864e-04 8.43888944e-01 1.55123012e-01]
 [8.46381718e-04 2.82127239e-04 5.27577938e-01 4.71293553e-01]
 [5.03968754e-04 2.51984377e-04 4.64722187e-01 5.34521860e-01]]


## Non-linear feaure mapping for numerical feature

In simple words, for some numerical features, we want to be able to do some nonlinear mapping, such as binarization according to some threshold, or data bucket according to points.

### Binarizer

**Binarizer** set feature values to 0 or 1 according to a threshold. In this example, the threshold is 0.7, so the values larger than 0.7 are set 1.

In [10]:
Xb = np.array([[0.2, 0.4, 0.9, 0.7, 0.1, 0.8],
               [0.8, 0.1, 0.2, 0.8, 0.1, 0.4]])
binary = preprocessing.Binarizer(threshold=0.7)
X_binary = binary.transform(Xb)
print("The original data")
print(Xb)
print("The transform data using Binarizer with threshold 0.7")
print(X_binary)

The original data
[[0.2 0.4 0.9 0.7 0.1 0.8]
 [0.8 0.1 0.2 0.8 0.1 0.4]]
The transform data using Binarizer with threshold 0.7
[[0. 0. 1. 0. 0. 1.]
 [1. 0. 0. 1. 0. 0.]]


### Quantile

Quantile transform features using quantiles information. In this example, we pass **n_quantiles=3**.

In [11]:
Xq = np.random.randint(1, 50, size=(4, 4))
quant = preprocessing.QuantileTransformer(n_quantiles=3)
quant.fit(Xq)
X_quant = quant.transform(Xq)
print("The original data")
print(Xq)
print("The transform data using QuantileTransformer with 3 quantiles")
print(X_quant)

The original data
[[20 36 45 45]
 [ 2 29 31  7]
 [45 24 12 44]
 [33 27 39 36]]
The transform data using QuantileTransformer with 3 quantiles
[[0.36734694 1.         1.         1.        ]
 [0.         0.5625     0.41304348 0.        ]
 [1.         0.         0.         0.9       ]
 [0.67567568 0.375      0.7        0.43939394]]


## Deal with categorical

Until now, all data we are talking about is a numerical type. In fact, there's a lot of real data are categorical. Not to mention, all data in NLP are text, not numerical. In this section, we would show some methods to transform the categorical data into numerical types.

### one-hot encoder

**One-hot** is a widely used method to encode categorical data. For example, you have a feature named **size** which has tree values, "big", "medium", and "small". If you encode this feature by one-hot, you may get three features, "big", "medium", and "small". One and only one of them is assigned a value of 1.

In [12]:
feasize = np.array([["big"], ["medium"], ["small"]])
onehot = preprocessing.OneHotEncoder()
onehot.fit(feasize)
feasize_onehot = onehot.transform(feasize).toarray()
print("The original data")
print(feasize)
print("The transform data using OneHotEncoder")
print(feasize_onehot)

The original data
[['big']
 ['medium']
 ['small']]
The transform data using OneHotEncoder
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


### Label encoder

Sometimes, our label is not number, it's a string. What we will want is to convert these strings to numbers that start from 0. For binary classification, the label is 0 and 1. If the classification is three classes, then the label is 0, 1, and 2. **LabelEncoder** can help encode labels with a value between 0 and n_classes-1.

As you can see from the example below, the **Sun** is encoded as **3**, the **Moon** is encoded as **2**.

In [16]:
targets = np.array(["Sun", "Sun", "Moon", "Earth", "Monn", "Venus"])
labelenc = preprocessing.LabelEncoder()
labelenc.fit(targets)
targets_trans = labelenc.transform(targets)
print("The original data")
print(targets)
print("The transform data using LabelEncoder")
print(targets_trans)

The original data
['Sun' 'Sun' 'Moon' 'Earth' 'Monn' 'Venus']
The transform data using LabelEncoder
[3 3 2 0 1 4]
