# Preprocessing Techniques

In the examples below, four preprocessing techniques will be demonstrated:

- Binarization
- Mean removal
- Scaling
- Normalization

The matrix below was used as the sample set:

In [61]:
import numpy as np
from sklearn import preprocessing as pp

m = np.array([[ 5.1, -2.9,  3.3],
              [-1.2,  7.8, -6.1],
              [ 3.9,  0.4,  2.1],
              [ 7.3, -9.9, -4.5]])

## Binarization

`preprocessing.Binarizer()`

`Binarizer()` method converts numerical values into booleans. Below, the `threshold` parameter is `2.1`. All values below 2.2 become `0`.

In [62]:
b = pp.Binarizer().fit(m)
b2 = pp.Binarizer(threshold = 2.1).transform(m)

print("\nBinarizing - fit:\n", b)
print("\nBinarizing - threshold, transform:\n", b2)


Binarizing - fit:
 Binarizer(copy=True, threshold=0.0)

Binarizing - threshold, transform:
 [[ 1.  0.  1.]
 [ 0.  1.  0.]
 [ 1.  0.  0.]
 [ 1.  0.  0.]]


First, in order to illustrate the difference, the `mean()` and `std()` deviation are calculated below:

In [63]:
print("\nBefore scaling:")
print("Mean:", m.mean(axis = 0))
print("Standard:", m.std(axis = 0))


Before scaling:
Mean: [ 3.775 -1.15  -1.3  ]
Standard: [ 3.12039661  6.36651396  4.0620192 ]


## Mean Removal

In order to center each feature on zero, the mean value is removed.

`preprocessing.scale()`

In [64]:
s = pp.scale(m)

print("\nAfter deviation:")
print("Mean:", s.mean(axis = 0))
print("Standard:", s.std(axis = 0))


After deviation:
Mean: [  1.11022302e-16   0.00000000e+00   2.77555756e-17]
Standard: [ 1.  1.  1.]


## Minimum/Maximum Scaling

Scaling maintains uniformity.

`preprocessing.MinMaxScaler()`

In [65]:
smm = pp.MinMaxScaler(feature_range=(0, 1))
data_scaled_minmax = smm.fit_transform(m)

print("\nMin/max scaled:\n", data_scaled_minmax)


Min/max scaled:
 [[ 0.74117647  0.39548023  1.        ]
 [ 0.          1.          0.        ]
 [ 0.6         0.5819209   0.87234043]
 [ 1.          0.          0.17021277]]


## Normalization

Normalization conforms values to a scale. For example, a row of values can be represented by decimal numbers which all add up to `1`, as with _L1 normalization_ or _least absolute deviations_. _L2 normalization_ is similar, but the sum of squared values is `1`. _L1 normalization_ is more effective, being resistant to outliers. But if outliers become important, then _L2_ might be preferred.

`preprocessing.normalize()`

In [66]:
nl1 = pp.normalize(m, norm='l1')
nl2 = pp.normalize(m, norm='l2')

print("\nL1 normalized:\n", nl1)
print("\nL2 normalized:\n", nl2)


L1 normalized:
 [[ 0.45132743 -0.25663717  0.2920354 ]
 [-0.0794702   0.51655629 -0.40397351]
 [ 0.609375    0.0625      0.328125  ]
 [ 0.33640553 -0.4562212  -0.20737327]]

L2 normalized:
 [[ 0.75765788 -0.43082507  0.49024922]
 [-0.12030718  0.78199664 -0.61156148]
 [ 0.87690281  0.08993875  0.47217844]
 [ 0.55734935 -0.75585734 -0.34357152]]
