<!--BOOK_INFORMATION-->
<a href="https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv" target="_blank"><img align="left" src="data/cover.jpg" style="width: 76px; height: 100px; background: white; padding: 1px; border: 1px solid black; margin-right:10px;"></a>
*This notebook contains an excerpt from the upcoming book [Machine Learning for OpenCV](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv) by Michael Beyeler (expected Aug 2017).
The code is released under the [MIT license](https://opensource.org/licenses/MIT),
and is available on [GitHub](https://github.com/mbeyeler/opencv-machine-learning).*

*Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations.
If you find this content useful, please consider supporting the work by
[buying the book](https://github.com/mbeyeler/opencv-machine-learning)!*

<!--NAVIGATION-->
< [Representing Data and Engineering Features](04.00-Representing-Data-and-Engineering-Features.ipynb) | [Contents](../README.md) | [Reducing the Dimensionality of the Data](04.02-Reducing-the-Dimensionality-of-the-Data.ipynb) >

# Preprocessing Data

## Standardizing Features

Standardization of datasets is a common requirement for many machine learning estimators; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

Let's consider a 3x3 data matrix `X`:

In [1]:
from sklearn import preprocessing
import numpy as np
X = np.array([[ 1., -2.,  2.],
              [ 3.,  0.,  0.],
              [ 0.,  1., -1.]])

Then, standardizing the data matrix `X` can be achieved with the function `scale`:

In [2]:
X_scaled = preprocessing.scale(X)
X_scaled

array([[-0.26726124, -1.33630621,  1.33630621],
       [ 1.33630621,  0.26726124, -0.26726124],
       [-1.06904497,  1.06904497, -1.06904497]])

Let's make sure `X_scaled` is indeed standardized: zero mean, unit variance

In [3]:
X_scaled.mean(axis=0)

array([  7.40148683e-17,   0.00000000e+00,   0.00000000e+00])

In [4]:
X_scaled.std(axis=0)

array([ 1.,  1.,  1.])

## Normalizing Features

X can be normalized using the `normalize` function, and the L1 norm is specified by the `norm` keyword:

In [5]:
X_normalized_l1 = preprocessing.normalize(X, norm='l1')
X_normalized_l1

array([[ 0.2, -0.4,  0.4],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  0.5, -0.5]])

L2 norm works, too:

In [6]:
X_normalized_l2 = preprocessing.normalize(X, norm='l2')
X_normalized_l2

array([[ 0.33333333, -0.66666667,  0.66666667],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

## Scaling Features to a Range

Scaling data to a desired range, such as [0, 1], can be achieved with `MinMaxScaler`:

In [7]:
min_max_scaler = preprocessing.MinMaxScaler()
X_min_max = min_max_scaler.fit_transform(X)
X_min_max

array([[ 0.33333333,  0.        ,  1.        ],
       [ 1.        ,  0.66666667,  0.33333333],
       [ 0.        ,  1.        ,  0.        ]])

Try a different range by specifying `feature_range`:

In [8]:
min_max_scaler = preprocessing.MinMaxScaler(feature_range=(-10, 10))
X_min_max2 = min_max_scaler.fit_transform(X)
X_min_max2

array([[ -3.33333333, -10.        ,  10.        ],
       [ 10.        ,   3.33333333,  -3.33333333],
       [-10.        ,  10.        , -10.        ]])

## Binarizing Features

Binarizing the data can be achieved by thresholding. Remember our data matrix:

In [9]:
X

array([[ 1., -2.,  2.],
       [ 3.,  0.,  0.],
       [ 0.,  1., -1.]])

Then threshold the data. Wherever an entry of `X >= 0.5`, put a 1, else 0:

In [10]:
binarizer = preprocessing.Binarizer(threshold=0.5)
X_binarized = binarizer.transform(X)
X_binarized

array([[ 1.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.]])

## Handling missing data

Sometimes datasets are incomplete, with missing entries marked as `nan`:

In [11]:
from numpy import nan
X = np.array([[ nan, 0,   3  ],
              [ 2,   9,  -8  ],
              [ 1,   nan, 1  ],
              [ 5,   2,   4  ],
              [ 7,   6,  -3  ]])

We can replace missing values with one of three strategies:
- `'mean'`: Replaces all nan values with the mean value along a specified axis of the
  matrix (default: axis=0).
- `'median'`: Replaces all nan values with median value along a specified axis of
  the matrix (default: axis=0).
- `'most_frequent'`: Replaces all nan values with the most frequent value along a
  specified axis of the matrix (default: axis=0).

In [12]:
from sklearn.preprocessing import Imputer
imp = Imputer(strategy='mean')
X2 = imp.fit_transform(X)
X2

array([[ 3.75,  0.  ,  3.  ],
       [ 2.  ,  9.  , -8.  ],
       [ 1.  ,  4.25,  1.  ],
       [ 5.  ,  2.  ,  4.  ],
       [ 7.  ,  6.  , -3.  ]])

Let's verify the math by calculating the mean by hand, should evaluate to 3.75 (same as `X2[0, 0]`):

In [13]:
np.mean(X[1:, 0]), X2[0, 0]

(3.75, 3.75)

Mode `'median'` works, too:

In [14]:
imp = Imputer(strategy='median')
X3 = imp.fit_transform(X)
X3

array([[ 3.5,  0. ,  3. ],
       [ 2. ,  9. , -8. ],
       [ 1. ,  4. ,  1. ],
       [ 5. ,  2. ,  4. ],
       [ 7. ,  6. , -3. ]])

Let's make sure the median of the column evaluates to 3.5 (same as `X3[0, 0]`):

In [15]:
np.median(X[1:, 0]), X3[0, 0]

(3.5, 3.5)

<!--NAVIGATION-->
< [Representing Data and Engineering Features](04.00-Representing-Data-and-Engineering-Features.ipynb) | [Contents](../README.md) | [Reducing the Dimensionality of the Data](04.02-Reducing-the-Dimensionality-of-the-Data.ipynb) >