# Data pre-processing

Prof. Dr. Georgios K. Ouzounis<br/>
[georgios.ouzounis@go.kauko.lt](georgios.ouzounis@go.kauko.lt)

<img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/02/17085331/scikit-learn-logo.png" alt="sci-kit learn" width="300" style="float: left; margin-right: 10px;" />

The contents of this session are taken directly from the source site
http://scikit-learn.org/stable/index.html 

## Contents

- feature loading 
- feature scaling, normalization & binarization
- encoding of categorical features  
- imputing missing values

## Load Features From Dictionaries

The sklearn.feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.

The class **DictVectorizer** can be used to convert feature arrays represented as lists of standard Python dict objects to the NumPy/SciPy representation used by scikit-learn estimators.

DictVectorizer implements what is called “one-hot” coding for categorical features. Categorical features are “attribute-value” pairs where the value is restricted to a list of discrete of possibilities without ordering (e.g. topic identifiers, types of objects, tags, names…).

Consider the following example. In the following, “city” is a categorical attribute while “temperature” is a traditional numerical feature:

In [None]:
measurements = [
    {'city': 'Dubai', 'temperature': 33.},
    {'city': 'London', 'temperature': 12.},
    {'city': 'San Francisco', 'temperature': 18.}]

Let's import the dictionary vectorizer and create an instance

In [None]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()

Let's do the conversion now

In [None]:
vec.fit_transform(measurements).toarray()

Note that the original data is not lost:

In [None]:
vec.get_feature_names()

## Feature Extraction

Visit http://scikit-learn.org/stable/modules/feature_extraction.html to learn more on:

- Feature hashing
- Text feature extraction
- Image feature extraction


## Standardization or Mean Removal and Variance Scaling

+ standardization of datasets is a common requirement
+ machine learning estimators may behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

<img src="http://cs231n.github.io/assets/nn2/prepro1.jpeg"  alt="mean removal and variance scaling" width="600" style="float: left; margin-right: 10px;" />

source [Convolutional Neural Networks for Visual Recognition](http://cs231n.github.io/neural-networks-2/)

- In practice: ignore the distribution shape, just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.

- If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.


The function **scale** provides a quick and easy way to perform this operation on a single array-like dataset:

In [None]:
# import necessary libraries
from sklearn import preprocessing
import numpy as np

# create a sample data-set
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

In [None]:
X_scaled = preprocessing.scale(X_train)

In [None]:
X_scaled  

Scaled data has zero mean and unit variance:

In [None]:
X_scaled.mean(axis=0)

In [None]:
X_scaled.std(axis=0)


The *preprocessing* module further provides a utility class **StandardScaler** that implements the Transformer API to compute the mean and standard deviation on a training set so as to be able to later re-apply the same transformation on the testing set. 

In [None]:
# create and verify an instance of the StandardScaler
scaler = preprocessing.StandardScaler().fit(X_train)
scaler

In [None]:
scaler.mean_

In [None]:
scaler.scale_

In [None]:
scaler.transform(X_train)

The scaler instance can then be used on new data to transform it the same way it did on the training set:

In [None]:
# create a sample data row
X_test = [[-1., 1., 0.]]

In [None]:
# apply the trained scaler function
scaler.transform(X_test)

It is possible to disable either centering or scaling by either passing *with_mean=False* or *with_std=False* to the constructor of **StandardScaler**.

### Scaling Features to a Range

- An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. 

- This can be achieved using **MinMaxScaler** or **MaxAbsScaler**, respectively.

- The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.


Here is an example to scale a toy data matrix to the [0, 1] range:

In [None]:
# create the training data-set
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

In [None]:
# instantiate the MinMaxScaler
min_max_scaler = preprocessing.MinMaxScaler()

In [None]:
# train it on the input data
X_train_minmax = min_max_scaler.fit_transform(X_train)

In [None]:
# verify the object
X_train_minmax

The same instance of the transformer can then be applied to some new test data unseen during the fit call: the same scaling and shifting operations will be applied to be consistent with the transformation performed on the train data:

In [None]:
# declare your test data
X_test = np.array([[ -3., -1.,  4.]])

In [None]:
# apply and verify the scaler
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax

It is possible to introspect the scaler attributes to find about the exact nature of the transformation learned on the training data:

In [None]:
# get the scaling parameters
min_max_scaler.scale_

In [None]:
# get the min parameters
min_max_scaler.min_

If **MinMaxScaler** is given an explicit feature_range=(min, max) the full formula is:

**X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))**

**X_scaled = X_std * (max - min) + min**

**MaxAbsScaler** works in a very similar fashion, but scales in a way that the training data lies within the range [-1, 1] by dividing through the largest maximum value in each feature. It is meant for data that is already centered at zero or sparse data.

In [None]:
# create the training data
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

In [None]:
# instantiate the MaxAbsScaler
max_abs_scaler = preprocessing.MaxAbsScaler()

In [None]:
# train it on the input data-set
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_train_maxabs

In [None]:
# declare your test data
X_test = np.array([[ -3., -1.,  4.]])

In [None]:
# deploy the scaler
X_test_maxabs = max_abs_scaler.transform(X_test)

In [None]:
X_test_maxabs

In [None]:
max_abs_scaler.scale_      

## Normalization

- **Normalization** is the process of *scaling individual samples to have unit norm*. 

- useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

- the function **normalize** provides a quick and easy way to perform this operation on a single array-like dataset, either using the l1 or l2 norms:

In [None]:
# declare your input data
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]

In [None]:
# operate the normalizer on X
X_normalized = preprocessing.normalize(X, norm='l2')


In [None]:
X_normalized  

<img src="http://idiomic.com/wp-content/uploads/2017/01/iddy-reading.jpg"  alt="mean removal and variance scaling" width="100" style="float: left; margin-right: 10px;"/>


Further reading: [L1 Norms versus L2 Norms - kaggle](https://www.kaggle.com/residentmario/l1-norms-versus-l2-norms)

The preprocessing module further provides a utility class **Normalizer** that implements the same operation using the Transformer API (even though the fit method is useless in this case: the class is stateless as this operation treats samples independently).

In [None]:
normalizer = preprocessing.Normalizer().fit(X)  # fit does nothing
normalizer

In [None]:
# apply the normalizer into the data
normalizer.transform(X)

In [None]:
# and on any given line of new data
normalizer.transform([[-1.,  1., 0.]])

## Binarization

**Feature binarization** is the process of *thresholding numerical features* to get boolean values.  The fit method does nothing as each sample is treated independently of others

In [None]:
# create your input data
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]

In [None]:
# instantiate the binarizer 
binarizer = preprocessing.Binarizer().fit(X)  # fit does nothing
binarizer

In [None]:
# apply the binarizer in the input data
binarizer.transform(X)

# the resulting array contains 0 or 1 entries only 

It is possible to adjust the threshold of the binarizer:

In [None]:
binarizer = preprocessing.Binarizer(threshold=1.1)
binarizer.transform(X)


## Encoding Categorical Features

Often features are not given as continuous values but categorical. 

For example a person could have features:
- ["male", "female"],
- ["from Europe", "from US", "from Asia"],
- ["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]. 


Such features can be efficiently coded as integers, for instance ["male", "from US", "uses Internet Explorer"] could be expressed as [0, 1, 3] while  ["female", "from Asia", "uses Chrome"] would be [1, 2, 1]

Such integer representation can not be used directly with scikit-learn estimators!

Convert categorical features using a one-of-K or one-hot encoding, which is implemented in **OneHotEncoder**. 

This estimator transforms each categorical feature with m-possible values into m binary features, with only one active.

In [None]:
# instantiate the encoder
enc = preprocessing.OneHotEncoder()

In [None]:
# train it on sample data
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) 

In [None]:
# convert new data to an encoded array
enc.transform([[0, 1, 3]]).toarray()

- the number of values each feature can take is inferred automatically from the dataset (default) 

- it is possible to specify this explicitly using the parameter n_values. 

- then we fit the estimator, and transform a data point. 

if there is a possibility that the training data might have missing categorical features, one has to explicitly set n_values:

In [None]:
enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])

In [None]:
# Note that there are missing categorical values for the 2nd and 3rd features
enc.fit([[1, 2, 3], [0, 2, 0]])  

In [None]:
enc.transform([[1, 0, 0]]).toarray()


## Imputation of Missing Values

Real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. 

Incompatible with scikit-learn estimators which assume that all values in an array are numerical

A basic strategy to use incomplete datasets is to discard entire rows and/or columns containing missing values. 

This comes at the price of losing data which may be valuable (even though incomplete). 

A better strategy is to impute the missing values, i.e., to infer them from the known part of the data.

The **Imputer** class provides basic strategies for imputing missing values, either using the mean, the median or the most frequent value of the row or column in which the missing values are located. 

This class also allows for different missing values encodings.

The following snippet demonstrates how to replace missing values, encoded as np.nan, using the mean value of the columns (axis 0) that contain the missing values:

In [None]:
# import the necessary libraries
import numpy as np
from sklearn.preprocessing import Imputer

In [None]:
# instantiate the imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)


In [None]:
# train it on sample data 
imp.fit([[1, 2], [np.nan, 3], [7, 6]]) # compute the means here

In [None]:
# create test data-set
X = [[np.nan, 2], [6, np.nan], [7, 6]]


In [None]:
# impute missing values
print(imp.transform(X)) # fill in the blanks with the means computed above


In [None]:
# The Imputer class also supports sparse matrices
import scipy.sparse as sp

X = sp.csc_matrix([[1, 2], [0, 3], [7, 6]])

In [None]:
# instantiate and train the imputer
imp = Imputer(missing_values=0, strategy='mean', axis=0)
imp.fit(X)


In [None]:
# create the test set
X_test = sp.csc_matrix([[0, 2], [6, 0], [7, 6]])

In [None]:
# impute missing values
print(imp.transform(X_test))  