In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# Data pre-processing

Machine-learning algorithms are based on optimization or statistical frameworks. These frameworks relies on numerical data and make some assumprions on the data themselves. We will see two different transformations which are commonly used before to apply machine-learning algorithms: (i) scaling and (ii) encoding. 

## Scaling numerical data

### Motivation

Let's illustrate the importance of scaling data before to fit a classifier.

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train , X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)
clf = LogisticRegression(random_state=42, solver='lbfgs', multi_class='multinomial')
clf.fit(X_train, y_train)
print('LogisticRegression trained with {} iterations with a loss equal to {}'
      .format(clf.n_iter_, clf.tol))

In [None]:
X_train -= X_train.mean(axis=0)
X_train /= X_train.std(axis=0)
clf.fit(X_train, y_train)
print('LogisticRegression trained with {} iterations with a loss equal to {}'
      .format(clf.n_iter_, clf.tol))

Algorithms based on optimization framework (gradient descent, etc.) make the assumption that the data are scaled. As a consequence, learning will be quicker and the problem will be better posed. We will dive into the scikit-learn scalers which will transform the data. 

### Scaler in scikit-learn

A very basic example is the rescaling of our data, which is a requirement for many machine learning algorithms as they are not scale-invariant -- rescaling falls into the category of data pre-processing and can barely be called *learning*. There exist many different rescaling technques, and in the following example, we will take a look at a particular method that is commonly called "standardization." Here, we will recale the data so that each feature is centered at zero (mean = 0) with unit variance (standard deviation = 0).

For example, if we have a 1D dataset with the values [1, 2, 3, 4, 5], the standardized values are

- 1 -> -1.41
- 2 -> -0.71
- 3 -> 0.0
- 4 -> 0.71
- 5 -> 1.41

computed via the equation $x_{standardized} = \frac{x - \mu_x}{\sigma_x}$,
where $\mu$ is the sample mean, and $\sigma$ the standard deviation, respectively.

In [None]:
ary = np.array([1, 2, 3, 4, 5])
ary_standardized = (ary - ary.mean()) / ary.std()
ary_standardized

Although standardization is a most basic preprocessing procedure -- as we've seen in the code snipped above -- scikit-learn implements a `StandardScaler` class for this computation. And in later sections, we will see why and when the scikit-learn interface comes in handy over the code snippet we executed above.  

Applying such a preprocessing has a very similar interface to the supervised learning algorithms we saw so far.
To get some more practice with scikit-learn's "Transformer" interface, let's start by loading the iris dataset and rescale it:


In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)
print(X_train.shape)

The iris dataset is not "centered" that is it has non-zero mean and the standard deviation is different for each component:


In [None]:
print("mean : %s " % X_train.mean(axis=0))
print("standard deviation : %s " % X_train.std(axis=0))

To use a preprocessing method, we first import the estimator, here StandardScaler and instantiate it:
    

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

As with the classification and regression algorithms, we call ``fit`` to learn the model from the data. As this is an unsupervised model, we only pass ``X``, not ``y``. This simply estimates mean and standard deviation.

In [None]:
scaler.fit(X_train)

Now we can rescale our data by applying the ``transform`` (not ``predict``) method:

In [None]:
X_train_scaled = scaler.transform(X_train)

``X_train_scaled`` has the same number of samples and features, but the mean was subtracted and all features were scaled to have unit standard deviation:

In [None]:
print(X_train_scaled.shape)

In [None]:
print("mean : %s " % X_train_scaled.mean(axis=0))
print("standard deviation : %s " % X_train_scaled.std(axis=0))

To summarize: Via the `fit` method, the estimator is fitted to the data we provide. In this step, the estimator estimates the parameters from the data (here: mean and standard deviation). Then, if we `transform` data, these parameters are used to transform a dataset. (Please note that the transform method does not update these parameters).

It's important to note that the same transformation is applied to the training and the test set. That has the consequence that usually the mean of the test data is not zero after scaling:

In [None]:
X_test_scaled = scaler.transform(X_test)
print("mean test data: %s" % X_test_scaled.mean(axis=0))

It is important for the training and test data to be transformed in exactly the same way, for the following processing steps to make sense of the data, as is illustrated in the figure below:

In [None]:
from figures import plot_relative_scaling
plot_relative_scaling()

There are several common ways to scale the data. The most common one is the ``StandardScaler`` we just introduced, but rescaling the data to a fix minimum an maximum value with ``MinMaxScaler`` (usually between 0 and 1), or using more robust statistics like median and quantile, instead of mean and standard deviation (with ``RobustScaler``), are also useful.

In [None]:
from figures import plot_scaling
plot_scaling()

## Encoding categorical data

In the previous section, we saw how important it was to normalize numerical data. In data science, another type of data are usually encountered: categorical data. These type of data can be grouped in a finite group of categories.

In previous example, we presented the iris dataset. We could imagine an additional feature which could be the color of the flower. The color would be defined by a finite set of known values (purple, blue, yellow, etc.) which will form the categories.

Categorical data can be expressed sometimes as strings. Machine-learning algorithms are premilarly working with numerical data. Thus, it might not work as expected. We can check an example.

We will add an additional column by giving a random color to each sample in the iris dataset.

In [None]:
X_train[:10]

In [None]:
color_feature = np.random.choice(
    ['purple', 'yellow', 'blue'], size=X_train.shape[0]
)
color_feature

In [None]:
X_train = np.hstack([X_train, color_feature[:, np.newaxis]])

In [None]:
X_train[:10]

In [None]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

Then, we need to convert the string categories into numerical data. We will present two strategies which are currently available in scikit-learn to tackle this issue.

### One-hot encoder

The most common way to deal with categorical data is to one-hot encode the categories. Each categorie of the original feature will be represented as a column and for each sample, `1` will be affected to the corresponding category while others will be given `0`. We can illustrate this on our toy example.

In [None]:
color_feature = np.random.choice(
    ['purple', 'yellow', 'blue'], size=X_train.shape[0]
).astype(object)
color_feature = color_feature[:, np.newaxis]
color_feature[:10]

In [None]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse=False)

In [None]:
color_feature_encoded = ohe.fit_transform(color_feature)
color_feature_encoded[:10]

In [None]:
ohe.categories_

### Ordinal encoder

In the previous example, we might have been tempted to assign numbers to these features instead of string, i.e. *purple=1, blue=2, yellow=3* but in general **this is a bad idea**.
Estimators tend to operate under the assumption that numerical features lie on some continuous scale, so, for example, 1 and 2 are more alike than 1 and 3, and this is often not the case for categorical features.

An example of ordinal features would be T-shirt sizes, e.g., XL > L > M > S.

Let's imagine the same type of categories for our flowers.

In [None]:
flower_size = np.random.choice(
    ['S', 'M', 'L', 'XL'], size=X_train.shape[0]
).astype(object)
flower_size = flower_size[:, np.newaxis]
flower_size[:10]

In [None]:
from sklearn.preprocessing import OrdinalEncoder

oe = OrdinalEncoder()

In [None]:
flower_size_encoded = oe.fit_transform(flower_size)
flower_size_encoded[:10]

In [None]:
oe.categories_

## Exercise

* Read the titanic dataset located in `datasets/titanic3.csv` using Pandas. Select the following columns: `['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']`

In [None]:
# %load solutions/preprocessing_01.py

In [None]:
titanic.shape

* Remove the rows containing NaN using `pd.DataFrame.dropna`.

In [None]:
# %load solutions/preprocessing_02.py

* Separate the data into numerical and categorical data into two dataframes.

In [None]:
# %load solutions/preprocessing_03.py

* Standardized the numerical dataframe and one-hot encode the categorical dataframe.

In [None]:
# %load solutions/preprocessing_04.py

* Concatenate the encoded arrays using `np.concatenate`.

In [None]:
# %load solutions/preprocessing_05.py

## The `ColumnTransformer` to simplify this pattern

You can imagine that encoding categorical columns and standardizing the numerical ones is a very generic pattern. Scikit-learn provides the `ColumnTransformer` (and the `make_column_transformer` to simplify such processing by assigning a transformer to a specific set of columns and concatenate the results.

In [None]:
import os
import pandas as pd

titanic = pd.read_csv(os.path.join('datasets', 'titanic3.csv'))
titanic = titanic[['pclass', 'sex', 'age', 'sibsp',
                   'parch', 'fare', 'embarked']].dropna()
titanic.head()

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

categorical_columns = ['pclass', 'sex', 'embarked']
numerical_columns = ['age', 'sibsp', 'parch', 'fare']

preprocessor = make_column_transformer(
    (OneHotEncoder(), categorical_columns),
    (StandardScaler(), numerical_columns)
)
X_encoded = preprocessor.fit_transform(titanic)

In [None]:
X_encoded.shape

### Exercise

Load the adult dataset located in `./datasets/adult_openml.csv`. Make your own `ColumnTransformer` preprocessor. Let's do step by step with the following instructions.

* Read the adult dataset located in `datasets/adult_openml.csv` using `pd.read_csv`.
* Split the datasets into a data and a target. The target corresponds to the `class` column. For the data, drop the columns `fnlwgt`, `capitalgain`, and `capitalloss`.
* Create a list containing the name of the categorical columns. Similarly, do the same for the numerical data.
* Create a pipeline to one-hot encode the categorical data. Use the `KBinsDiscretizer` for the numerical data. Import it from `sklearn.preprocessing`.
* Create a `preprocessor` by using the `make_column_transformer`. You should apply the good pipeline to the good column.
* Apply `fit_transform` to obtain preprocessed data.

In [None]:
# %load solutions/preprocessing_06.py