# Tutorial 3 (Week 4) - Data Preprocessing

## Learning Objectives

After completing this tutorial, you should be able to:

+ Perform data transformation using `sklearn.preprocessing`
  + Perform standardization and normalization
  + Encode ordinal and nominal values as numerical values
  + Perform discretization
  + Generate polynomial features
+ Combine preprocessing steps for heterogenous data
+ Handle missing values using `sklearn.impute`
+ Perform dimensionality reduction using PCA

References:
- [scikit-tutorials](https://scikit-learn.org/stable/auto_examples/index.html#preprocessing)
- [Preprocessing tutorial](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing)
- [Column transformers tutorial](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py)
- [Pipelines tutorial](https://scikit-learn.org/stable/modules/compose.html#combining-estimators).


We have learned data visualization in a previous tutorial. In practise, data cleaning and visualization go hand in hand, and are usually done together too. We will go over a few data cleaning strategies in this tutorial.

In [1]:
import pandas as pd
import numpy as np

# Dataset

Let us work on the heart disease dataset ["Statlog (Heart)"](https://archive.ics.uci.edu/dataset/145/statlog+heart). The csv file and txt files with the dataset info are available from this Tutorial folder.

This dataset has 13 attributes and 1 label column (presence or absence of heart disease). In this tutorial, we are working on only data preprocessing, and not concerning ourselves with any model and its prediction. Thus for simplicity, we will work on the whole dataset without splitting it into training and testing datasets. 

Go ahead and read the dataset using Pandas as usual, loading it into a variable `data`.

In [2]:
# TODO
data = pd.read_csv( "heart-statlog-.csv" )
data.head()

FileNotFoundError: [Errno 2] No such file or directory: 'heart-statlog.csv'

The `class` column indicates the presence and absence of disease, which we are not using for preprocessing in this tutorial. 

Use the `DataFrame.drop` function to drop the `class` column from the data.

In [None]:
# TODO
data = data.drop( 'class', axis=1 )
data.head()

Let's rename some of the columns for easier handling.

```
resting_blood_pressure               --> rest_BP
serum_cholestoral                    --> cholesterol
fasting_blood_sugar                  --> fast_sugar
resting_electrocardiographic_results --> rest_ECG
maximum_heart_rate_achieved          --> max_HR
exercise_induced_angina              --> exer_angina
number_of_major_vessels              --> vessels
```

In [3]:
# TODO
data = data.rename( columns={
    'resting_blood_pressure' : 'rest_BP',
    'serum_cholestoral' : 'cholesterol',
    'fasting_blood_sugar' : 'fast_sugar',
    'resting_electrocardiographic_results' : 'rest_ECG',
    'maximum_heart_rate_achieved' : 'max_HR',
    'exercise_induced_angina' : 'exer_angina',
    'number_of_major_vessels' : 'vessels'
})

NameError: name 'data' is not defined

The data has no missing values - this is stated in the accompanying txt file. Let's have a quick check on its descriptive statistics.

In [None]:
# TODO
data.describe()

The `heart-statlog.txt` file lists the data type of each attribute in the dataset: numerical (real), ordinal, binary, or categorical (nominal).

We will perform different preprocessing operations:
- Discretization on the `age` data
- Normalization and Polynomial Feature Construction on the other numerical data
- Encoding the ordinal and nominal data

Let's save the column indices for each of these types in a list for our later use. Complete the code below using the information from `heart-statlog.txt`. _(Note that the numbering in the txt file starts from 1, while column indexing starts from 0 -- adjust accordingly.)_

In [None]:
# Discrete (the age column)
disc_features = [0]

# Numerical (the rest of the real columns)
num_features = [3,4,7,9,11]

# TODO

# Ordinal
# ordinal_features = ?
ordinal_features = [10]

# Binary
# bin_features = ?
bin_features = [1,5,8]

# Categorical (nominal)
# cat_features = ?
cat_features = [6,2,12]

# Introduction to scikit-learn

The [`scikit-learn`](https://scikit-learn.org/stable/index.html) library is a part of the SciPy (Scientific Python) group, which has a set of libraries created for scientific computing. The first part of the name refers to this origin of the library, while the second part refers to the discipline this library pertains to: Machine Learning. It is built on NumPy, and has extremely efficient and reusable codes. The library is included in the Anaconda distribution.

## Transformers and Estimators

__Transformers__ is a term used for classes in `scikit-learn` (or `sklearn`) that enable data transformations. `scikit-learn` provides a library of transformers, which may _clean_ (for preprocessing), _reduce_ (for unsupervised dimensionality reduction), _expand_ (for kernel approximation) or _generate_ (for feature extraction) feature representations.

All standard transformers in `sklearn` have the following methods:

- `fit`, which learns model parameters (e.g., mean and standard deviation for normalization) from a training set;
- `transform`, which applies this transformation model to unseen data;
- `fit_transform`, which models and transforms the training data simultaneously for convenience and efficiency.

We will use transformers for scaling (standardization and normalisation) as well as for encoding in this tutorial.


__Estimators__ is a term used for classes which manage the estimation and decoding of a model. Estimators must provide a `fit` method, and should provide `set_params` and `get_params`, although these are usually provided by inheritance from `base.BaseEstimator`. 

We will use an estimator for discretization in this tutorial. 

A useful estimator class, but which we are not using in this tutorial, is _Predictors_. It is an estimator supporting `predict` and/or `fit_predict`. This encompasses classifier, regressor, outlier detector and clusterer.


## Preprocessing Module

A package in `scikit-learn`, named [`sklearn.preprocessing`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing), provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for tasks such as classification, regression, etc.

In [None]:
from sklearn import preprocessing

# Standardization and Normalization

_Standardization_ involves rescaling the features such that they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one. Feature scaling through standardization (or Z-score normalization) can be an important preprocessing step for many machine learning algorithms. If a feature has a variance that is orders of magnitude larger than others, it might end up dominating the estimator, which might not learn well from other features. 

_Normalization_ is the process of scaling individual samples to have unit norm, independently of the distribution of the samples. 

Note that standardization is a _feature-wise_ operation, while normalization is a _sample-wise_ operation. 

## Standardization

Let's perform standardization on the numerical columns. We can select these columns by passing the indices list we constructed earlier to `DataFrame.iloc`.

In [None]:
data.iloc[:,num_features]

The `preprocessing` module provides the [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) utility class, which is a quick and easy way to perform standardization on an array-like dataset. The scaled data will have zero mean and a unit variance.

The `fit_transform` method of `StandardScaler` works on each feature to first calculate the mean and variance of the feature (_fit_), then transforms the feature using the calculated mean and variance values as scaling parameters (_transform_). The method returns the transformed data as an array.

Let's run this method on the numerical columns.

In [None]:
scaler = preprocessing.StandardScaler()

In [None]:
num_scaled = scaler.fit_transform( data.iloc[:,num_features] )
num_scaled

Verify that the standardization works: what are the mean values of the original columns, and what are the mean values of the transformed columns? Are the latter exactly zero?

_(Note: For the multidimensional array, you will need to specify the axis in order to apply mean computation on individual columns.)_

In [None]:
# TODO: Original means
data.iloc[:,num_features].mean( axis=0 )

In [None]:
# TODO: Transformed mean
num_scaled.mean( axis=0 )

How about the variance?

In [None]:
# TODO: Original variance
data.iloc[:,num_features].var( axis=0 )

In [None]:
# TODO: Transformed variance
num_scaled.var( axis=0 )

## Normalisation

The `preprocessing` module has the [`Normalizer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer) utility class, which transforms individual samples to unit norm. We can specify which norm to use (i.e., how the unit norm is defined); the default is the `l2` norm (Euclidean).

The `Normalizer` class also provides a `fit_transform` method. Run this method on our numerical columns.

In [None]:
# TODO
preprocessing.Normalizer( norm='l2' ).fit_transform( data.iloc[:,num_features] )

Another way to perform normalisation is to use the [`normalize`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html#sklearn.preprocessing.normalize) method from the `preprocessing` module directly. Refer to the [Guide](https://scikit-learn.org/stable/modules/preprocessing.html#normalization) and try it below.

In [None]:
# TODO
preprocessing.normalize( data.iloc[:,num_features], norm='l2' )

# Encoding Ordinal and Nominal Values

Ordinal data can be encoded into numerical data using [`OrdinalEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder). This results in a single column of integers (0 to `n_categories - 1`) per feature. 

For our ordinal data, we do not actually need to use ordinal encoder, as the data is already in integer form. We will just instantiate this class here for a later use.

In [None]:
ord_enc = preprocessing.OrdinalEncoder( categories='auto' )

## One-Hot Encoding

A common technique for encoding categorical variables is [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder). It transforms a categorical feature that has `n` possible values into `n` binary features. Exactly one of the binary features will have value 1 (corresponding to the feature value), and all others 0. 

For example, for a feature that has 4 categories named [1,2,3,4], the one-hot encoding will be:
```
1 -> [1, 0, 0, 0]

2 -> [0, 1, 0, 0]

3 -> [0, 0, 1, 0]

4 -> [0, 0, 0, 1]
```

What will happen if the encoder encounters unknown categories during transform? When `handle_unknown='ignore'` is specified, no error will be raised but the resulting one-hot encoded columns for this feature will be all zeros.

In [None]:
oh_enc = preprocessing.OneHotEncoder( categories='auto', handle_unknown='ignore' )

Let's apply one-hot encoding on our categorical (nominal) data. Select the relevant columns from the data. 

In [None]:
# TODO
data.iloc[:, cat_features]

Apply the `fit_transform` method of `OneHotEncoder` on those columns. How many columns do you expect to see in the output? (How many possible values does each feature have?)

In [None]:
# TODO
oh_enc.fit_transform( data.iloc[:, cat_features] )

Let's convert that sparse matrix output to NumPy multidimensional array so that we can view it, and save it as `data_cat` for further use.

In [None]:
# TODO
data_cat = oh_enc.fit_transform( data.iloc[:, cat_features] ).toarray()
data_cat

As there are three transformed features, we will expect to see three 1 values in each row of the transformed data if there are no unknown categories.

In [None]:
data_cat[0]

We have earlier instantiated OneHotEncoder with `categories=auto`, so that it determines categories automatically from the data. We can check the `categories_` properties to see them. 

In [None]:
# TODO
oh_enc.categories_

# Discretization

Discretization (otherwise known as _quantization_ or _binning_) provides a way to partition continuous features into discrete values. One-hot encoded discretized features can make a model more expressive, while maintaining interpretability. 

In our example, we can perform discretization on the `age` column. Let's check its value range again. 

In [None]:
data['age'].describe()

As the range of the column is 29 to 77, we can do a binning into 5 bins to express different age-groups. 

We will use the estimater class [`KBinsDiscretizer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html#sklearn.preprocessing.KBinsDiscretizer) for this. Let's instantiate it with 5 bins and onehot encoding, specify a strategy that will give us equal-sized bins, and a subsample option that will use all the samples for computing the quantiles that determine the binning thresholds.

In [None]:
# TODO
# discretizer = ?
discretizer = preprocessing.KBinsDiscretizer( n_bins=5, strategy='uniform', encode='onehot', subsample=None )

As `KBinDiscretizer` works with an array, we first need to convert the `age` column to a NumPy array of dimension `num_values` x 1, where the single column corresponds to the single feature. (You can use `reshape` to control the dimension.)

In [None]:
# TODO
# age_arr = ?
age_arr = np.array(data['age']).reshape(-1,1)

Now use the discretizer's `fit` method to fit the data into bins. We can view the result by checking the `bin_edges_` property of the output.

In [None]:
# TODO
discretizer.fit( age_arr ).bin_edges_

Now we can use the discretizer's `transform` method to discretize the data.

In [None]:
# TODO
k = discretizer.transform( age_arr )

The output is a sparse matrix, which we can convert to NumPy multidimensional array for viewing. We will expect to see one-hot encoding format as we specified earlier.

In [None]:
# TODO
k.toarray()

# Polynomial Feature Construction

It is often useful to add complexity to the model by considering nonlinear features of the input data. The transformer class [`PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures) allows us to generate higher order terms and interaction terms (representing joint effects of multiple features) to consider this non-linearity. 

Refer to the class documentation for the definitions and default values of the parameters. Let's instantiate this class with degree 2, exclude bias columns, and only produce interaction features.

In [None]:
# TODO
# poly_tfr = ?
poly_tfr = preprocessing.PolynomialFeatures( degree=2, include_bias=False, interaction_only=True )

Apply its `fit_transform` method to our numerical data columns. How many columns are there in the transformed data?

In [None]:
# TODO
# poly_feats = ?
poly_feats = poly_tfr.fit_transform( data.iloc[:,num_features] )

print( data.iloc[:,num_features].shape )
print( poly_feats.shape )

We can view the names of the constructed features using the `get_feature_names_out` method of `PolynomialFeatures`.

In [None]:
# TODO
poly_tfr.get_feature_names_out()

Those form the columns of the transformed data, comprising original and constructed features. Let's see the values on the first row before and after feature construction.

In [None]:
# TODO: Original feature values in first row
data.iloc[0,num_features]

In [None]:
# TODO: Transformed feature values in first row
poly_feats[0]

# Putting It Together: Pipeline and ColumnTransformer

Our dataset contains heterogeneous data types. As we have done various different preprocessing on different columns - how do we put it all together? A simple approach could be to stitch it all together in a new DataFrame. The following code snippet could do categorical encoding and binning. 
```
new_data = pd.DataFrame()

for i in range(6):
    new_data['age_'+str(i)] = data_disc[:,i]
new_data['sex'] = data.sex
for i in range(4):
    new_data['chest_pain_'+str(i)] = data_cat[:,i]
new_data['restBP'] = data.restBP
new_data['cholesterol'] = data.cholesterol
new_data['fast_sugar'] = data.fast_sugar
for i in range(3):
    new_data['rest_ECG_'+str(i)] = data_cat[:,4+i]
new_data['max_HR'] = data.max_HR
new_data['exer_angina'] = data.exer_angina
new_data['oldpeak'] = data.oldpeak
new_data['slope'] = data.slope
new_data['vessels'] = data.vessels
for i in range(3):
    new_data['thal_'+str(i)] = data_cat[:,7+i]
    
new_data.head()
```

However, as the number of preprocessing steps increase and change, this approach becomes difficult to scale. To rescue us from this difficulty, sklearn has the [`sklearn.pipeline`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline) and [`sklearn.compose`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.compose) packages. 

In [None]:
from sklearn import pipeline
from sklearn import compose

[__`pipeline.Pipeline`__](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) can be used to chain multiple _fixed_ steps into one. 

For example, our preprocessing steps for numeric columns are fixed: scaling, and doing polynomial feature creation. So we can essentially encapsulate these into a pipeline.

Let's instantiate `Pipeline` to chain our `StandardScaler` and `PolynomialFeatures` transformers from earlier.

In [None]:
# TODO
# numeric_transformer = ?
numeric_transformer = pipeline.Pipeline( steps=[('scaler', scaler), ('poly', poly_tfr)] )

'''
This is the same as:

numeric_transformer = pipeline.Pipeline( steps=[('scaler', preprocessing.StandardScaler()), 
                                                ('poly', preprocessing.PolynomialFeatures( degree=2, include_bias=False, interaction_only=True ))] )
'''

We can blindly apply this numeric transformer (with `fit_transform`) to all our columns of data. Starting with 13 columns, how many columns will the fitted product have?

In [None]:
# TODO
numeric_transformer.fit_transform( data ).shape

_(13*1 + 13C2 = 13+78 = 91)_

However, this is not very useful -- what is the meaning of a polynomial variable comprising a nominal variable multiplied by a real variable?

[__`compose.ColumnTransformer`__](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer) helps to perform different transformations for different columns of the data, within a Pipeline that is safe from data leakage and that can be parameterized. To each column, a different transformation can be applied, such as preprocessing for different types of data.

Let's instantiate `ColumnTransformer` to combine the following transformers that we have seen earlier:
- apply the numeric transformation `Pipeline` to our numeric features;
- apply the discretizer to our discrete feature;
- apply the one-hot encoder to our categorical features;
- apply the ordinal encoder to our ordinal features.

We can specify that we want all remaining columns that were not specified in transformers, but present in the data passed to fit, to be automatically passed through.

In [None]:
# TODO
# preprocessor = ?
preprocessor = compose.ColumnTransformer(
                transformers=[
                    ('num', numeric_transformer, num_features),
                    ('disc', discretizer, disc_features),
                    ('cat', oh_enc, cat_features),
                    ('ord', ord_enc, ordinal_features)
                ], remainder="passthrough"
                )

Now run its `fit_transform` and check the result.

In [None]:
# TODO
preprocd_data = preprocessor.fit_transform(data)
print( preprocd_data.shape )
preprocd_data

Finally, we can convert the preprocessed data back to DataFrame format for our use in analysis.

In [None]:
# TODO
data_preprcd = pd.DataFrame(preprocd_data)
data_preprcd.index = data.index
data_preprcd.head()

# Dealing with Missing Values using `SimpleImputer`

Since our dataset has no missing values, let us randomly remove some `age` values for the purpose of this tutorial.

In [None]:
data_drop = data.copy()
data_drop.iloc[ np.random.randint(0, 268, size = 10).tolist(), 0 ] = np.nan
data_drop.age.to_numpy()

`sklearn.impute` package provides a [__`SimpleImputer`__](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) that can help us fill these missing values.

In [None]:
from sklearn.impute import SimpleImputer

Let's instantiate the class to replace missing values with the mean of the column.

In [None]:
# TODO
# imp_mean = ?
imp_mean = SimpleImputer( missing_values=np.nan, strategy='mean' )

We can run the `fit` method first and see the computed values, in particular the mean `age` which will be used to replace the values we removed.

In [None]:
# TODO
imp_mean.fit( data_drop ).statistics_

Now we can run the `transform` method to actually do the filling, and convert the result back to a DataFrame.

In [None]:
# TODO
# data_filled = ?
data_filled = pd.DataFrame( imp_mean.transform( data_drop ))
data_filled.columns, data_filled.index = data_drop.columns, data_drop.index
data_filled

The NumPy array format will enable us to see the entire column and check the replaced missing values.

In [None]:
data_filled.age.to_numpy()

# Image Data and PCA (Feature Decomposition)

## Dataset

Now let us work on image data, as we have already explored tabular, hierarchical and array data in the previous tutorials. Let us use the [Olivetti dataset](https://cam-orl.co.uk/facedatabase.html), which was used in the context of a face recognition project at AT&T Laboratories Cambridge. This dataset contains a set of face images of 40 different subjects. This dataset is available in `sklearn` itself. 

The below code will fetch the dataset.

In [None]:
from sklearn.datasets import fetch_olivetti_faces

faces, targets = fetch_olivetti_faces( return_X_y=True )

The returned data `faces` is an array representation of the images, where each row corresponds to a ravelled face image of original size 64 x 64 pixels.

In [None]:
print( faces.shape )
faces

The returned `targets` are labels associated to each face image, ranging from 0-39 and correspond to the Subject IDs.

In [None]:
targets

We can use Matplotlib's function [`imshow`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.imshow.html) to display data as an image. For example, let's display the 25th sample.

In [None]:
import matplotlib.pyplot as plt

image_shape = (64,64)
plt.imshow( faces[24].reshape(image_shape), cmap=plt.cm.gray )

Let's perform standardization on this image data using `StandardScaler`.

In [None]:
# TODO
scaler = preprocessing.StandardScaler()
faces_scaled = scaler.fit_transform( faces )
print( faces_scaled.shape )
faces_scaled

The transformation will also be visible when we display the resulting data using Matplotlib.

In [None]:
# TODO: Display the scaled 25th sample
plt.imshow( faces_scaled[24].reshape(image_shape), cmap=plt.cm.gray )

## Principal Component Analysis

Principal Component Analysis is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance. It is a technique which essentially helps us to reduce the dimensionality of our dataset. 

The [`sklearn.decomposition`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition) module provides the transformer [`PCA`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA). It learns `n_components` in its `fit` method, and can be used on new data to project it on these components. 

In [None]:
from sklearn.decomposition import PCA

# TODO: instantiate PCA and apply it to the standardized faces data
# pca = ?
pca = PCA()
pca.fit( faces_scaled )

Let us find out how many components are sufficient to explain our faces dataset, by plotting the cumulative explained variance against the number of components.

In [None]:
plt.plot( np.cumsum( pca.explained_variance_ratio_ ))
plt.xlabel( 'number of components' )
plt.ylabel( 'cumulative explained variance' )
plt.grid()

We see that 100 components explain about 90% of the variance in the dataset. Thus, those 100 components might be sufficient for our downstream tasks like prediction. 

As we have image data however, we can actually view these orthogonal components that PCA has learnt. These are called _Eigenfaces_. A combination of these eigenfaces is usually sufficient to recreate the original sample. 

In [None]:
n_components=100
h = w = 64

print( "Extracting the top %d eigenfaces from %d faces" % (n_components, faces_scaled.shape[0]) )
pca = PCA( n_components=n_components, svd_solver='randomized', whiten=True ).fit( faces_scaled )

eigenfaces = pca.components_.reshape(( n_components, h, w ))

In [None]:
pca.components_.shape

In [None]:
eigenfaces.shape

Let's see how these Eigenfaces look like. The below code plots a gallery of portraits, with preset numbers of rows and columns.

In [None]:
def plot_gallery( images, titles, h, w, n_row=3, n_col=5 ):
    "Helper function to plot a gallery of portraits"
    plt.figure( figsize=(1.8 * n_col, 2.4 * n_row) )
    plt.subplots_adjust( bottom=0, left=.01, right=.99, top=.90, hspace=.35 )
    for i in range( n_row * n_col ):
        plt.subplot( n_row, n_col, i + 1 )
        plt.imshow( images[i].reshape((h, w)), cmap=plt.cm.gray )
        plt.title( titles[i], size=12 )
        plt.xticks(())
        plt.yticks(())

eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery( eigenfaces, eigenface_titles, h, w )

plt.show()

Eigenfaces are eigenvectors used in the computer vision problem of human face recognition. They are the principal components of a distribution of faces. They determine the variance in faces in a dataset, and the variances can be used to encode and decode a face in machine learning. 