# 3. Data pre-processing and pipelines

In this notebook, we will review:
- Transformer objects in _scikit-learn_.
- The concepts of preprocessing, feature selection and feature extraction; some examples of them and how to implement them using_scikit-learn_.
- Pipelines in _scikit-learn_ and how to create them.

---

# Transformers

Many real-world datasets don't come prepared for serving as the input of machine learning (ML) analysis and need additional __preprocessing__. For example, in classification problems it is very common that the labels are provided in string format (e.g. `dog`, `cat`) instead of the integers the algorithms require. Fortunately, _scikit-learn_ provides many different functions to transform these datasets and make them ready for analyses. These type of functions belong to a broader type of estimators called __transformers__. 

Transformers are estimator objects that besides learning from the data, they can also transform it in some way. Besides preprocessing, transformers can also be methods that perform __feature selection__, or __feature engineering/extraction__ steps. 

In feature selection, we select a subset of the feature columns present in `X`. In feature engineering/extraction, we create a new set of features from the existing ones. _Scikit-learn_ provides many different methods for achieving those aims.

Both methods come in useful when we want to perform __dimensionality reduction__ to our data. 

(...)
- Explain why do it (curse of dimensionality)
- Curse of dimensionality can lead to fitting our model to noise, but we will explain this concept in detail in the next notebook
- With this approach we can remove noisy data that might be affecting the performance of our model


This notebook will show you how to implement transformers in _scikit-learn_ based on some of these examples.

# Standard Scaler
A popular transformation consists of standarizing the dataset. This removes the mean (also called __centering__) and scales to unit variance (`std=1`) each feature. This is an important step when dealing with data where each feature is scaled differently, and/or where the observations in each feature might not follow a normal standard distribution. These properties can make some algorithms behave badly.

In this tutorial we will learn how to standarize our data. For this example we will use a popular dataset stored in _scikit-learn_. Let's load it:

In [None]:
from sklearn.datasets import load_wine

# Load and print dataset
dataset = load_wine(as_frame=True)
dataset['frame']

What is this dataset encoding? We can read a description of the dataset in the following manner:

In [None]:
print(dataset['DESCR'])

With this dataset we will be trying to predict the wine class from some of their chemical/structural attributes. The description of the dataset above particularly reveals that:
1. The example contains three clases as targets of predictions. That means, it is a multi-class classification problem. 
2. We are trying to predict the class of wine using 13 features.
3. If you take a look at the mean and standard deviation of these 13 features, you will notice they are scaled differently. 

We will thus scale each of the features so that they have a mean of 0 and a standard deviation of 1. In _scikit-learn_ we can achieve this using `StandardScaler` (read the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)):

In [None]:
from sklearn.preprocessing import StandardScaler

# Define X and y
X = dataset['data']
y = dataset['target']

# Create and fit scaler
scaler = StandardScaler().fit(X)

Notice that for fitting the scaler we only need the input features (`X`), and not the target (`y`). But calling the `fit` function is not enough to transform our input to the model. For this we need to use the `transform` function: 

In [None]:
# Transform X
X_tr = scaler.transform(X)

Notice that we only tranform the input `X`. Also, we assign to the output of the transformation a different variable than `X` (in this case we used `X_tr`), otherwise the method won't give us the expected outcome. Let's make sure the data got scaled and get the mean and standard deviation for each feature, which should be 0 and 1 respectively:

In [None]:
import numpy as np

# Compute mean and std of features
means = np.mean(X_tr, axis=0)
stds = np.std(X_tr, axis=0)

print(f"means: {np.round(means, 2)}")
print(f"stds: {np.round(stds, 2)}")

One type of models that can behave badly if features are not standarized are support vector machines (SVM) classifiers, commonly used in cognitve neuroscience research. We won't describe what these models are in detail, but you can read more about them here {add}.

Let's compare the performance of a support vector classifier (implemented with `SCV` in _scikit-learn_) trained with a non-scaled dataset, with that trained on a scaled one:

In [None]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# Create and fit model with original data
svc = SVC().fit(X, y)

# Create and fit model with transformed data
svc_scaled_data = SVC().fit(X_tr, y)

# Print scores of both models
print(f"SVC accuracy with non-scaled data: {np.round(svc.score(X, y), 2)}")
print(f"SVC accuracy with non-scaled data: {np.round(svc_scaled_data.score(X_tr, y), 2)}")

Indeed, the performance of SVC increased when we scaled our data.

#### ✍️ Exercise

The fitting and transformation steps can be simplified using `fit_transform`. Read how to use it [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) and implement `StandardScaler` this way. Write your answer in the cell below, and press the three dots to reveal the solution.

In [None]:
# Answer
scaler = StandardScaler()
X_tr = scaler.fit_transform(X)

# Select K best

`SelectKBest` is one type of feature selection step. It selects the $k$ features that have the best score when evaluated with a pre-defined performance metric. _Scikit-learn_ provides many different types of scoring functions that can be used, and by default uses the ANOVA F-value of the sample (see [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif)).

Let's select the 5 features with the highest F-value:

In [None]:
from sklearn.feature_selection import SelectKBest

# Create and fit selector
selector = SelectKBest(k=5).fit(X_tr, y)

# Transform data
X_sel = selector.transform(X_tr)

Let's inspect the `selector` object after fitting:

In [None]:
vars(selector).keys()

We can inspect the scores obtained by each feature:

In [None]:
selector.scores_

Let's check that `X_sel` has now less feature columns than `X`:

In [None]:
print(f"Number of columns in original data: {X.shape[1]}")
print(f"Number of columns in data after feature selection: {X_sel.shape[1]}")

#### ✍️ Exercise

Use another score metric in `SelectKBest` for selecting the $k$ features with the best performance. Write your answer in the cell below, and press the three dots to reveal the solution.

In [None]:
# Answer (example using chi squared)
from sklearn.feature_selection import chi2

# Create and fit selector
selector = SelectKBest(score_func=chi2, k=5).fit(X, y)

# _Optional material:_ Principal Component Analysis (PCA)

Another type of dimensionality reduction technique common in cognitive neuroscience research is Principal Component Analysis (PCA). If you have time after finishing this notebook, you can read more about it and learn how to implement it using _scikit-learn_ [here](./optional/pca.ipynb).

# Pipelines

[Pipelines](https://scikit-learn.org/stable/modules/compose.html) are objects in _scikit-learn_ that chain together transformers and a final estimator in a convenient manner.

Creating a pipeline is also an useful tool for avoiding leaking information when splitting the data into training/testing sets or doing cross-validation. Don't worry if you don't know what these terms mean, they will be explained in [Notebook 4](./04-preventing_overfitting.ipynb).

Let's implement a pipeline that chains together a standard scaler and a support vector classifier model, using the class `Pipeline` (read the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline)):

In [None]:
from sklearn.pipeline import Pipeline

# Create pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svc', SVC())
])

Let's use this pipeline to transform, fit and score our model:

In [None]:
# Define X and y
X = dataset['data']
y = dataset['target']

# Fit and print estimators
pipe = pipe.fit(X, y)

# Score model
pipe.score(X, y)

#### ✍️ Exercise

We can also use `make pipeline` for a faster approach to creating a pipeline. Read the documentation of this method [here](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline) and implement it to create a pipeline. Write your answer in the cell below, and press the three dots to reveal the solution.

In [None]:
# Answer
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(StandardScaler(), SVC())
print(pipe)

# ✏️ Check your knowledge

Load the ABIDE 2 dataset and create two pipelines:
1. _Pipeline 1_: Standarize your data, and perform classification analysis using `SVC`.
2. _Pipeline 2_: Standarize your data, select $k$ best features using F-score, and perform classification analysis using `SVC`.

Answer the following questions:
1. Which pipeline achieves the best performance?
2. Vary $k$. How does the performance change?


## Additional resources