# 03 - Data pre-processing and pipelines

In this notebook, we will review:
- Some common pre-processing steps for ML analyses in Cognitive Neuroscience, and why these are useful/needed.
- 
- !! Explain that common pre-processing steps are so called transformers in scikit-learn, and transformers are also estimators.

---

# Standard Scaler

(...)
- Explain what standarizing your data means.
- Some models like support vector machine classifiers benefit from the data being standarized.
    - Talk about svm and mention we will not discuss them in detal, but they can read more about them (here...)

Let's standarize our data using scikit-learn. For this example we will use the wine dataset of scikit-learn. Let's load it:

In [None]:
from sklearn.datasets import load_wine

dataset = load_wine(as_frame=True)
dataset['frame']

We can also read the description of the dataset that sklearn provides:

In [None]:
print(dataset['DESCR'])

The description of the dataset reveals that:
1. The example contains three clases as targets of predictions. That means, it is a multi-class classification problem. 
2. We are trying to predict the class of wine using 13 features.
3. If you take a look at the mean and standard deviation of these 13 features, you will notice their scale is different. 

We will thus scale each of the features so that they have a mean of 0 and a standard deviation of 1. In scikit-learn we can achieve this by calling the `StandardScaler` class (read the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)).

In [None]:
from sklearn.preprocessing import StandardScaler

# Define X and y
X = dataset['data']
y = dataset['target']

# Create and fit scaler
scaler = StandardScaler().fit(X)

Notice that for fitting the scaler we only need the input features (`X`), and not the target (`y`). But calling the `fit` method is not enough to transform our input to the model. We also need to call the `tranform` method: 

In [None]:
# Transform X
X_tr = scaler.transform(X)

Notice that we only tranform the input `X`. Also, we assign to the output of the transformation a different variable than `X` (in this case we used `X_tr`), otherwise the method won't give us the expected outcome. Let's make sure the data got scaled and get the mean and standard deviation for each feature, which should be 0 and 1 respectively:

In [None]:
import numpy as np

# Compute mean and std of features
means = np.mean(X_tr, axis=0)
stds = np.std(X_tr, axis=0)

print(f"means: {np.round(means, 2)}")
print(f"stds: {np.round(stds, 2)}")

As we stated before, support vector machines can be impacted by the scaling of the data. Let's compare the performance of such model with and without the data scaled:

In [None]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# Create and fit model with original data
svc = SVC().fit(X, y)

# Create and fit model with transformed data
svc_scaled_data = SVC().fit(X_tr, y)

# Print scores of both models
print(f"SVC accuracy with non-scaled data: {np.round(svc.score(X, y), 2)}")
print(f"SVC accuracy with non-scaled data: {np.round(svc_scaled_data.score(X_tr, y), 2)}")

Indeed, the performance of the support vector classifier increased when we scaled our data.

#### ✍️ Exercise

The fitting and transformation steps can be simplified using `fit_transform`. Read how to use it [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) and implement `StandardScaler` this way. Write your answer in the cell below, and press the three dots to reveal the solution.

In [None]:
# Answer
scaler = StandardScaler()
X_tr = scaler.fit_transform(X)

# Dimensionality reduction

- !! Explain what dimensionality reduction is
- !! Explain why do it (curse of dimensionality)
    - Mention this can lead to fitting our model to noise, but we will explain this concept in detail in the next notebook
    - With this approach we can remove noisy data that might be affecting the performance of our model

- There are two main types of dimensionality reduction: 
    - _Feature selection_: We select a subset of our features based on some method.
    - _Feature extraction_: We create new (and usually fewer) features based on the existing ones.

We will now see some examples of feature selection and feature extraction, and how to compute them using scikit-learn.

## Feature selection

In feature selection, we select of subset of the feature columns present in `X`. We perform this subselection based on some method.

In scikit-learn we find many methods to perform feature selection, as can be seen [here](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection). In this example, we will illustrate how `SelectKBest` works (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest)).

`SelectKBest` selects the __k__ features that have the highest score when evaluated with a pre-defined scoring function. Scikit-learn provides many different types of scoring functions that can be used, and by default uses the ANOVA F-value of the sample (see [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif)).

Let's select the 5 features with the highest F-value using this function:

In [None]:
from sklearn.feature_selection import SelectKBest

# Create and fit selector
selector = SelectKBest(k=5).fit(X_tr, y)

# Transform data
X_sel = selector.transform(X_tr)

Let's inspect the `selector` object after fitting:

In [None]:
vars(selector).keys()

In [None]:
selector.scores_

Let's check that `X_sel` has now less feature columns than `X`:

In [None]:
print(f"Number of columns in original data: {X.shape[1]}")
print(f"Number of columns in data after feature selection: {X_sel.shape[1]}")

#### ✍️ Exercise

Use another score metric in `SelectKBest` for selecting the $k$ features with the best performance. Write your answer in the cell below, and press the three dots to reveal the solution.

In [None]:
from sklearn.feature_selection import chi2

# Create and fit selector
selector = SelectKBest(score_func=chi2, k=5).fit(X, y)

#### Answer

## Feature extraction

- We tranform our features into a new set of features living in a lower dimensional space
- There are linear/non-linear methods
- !! Explain PCA (link to notebook?) (link to documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA))
 

In scikit-learn, dimensionality reduction methods are transformer objects. In this example, we will implement `PCA` (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA)) to perform dimensionality reduction and feature selection.

!! Let's first create a classification problem:

In [None]:
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=300, n_features=300, 
    n_informative=5, n_redundant=100, 
    random_state=0
)

Let's compute `PCA` over `X`: 

In [None]:
from sklearn.decomposition import PCA

# Fit PCA
pca = PCA(n_components=5, random_state=0).fit(X)

# Transform data
X_pca = pca.transform(X)

!! We don't have time to explain the inner outputs of PCA and their rationale, but you can read more about them in detail here (link to notebook)

In [None]:
X_pca.shape

#### ✍️ Exercise

{To-Do}

Write your answer in the cell below, and press the three dots to reveal the solution.

# Pipelines

- The convenience of pipelines
- How to implement a pipeline in sklearn
- Creating a pipeline is also an useful tool for avoiding leaking information when splitting the data into training/testing sets or doing cross-validation, but this will be explained in the next chapter

Documentation of pipeline can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline)

In [None]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svc', SVC())
])

In [None]:
# Define X and y
X = dataset['data']
y = dataset['target']

# Fit and print estimators
pipe = pipe.fit(X, y)

# Score model
pipe.score(X, y)

We can also use `make pipeline` for a faster approach (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline)) to creating a pipeline:

In [None]:
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(StandardScaler(), SVC())
pipe

In [None]:
pipe.fit(X, y).score(X, y)

#### ✍️ Exercise

{To-Do}
Write your answer in the cell below, and press the three dots to reveal the solution.

In [None]:
# Answer

# ✏️ Check your knowledge

Load the ABIDE 2 dataset and create two pipelines:
1. _Pipeline 1_: Standarize your data, perform `PCA` selecting $k$ components, and perform classification analysis using `SVC`.
2. _Pipeline 2_: Standarize your data, select $k$ best features using F-score, and perform classification analysis using `SVC`.

Answer the following questions:
1. Which pipeline achieves the best performance?
2. Vary $k$. How does the performance change?


## Additional resources