# CRC Workshop: Machine Learning with Functional Connectivity (FC) Data


## The AOMIC Data

First, lets load the data and inspect it a bit. The [AOMIC dataset](https://nilab-uva.github.io/AOMIC.github.io/) is a collection data obtained in three different studies (**PIOP1**, **PIOP2**, **ID1000**). Here, we will be only concerned with the data from the **ID1000** study, which aimed to collect 1000 fMRI scans during movie-watching. The next cell defines the path to all the **ID1000** specific data, and also adds the names of the two files we will be interested in. One of these files contains the preprocessed **functional connectivity (FC)** data, whereas the other file contains the important **demographic** information. Let's start by also loading the dependencies:

In [None]:
import pandas as pd
from pathlib import Path

In [None]:
# Path to ID1000 data within the AOMIC datalad dataset
ID1000_path = (
    Path("..") / "aomic-fc"/ "junifer_storage" / 
    "JUNIFER_AOMIC_TSV_CONNECTOMES" / "ID1000"
)

# Path to the demographics data file
demographics_path = ID1000_path / "ID1000_participants.tsv"

# Path to the connectomes data file
# The name of this file is a bit of a mouthful but contains important
# information
connectomes_path = ID1000_path / (
    "ID1000_BOLD_parccortical-Schaefer100x17FSLMNI_"
    "parcsubcortical-TianxS2x3TxMNInonlinear2009cAsym_"
    "marker-empiricalFC_moviewatching.tsv.gz"
)

### Demographic Data

Now that we have defined these paths let's load each file and look at them one by one. Let's start with the demographics. We will load it using pandas, and as you might see from the file extension, these are both **tsv** files and we will therefore load them using a tab as a delimiter. In addition, we will load the first column as the index of the dataframe as it happens to contain the subject ID's.

In [None]:
demographics = pd.read_csv(demographics_path, sep="\t", index_col=0)
demographics

We can see some of the standard demographic variables, like "age", "sex", "BMI", and so on. As you might be able to tell, however, this file not *only* contains "demographic" information but also some other participant data, as for example cognitive measurements (e.g. "IST_memory", "IST_fluid").

### Connectomes

Let us now check the connectomes out to see for which subjects we have preprocessed functional connectivity data available.

In [None]:
connectomes = pd.read_csv(connectomes_path, sep="\t", index_col=0, compression="gzip")
connectomes

In this dataframe again, **each row** corresponds to *one subject* from the study. **Each column** represents a *unique pairwise relationship* between two brain areas (also called an *edge* in graph theory terminology). That is, since a brain parcellation with **N** areas results in an **NxN** symmetric correlation matrix per subject, one half of a subjects matrix is discarded. Similarly, the diagonal of this correlation matrix is also typically discarded as the correlation of an area with itself is always 1. The remaining entries can be stacked and result in one row of this dataframe. Thus, each row contains **N x (N-1) / 2** entries. In our case, since the connectomes were processed with a combination of the Schaefer 100 cortical parcellation and the Tian 32 subcortical parcellation, this results in **100 x (100 - 1) / 2 = 8646** columns. This concept is illustrated by the graphic below:

![title](images/connectomes.png)


### Subsetting the data

As you can see, the first dataframe on demographics contains 928 rows (i.e. subjects), whereas the second dataframe contains 877 rows. Let us for further analyses only select subjects for which we actually have connectomes. But first, let's also make sure, that we identify any 'NaN' values in the functional connectivity data and remove subjects with any 'NaN' entries.

The pandas **isna()** method will check for each entry in the dataframe whether it is 'NaN' or not. That is, if an entry is 'NaN' it will return True and otherwise it will return False. We can use this to identify the indices (i.e. subjects) for which there are 'NaN' entries by combining it with the **any()** method provided by pandas.

First see the output from **isna()**:

In [None]:
isna = connectomes.isna()
isna

The **any()** method will return whether any element along a given axis (i.e. along a row or a column) is True. The following output should be "True" therefore, if a subject has 'NaN' values and "False" otherwise:

In [None]:
isna_any = isna.any(axis=1)
isna_any

We can do calculations on these boolean values as if they are 0's and 1's. That is, "True" will be counted as 1 and "False" will be counted as 0. We can therefore use the **.sum()** method to determine the number of 'NaN' values:

In [None]:
isna_any.sum()

The output ("0") shows us that there aren't any 'NaN' values, so we can simply proceed with the data we have here. Let us therefore now subset the demographic data for which we have connectomes. That is, we will index the demographics dataset using the index from the connectomes dataset:

In [None]:
subsampled_demographics = demographics.loc[connectomes.index]
subsampled_demographics

The indexing using the **.loc()** method importantly also ensures that the rows in both dataframes are in the same order which will be important later when we convert them to numpy arrays, a data structure that scikit-learn understands.

### Exploring our sample:

Now that the samples in the connectome data and the demographics data are matched, let's take a quick look at sex and age to get an overview of our sample.

The **value_counts()** method takes a pandas series (i.e. a column from the dataframe) and counts the amount of times each possible value is contained in the column. This is a good way of discovering what values are possible for a specific variable, and how many instances there are for each value. This is useful for example when looking at categorial variables, for example "sex":

In [None]:
subsampled_demographics["sex"].value_counts()

The **plot.hist()** pandas method provides a quick way of making a histogram that we can also group by "sex" to look at each distribution seperately: 

In [None]:
subsampled_demographics.plot.hist(column=["age"], by="sex", figsize=(10, 8))

As you can see, the age range is quite narrow, and limited to young people. This is a common problem in neuroimaging or psychology studies, which often sample students from their universities for convenience. It is always good to be aware of these limitations before starting any complicated machine learning pipeline.

## Doing some ML

Now, lets try to build a classifier that can distinguish between males and females given a functional connectome. That is, the connectomes will be the features ('X') and the sex will be the target ('y'). Since "sex" in our data is encoded as "male" and "female" and scikit-learn only understands numeric data, we have to convert "sex" to a numeric, categorial variable. Since it's a binary target, this is relatively straightforward. We can do this simply by adding another column to our demographics data as follows:


In [None]:
subsampled_demographics["sex_numeric"] = subsampled_demographics["sex"].map(lambda x: 1 if x == "female" else 0)

We can inspect the output as follows:


In [None]:
sex_info = subsampled_demographics[["sex", "sex_numeric"]]
sex_info

As you can see, there is a 1 where sex is given as "female" and a 0 where sex is given as male. Let's finalise the target as a numpy array which is a data structure scikit-learn understands:

In [None]:
import numpy as np

y = np.array(sex_info["sex_numeric"])
y

### Train-test split

Let us first split the data so we have one hold-out validation set that will be left untouched for now. Let us also finalise our features as a numpy array:

In [None]:
from sklearn.model_selection import train_test_split

X = np.array(connectomes)

X_model_selection, X_holdout, y_model_selection, y_holdout = train_test_split(X, y, random_state=25)

We will then do another train-test split on the model selection data, so that we can train models on the inner training set, compare their performance on the inner test set, and then evaluate our final model on the hold-out validation set.

In [None]:
X_inner_train, X_inner_test, y_inner_train, y_inner_test = \
    train_test_split(X_model_selection, y_model_selection, random_state=30)

### Fitting a bunch of models

Every problem, every classification or regression task is different, and therefore requires a different model. That is, which model works best depends on the underlying processes that generate the distribution of our X and on the "true" function that maps X to y. In other words, we cannot really know which model will work best before we try it out. Let us test out a few popular options therefore starting with a **Ridge Classifier**.

In [None]:
from sklearn.linear_model import RidgeClassifier

We can initialise the RidgeClassifier object, and then fit it on the training data:

In [None]:
ridge = RidgeClassifier()

In [None]:
ridge.fit(X_inner_train, y_inner_train)

We can now test the accuracy of this classifier by making predictions on our test set that was so far unseen. The predictions can then be compared to the true values of y in this test set using some metric (for example "accuracy" in our case):

In [None]:
predictions_ridge = ridge.predict(X_inner_test)

In [None]:
predictions_ridge

In [None]:
y_inner_test

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(y_inner_test, predictions_ridge)

However, if we check out the **sklearn documentation**, we see that the **RidgeClassifier** has a parameter called **alpha**, which can be set to a positive floating point value. This is a **hyperparameter**, *that must be set by the user and cannot be fitted based on the data*. How can we know which value is the best one? First, run the code in the next cell to see the documentation:

In [None]:
?RidgeClassifier

A simple approach could be to define a grid of potential candidate values, repeat the fitting and testing procedure for each of them, and then simply select the one that yields the highest accuracy. One might do so as follows:

In [None]:
scores = {}
alpha_candidates = [0.001, 0.01, 0.1, 1, 2, 10, 50, 100]
for alpha in alpha_candidates:
    ridge = RidgeClassifier(alpha=alpha)
    ridge.fit(X_inner_train, y_inner_train)
    # make predictions on the test data
    predictions_ridge = ridge.predict(X_inner_test)
    # compare the predictions to actual observations of the y
    scores[str(alpha)] = accuracy_score(y_inner_test, predictions_ridge)

In [None]:
scores

Most of the time, however, we want to not only compare **hyperparameters** for one **model family**, but we also want to compare accuracy across **different model families**, since we cannot know beforehand which model family will perform best. Lets import some other model families:

In [None]:
from sklearn.svm import SVC

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

We can organise these models in a python dictionary, such that they **"keys"** of the dictionary tell us the **name of a classifier**, and the **"values"** of the dictionary holds the actual scikit-learn object. Note, that we can also use some model families multiple times with different hyperparameters. For example, we define three different **Support Vector Classifiers (SVC)**, each with a different **kernel** (this is one of the SVC's hyperparameters):

In [None]:
classifiers = {
    "LogReg": LogisticRegression(max_iter=1000),
    "KNN": KNeighborsClassifier(n_neighbors=50),
    "linear_SVC": SVC(kernel="linear"),
    "rbf_SVC": SVC(kernel="rbf"),
    "poly_SVC": SVC(kernel="poly", degree=2),
    "DT": DecisionTreeClassifier(),
    "RF": RandomForestClassifier(),
    "GNB": GaussianNB(),
    "Ridge": RidgeClassifier(),
}

When looping through dictionaries, the key and corresponding value can be retrieved at the same time using the **items()** method. That means, in each iteration, we obtain the name and the object of a specific classifier:

In [None]:
scores = {}
for classifier_name, classifier_object in classifiers.items():
    # fit the model on the training data
    classifier_object.fit(X_inner_train, y_inner_train)
    # make predictions on the test data
    classifier_predictions = classifier_object.predict(X_inner_test)
    # compare the predictions to actual observations of the y
    scores[classifier_name] = accuracy_score(y_inner_test, classifier_predictions)

In [None]:
scores

# Exercises

1. Take the model that obtained the highest score in the model selection process, refit it on the model selection data (**X_model_selection, y_model_selection**) and test it on the holdout data (**X_holdout, y_holdout**). How does it perform now? Better, worse, or the same? What about other models that performed well (but not best) in the model selection process?
2. Check out the documentation for some of the other model families. Can you see some more hyperparameters that sound interesting to you? Add more models with different hyperparameters. How do they affect accuracy on the test set? Do they perform better than the models already outlined above? How about on the hold-out set?

# Additional Reading

* [This article talks a bit more about train-test split evaluation](https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/)

* The scikit-learn user guide has some explanations and demonstrations of the above mentioned models [here](https://scikit-learn.org/stable/supervised_learning.html) and it usually links to some relevant papers or books as well.
It is well worth trying to read and understand as much as possible about the individual algorithms you are planning to use in your research.

## Tuning Hyperparameters using Cross-Validation

In the previous example we have compared different models regarding their accuracy on some test data, and selected the model that achieved the highest score on the test set. We have seen that a **model** can be defined by its **model family** and by the associated **hyperparameters** of that model family. We have seen a potential, naive strategy to select the best model for a problem at hand, but perhaps we have also noticed some problems with the proposed strategy. Importantly, we have noticed, that the performance of the selected model can change on the holdout set. That is, the model that we selected based on its performance on the test set, may not actually be the best model for the **general problem at hand** but may just be the model **that happened to work best on the test set**. That is, when using just one test set to assess accuracy of a model, we risk **"overfitting"** our model selection process to **this specific test set.** A better strategy for doing model selection may thus be to use cross-validation. 

Scikit-learn allows us to perform **model selection using a cross-validated gridsearch** using the **GridSearchCV** object. Let's import it:

In [None]:
from sklearn.model_selection import GridSearchCV

Let's check out the documentation:

In [None]:
?GridSearchCV

We can see many parameters, but there are 4 which we care about predominantly:

1. **"estimator"** -> our sklearn estimator object (i.e. the model class)
2. **"param_grid"** -> the grid of hyperparameters to search
3. **"scoring"** -> which scoring metric to use
4. **"cv"** -> the cross-validation scheme to use
   

Let's for the moment go again with the example of the ridge classifier, for which we want to tune the alpha value. We can define the estimator as:

In [None]:
ridge = RidgeClassifier()

The **param_grid** parameter typically is handed over as a **dictionary** in which the **keys** consist of the names of the parameters that are to be set for the estimator, and the **values** of the dictionary each yield an **iterable** (for example a **list**) of **possible candidate values** for each parameter.

For example, for our ridge classifier, we can define a very simple grid, that only searches the value for one parameter (the alpha parameter) as follows:

In [None]:
param_grid_ridge = {
    "alpha": [0.001, 0.01, 0.1, 1, 2, 10, 50, 100]
}

The scoring parameter defines which scoring metric we care about, i.e. which scoring metric should be optimised by the grid search. It can be a string if the metric is already in-built in sklearn. For simplicity we will use "accuracy" here.
To see what other metrics are available check out this: https://scikit-learn.org/stable/modules/model_evaluation.html

In [None]:
scoring = "accuracy"

Lastly, the "cv" parameter can be any scikit-learn compatible cross-validation scheme. Here we will use a simple 5-fold cross-validation. We should also make sure that the KFold cv shuffles the data, but with a specific random state, so that the results are reproducible:

In [None]:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=100)

We can then initialise the GridSearchCV object, and fit it like any other scikit-learn estimator:

In [None]:
gridsearchcv = GridSearchCV(
    estimator=ridge,
    param_grid=param_grid_ridge,
    scoring=scoring,
    cv=kfold
)

gridsearchcv.fit(X_model_selection, y_model_selection)

The GridSearchCV has an **attribute** called **cv_results_** which we can access as follows:

In [None]:
gridsearchcv.cv_results_

As you can see it is a dictionary with quite a lot of stuff, and somewhat difficult to read. However, it can be easily converted into a pandas dataframe for easier inspection: 

In [None]:
cv_results = pd.DataFrame(gridsearchcv.cv_results_)
cv_results

We can see results for each of our 7 model candidates (remember, we used 7 alpha values to define our grid). That is, each row represents the results for 1 model candidate (for which you can see the parameters in the **"params"** column). Perhaps most interesting are the **"mean_test_score"** and **"std_test_score"**, which show us mean accuracy and the standard deviation across the different train-test splits.

In [None]:
cv_results[["params", "mean_test_score", "std_test_score"]]

Conveniently, since the GridSearchCV had the **"refit"** parameter set to True, it already also selects the best scoring model and refits it on all of the data we gave it, so that we can now use it directly for further testing. We can check the best model and its parameters as follows:

In [None]:
gridsearchcv.best_estimator_

In [None]:
gridsearchcv.best_params_

Obviously, it's a RidgeClassifier, but we can also see that the alpha value fitted for the best model, corresponds to the alpha parameter for which we can see the highest score in the **cv_results** table. Let us now try to evaluate this best model on the final holdout set:

In [None]:
holdout_predictions_gscv = gridsearchcv.predict(X_holdout)

In [None]:
accuracy_score(y_holdout, holdout_predictions_gscv)

# Exercises

1. Do a similar grid search for other model families of your choice. Can you search grids with multiple hyperparameters (rather than just the one alpha parameter that we used in the example)?


# Additional Reading or Ressources


The scikit-learn user guide has some good explanations on these topics in its user guide. For example you can look at:

* [computing cross-validated metrics](https://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics)
* [Tuning hyper-parameters using grid search](https://scikit-learn.org/stable/modules/grid_search.html)

# Sklearn's pipeline

Since we often want to apply preprocessing steps before fitting our models, and we want to do so in a cross-validation consistent way, so that we avoid data leakage and over-optimistic accuracy estimates, we need an easy way to chain different pipeline steps. The scikit-learn **pipeline** object provides an easy way to do just that.

Check out its documentation below:

In [None]:
from sklearn.pipeline import Pipeline
?Pipeline

The main parameter we care about is the **"steps"** parameter. As the name (and the description) suggests, it is a list of the steps that we want to apply in our pipeline. Let's imagine for example, that we want to make a pipeline in which we first perform a **principal component analysis (PCA)** to extract 5 components, and then fit a Logistic Regression on only those 5 components. This may be quite complicated to implement in code especially if we also want to do some hyperparameter tuning in a cross-validated grid search, but combining the two steps in a pipeline makes it much more convenient.

First let's prepare the PCA and the LogisticRegression:

In [None]:
from sklearn.decomposition import PCA

pca_5comps = PCA(n_components=5)
log_reg = LogisticRegression(max_iter=1000)

Now, the steps parameter of the pipeline actually takes a list of tuples. That is, each step of the pipeline is a tuple, that indicates the name of the step, and hands over the actual scikit-learn compatible object. Importantly, the last step must always be an estimator:

In [None]:
steps = [("pca_5comps", pca_5comps), ("logistic_regression", log_reg)]

We then simply hand this over to the pipeline, which we can then itself use like an estimator:

In [None]:
pipeline = Pipeline(steps=steps)
pipeline.fit(X_model_selection, y_model_selection)

In [None]:
pipeline_predictions = pipeline.predict(X_holdout)
pipeline_score = accuracy_score(y_holdout, pipeline_predictions)
pipeline_score

As you can see, accuracy is quite bad compared to what we achieved without the PCA. This is not really surprising since we are reducing the information from a few thousand features into 5 features only. Luckily we can perform cross-validated hyperparameter tuning now using GridSearchCV to find the "optimal" number of components. However, to define the **param_grid** we now need to specify the name of the pipeline step and the name of the parameter, separated by a double underscore such as:

**"stepname__parametername"**

We also want to compare the pipeline with PCA to a pipeline where PCA is not applied.
We can do this by adding a second param_grid with the "pca": "passthrough" key-value pair. That is, we pass two param_grids in a list.

Check out the example below and you can see it is really quite simple:

In [None]:
steps = [("pca", PCA()), ("logistic_regression", LogisticRegression(max_iter=1000))]
pipeline = Pipeline(steps=steps)
gridsearch_pipeline = GridSearchCV(
    estimator=pipeline,
    param_grid=[
        {"pca__n_components": [5, 10, 50, 100, 250]},
        {"pca": ["passthrough"]},
    ],
    cv=KFold(n_splits=5, shuffle=True, random_state=100),
    scoring="accuracy",
)
gridsearch_pipeline.fit(X_model_selection, y_model_selection)
cv_results = pd.DataFrame(gridsearch_pipeline.cv_results_)
cv_results

As you can see, the GridSearchCV results again display the scores for each hyperparameter, i.e. each amount of components we extracted in a PCA. Again we can evaluate the best model on our holdout dataset:

In [None]:
gridsearch_pipeline.best_estimator_

In [None]:
holdout_predictions = gridsearch_pipeline.predict(X_holdout)
pipeline_score = accuracy_score(y_holdout, holdout_predictions)
pipeline_score

# Exercise:

1. One very popular method of preprocessing where data leakage can happen quite easily is feature selection. Take one of the feature selection methods [in-built in scikit-learn](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection) and add it to a pipeline with a classifier of your choice. Can you do hyperparameter tuning for the feature selection process and the estimator simultaneously?
Hint: check out the scikit-learn user guide to see [how you can use an F-test to select relevant features here](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection).

# Bonus Material: How to perform a cross-validated grid search for both **model family** and **hyperparameters** simultaneously?

Since we usually want to select the best model for a given problem from a set of different model families and their associated hyperparameters, you may now wonder how to do this with the GridSearchCV. We can see easily how it is done with one estimator, but it is not easy to see how to do this with a set of different estimators. This is another use case where the scikit-learn pipeline object can come in quite handy!

We can initialise a pipeline with an estimator as a step, and then replace **this estimator** in the **pipeline** with **other type of estimators** using different parameter grids in our **GridSearchCV**, very similar to the way in which we tested the pipeline with and without PCA in the previous example:

In [None]:
# We only define one step, we care only about the classifier and its hyperparameters.
# We arbitrarily initialise the pipeline with ridge:
pipeline = Pipeline(steps=[("classifier", RidgeClassifier())])

# parameters for the ridge classifier
ridge_params = {
    "classifier": [RidgeClassifier()], # parameters have to be handed over as iterables!
    "classifier__alpha": [0.001, 0.01, 0.1, 1, 2, 10, 50, 100, 200, 500, 1000],
}

# parameters for the support vector classifier:
svc_params = {
    "classifier": [SVC()],
    "classifier__C": [0.001, 0.01, 0.1, 1, 2, 10],
    "classifier__kernel": ["linear", "rbf"],
}

# parameters for the random forest classifier:
rf_params = {
    "classifier": [RandomForestClassifier()],
    "classifier__n_estimators": [10, 50, 100], 
}

After defining the estimators, the pipeline, and the parameter grids we can put it all together as follows and run the search. This may take a few minutes:

In [None]:
gridsearch_cv = GridSearchCV(
    estimator=pipeline,
    param_grid=[ridge_params, svc_params, rf_params],
    scoring="accuracy",
    cv=kfold,
    n_jobs=-1 # to speed up computation, '-1' means it will use all available CPU's
)
gridsearch_cv.fit(X_model_selection, y_model_selection)
pd.DataFrame(gridsearch_cv.cv_results_)

Let's evaluate again the best model on the holdout data:

In [None]:
gscv_predictions = gridsearch_cv.predict(X_holdout)
accuracy_score(y_holdout, gscv_predictions)