# Machine learning Part 1: Supervised learning

## Exercise 1: Training and testing a model

### Load data

We will use data from the Autism Brain Imaging Data Exchange (ABIDE) dataset. 
The data has already been downloaded and preprocessed into two TSV tables:
- [`participants_nbsub-100.tsv`](../../data/participants_nbsub-100.tsv)
    - Phenotypic information: participant age, sex, scan site, diagnosis, etc.
- [`abide_nbsub-100_atlas-ho_meas-correlation_relmat.tsv`](../../data/abide_nbsub-100_atlas-ho_meas-correlation_relmat.tsv)
    - Flattened functional connectivity matrixes

The script used to create these files can be found [here](../../data/build_datasets.py).

Let's load and display the phenotypic data:

In [2]:
import pandas as pd

data_dir = "../../data"

# phenotypic data (including diagnosis group DX_GROUP)
pheno_data_tsv = f"{data_dir}/participants_nbsub-100.tsv"
pheno_df = pd.read_csv(pheno_data_tsv, sep="\t", index_col=0)
pheno_df

Unnamed: 0_level_0,X,subject,SITE_ID,FILE_ID,DX_GROUP,DSM_IV_TR,AGE_AT_SCAN,SEX,HANDEDNESS_CATEGORY,HANDEDNESS_SCORES,...,qc_notes_rater_1,qc_anat_rater_2,qc_anat_notes_rater_2,qc_func_rater_2,qc_func_notes_rater_2,qc_anat_rater_3,qc_anat_notes_rater_3,qc_func_rater_3,qc_func_notes_rater_3,SUB_IN_SMP
SUB_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
50003,2,50003,PITT,Pitt_0050003,1,1,24.45,1,R,,...,,OK,,OK,,OK,,OK,,1
50004,3,50004,PITT,Pitt_0050004,1,1,19.09,1,R,,...,,OK,,OK,,OK,,OK,,1
50005,4,50005,PITT,Pitt_0050005,1,1,13.73,2,R,,...,,OK,,maybe,ic-parietal-cerebellum,OK,,OK,,0
50006,5,50006,PITT,Pitt_0050006,1,1,13.37,1,L,,...,,OK,,maybe,ic-parietal slight,OK,,OK,,1
50007,6,50007,PITT,Pitt_0050007,1,1,17.78,1,R,,...,,OK,,maybe,ic-cerebellum_temporal_lob,OK,,OK,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50162,112,50162,OHSU,OHSU_0050162,2,-9999,8.94,1,R,,...,,OK,,OK,,OK,,OK,,1
50163,113,50163,OHSU,OHSU_0050163,2,-9999,9.40,1,R,,...,,OK,,OK,,OK,,OK,,1
50164,114,50164,OHSU,OHSU_0050164,2,-9999,8.86,1,R,,...,,OK,,OK,,OK,,OK,,1
50167,117,50167,OHSU,OHSU_0050167,2,-9999,10.08,1,R,,...,,OK,,OK,,OK,,OK,,1


And now the brain data:

In [3]:
# brain data: functional connectivity matrices (flattened)
brain_data_tsv = f"{data_dir}/abide_nbsub-100_atlas-ho_meas-correlation_relmat.tsv"
brain_df = pd.read_csv(brain_data_tsv, sep="\t", index_col=0)
brain_df

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,6095,6096,6097,6098,6099,6100,6101,6102,6103,6104
SUB_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
50003,0.616131,0.631333,0.536934,0.579913,0.486430,0.674588,0.419927,0.320891,0.567708,0.482514,...,0.528068,0.501324,0.492328,0.383765,0.433528,0.445759,0.432495,0.563743,0.538968,0.794665
50004,0.469488,0.555710,0.382993,0.438907,0.351902,0.460364,0.418160,0.222128,0.303198,0.225140,...,0.169667,0.274514,0.240135,0.147265,0.130083,0.167236,0.173157,0.339317,0.085935,0.578523
50005,0.477262,0.444849,0.406490,0.430605,0.333766,0.652202,0.675285,0.316591,0.351914,0.333831,...,0.168143,0.232346,0.481708,0.282241,0.324252,0.341667,0.381824,0.527614,0.515012,0.829215
50006,0.507016,0.661879,0.520613,0.585333,0.375889,0.625316,0.341862,0.102353,0.270784,0.290639,...,0.274579,0.275056,0.169083,0.294372,0.409430,0.410919,0.377898,0.496621,0.187485,0.810404
50007,0.618285,0.753630,0.629141,0.643313,0.468474,0.716065,0.479454,0.432951,0.470434,0.449635,...,0.358740,0.378375,0.390896,0.392232,0.381120,0.428890,0.525672,0.713384,0.249400,0.883117
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50162,0.606567,0.765567,0.623567,0.816146,0.628667,0.793389,0.681872,0.350646,0.669347,0.632564,...,0.568262,0.129996,0.198904,0.522232,0.562450,0.603381,0.657473,0.672187,0.482378,0.882056
50163,-0.107793,-0.020293,0.368421,0.329429,0.006038,0.372814,0.305910,0.102373,0.346785,0.193353,...,0.212725,0.407415,0.192493,0.493806,0.209915,0.502917,0.556228,0.501843,0.226621,0.633680
50164,0.283339,0.445784,0.269522,0.557545,0.122739,0.643369,0.427661,0.337729,0.128677,0.166567,...,0.229128,0.388615,0.383393,0.283484,0.147303,0.361896,0.283394,0.523763,0.442114,0.786180
50167,0.441499,-0.161474,0.076545,0.333590,0.299839,0.398375,0.119602,-0.248906,0.219889,0.206871,...,0.436012,0.444572,0.524950,0.387880,0.494431,0.367273,0.473550,0.375159,0.129914,0.367245


### Try predicting diagnosis

Now we will try to predict diagnosis using the functional connectivity measures.

First, let's write a helper function that takes as input a matrix `X`, a vector `y` and an `sklearn` model `model`. The function should:
1. Split the data into train and test sets
2. Fit the model on the train data
3. Compute and print model performance on the train and test sets
4. Return the fitted classifier model

Write/modify the code in the cell below where it says `TODO`:

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split


def train_and_test_model(
    model,
    X: pd.DataFrame,
    y: pd.Series,
    test_subset_fraction=0.2,
    shuffle=True,
    do_stratify=True,
    random_state=123,
):
    """Train and test a scikit-learn model.

    Parameters
    ----------
    model :
        The scikit-learn model to be trained and tested
    X : pd.DataFrame
        Input features
    y : pd.Series
        Output labels
    test_subset_fraction : float, optional
        Fraction of the dataset to use for testing, by default 0.2
    shuffle : bool, optional
        Whether to shuffle the data before splitting, by default True
    random_state : int, optional
        Seed to force the shuffle to be the same each time, by default 123

    Returns
    -------
        The fitted model
    """
    if do_stratify:
        # ensure similar distribution of classes in train and test sets
        stratify = y
    else:
        stratify = None

    # TODO divide the data into train/test sets
    # Hint: use sklearn's train_test_split function here
    #       make sure to use test_subset_fraction, shuffle, stratify, and random_state
    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size=test_subset_fraction,
        shuffle=shuffle,
        stratify=stratify,
        random_state=random_state,
    )

    # TODO fit the model
    model.fit(X_train, y_train)

    # TODO compute the train and test accuracies
    acc_train = model.score(X_train, y_train)
    acc_test = model.score(X_test, y_test)

    # print accuracies
    print(f"Train accuracy: {acc_train:.3f}")
    print(f"Test accuracy:  {acc_test:.3f}")

    return model

Now use this function to test a logistic regression model ([`sklearn.linear_model.LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html))

In [None]:
from sklearn.linear_model import LogisticRegression

X = brain_df
y = pheno_df["DX_GROUP"]  # diagnosis

# TODO define your model
gridsearch_model = LogisticRegression()

# TODO call train_and_test_model() with the appropriate arguments
train_and_test_model(gridsearch_model, X, y)

Train accuracy: 1.000
Test accuracy:  0.550


#### Questions

What do the train and test accuracies tell us? 
- The model overfits and does not generalize well on new data

What can we try to improve test performance?

### Try predicting scan site instead of diagnosis

How about we try predicting another variable instead: scan site

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder

X = brain_df  # same input data
y = pheno_df["SITE_ID"]  # scan site
y = pd.Series(LabelEncoder().fit_transform(y))  # encode sites as integers

# TODO: define your model
gridsearch_model = LogisticRegression()

# TODO call train_and_test_model() with the appropriate arguments
train_and_test_model(gridsearch_model, X, y)

Train accuracy: 1.000
Test accuracy:  0.800


#### Questions

What do these performance metrics tell us?

How do we know if this is a good model? What is the chance performance?

## Exercise 2: Model selection and cross-validation

We can get a more stable estimate of a model's generalization ability by using cross-validation instead of a single test set.

Update/complete the `cross_validate_model` function below to use the [`StratifiedKFold` class](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) to create cross-validation splits for the dataset, then train and test models separately for each split.

In [11]:
import numpy as np
import pandas as pd
from sklearn.base import clone
from sklearn.model_selection import StratifiedKFold


def cross_validate_model(
    model,
    X: pd.DataFrame,
    y: pd.Series,
    n_splits=5,
    shuffle=True,
    random_state=123,
):
    """Train and test a scikit-learn model with cross-validation.

    Parameters
    ----------
    model :
        The scikit-learn model to be trained and tested
    X : pd.DataFrame
        Input features
    y : pd.Series
        Output labels
    n_splits : int, optional
        Number of folds for cross-validation, by default 5
    shuffle : bool, optional
        Whether to shuffle the data before splitting, by default True
    random_state : int, optional
        Seed to force the shuffle to be the same each time, by default 123

    Returns
    -------
    list
        The fitted models
    """
    # create lists to store results for each fold
    fitted_models = []
    accs_train = []
    accs_test = []

    # TODO create a StratifiedKFold cross-validator object
    # make sure to use the n_splits, shuffle, and random_state variables
    cv = StratifiedKFold(n_splits=n_splits, shuffle=shuffle, random_state=random_state)

    # TODO iterate over the folds
    # Hint: look at the documentation of StratifiedKFold
    for train_index, test_index in cv.split(X, y):

        # TODO get the train and test sets using the indices
        # hint: X and y and pandas objects, you can use .iloc[] on them
        X_train = X.iloc[train_index]
        X_test = X.iloc[test_index]
        y_train = y.iloc[train_index]
        y_test = y.iloc[test_index]

        # get a fresh (unfitted) model
        model = clone(model)

        # TODO fit the model
        model.fit(X_train, y_train)

        # TODO get the train and test accuracies
        acc_train = model.score(X_train, y_train)
        acc_test = model.score(X_test, y_test)

        # append results
        fitted_models.append(model)
        accs_train.append(acc_train)
        accs_test.append(acc_test)

    # report the mean accuracies
    accs_train = np.array(accs_train)
    accs_test = np.array(accs_test)
    print(f"Train accuracy: {accs_train.mean():.3f} ± {accs_train.std():.3f}")
    print(f"Test accuracy:  {accs_test.mean():.3f} ± {accs_test.std():.3f}")

    return fitted_models

We can now use our new cross-validation function to predict scan site again, but let's switch it up a bit and use Support Vector Machines (implemented as `sklearn.svm.SVC`) instead of logistic regression.

In [None]:
import pandas as pd
from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder

X = brain_df
y = pheno_df["SITE_ID"]  # scanning site
y = pd.Series(LabelEncoder().fit_transform(y))  # encode as integers

# TODO: define your model
gridsearch_model = SVC()

# TODO call cross_validate_model() with the appropriate parameters
cross_validate_model(gridsearch_model, X, y)

Train accuracy: 0.965 ± 0.009
Test accuracy:  0.760 ± 0.073


[SVC(), SVC(), SVC(), SVC(), SVC()]

You may have noticed that the models we are using have arguments that can be specified when defining/instantiating the model. For example `SVC` has a regularization **hyperparameter** called `C`. Changing the value of `C` may change the model's performance, but doing so in a trial-and-error way leads to implicit data leakage (can you think of why?).

Sklearn's `GridSearchCV` class allows us to specify a hyperparameter grid to search over when fitting the data. The test data is not used during that process. Let's use `GridSearchCV` to test several values of `C`: `[0.01, 0.1, 1, 10, 100]`.

Note: what `GridSearchCV` does is called **inner** cross-validation, which should not be confused with the cross-validation done in `cross_validate_model`. Inner cross-validation is used to choose the best set of hyperparameters for a model, while outer cross-validation is used for getting a measure of model generalizability on new data.

In [None]:
import pandas as pd
from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV

X = brain_df
y = pheno_df["SITE_ID"]  # scanning site
y = pd.Series(LabelEncoder().fit_transform(y))  # encode as integers

# TODO: define your model
gridsearch_model = GridSearchCV(SVC(), param_grid={"C": [0.01, 0.1, 1, 10, 100]})

# TODO call cross_validate_model() with the appropriate parameters
models = cross_validate_model(gridsearch_model, X, y)

Train accuracy: 1.000 ± 0.000
Test accuracy:  0.870 ± 0.093


Print the `C` that was chosen for each CV fold. Is it the same for all folds?

Hint: What does the `cross_validate_model` function return?

Hint: Look at the [`GridSearchCV` documentation page](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to see how to access the best estimator for each fold.

Hint: The documentation for the [`SVC` class](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) may also be useful.

In [21]:
for i_model, gridsearch_model in enumerate(models):
    svc_model = gridsearch_model.best_estimator_
    print(f"Best C for model at index {i_model}: {svc_model.C}")

Best C for model at index 0: 10
Best C for model at index 1: 10
Best C for model at index 2: 10
Best C for model at index 3: 10
Best C for model at index 4: 10


### Question

How should we go about comparing different models (e.g. `LogisticRegression` vs `SVC`?)