# Assignment 9: Multivariate and Machine Learning Analysis for Intracranial EEG Data
Please submit this assignment to Canvas as a jupyter notebook (.ipynb).  The assignment will have you utilize machine learning techniques to classify memory states.

In [2]:
# imports
import numpy as np
import pandas as pd
import cmlreaders as cml
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import KFold
from scipy.stats import ttest_ind

Compare and contrast the performances of classifiers with different penalization schemes such as L1, L2, and elastic net.

## Assignment Overview
This project is designed to familiarize you with multivariate analysis of intracranial EEG data. For each subject, you will train a logistic-regression classifier to discriminate subsequently recalled vs non-recalled studied items using the distribution of spectral power across electrodes as the features. After completing the assignment you should be able to do the following:
* Fit an L2-penalized logistic regression classifier to intracranial electrophysiological recordings. 
* Construct a receiver operating characteristic (ROC) curve and compute area under the curve (AUC) to assess classifier performance. Compare train and test performance.
* Optimize model penalization parameters using nested cross-validation, specifically focusing on L2 penalization.

Use data from the following 20 FR1/catFR1 subjects in the intracranial EEG (iEEG) dataset.

In [3]:
subs = ['R1380D', 'R1380D', 'R1111M', 'R1332M', 'R1377M', 
        'R1065J', 'R1385E', 'R1189M', 'R1108J', 'R1390M', 
        'R1236J', 'R1391T', 'R1401J', 'R1361C', 'R1060M', 
        'R1350D', 'R1378T', 'R1375C', 'R1383J', 'R1354E', 
        'R1292E']

For each of these subjects, use the following processing steps:
* Load EEG with CMLReader.load_eeg from a bipolar montage loaded using CMLReader.load('pairs').
* Apply a Butterworth notch filter around 60 Hz (freqs = [58 62]) when extracting the voltage.
* Calculate power at the above frequencies with a Morlet wavelet with wavenumber (keyword “width”) of 6 for each encoding event (from time 0 until 1.6 seconds after the encoding event onset) using a 1 second buffer.
* For each frequency, channel, and encoding event, average the power over the entire 1600 ms encoding period (but not over the buffer period!)
* Log-transform the average encoding power values as in the final step of the previous problem.
* In some cases you may notice artifacts in the data that manifest in power values of zero. These would produce problems in the transformation and classification, so please exclude any events with this issue from all analyses.

## Generating Features, Cross-Validation, ROCs and AUCs
For the first part of this assignment, you will train an L2-penalized logistic regression classifier on the time-frequency (TF) data obtained during item encoding for every subject. Throughout this assignment, unless otherwise specified, we will use the default parameters for the *LogisticRegression* classifier in sklearn.

## Question 1
1) For each subject, create an $X_{N×p}$ matrix of spectral power patterns ($N$= number of encoding events concatenated across sessions; $p$ = number of frequencies × number of channels) and obtain the $y_{N×1}$ vector of labels (1: recalled, 0: non-recalled). The pair $(X, y)$ will be our dataset.
* For your input features, extract spectral power with Morlet wavelets at 8 frequencies logarithmically spaced between 3 and 180 Hz (np.logspace(np.log10(3), np.log10(180),8)) for each recorded electrode pair (“channel”). Average the power across each of these frequencies over the 1600 ms word encoding period.
* Z-score the features across observations (i.e., events) within each session. Since we're performing leave-one-session-out cross validation in this assignment, z-scoring within-session prevents information from leaking between train sessions and test sessions through the z-score statistics.
* Some subjects will have different sets of electrodes for different recording sessions. For these subjects you can drop the sessions such that you keep the largest possible set of available sessions which all have the same recording contacts (there could be groups of sessions for the same subject with different electrode sets). The reasons a subject might have different active recording electrodes across sessions are:
    * some subjects have so many electrodes implanted that not all of them could be recorded from simultaneously; these subjects will sometimes then have different "montages" in which different electrodes are turned connected or disconnected or
    * some subjects have multiple implant surgeries, with different electrodes being in place after each operation, meaning again that the same subject will have different sets of electrodes. 

In [4]:
freqs = np.logspace(np.log10(3), np.log10(180), 8)

In [5]:
# Question 1.1
### YOUR CODE HERE

## Question 2
1) Use leave-one-session-out cross-validation to train and test L2-penalized logistic regression classifiers.

* For each cross-validation iteration, you will (1) leave out one session ($X_{test}$, $y_{test}$), (2) train the model on the other sessions ($X_{train}$, $y_{train}$), and (3) test the trained model on the held-out session. Repeat this procedure by iterating across all sessions that a subject has. For each iteration of the cross-validation procedure, you will train the L2-penalized logistic regression classifier on the encoding events from all sessions except the held-out session. You will take the model fit to this training set and use it to predict recall performance for the encoding events in the held-out session. For each encoding event in the held out session, you should get a predicted probability that this item will be subsequently recalled. Once you have held out each session (i.e., at the end of the leave-one-session-out cross-validation procedure), you will have the predicted probability for each encoding event (all predicted by models trained on all encoding events except for the ones in the same session). After doing the above separately for each subject, you should now have cross-validated predictions for all encoding events. Use the default penalty parameter (C) of 1.0. 

2) For the first three subjects in the list above, plot a histogram of the predicted cross-validated probabilities across all encoding events, giving different colors to predictions for encoding events of words that were subsequently recalled and for encoding events for words that were not recalled. 

3) Based on your results from (1) and the visualizations in (2), how strongly do the neural features predict subsequent recall?

* Hint: since different sessions have different numbers of events (subjects can discontinue a session partway through), you'll need to think carefully about how to implement leave-one-session-out cross validation.

In [8]:
# Question 2.1
### YOUR CODE HERE

In [9]:
# Question 2.2
### YOUR CODE HERE

Question 2.3

**YOUR ANSWER HERE**

## Question 3
To assess the performance of a classifier, we will utilize the area under the receiver operating curve (AUC). 

1) Using sklearn’s ROC curve function, calculate the ROC and the corresponding AUC for each subject. Plot all the subjects’ ROC curves in one plot, and plot all the subjects’ AUCs in one histogram. 

2) Compute a subject-level ROC curve and AUC value, by pooling all predictions across the outer cross-validation folds and compute the AUC/ROC curve with the pooled predictions.

3) How good is the performance? Run a statistical test to determine if the between-subject average performance is reliably above chance.

In [10]:
# Question 3.1
### YOUR CODE HERE

In [11]:
# Question 3.2
### YOUR CODE HERE

In [12]:
# Question 3.3
### YOUR CODE HERE

## Question 4
1) Report mean train AUCs and mean test AUCs across cross validation folds for all subjects with two overlapping histograms (two histograms in the same plot). 
* In comparison to how the test AUCs were computed at the subject level in previous problems, you can compute the train and test AUCs for a given outer fold with just the predictions from that fold; then you can average those fold-level AUCs together.
2) What is the mean difference across subjects in cross-validated AUC scores between training and testing?

In [13]:
# Question 4.1
### YOUR CODE HERE

Question 4.2

**YOUR ANSWER HERE**

## Optimizing Hyperparameters using Nested Cross-Validation

In the previous questions you used a fixed penalty parameter (C = 1) for all subjects. Oftentimes we want to tune the hyperparameters of a model to optimize its performance. It is crucial to make it clear that the aim of cross-validation is not to get one or multiple trained models for inference, but to estimate an unbiased generalization performance. We can use a grid search approach in which we search over a grid of 10 values of C logarithmically spaced between $10^{−6}$ and $10^2$ (np.logspace(-6,2,10)). One naive approach is to repeat the earlier analyses for every C value and select the optimal C that maximizes the average AUC across folds. The problem is, if we use the test set multiple times for different trained models, during our selection of the optimal model, the test set actually “leaks” information, and is thus impure or biased. To rigorously select the optimal parameter C and correctly estimate the prediction error of the optimal model, we utilize a nested cross-validation procedure. As the name suggests, you will perform two rounds of cross-validation with the inner CV nested in the outer CV. The outer CV is responsible for obtaining the prediction error for the model and the inner CV is responsible for selecting the optimal hyperparameter for each outer CV fold. We apply the nested cross-validation procedure to our dataset as follows:
* For each subject, divide the dataset into K folds corresponding to K sessions.
* For each fold k = 1, · · · , K (outer loop for evaluation of the model):
    * Let *test* be the kth session (hold-out fold), and *train_val* be all other sessions except the kth fold. *train_val* should have K − 1 sessions. We next perform cross-validation on the train-validation data (inner CV), while leaving the test data alone.
    * For each l = 1, · · · , K − 1 (inner loop for hyperparameter tunning):
        - Let *val_inner* be the held-out fold for the inner CV, and *train_inner* be the other K − 2 sessions except for the lth fold.
        - For each value C in the grid, train a classifier on the *train_inner* data set and obtain prediction for the inner held-out fold, *val_inner*.
        - Repeat the procedure until you sweep through every session of trainval.
    * Select the optimal C from the inner cross-validation that maximizes the inner AUC for the *train_val* data.
    * Retrain the model using the entire *train_val* data with the optimal C.
    * Finally, test the optimal model on the outer held-out fold K.
* Repeat the procedure, holding out each fold in turn as the test fold. As you notice, the procedure above is computationally intensive and requires a lot of data. For instance, R1065J has 10 sessions of data. We perform 10 iterations for the outer CV and for each held-out session in the outer CV, we perform an inner CV procedure on the outer training data, which entails an addition of 9 iterations. On top of it, we need to perform the inner CV for all 10 values of C in the grid. As a result, you will be training 10 × 9 × 10 = 900 classifiers for a single subject using this procedure.

## Question 5
1) Perform nested CV on subjects with at least 6 sessions of data and compare the performance (AUC) of the optimal classifier (with the optimal C) to the default classifier (with C = 1.0) using L2 penalized logistic regression.

2) Does optimizing the regularization hyperparameter help? Use barplots, scatterplots, and appropriate statistical tests to support your conclusions.

In [14]:
# Question 5.1
### YOUR CODE HERE

In [15]:
# Question 5.2
### YOUR CODE HERE