# Standard CSP pipelines

This notebook implements multiple standard CSP pipelines and tests their performance on the data from the database provided by [Kaya et al.](https://doi.org/10.1038/sdata.2018.211).
The knowledge and utilities obtained from the experimental notebooks four to five are used throughout this notebook.

This notebook works in an offline fashion and uses epochs with a length of 3 seconds.
This epoch starts 1 second before the visual queue was given, includes the 1 second the visual queue was shown and ends 1 second after the visual queue was hidden, totalling 3 seconds.
Baseline correction was done on the first second of the epoch, meaning the second before the visual queue was shown.
The effective training and testing are done on a 2-second window, starting 0.5 seconds before the 1-second visual queue and ending 0.5 seconds after this visual queue.
A window of 2 seconds was chosen as it is a common size for sliding window approaches in online systems.


Instructions on where to get the data are available on [the GitHub repository of the BCI master thesis project](https://www.github.com/pikawika/bci-master-thesis). These instructions are under `bci-master-thesis/code/data/CLA/README.md`. We will use the utility file `bci-master-thesis/code/utils/CLA_dataset.py` to work with this data. The data was stored as FIF files, which are included in [the GitHub repository of the BCI master thesis project](https://www.github.com/pikawika/bci-master-thesis).

<hr><hr>

## Table of Contents

- Checking requirements
   - Correct Anaconda environment
   - Correct module access
   - Correct file access
- Same subject, same session
   - Same subject, same session: LDA classifier 
- Same subject, new session
- New subject
- New subject with calibration
- Cleaning resedual notebook variables

<hr><hr>

## Checking requirements

### Correct Anaconda environment

The `bci-master-thesis` Anaconda environment should be active to ensure proper support. Installation instructions are available on [the GitHub repository of the BCI master thesis project](https://www.github.com/pikawika/bci-master-thesis).

In [1]:
####################################################
# CHECKING FOR RIGHT ANACONDA ENVIRONMENT
####################################################

import os
from platform import python_version
from pathlib import Path
from copy import copy

print(f"Active environment: {os.environ['CONDA_DEFAULT_ENV']}")
print(f"Correct environment: {os.environ['CONDA_DEFAULT_ENV'] == 'bci-master-thesis'}")
print(f"\nPython version: {python_version()}")
print(f"Correct Python version: {python_version() == '3.8.10'}")

Active environment: bci-master-thesis
Correct environment: True

Python version: 3.8.10
Correct Python version: True


<hr>

### Correct module access

The following code block will load in all required modules.

In [2]:
####################################################
# LOADING MODULES
####################################################

# Load util function file
import sys
sys.path.append('../utils')
import CLA_dataset

# IO functions
from IPython.utils import io

# Set logging level for MNE before loading MNE
os.environ['MNE_LOGGING_LEVEL'] = 'WARNING'

# Modules tailored for EEG data
import mne; print(f"MNE version (1.0.2 recommended): {mne.__version__}")
from mne.decoding import CSP
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV

# ML libraries
import sklearn;  print(f"Scikit-learn version (1.0.2 recommended): {sklearn.__version__}")
from sklearn.pipeline import Pipeline

# Data manipulation modules
import numpy as np; print(f"Numpy version (1.21.5 recommended): {np.__version__}")

# Plotting
import matplotlib; print(f"Matplotlib version (3.5.1 recommended): {matplotlib.__version__}")
import matplotlib.pyplot as plt

# Storing files
import pickle;  print(f"Pickle version (4.0 recommended): {pickle.format_version}")

MNE version (1.0.2 recommended): 1.0.2
Scikit-learn version (1.0.2 recommended): 1.0.2
Numpy version (1.21.5 recommended): 1.21.5
Matplotlib version (3.5.1 recommended): 3.5.1
Pickle version (4.0 recommended): 4.0


<hr>

### Correct file access

As mentioned, this notebook uses a database provided by [Kaya et al](https://doi.org/10.1038/sdata.2018.211). The CLA dataset in particular. Instructions on where to get the data are available on [the GitHub repository of the BCI master thesis project](https://www.github.com/pikawika/bci-master-thesis). These instructions are under `bci-master-thesis/code/data/CLA/README.md`. The following code block checks if all required files are available.

In [3]:
####################################################
# CHECKING FILE ACCESS
####################################################

# Use util to determine if we have access
print("Full Matlab CLA file access: " + str(CLA_dataset.check_matlab_files_availability()))
print("Full MNE CLA file access: " + str(CLA_dataset.check_mne_files_availability()))

Full Matlab CLA file access: True
Full MNE CLA file access: True


<hr><hr>

## Same subject, same session

As discussed in the master's thesis, training and testing a classification system can happen using multiple strategies.
A classifier may be trained on a singular subject, using a singular session and testing on that same session.
This is an over-optimistic testing scenario and has a great risk of overfitting with poor generalisation to new sessions or new subjects but can be an okay baseline test to see if *at least something* can be learned.
We do this for three different traditional machine learning classifiers: linear discriminant analysis (LDA), support vector machines (SVM) and random forest (RF).
K-nearest neighbour (KNN) is not considered as it is too time-consuming in predictions and complex models such as a multilayer perceptron (MLP) are not considered either as they are an integral part of the deep learning models considered in later notebooks.

<hr>

### Same subject, same session: LDA classifier

This experiment works as follows:
   - We use participants with at least three recordings
      - Participants: B, C, E
      - NOTE: participant F has three files provided but one of those files has only three MI classes rather than three, hence it is not considered here
   - We use the last recorded session of each of these participants, thus the one where the participant has the most experience
   - We get epochs of 3 seconds, which includes one second before and after the visual queue
      - We use a two-second window including 0.5 seconds before and after the visual queue for training
   - We split the data in a train/test dataset with 20% test data balanced over all MI classes
   - We use grid search on the 1-second windows of each baseline corrected epoch from the train split to find the best parameters for the pipeline
      - The frequency filtering uses fixed parameters to limit the training process and since CSP alternatives which perform automatic filtering exist and are recommended over manually finding the best frequencies through grid search
         - According to [Afrakhteh and RezaMosavi](https://doi.org/10.1016/B978-0-12-819045-6.00002-9), the desired frequency band for MI classification is 8-30 Hz. 
         - However, the neutral task isn't a specific MI task and is more likely to correspond with a relaxed state, having a low frequency.
         - To accommodate for the neutral task and a general configuration that suits all participants, the overlap-add FIR filter uses frequencies 2 to 35Hz 
      - The pipeline that is hyperparameter tuned is as follows
         - CSP -> LDA
      - The following hyperparameters are tested
         - For CSP:
            - Number of components: 2 | 3 | 4 | 6 | 10
         - For LDA:
            - The optimizer: svd | lsqr | eigen
            - When using SVD optimizer, the tol: 0.0001 | 0.00001 | 0.001 | 0.0004 | 0.00007 
   - We use the test split for final validation on the best-found parameters

In [4]:
####################################################
# GRID SEARCHING BEST PIPELINE FOR EACH SUBJECT
####################################################

# Configure global parameters for all experiments
subject_ids_to_test = ["B", "C", "E"] # Subjects with three recordings
start_offset = -1 # One second before visual queue
end_offset = 1 # One second after visual queue
baseline = (None, 0) # Baseline correction using data before the visual queue
filter_lower_bound = 2 # Filter out any frequency below 2Hz 
filter_upper_bound = 35 # Filter out any frequency above 35Hz

# Loop over all subjects and perform the grid search for finding the best parameters
for subject_id in subject_ids_to_test[:1]:
    # Get MNE raw object for latest recording of that subject
    mne_raw = CLA_dataset.get_last_raw_mne_data_for_subject(subject_id= subject_id)
    # Get epochs for that MNE raw
    mne_epochs = CLA_dataset.get_usefull_epochs_from_raw(mne_raw,
                                                         start_offset= start_offset,
                                                         end_offset= end_offset,
                                                         baseline= baseline)
    
    # Only keep epochs from the MI tasks
    mne_epochs = mne_epochs['task/neutral', 'task/left', 'task/right']
    
    # Load epochs into memory
    mne_epochs.load_data()
    
    # Get the labels
    labels = mne_epochs.events[:, -1]
    
    # Use a fixed filter
    mne_epochs.filter(l_freq= filter_lower_bound,
                      h_freq= filter_upper_bound,
                      picks= "all",
                      phase= "minimum",
                      fir_window= "blackman",
                      fir_design= "firwin",
                      pad= 'median', 
                      n_jobs= -1,
                      verbose= False)
    
    # Create a test and train split
    X_train, X_test, y_train, y_test = train_test_split(mne_epochs,
                                                        labels,
                                                        test_size = 0.2,
                                                        shuffle= True,
                                                        stratify= labels,                                                    
                                                        random_state= 1998)
    
    # Configure the pipeline components by specifying the default parameters
    csp = CSP(norm_trace=False,
              component_order="mutual_info",
              cov_est= "epoch")
    
    lda = LinearDiscriminantAnalysis(shrinkage= None,
                                     priors=[1/3, 1/3, 1/3])
    
    # Configure the pipeline
    pipeline = Pipeline([('CSP', csp), ('LDA', lda)])
    
    # Configure cross validation to use
    cv = StratifiedKFold(n_splits=4,
                         shuffle= True,
                         random_state= 2022)
    
    # Configure the hyperparameters to test
    # NOTE: these are somewhat limited due to limitedd computational resources
    param_grid = [{"CSP__n_components": [2, 3, 4, 6, 10],
                   "LDA__solver": ["svd"],
                   "LDA__tol": [0.0001, 0.00001, 0.001, 0.0004, 0.00007]
                   },
                  {"CSP__n_components": [2, 3, 4, 6, 10],
                   "LDA__solver": ["lsqr" , "eigen"]
                   }]
    
    # Configure the grid search
    grid_search = GridSearchCV(estimator= pipeline,
                               param_grid= param_grid,
                               scoring= "balanced_accuracy",
                               n_jobs= -1,
                               refit= True,
                               cv= cv,
                               verbose= 10,
                               return_train_score= True)

    # Do the grid search on the training data
    grid_search.fit(X= X_train, 
                    y= y_train)
    
    # Store the results of the grid search
    with open(f"saved_variables/4/gridsearch_samesubject_samesession_csplda_subject{subject_id}.pickle", 'wb') as file:
            pickle.dump(grid_search, file)
    
    # Store the best model and the test data
    with open(f"saved_variables/4/bestmodel_samesubject_samesession_csplda_subject{subject_id}.pickle", 'wb') as file:
            pickle.dump(grid_search.best_estimator_, file)
    with open(f"saved_variables/4/testdata-x_samesubject_samesession_csplda_subject{subject_id}.pickle", 'wb') as file:
            pickle.dump(X_test, file)
    with open(f"saved_variables/4/testdata-y_samesubject_samesession_csplda_subject{subject_id}.pickle", 'wb') as file:
            pickle.dump(y_test, file)
    
    # Delete vars after singular experiment
    del mne_raw
    del mne_epochs
    del csp
    del lda
    del pipeline
    del labels
    del cv
    del file
    del X_train
    del X_test
    del y_train
    del y_test 
    del grid_search
    del param_grid
    
# Delete vars after all experiments
del baseline
del end_offset
del start_offset
del subject_id
del subject_ids_to_test
del filter_lower_bound
del filter_upper_bound

<hr><hr>

## Same subject, new session

TODO

In [None]:
# TODO: foresee as needed in paper.

<hr><hr>

## New subject

TODO

In [None]:
# TODO: foresee as needed in paper.

<hr><hr>

## New subject with calibration

TODO

In [None]:
# TODO: foresee as needed in paper.

<hr><hr>

## Cleaning resedual notebook variables

This last codeblock cleans any resedual notebook variables.

In [None]:
####################################################
# CLEAN NOTEBOOK VARIABLES
####################################################
