Launch the following blocks to connect to your drive and go into the tutorial folder

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
%cd /content/gdrive/My\ Drive/TD_Dreem_MasterBin/Dreem_Master_Bin
! ls

This tutorial is about machine learning methods for sleep stage classification.


In [3]:
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime

from sklearn.metrics import balanced_accuracy_score, cohen_kappa_score, confusion_matrix
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from dreem_master_bin.hypnogram import plot_hypnogram, stage_colors

To save you some time, train and test datasets are and are available in the data folder. It consists of preprocessed data, not raw record. You have two types of datasets:

- Spectral dataset = data containing spectral power (spectrogram matrix): train and test
- Features dataset = data containing precomputed features: train and test

In [4]:
from dreem_master_bin.load_data import load_spectral_datasets, load_feature_datasets

# load spectrogram dataset and shuffle train data
x_train_spect, y_train_spect, x_test_spect, y_test_spect = load_spectral_datasets()
spectral_names = ['index_window', '1Hz', '2Hz', '3Hz', '4Hz', '5Hz', '6Hz', '7Hz',
                  '8Hz', '9Hz', '10Hz', '11Hz', '12Hz', '13Hz', '14Hz', '15Hz', 
                  '16Hz', '18Hz', '19Hz']
# shuffle train dataset
p = np.random.permutation(len(y_train_spect))
x_train_spect, y_train_spect = x_train_spect[p], y_train_spect[p]


# load precomputed features dataset and shuffle train data
x_train_feat, y_train_feat, x_test_feat, y_test_feat = load_feature_datasets()
features_name = ['index_window', 'delta', 'delta_r', 'theta', 'theta_r',
       'lowfreq', 'lowfreq_r', 'alpha', 'alpha_r', 'sigma', 'sigma_r', 'beta',
       'beta_r', 'kcomp', 'kcomp_r', 'SC', 'SEF90', 'SEF95', 'Nb spindles',
       'spindles magnitude', 'spindles duration', 'Nb slow waves',
       'slow waves magnitude', 'slow waves duration', 'AccelerometerVar',
       'little movement', 'strong movement']
# shuffle train dataset
p = np.random.permutation(len(y_train_feat))
x_train_feat, y_train_feat = x_train_feat[p], y_train_feat[p]

We have just loaded the spectral data:

- x_train_spect, x_test_spect: spectral data to predict sleep stages
It is an array of shape n_samples x n_features

    - n_samples = number of sleep epochs
    - n_features = number of features for each of these epochs. The features are: [index_window, power_frequency_1Hz, power_frequency_2Hz, ..., power_frequency_18Hz], where index window to the position of the sample in its sleep record.
- y_train_spect, y_test_spect: labels (sleep stages)

Then, we have loaded the other dataset:

- x_train_spect, x_test_spect: shape n_samples x n_features
    - n_features = the features are ['index_window', 'delta', 'delta_r', 'theta', 'theta_r', 'lowfreq', 'lowfreq_r', 'alpha', 'alpha_r', 'sigma', 'sigma_r', 'beta', 'beta_r', 'kcomp', 'kcomp_r', 'SC', 'SEF90', 'SEF95', 'Nb spindles', 'spindles magnitude', 'spindles duration', 'Nb slow waves', 'slow waves magnitude', 'slow waves duration', 'AccelerometerVar','little movement', 'strong movement']

- y_train_spect, y_test_spect: labels (sleep stages)



Let's start with the spectral dataset !

In [5]:
x_train, y_train = x_train_spect, y_train_spect
x_test, y_test = x_test_spect, y_test_spect

1 - Dimension reduction + (linear) classifier

- Choose an algorithm for dimension reduction (e.g PCA)
- Choose a (linear) classifier (e.g SVM classifier)


You can go to the online documentation of the scikit-library to find similar functions, with the keywords: 
- multi class classifier
- dimension reduction
- decomposition


In [None]:
from sklearn.decomposition import PCA
from sklearn.svm import SVC

# scale input data and reduce dimension
pca = make_pipeline(StandardScaler(),
                    PCA(n_components=5, random_state=10))
pca.fit(x_train, y_train)

# linear classifier
classifier = SVC(kernel='linear')
# training: fit the model to the data
classifier.fit(pca.transform(x_train), y_train)

# test it
predictions = classifier.predict(pca.transform(x_test))
scores = {'balanced_accuracy': balanced_accuracy_score(y_test, predictions),
            'cohen_kappa': cohen_kappa_score(y_test, predictions),
            'confusion_matrix': confusion_matrix(y_test, predictions)}

scores

2 - Ensemble learning
https://scikit-learn.org/stable/modules/ensemble.html

Here we are going to use the Random Forest method, try to use other ensemble learning functions of the scikit-learn library.
Also we are going to work with preprocessed features.

Go to the online documentation of these functions to set the parameters



In [None]:
# load features dataset and shuffle train data
x_train, y_train = x_train_feat, y_train_feat
x_test, y_test = x_test_feat, y_test_feat

# select a classifier and train it
from sklearn.ensemble import RandomForestClassifier

clf_rf = make_pipeline(StandardScaler(),
                       RandomForestClassifier(max_depth=10, random_state=42))
print('training...')
clf_rf.fit(x_train, y_train)

# test it
predictions = clf_rf.predict(x_test)
scores = {'balanced_accuracy': balanced_accuracy_score(y_test, predictions),
            'cohen_kappa_score': cohen_kappa_score(y_test, predictions),
            'confusion_matrix': confusion_matrix(y_test, predictions)}

scores

3 - Stack multiple estimators

It is possible to combine multiple machine learning algorithms to improve performance.

> Use the StackingClassifier to stack estimators with a final classifier


In [None]:
# select a classifier and train it
from sklearn.experimental import enable_hist_gradient_boosting  # noqa
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

rf_pipeline = make_pipeline(StandardScaler(),
                            RandomForestClassifier(n_estimators=10, random_state=42))
gradient_pipeline = make_pipeline(StandardScaler(),
                                  HistGradientBoostingClassifier(learning_rate=0.01, random_state=30))
estimators = [('Random Forest', rf_pipeline),
                  ('Gradient Boosting', gradient_pipeline)]
stacking_classifier = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression(max_iter=200))

print('training...')
stacking_classifier.fit(x_train, y_train)

# test it
predictions = stacking_classifier.predict(x_test)
scores = {'balanced_accuracy': balanced_accuracy_score(y_test, predictions),
          'cohen_kappa_score': cohen_kappa_score(y_test, predictions),
          'confusion_matrix': confusion_matrix(y_test, predictions)}

scores

In all this tutorial, we have tried to predict sleep stages from precomputed features (spectral or other features). 

Scikit provides methods to assess to importance of each feature for the prediction.

In [None]:
# load spectrogram dataset and shuffle train data
x_train, y_train = x_train_feat, y_train_feat
list_features = features_name

# let's take an already trained classifier
clf = stacking_classifier

# permutation importance > feature importance
print('permutations...')
from sklearn.inspection import permutation_importance
result = permutation_importance(clf, x_train, y_train, n_repeats=10, random_state=0)

# sort by importance
sorted_idx = result.importances_mean.argsort()

# Plot
fig, ax = plt.subplots(figsize=(25, 10))
ax.boxplot(result.importances[sorted_idx].T,
           vert=False, labels=[list_features[i] for i in sorted_idx])
ax.set_title("Permutation Importances (train set)")
fig.tight_layout()
plt.show()


You've reached the end of this second tutorial.

Let's go to the last part about deep learning methods.
Open the **Tutorial_Sleep_Staging_C** tutorial