# CLA dataset exploration notebook

In this notebook, we perform some initial data exploration of raw EEG data.

This experimental notebook uses a database provided by [Kaya et al.](https://doi.org/10.1038/sdata.2018.211). The CLA dataset in particular. Instructions on where to get the data are available on [the GitHub repository of the BCI master thesis project](https://www.github.com/pikawika/bci-master-thesis). These instructions are under `bci-master-thesis/code/data/CLA/README.md`. A python variant of the files will be made available in [the GitHub repository of the BCI master thesis project](https://www.github.com/pikawika/bci-master-thesis).

## Table of Contents

- Checking requirements
    - Correct anaconda environment
    - Correct module access
    - Correct file access
- Loading in data
    - Exploring data structure
    - Classification labels
    - Making MNE object
    - Storing MNE objects for all CLA data

<hr><hr>

## Checking requirements

### Correct anaconda environment

The `bci-master-thesis` anaconda environment should be active to ensure proper support. Installation instructions are available on [the GitHub repository of the BCI master thesis project](https://www.github.com/pikawika/bci-master-thesis).

In [1]:
####################################################
# CHECKING FOR RIGHT ANACONDA ENVIRONMENT
####################################################

import os
from platform import python_version

print(f"Active environment: {os.environ['CONDA_DEFAULT_ENV']}")
print(f"Correct environment: {os.environ['CONDA_DEFAULT_ENV'] == 'bci-master-thesis'}")
print(f"\nPython version: {python_version()}")
print(f"Correct Python version: {python_version() == '3.8.10'}")

Active environment: bci-master-thesis
Correct environment: True

Python version: 3.8.10
Correct Python version: True


<hr>

### Correct module access

The following codeblock will load in all required modules.

In [2]:
####################################################
# LOADING MODULES
####################################################

# Performs IO operations
import os
import scipy.io; print(f"Scipy version (1.8.0 recommended): {scipy.__version__}")

# Modules tailored for EEG data
import mne; print(f"MNE version (1.0.2 recommended): {mne.__version__}")

# Data manipulation modules
import numpy as np; print(f"Numpy version (1.21.5 recommended): {np.__version__}")

# Datetime object
import datetime
import pytz

# Print progress
print("\n\n... LOADED ALL MODULES ...")

Scipy version (1.8.0 recommended): 1.8.0
MNE version (1.0.2 recommended): 1.0.2
Numpy version (1.21.5 recommended): 1.21.5


... LOADED ALL MODULES ...


<hr>

### Correct file access

As mentioned, this experimental notebook uses a database provided by [Kaya et al](https://doi.org/10.1038/sdata.2018.211). The CLA dataset in particular. Instructions on where to get the data are available on [the GitHub repository of the BCI master thesis project](https://www.github.com/pikawika/bci-master-thesis). These instructions are under `bci-master-thesis/code/data/CLA/README.md`.

In [3]:
####################################################
# CHEKCING FILE AVAILABILITY
####################################################

# You can specify the data directory here, per default it is "../data/CLA/"
data_directory = r'../data/CLA/'

# Files needed for this notebook
filenames = ["CLASubjectA1601083StLRHand.mat",
             "CLASubjectB1510193StLRHand.mat",
             "CLASubjectB1510203StLRHand.mat",
             "CLASubjectB1512153StLRHand.mat",
             "CLASubjectC1511263StLRHand.mat",
             "CLASubjectC1512163StLRHand.mat",
             "CLASubjectC1512233StLRHand.mat",
             "CLASubjectD1511253StLRHand.mat",
             "CLASubjectE1512253StLRHand.mat",
             "CLASubjectE1601193StLRHand.mat",
             "CLASubjectE1601223StLRHand.mat",
             "CLASubjectF1509163StLRHand.mat",
             "CLASubjectF1509173StLRHand.mat",
             "CLASubjectF1509283StLRHand.mat"]

# Append data directory to filenames
filenames = [data_directory + filename for filename in filenames]

# Check if all files are available, if not display file name of missing file
all_files_available = True

for filename in filenames:
    if (not os.path.isfile(filename)):
        print(filename + " not available!")
        all_files_available = False

# Display succes message if all files are available
if (all_files_available):
    print("All files are available")
    
# Cleaning up redundant variables from this codeblock
del all_files_available
del data_directory
del filename

All files are available


<hr><hr>

## Loading in data

In this step, we load in the data. The data is provided as `.mat` files and thus originally meant for use with MATLAB. However, thanks to scipy we can get them to work in Python as well. Scipy will load in the `.mat` file as a dictionary. From the article by [Kaya et al.](https://doi.org/10.1038/sdata.2018.211) we know:

> The data in each file are represented as an instance of a Matlab structure named `o`, having the following key fields `id`, `nS`, `sampFreq`, `marker` and `data`.

We will explore 1 file to check out how these files are structured.

In [4]:
####################################################
# LOADING 1 FILE AND CHECKING CONTENTS
####################################################

# Load in the first file
first_file_full = scipy.io.loadmat(filenames[0], struct_as_record=False, squeeze_me=True)

# show keys of the dictionary
print(f"File structure of first mat file: {first_file_full.keys()}")

# The data is stored inside the matlab structure named "o"
first_file_o = first_file_full['o']

# show fieldnames of the o structure
print(f"Fieldnames of o structure: {first_file_o._fieldnames}")

# Cleaning up redundant variables from this codeblock
del first_file_full;

File structure of first mat file: dict_keys(['__header__', '__version__', '__globals__', 'o'])
Fieldnames of o structure: ['id', 'tag', 'sampFreq', 'nS', 'marker', 'data', 'chnames', 'binsuV']


<hr>

### Exploring data structure

From loading in the data we see that the following data is available:

- id: A unique alphanumeric identifier of the record
- tag: Unknown field
   - Was not specified in the article
- sampFreq: Sampling frequency of the EEG data
- nS: Number of EEG data samples
- marker: The eGUI interaction record of the recording session
- data: The raw EEG data of the recording session
- chnames: Probably channel names of the EEG data sensors/channels in 10/20 configuration
   - Was not specified in article
- binsuV: Probably bins per microvolt
   - Was not specified in the article

We will analyse this data for the first file now.

In [5]:
####################################################
# CHECKING DATA CONTENTS OF FIRST FILE
####################################################

print("id: " + str(first_file_o.id))

print("\ntag: " + str(first_file_o.tag))

print("\nsampFreq: " + str(first_file_o.sampFreq))

print("\nnS: " + str(first_file_o.nS))

print("\nmarker: " + str(first_file_o.marker))
print("marker shape: " + str(first_file_o.marker.shape))

print("\ndata: " + str(first_file_o.data))
print("data shape: " + str(first_file_o.data.shape))

print("\nchnames: " + str(first_file_o.chnames))
print("chnames shape: " + str(first_file_o.chnames.shape))

print("\nbinsuV: " + str(first_file_o.binsuV))

# Validating right output
print("\n\nAmount of channel names correspond with amount of channels available (" + str(first_file_o.data.shape[1]) +"): " + str(first_file_o.chnames.size == first_file_o.data.shape[1]))
print("Number of samples corresponds with amount of data records (" + str(first_file_o.data.shape[0]) +"): " + str(first_file_o.nS == first_file_o.data.shape[0]))

id: 201601081851.951FEF1D

tag: NK-data import (auto)

sampFreq: 200

nS: 671600

marker: [0 0 0 ... 0 0 0]
marker shape: (671600,)

data: [[ -0.    -0.    -0.   ...  -0.    -0.    -0.  ]
 [ -0.    -0.    -0.   ...  -0.    -0.    -0.  ]
 [ -0.    -0.    -0.   ...  -0.    -0.    -0.  ]
 ...
 [ 23.8  -28.4    4.31 ...  -8.31  -6.    -0.23]
 [ 10.74 -37.39   5.51 ...  -9.34  -5.99  -0.16]
 [  0.76 -47.95   3.66 ...  -7.32  -4.9   -0.41]]
data shape: (671600, 22)

chnames: ['Fp1' 'Fp2' 'F3' 'F4' 'C3' 'C4' 'P3' 'P4' 'O1' 'O2' 'A1' 'A2' 'F7' 'F8'
 'T3' 'T4' 'T5' 'T6' 'Fz' 'Cz' 'Pz' 'X5']
chnames shape: (22,)

binsuV: 1


Amount of channel names correspond with amount of channels available (22): True
Number of samples corresponds with amount of data records (671600): True


<hr>

### Classification labels

From the article by [Kaya et al.](https://doi.org/10.1038/sdata.2018.211) we know:

>The `marker` field contains the recording sessions’ interaction record. This record is in the form of 1D
Matlab array of size nSx1, which contains integer values from 0 to 99. Each value encodes the state of the
eGUI at the time mapping to the corresponding EEG data sample in the “data” array at the same timeindex location.

We see the following codes in the CLA datasets:
- 0: “blank” or nothing is displayed in eGUI
    - Can be seen as a break between stimuli, thus random EEG data that should probably be ignored
- 1: Left hand action
    - EEG data for MI of the left hand
- 2: Right hand action
    - EEG data for MI of the right hand
- 3: Passive/neutral
    - EEG data for forced relax state, no MI activity
- 91: inter-session rest break period
- 92: experiment end
- 99: initial relaxation period

In [6]:
####################################################
# CHECKING LABELS FOR FIRST FILE
####################################################

unique, counts = np.unique(first_file_o.marker, return_counts=True)
print("labels present in first file: " + str(dict(zip(unique, counts))))

# Cleaning up redundant variables from this codeblock
del unique
del counts

labels present in first file: {0: 476168, 1: 61490, 2: 69202, 3: 64740}


It was known from the paper by [Kaya et al.](https://doi.org/10.1038/sdata.2018.211) that 

> CLA-SubjectF-150916-3St-LRHand is one of our earliest BCI experiments and for that reason contains only two BCI signals - left and right hand movements, without passive imagery.

To validate the presence of all three signals in all data, we loop the unique counts for all files.

In [7]:
####################################################
# CHECKING LABELS FOR ALL FILES
####################################################

for filename in filenames:
    # Load in file o structure
    file = scipy.io.loadmat(filename, struct_as_record=False, squeeze_me=True)['o']
    
    # Print count
    unique, counts = np.unique(file.marker, return_counts=True)
    print(f"labels present in {filename}: " + str(dict(zip(unique, counts))))
    
# Cleaning up redundant variables from this codeblock
del unique
del counts
del file
del filename

labels present in ../data/CLA/CLASubjectA1601083StLRHand.mat: {0: 476168, 1: 61490, 2: 69202, 3: 64740}
labels present in ../data/CLA/CLASubjectB1510193StLRHand.mat: {0: 534120, 1: 65596, 2: 63935, 3: 65749}
labels present in ../data/CLA/CLASubjectB1510203StLRHand.mat: {0: 472267, 1: 65511, 2: 63892, 3: 65730}
labels present in ../data/CLA/CLASubjectB1512153StLRHand.mat: {0: 472328, 1: 65574, 2: 63953, 3: 65945}
labels present in ../data/CLA/CLASubjectC1511263StLRHand.mat: {0: 471330, 1: 65585, 2: 63945, 3: 65940}
labels present in ../data/CLA/CLASubjectC1512163StLRHand.mat: {0: 485680, 1: 65646, 2: 63928, 3: 65946}
labels present in ../data/CLA/CLASubjectC1512233StLRHand.mat: {0: 473299, 1: 65373, 2: 63931, 3: 65971, 99: 826}
labels present in ../data/CLA/CLASubjectD1511253StLRHand.mat: {0: 472424, 1: 65620, 2: 63757, 3: 65799}
labels present in ../data/CLA/CLASubjectE1512253StLRHand.mat: {0: 468751, 1: 65352, 2: 63946, 3: 65951}
labels present in ../data/CLA/CLASubjectE1601193StLRHan

We notice that `CLASubjectF1509163StLRHand.mat` does indeed not contain the 3 marker.
We notice some files do contain the `91`,  `92` and `99` marker whilst others don't.

<hr>

### Making MNE object

Having access to all of the data, we can manually make MNE-Python data structures per specification of the [MNE documentation](https://mne.tools/dev/auto_tutorials/simulation/10_array_objs.html). This consist of first making an `MNE info` object which can then be used to create a `raw MNE` object.

#### Making the info object

From the article by [Kaya et al.](https://doi.org/10.1038/sdata.2018.211) we know that:
> The “data” array describes the measured voltage time-series from 19 EEG leads in 10/20 configuration, two ground leads A1-A2, and one synchronization channel X3, as detailed previously in this document

<center>
<div>
    <img src="../images/10-20.svg" width="300"/>
    <br>
    <small>By トマトン124 (talk) - Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=10489987</small>
</div>
</center>

In [8]:
####################################################
# CHECKING CHANNEL NAMES
####################################################

# Check the present channels, and conclude X5 is not in 10/20 convention but remember it is the synchronization channel
first_file_o.chnames

array(['Fp1', 'Fp2', 'F3', 'F4', 'C3', 'C4', 'P3', 'P4', 'O1', 'O2', 'A1',
       'A2', 'F7', 'F8', 'T3', 'T4', 'T5', 'T6', 'Fz', 'Cz', 'Pz', 'X5'],
      dtype=object)

Using the data we have at our disposal from the article by [Kaya et al.](https://doi.org/10.1038/sdata.2018.211), we make meta data list for each CLA file containing:

- File location: String
- Subject ID: String
- Date of recording: Date
- Sex of subject: String
- Age of string (highest value of age range): Int
- Health condition: String
- Prior BCI experience: Bool
- BCI literacy level: String

In [9]:
####################################################
# META DATA FOR ALL CLA DATA FILES
####################################################

# Class to store data information
class MneDataInfo:
    def __init__(self, file_location, subject, date_of_recording, sex, age, health_condition, prior_bci_experience, bci_literacy):
        self.file_location = file_location
        self.subject = subject
        self.date_of_recording = date_of_recording
        self.sex = sex
        self.age = age
        self.health_condition = health_condition
        self.prior_bci_experience= prior_bci_experience
        self.bci_literacy = bci_literacy
    
# Create data information for each usefull record
meta_data_list = [
    # CLASubjectA1601083StLRHand.mat
    MneDataInfo(file_location= "../data/CLA/CLASubjectA1601083StLRHand.mat",
                subject= "A",
                date_of_recording= datetime.datetime(2016, 1, 8, 0, 0, 0, 0, pytz.UTC),
                sex= "M",
                age= 25,
                health_condition= "Healthy",
                prior_bci_experience = False,
                bci_literacy= "Intermediate - High"),
    
    # CLASubjectB1510193StLRHand.mat
    MneDataInfo(file_location= "../data/CLA/CLASubjectB1510193StLRHand.mat",
                subject= "B",
                date_of_recording= datetime.datetime(2015, 10, 19, 0, 0, 0, 0, pytz.UTC),
                sex= "M",
                age= 25,
                health_condition= "Healthy",
                prior_bci_experience = False,
                bci_literacy= "Intermediate - Low"),
    
    # CLASubjectB1510203StLRHand.mat
    MneDataInfo(file_location= "../data/CLA/CLASubjectB1510203StLRHand.mat",
                subject= "B",
                date_of_recording= datetime.datetime(2015, 10, 20, 0, 0, 0, 0, pytz.UTC),
                sex= "M",
                age= 25,
                health_condition= "Healthy",
                prior_bci_experience = False,
                bci_literacy= "Intermediate - Low"),
    
    # CLASubjectB1512153StLRHand.mat
    MneDataInfo(file_location= "../data/CLA/CLASubjectB1512153StLRHand.mat",
                subject= "B",
                date_of_recording= datetime.datetime(2015, 12, 15, 0, 0, 0, 0, pytz.UTC),
                sex= "M",
                age= 25,
                health_condition= "Healthy",
                prior_bci_experience = False,
                bci_literacy= "Intermediate - Low"),
    
    # CLASubjectC1511263StLRHand.mat
    MneDataInfo(file_location= "../data/CLA/CLASubjectC1511263StLRHand.mat",
                subject= "C",
                date_of_recording= datetime.datetime(2015, 11, 26, 0, 0, 0, 0, pytz.UTC),
                sex= "M",
                age= 30,
                health_condition= "Healthy",
                prior_bci_experience = False,
                bci_literacy= "Intermediate - High"),
    
    # CLASubjectC1512163StLRHand.mat
    MneDataInfo(file_location= "../data/CLA/CLASubjectC1512163StLRHand.mat",
                subject= "C",
                date_of_recording= datetime.datetime(2015, 12, 16, 0, 0, 0, 0, pytz.UTC),
                sex= "M",
                age= 30,
                health_condition= "Healthy",
                prior_bci_experience = False,
                bci_literacy= "Intermediate - High"),
    
    # CLASubjectC1512233StLRHand.mat
    MneDataInfo(file_location= "../data/CLA/CLASubjectC1512233StLRHand.mat",
                subject= "C",
                date_of_recording= datetime.datetime(2015, 12, 23, 0, 0, 0, 0, pytz.UTC),
                sex= "M",
                age= 30,
                health_condition= "Healthy",
                prior_bci_experience = False,
                bci_literacy= "Intermediate - High"),
    
    # CLASubjectD1511253StLRHand.mat
    MneDataInfo(file_location= "../data/CLA/CLASubjectD1511253StLRHand.mat",
                subject= "D",
                date_of_recording= datetime.datetime(2015, 11, 25, 0, 0, 0, 0, pytz.UTC),
                sex= "M",
                age= 30,
                health_condition= "Healthy",
                prior_bci_experience = False,
                bci_literacy= "Intermediate - Low"),
    
    # CLASubjectE1512253StLRHand.mat
    MneDataInfo(file_location= "../data/CLA/CLASubjectE1512253StLRHand.mat",
                subject= "E",
                date_of_recording= datetime.datetime(2015, 12, 25, 0, 0, 0, 0, pytz.UTC),
                sex= "F",
                age= 25,
                health_condition= "Healthy",
                prior_bci_experience = False,
                bci_literacy= "Intermediate - Low"),
    
    # CLASubjectE1601193StLRHand.mat
    MneDataInfo(file_location= "../data/CLA/CLASubjectE1601193StLRHand.mat",
                subject= "E",
                date_of_recording= datetime.datetime(2016, 1, 19, 0, 0, 0, 0, pytz.UTC),
                sex= "F",
                age= 25,
                health_condition= "Healthy",
                prior_bci_experience = False,
                bci_literacy= "Intermediate - Low"),
    
    # CLASubjectE1601223StLRHand.mat
    MneDataInfo(file_location= "../data/CLA/CLASubjectE1601223StLRHand.mat",
                subject= "E",
                date_of_recording= datetime.datetime(2016, 1, 22, 0, 0, 0, 0, pytz.UTC),
                sex= "F",
                age= 25,
                health_condition= "Healthy",
                prior_bci_experience = False,
                bci_literacy= "Intermediate - Low"),
    
    # CLASubjectF1509163StLRHand.mat - removed due to only 2 recorded signals
    
    # CLASubjectF1509173StLRHand.mat
    MneDataInfo(file_location= "../data/CLA/CLASubjectF1509173StLRHand.mat",
                subject= "F",
                date_of_recording= datetime.datetime(2015, 9, 17, 0, 0, 0, 0, pytz.UTC),
                sex= "M",
                age= 35,
                health_condition= "Healthy",
                prior_bci_experience = False,
                bci_literacy= "Intermediate - Low"),
    
    # CLASubjectF1509283StLRHand.mat
    MneDataInfo(file_location= "../data/CLA/CLASubjectF1509283StLRHand.mat",
                subject= "F",
                date_of_recording= datetime.datetime(2015, 9, 28, 0, 0, 0, 0, pytz.UTC),
                sex= "M",
                age= 35,
                health_condition= "Healthy",
                prior_bci_experience = False,
                bci_literacy= "Intermediate - Low"),
]

print("... LOADED MNE META DATA FOR ALL CLA DATASETS ...")

... LOADED MNE META DATA FOR ALL CLA DATASETS ...


We now create MNE-Python data structures from scratch, based upon the [MNE documentation](https://mne.tools/stable/auto_tutorials/simulation/10_array_objs.html).

In [10]:
####################################################
# CREATE MNE INFO OBJECTS
####################################################

def create_mne_raw(data_info: MneDataInfo):
    # Get file as matlab file
    file = scipy.io.loadmat(data_info.file_location,
                            struct_as_record=False,
                            squeeze_me=True)
    
    # Keep only the matlab "o" structure
    file = file["o"]
    
    # Get channel names but drop last sync channel
    ch_names = file.chnames[:-1].tolist()
    
    # All channels are EEG
    ch_types = ["eeg"] * len(ch_names)
    
    # Sample frequency is stored
    sfreq = file.sampFreq
    
    # Create an MNE info object
    mne_info = mne.create_info(ch_names= ch_names,
                               ch_types= ch_types,
                               sfreq= sfreq)
    
    # Update MNE info object further    
    mne_info['description'] = f"Data from {data_info.file_location} | age {data_info.age}| health {data_info.health_condition} | prior BCI experience {data_info.prior_bci_experience} | BCI literacy {data_info.bci_literacy}"
    
    mne_info.set_montage('standard_1020')
    
    mne_info['experimenter'] = 'Kaya et al.'
    
    if (data_info.sex == "m"):
        sex = 1
    elif (data_info.sex == "f"):
        sex = 2
    else:
        sex = 0
    
    mne_info['subject_info'] = {
        "his_id": data_info.subject,
        "sex": sex
    }
    
    # Get data in correct format
    data = file.data # RAW data
    data = data[:, :-1] # Remove last sync channel
    data = data.transpose() # Right order
    data = data * 10e-6  # From microvolts to volts
    
    # Create raw object
    mne_raw = mne.io.RawArray(data, mne_info)
    
    # Set date
    mne_raw.set_meas_date(data_info.date_of_recording.astimezone(datetime.timezone.utc))
    
    # Return created mne info
    return mne_raw

first_raw_mne = create_mne_raw(meta_data_list[0])
first_raw_mne
    

Creating RawArray with float64 data, n_channels=21, n_times=671600
    Range : 0 ... 671599 =      0.000 ...  3357.995 secs
Ready.


0,1
Measurement date,"January 08, 2016 00:00:00 GMT"
Experimenter,Kaya et al.
Digitized points,0 points
Good channels,21 EEG
Bad channels,
EOG channels,Not available
ECG channels,Not available
Sampling frequency,200.00 Hz
Highpass,0.00 Hz
Lowpass,100.00 Hz


<hr>

### Storing MNE objects for all CLA data

We now store the CLA matlab dataset to MNE files (FIF) to be re-used later.

In [11]:
####################################################
# CREATE FIF FILES FOR ALL CLA DATASET
####################################################

for meta_data in meta_data_list:
    mne_file = create_mne_raw(meta_data)
    filename = meta_data.file_location[:-4] + "_raw.fif"
    mne_file.save(filename, overwrite=True)
    
# Clear unused variables
del meta_data
del mne_file
del filename

Creating RawArray with float64 data, n_channels=21, n_times=671600
    Range : 0 ... 671599 =      0.000 ...  3357.995 secs
Ready.
Overwriting existing file.
Writing /Users/lennertbontinck/Documents/GitHub/VUB-BCI-thesis/code/experimental-notebooks/../data/CLA/CLASubjectA1601083StLRHand_raw.fif
Closing /Users/lennertbontinck/Documents/GitHub/VUB-BCI-thesis/code/experimental-notebooks/../data/CLA/CLASubjectA1601083StLRHand_raw.fif
[done]
Creating RawArray with float64 data, n_channels=21, n_times=729400
    Range : 0 ... 729399 =      0.000 ...  3646.995 secs
Ready.
Overwriting existing file.
Writing /Users/lennertbontinck/Documents/GitHub/VUB-BCI-thesis/code/experimental-notebooks/../data/CLA/CLASubjectB1510193StLRHand_raw.fif
Closing /Users/lennertbontinck/Documents/GitHub/VUB-BCI-thesis/code/experimental-notebooks/../data/CLA/CLASubjectB1510193StLRHand_raw.fif
[done]
Creating RawArray with float64 data, n_channels=21, n_times=667400
    Range : 0 ... 667399 =      0.000 ...  3336.995

In [12]:
####################################################
# CLEAR VARIABLES
####################################################

# End of notebook, clear all variables
del filenames
del first_file_o
del first_raw_mne
del meta_data_list
