# Initial EEG data exploration

In this notebook we perform some initial data exploration. It does not have a direct link to the paper but was used to explore EEG data.

The GitHub repository of this project is available [here](https://www.github.com/pikawika/bci-master-thesis). We make use of a database provided by by [Kaya et al.](https://doi.org/10.1038/sdata.2018.211) is used. It can be downloaded from [here](https://doi.org/10.6084/m9.figshare.c.3917698.v1). Some of the code is inspired by other projects that make use of this database, such as:

- [MotorImageryPreprocessing by zafeiriou-arg](https://github.com/zafeiriou-arg/MotorImageryPreprocessing)
- [Motor Imagery Classification by sauriii98](https://github.com/sauriii98/Motor-Imagery-Classification)

## Table of Contents

- Checking requirements
    - Correct anaconda environment
    - Correct module access
    - Correct file access
- Loading in data
    - Exploring data structure
    - Classification labels

## Checking requirements

### Correct anaconda environment

The `bci-master-thesis` anaconda environment should be active to ensure proper support. Installation instructions are available on the GitHub repository of this project.

In [1]:
import os
print("Active environment: "+ os.environ['CONDA_DEFAULT_ENV'])
print("Correct environment: " + str(os.environ['CONDA_DEFAULT_ENV'] == "bci-master-thesis"))

Active environment: bci-master-thesis
Correct environment: True


### Correct module access

The following codeblock will load in all required modules.

In [2]:
# Performs IO operations
import os
import scipy.io

# MNE package for processing of EEG data
import mne

# Data manipulation modules
import scipy
import numpy as np
import pandas as pd
import math

# Plotting modules
import matplotlib.pyplot as plt

# Other helpfull modules
import itertools
import random

# from mne.io import RawArray
# import scipy.io
# from mne.filter import filter_data
# from scipy.stats import pearsonr
# from math import sqrt
# from scipy import signal
# from scipy.signal import butter, lfilter

### Correct file access

The dataset provided by [Kaya et al.](https://doi.org/10.1038/sdata.2018.211) is used. It can be downloaded from [here](https://doi.org/10.6084/m9.figshare.c.3917698.v1). The CLA variants are used and all 16 files should be saved under a subfolder `data` of this document. Only a select few will be used for this notebook though, their filenames can be found in the vector filenames.

In [3]:
# You can specify the data directory here, per default it is under the subfolder data from this file location
data_directory = r'data/'

In [4]:
# Files needed for this notebook
filenames = ["CLA-SubjectJ-170504-3St-LRHand-Inter.mat",
            "CLA-SubjectJ-170508-3St-LRHand-Inter.mat",
            "CLA-SubjectJ-170510-3St-LRHand-Inter.mat",
            "CLASubjectA1601083StLRHand.mat",
            "CLASubjectB1510193StLRHand.mat",
            "CLASubjectB1512153StLRHand.mat"]

# Check if all files are available, if not display file name of missing file
all_files_available = True

for filename in filenames:
    if (not os.path.isfile(data_directory + filename)):
        print(data_directory + filename + "not available!")
        all_files_available = False

# Display succes message if all files are available
if (all_files_available):
    print("All files are available")
    
# Cleaning up redundant variables from this codeblock
del all_files_available
del filenames
del filename
    


All files are available


## Loading in data

In this step we load in the data. The data is provided as `.mat` files and thus originally meant for use with MATLAB. However, thanks to scipy we can get them to work in Python as well. Scipy will load in the `.mat` file as a dictionary. From the article by [Kaya et al.](https://doi.org/10.1038/sdata.2018.211) we know:

> The data in each file are represented as an instance of a Matlab structure named “o,” having the following key fields “id,” “nS,” “sampFreq,” “marker” and “data”.

In [5]:
# you can specify the data file to use here, per default it is CLASubjectA1601083StLRHand.mat
data_file_name = r'CLASubjectA1601083StLRHand.mat'

In [6]:
# Load in the file
data_raw_full = scipy.io.loadmat(data_directory + data_file_name, struct_as_record=False, squeeze_me=True)

# show keys of the dictionary
print(data_raw_full.keys())

# The data is stored inside the matlab structure named "o"
data_raw = data_raw_full['o']

# Cleaning up redundant variables from this codeblock
del data_raw_full

dict_keys(['__header__', '__version__', '__globals__', 'o'])


### Exploring data structure

The following data is available:

- id: A unique alphanumeric identifier of the record
- tag: Unknown field
   - Was not specified in article
- binsuV: Probably bins per microvolt
   - Was not specified in article
- nS: Number of EEG data samples
- sampFreq: Sampling frequency of the EEG data
- marker: The eGUI interaction record of the recording session
- chnames: Probably channel names of the EEG data sensors/channels in 10/20 configuration
   - Was not specified in article
- data: The Raw EEG data of the recording session

In [7]:
# We can now access the data in a MATLAB like fashion thanks to the configuration of loadmat
print("id: " + str(data_raw.id))
print()
print("tag: " + str(data_raw.tag))
print()
print("binsuV: " + str(data_raw.binsuV))
print()
print("nS: " + str(data_raw.nS))
print()
print("sampFreq: " + str(data_raw.sampFreq))
print()
print("marker: " + str(data_raw.marker))
print("marker shape: " + str(data_raw.marker.shape))
print()
print("chnames: " + str(data_raw.chnames))
print("chnames shape: " + str(data_raw.chnames.shape))
print()
print("data: " + str(data_raw.data))
print("data shape: " + str(data_raw.data.shape))

id: 201601081851.951FEF1D

tag: NK-data import (auto)

binsuV: 1

nS: 671600

sampFreq: 200

marker: [0 0 0 ... 0 0 0]
marker shape: (671600,)

chnames: ['Fp1' 'Fp2' 'F3' 'F4' 'C3' 'C4' 'P3' 'P4' 'O1' 'O2' 'A1' 'A2' 'F7' 'F8'
 'T3' 'T4' 'T5' 'T6' 'Fz' 'Cz' 'Pz' 'X5']
chnames shape: (22,)

data: [[ -0.    -0.    -0.   ...  -0.    -0.    -0.  ]
 [ -0.    -0.    -0.   ...  -0.    -0.    -0.  ]
 [ -0.    -0.    -0.   ...  -0.    -0.    -0.  ]
 ...
 [ 23.8  -28.4    4.31 ...  -8.31  -6.    -0.23]
 [ 10.74 -37.39   5.51 ...  -9.34  -5.99  -0.16]
 [  0.76 -47.95   3.66 ...  -7.32  -4.9   -0.41]]
data shape: (671600, 22)


In [8]:
# If all is correct the following should make sense
print("Amount of channel names correspond with amount of channels available(" + str(data_raw.data.shape[1]) +") " + str(data_raw.chnames.size == data_raw.data.shape[1]))

print("Number of samples corresponds with amount of data records (" + str(data_raw.data.shape[0]) +"): " + str(data_raw.nS == data_raw.data.shape[0]))

Amount of channel names correspond with amount of channels available(22) True
Number of samples corresponds with amount of data records (671600): True


### Classification labels

From the article by [Kaya et al.](https://doi.org/10.1038/sdata.2018.211) we know:

>The “marker” field contains the recording sessions’ interaction record. This record is in the form of 1D
Matlab array of size nSx1, which contains integer values from 0 to 99. Each value encodes the state of the
eGUI at the time mapping to the corresponding EEG data sample in the “data” array at the same timeindex location.

We see the following codes in the CLA datasets:
- 0: “blank” or nothing is displayed in eGUI
    - Can be seen as break between stimuli, thus random EEG data that should probably be ignored
- 1: Left hand
    - EEG data for MI of left hand
- 2: Right hand action
    - EEG data for MI of right hand
- 3: Passive/neutral
    - EEG data for MI of neither left or right hand

In [9]:
unique, counts = np.unique(data_raw.marker, return_counts=True)
dict(zip(unique, counts))

{0: 476168, 1: 61490, 2: 69202, 3: 64740}