We provide the following data to support experimentation with signal representation and classification techniques. Please acknowledge us in any communications of your work involving our data.

We have worked primarily with EEG data recorded by Zak Keirn at Purdue University for his work on his Masters of Science thesis in the Electrical Engineering Department at Purdue. We make that data available here as a 23 MB, binary Matlab mat-file. After downloading this file, load it into matlab using load eegdata. You should then have these two variables defined:

>> whos

Name     |   Size    |    Bytes | Class
---------|-----------|----------|-----------
data     |   1x325   | 22917020 | cell array
readme   |   1x1379  |     2758 | char array

Grand total is 5699264 elements using 22919778 bytes
The variable readme is a string containing the following explanation:

data is a cell array of cell arrays. Each individual cell array is made up of a subject string, task string, trial string, and data array. Each data array is 7 rows by 2500 columns. The 7 rows correspond to channels c3, c4, p3, p4, o1, o2, and EOG. Across columns are samples taken at 250 Hz for 10 seconds, for 2500 samples. For example, the first cell array looks like 'subject 1' 'baseline' 'trial 1' [7x2500 single]. Recordings were made with reference to electrically linked mastoids A1 and A2. EOG was recorded between the forehead above the left browline and another on the left cheekbone. Recording was performed with a bank of Grass 7P511 amplifiers whose bandpass analog filters were set at 0.1 to 100 Hz. data 1 and 2 were employees of a university and were left-handed age 48 and right-handed age 39, respectively. data 3 through 7 were right-handed college students between the age of 20 and 30 years old. All were mail data with the exception of Subject 5. data performed five trials of each task in one day. They returned to do a second five trials on another day. data 2 and 7 completed only one 5-trial session. Subject 5 completed three sessions. For more information see Alternative Modes of Communication Between Man and Machine, Zachary A. Keirn, Masters Thesis in Electrical Engineering, Purdue University, December, 1988.

Here is a file named makesubset.m that will extract five 7x2500 matrices for Subject 1, Trial 1 of each of the five tasks and plots them. This can be run only after loading the eegdata.mat file.

If you don't have access to Matlab, or want more data, here is a 12.8 MB file named alleegdata.ascii.gz that contains data for all data. Values are stored with a precision of three decimal places to save space.

In [2]:
import numpy as np
import pandas as pd

import shutil

from os import listdir, mkdir

import re

In [7]:


# The file (alleegdata.ascii) has the following contents:
# Line 1: "subject 1, baseline, trial 1"
# Line 2: 2500 samples of channel c3
# Line 3: 2500 samples of channel c4
# ...
# Line 8: 2500 samples of channel EOG
# Line 9: "subject 1, baseline, trial 2"
# ...
# Line 56: "subject 1, baseline, trial 5"
# Line 57: "subject 1, task 1, trial 1"
# ...
# And so on

# separate each trial 
with open('data/raw/alleegdata.ascii') as f:
    lines = f.read().split("\n\n")

# we only want resting EEG, so we use the baseline
data = [line for line in lines if "baseline" in line]

subjects_data = []
subjects = []

past = ""

# clear df
df = None

for trial in data:
    rows = trial.split("\n")
    head = rows[0].strip().split(",")
    rows.pop(0)
    
    for i in range(0, len(rows)):
        rows[i] = np.array(rows[i].strip().split(" "), dtype=float)
        
    rows.append(np.repeat(head[2], len(rows[0])))

    if len(rows) == 8:
        df_ = pd.DataFrame(rows).transpose()
        if past == head[0] and df is not None:
            # concat the dataframes
            df = pd.concat([df, df_], axis=0)
        else:
            # Rename the columns
            subjects.append(head[0])
            if df is not None:
                df.columns = ['c3', 'c4', 'p3', 'p4', 'o1', 'o2', 'eog', 'trial']
                # Add oz, which is the normalized sum of o1 and o2
                df['oz'] = (df['o1'] + df['o2']) / 2
                subjects_data.append(df)
            # append the dataframe
            df = df_

        # keep track of the subject
        past = head[0]

# add the last subject
subjects.append(past[0])

In [8]:
# Save the dataframes
for subject in subjects_data:
    # save each task separately
    trials = subject['trial'].unique()
    columns = ['o1', 'o2', 'oz']
    for trial, i in zip(trials, subjects):
        subject[subject['trial'] == trial][columns].to_csv('data/processed/' + i.replace("subject", 's').replace(" ", "") + '_' + trial.replace(" ", "") + '.csv', index=False)


## Final processing

In [9]:
# Processed files at: data/processed/*.csv
# Read all the files, and create 4 second (1000 samples) windows for each one, use 0.5s of overlap

path = "data/processed/"
files = listdir(path)

fs = 250

w_len = 4 * fs

overlap = int(0.5 * fs)

for file in files:
    df = pd.read_csv(path + file)
    # 4 seconds of data, 0.5s of overlap
    for i in range(0, len(df), w_len - overlap):
        if i + w_len < len(df):
            df_ = df[i:i+w_len]
            out = df_.to_numpy()
            # save the file (npy)
            np.save('data/final/0/' + file.replace(".csv", "") + '_' + str(i) + '.npy', out)
            

## Train test split

In [4]:
def is_perfect_score_subject(file: str, subjects: list):
    # re.match(f"^{subject}_.\.npy$", file)
    for subject in subjects:
        if re.match(f"^S{subject}_.\.npy$", file) is not None:
            return True
        
    return False

In [5]:
# Data is on data/final/*/*.npy
# Each folder represents a label (data/final/0, data/final/1, data/final/2, data/final/3 ...)

# Find the folder with the least amount of data
files = listdir("data/staged/")


src_path = "data/staged/"
dst_path_train = "data/train/"
dst_path_test  = "data/val/"

perfect_score_subjects = [1, 2, 3, 5, 6, 7, 8, 10, 15, 17, 20, 22, 25, 27, 28, 30, 32, 34, 35]

for file in files:
    print(file)
    files_ = listdir(src_path + file)

    # create folders on train and test paths if they don't exist
    if not file in listdir(dst_path_train):
        mkdir(dst_path_train + file)
    if not file in listdir(dst_path_test):
        mkdir(dst_path_test + file)

    N = len(files_)

    for i in range(N):
        if is_perfect_score_subject(files_[i], perfect_score_subjects):
            print(f"perfect score subject: {files_[i]}")
            # copy the file to the train folder
            shutil.copy(src_path + file + "/" + files_[i], dst_path_train + file + "/" + files_[i])
        else:
            # copy the file to the test folder
            shutil.copy(src_path + file + "/" + files_[i], dst_path_test + file + "/" + files_[i])

0
perfect score subject: S10_0.npy
perfect score subject: S10_1.npy
perfect score subject: S10_2.npy
perfect score subject: S10_3.npy
perfect score subject: S10_4.npy
perfect score subject: S10_5.npy
perfect score subject: S10_6.npy
perfect score subject: S15_0.npy
perfect score subject: S15_1.npy
perfect score subject: S15_2.npy
perfect score subject: S15_3.npy
perfect score subject: S15_4.npy
perfect score subject: S15_5.npy
perfect score subject: S15_6.npy
perfect score subject: S17_0.npy
perfect score subject: S17_1.npy
perfect score subject: S17_2.npy
perfect score subject: S17_3.npy
perfect score subject: S17_4.npy
perfect score subject: S17_5.npy
perfect score subject: S17_6.npy
perfect score subject: S1_0.npy
perfect score subject: S1_1.npy
perfect score subject: S1_2.npy
perfect score subject: S1_3.npy
perfect score subject: S1_4.npy
perfect score subject: S1_5.npy
perfect score subject: S1_6.npy
perfect score subject: S20_0.npy
perfect score subject: S20_1.npy
perfect score s

In [3]:
files = listdir("data/final/")
min_ = 10000000
for file in files:
    print(f"FIle: {file}, has: {len(listdir('data/final/' + file))}")

FIle: 0, has: 16
FIle: 1, has: 245
FIle: 2, has: 245
