<h2><center> EEG - N. 9 </center></h2>

<h3><center> MANU 465 101 </center></h3>

## Dataset
The dataset was constructed from approximately 90 test subjects. Each subject was asked to perform 4 tasks:
- 1) drawing in a circle with left hand
- 2) drawing in a circle with right hand
- 3) writing a sentence with left hand
- 4) writing a sentence with right hand

The data from each of these tasks were saved in a csv file and classified as either left hand or right hand dominant based on the test subject. Four additional features were also added for our analysis:
- 1) Participant ID
- 2) Gender
- 3) English Native speaker
- 4) left or right handed

# Set phase

### Import libraries

In [11]:
import glob
import pandas as pd
import numpy as np
import scipy.stats as stats

Get all the filenames in the folder indicated by the previous path.

In [12]:
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

In [13]:
len(all_filenames)

365

# Features Generation

In order to get a unique file that combines all the 365 files, we need to summarize each csv file in only one row using some metrics. 

For each wave, which corresponds to 4 columns in each file, we want one value of mean, one value of std, and so on. Then all these values will be stored in a final dataset.

In [14]:
from scipy import stats 
def mean(x):
    return np.mean(x, axis=0)
def std(x):
    return np.std(x, axis=0)
def ptp(x):
    return np.ptp(x, axis=0)
def var(x):
    return np.var(x, axis=0)
def minim(x):
    return np.min(x, axis=0)
def maxim(x):
    return np.max(x, axis=0)
def argminim(x):
    return np.argmin(x, axis=0)
def argmaxim(x):
    return np.argmax(x, axis=0)
def rms(x):
    return np.sqrt(np.mean(x**2, axis=0))
def skewness(x):
    return stats.skew(x,axis=0)
def kurtosis(x):
    return stats.kurtosis(x,axis=0)

def concatenate_features(x):
    '''''''''
    this function apply several functions defined above.
    It takes as input a numpy array.
    It outputs a vector with the value of each function: mean, std, ...
    '''''''''
    return mean(x),std(x),ptp(x),var(x),minim(x),maxim(x),argminim(x),argmaxim(x),rms(x),skewness(x),kurtosis(x)

# Import Procedure

## Adding Features

The final dataset should has some columns related to information about the patient (Gender, Dominance hand, ID Number, Test, ...), 12 columns for each wave (for each wave we have mean, std, ... (in total 12 new features)). 

Since we tested each patient 4 times, the dimension of the final dataset will be:
- number rows = 4 * number of patients = number of csv files
- number columns = 12 * 5 (features * waves numbers) + fixed qualities (personal data)

In [15]:
waves = ["Delta", "Theta", "Alpha", "Beta", "Gamma"] # names of waves we are interested in

test_list = []
dominance_list = []
english_list = []
gender_list = []
participant_list = []

# dictionary to store all the values for one dataframe
final_dic = {}

## Final Raw Dataset Creation

The final dataset is created by adding features to the raw dataset.

In [16]:
for name in all_filenames: # file 6 contains string number
    print(name)
    df = pd.read_csv(name)

    df = df.drop(["Elements"], axis=1)
    df = df.dropna() # drop Nan values
    df = df.reset_index(drop=True) # restart index
    
    # Add to lists the values related to patient profile
    test_list.append(df["Test"][0])
    dominance_list.append(df["Dominance"][0])
    english_list.append(df["English"][0])
    gender_list.append(df["Gender"][0])
    participant_list.append(df["Participant"][0])

    # create empty list to collect mean, std, var, ... for a fixed wave
    gen_features = {}
    
    #print(name)
    
    for wave in waves: # for each waves (ALpha, Beta, Gamma, Delta, Theta)
        
        # create empty list to collect all the data in the four columns for a fixed wave
        all_values = []
        
        for col in df.columns: # for each column
            if col.split("_")[0] == wave: # if the wave is in the column name, then:
                
                # to clean data, we delete values equal to the one 4 time points before
                for i in range(4,len(df[col])):
                    if not df[col][i] == df[col][i-4]:
                        all_values.append(df[col][i])
                #if len(all_values)!=0: # if the values are not constant, we want to add also the first 4 values
                #    for i in range(4):
                #        all_values.append(df[col][i])
                for i in range(4):
                    all_values.append(df[col][i])  
                    
        # add a list with the new features associated to the name of the wave in a dictionary
        gen_features[wave] = list( concatenate_features(np.array(all_values)) )
    
    final_dic[name] = gen_features

LHC-101.csv
LHC-102.csv
LHC-103.csv
LHC-105.csv
LHC-106.csv
LHC-107.csv
LHC-108.csv
LHC-109.csv
LHC-110.csv
LHC-111.csv
LHC-112.csv
LHC-113.csv
LHC-114.csv
LHC-115.csv
LHC-116.csv
LHC-117.csv
LHC-118.csv
LHC-119.csv
LHC-120.csv
LHC-121.csv
LHC-122.csv
LHC-123.csv
LHC-124.csv
LHC-126.csv
LHC-128.csv
LHC-129.csv
LHC-130.csv
LHC-131.csv
LHC-132.csv
LHC-133.csv
LHC-134.csv
LHC-135.csv
LHC-136.csv
LHC-137.csv
LHC-138.csv
LHC-139.csv
LHC-207.csv
LHC-209.csv
LHC-210.csv
LHC-211.csv
LHC-212.csv
LHC-213.csv
LHC-214.csv
LHC-215.csv
LHC-216.csv
LHC-217.csv
LHC-218.csv
LHC-219.csv
LHC-220.csv
LHC-221.csv
LHC-222.csv
LHC-223.csv
LHC-224.csv
LHC-225.csv
LHC-226.csv
LHC-227.csv
LHC-228.csv
LHC-229.csv
LHC-230.csv
LHC-231.csv
LHC-232.csv
LHC-233.csv
LHC-234.csv
LHC-235.csv
LHC-301.csv
LHC-302.csv
LHC-303.csv
LHC-304.csv
LHC-305.csv
LHC-306.csv
LHC-307.csv
LHC-308.csv
LHC-309.csv
LHC-310.csv
LHC-311.csv
LHC-312.csv
LHC-313.csv
LHC-314.csv
LHC-315.csv
LHC-316.csv
LHC-317.csv
LHC-318.csv
LHC-319.csv
LHC-

The following script takes all the csv files and creates a final dataset with all the features.

In [17]:
# create an empty dataframe
final_df = pd.DataFrame()

# assign to the column 'Test' of the final df all the values which are in the test_list
final_df["Test"] = test_list
final_df["English"] = english_list
final_df["Gender"] = gender_list
final_df["Participant"] = participant_list
final_df["Dominance"] = dominance_list


functions = ["mean", "std", "ptp","var","minim","maxim","argminim","argmaxim","rms","skewness","kurtosis"] 

for i in range(len(all_filenames)): # i indicates the row (index for each file)
    # Change the class from Left to Dominant or NonDominant
    if final_df.at[i, "Dominance"] == 'Left':
        if final_df.at[i, "Test"] =='LHC' or final_df.at[i, "Test"] =='LHS':
            final_df.at[i, "Dominance"] = 'Dominant'
        else:
            final_df.at[i, "Dominance"] = 'NonDominant'
    else: # change the class from Right to Dominant or NonDominant
        if final_df.at[i, "Test"] =='RHC' or final_df.at[i, "Test"] =='RHS':
            final_df.at[i, "Dominance"] = 'Dominant'
        else:
            final_df.at[i, "Dominance"] = 'NonDominant'
    name = all_filenames[i]
    
    for wave in waves: # for each wave
        for j in range(11): 
            # at row i, and column specified by the name of the wave and features
            final_df.at[i, wave + "_" + functions[j]] = final_dic[name][wave][j]       
            
final_df

Unnamed: 0,Test,English,Gender,Participant,Dominance,Delta_mean,Delta_std,Delta_ptp,Delta_var,Delta_minim,...,Gamma_std,Gamma_ptp,Gamma_var,Gamma_minim,Gamma_maxim,Gamma_argminim,Gamma_argmaxim,Gamma_rms,Gamma_skewness,Gamma_kurtosis
0,LHC,Yes,Female,101.0,Dominant,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,-3.000000
1,LHC,Yes,Male,102.0,Dominant,0.526442,0.294538,1.336331,0.086752,0.000000,...,0.424252,1.609542,0.179990,-0.325524,1.284018,46.0,50.0,0.589387,0.181548,-1.023259
2,RHS,Yes,Male,103.0,NonDominant,0.340579,0.413665,1.775144,0.171119,-0.550673,...,0.463500,1.632933,0.214832,-0.614941,1.017991,35.0,108.0,0.638468,-0.662233,-1.064268
3,RHS,Yes,Male,139.0,Dominant,0.933517,0.513017,1.899951,0.263186,-0.091274,...,0.397164,1.218180,0.157739,-0.067039,1.151140,19.0,41.0,0.682415,-0.174877,-1.284745
4,LHC,Yes,Male,106.0,NonDominant,0.567189,0.504123,2.265507,0.254140,-0.570804,...,0.630338,1.762350,0.397326,-0.568534,1.193816,47.0,91.0,0.761039,-0.310182,-1.689339
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
360,RHS,Yes,Male,322.0,NonDominant,0.582510,0.434822,1.406913,0.189070,0.000000,...,0.339918,1.001319,0.115544,0.000000,1.001319,18.0,14.0,0.673433,-0.633925,-0.811194
361,RHS,No,Female,323.0,Dominant,0.504330,0.501552,1.719051,0.251554,-0.540604,...,0.473516,1.444715,0.224217,-0.550697,0.894018,14.0,31.0,0.494641,0.231307,-1.265068
362,RHS,Yes,Female,324.0,NonDominant,0.204568,0.381588,1.164488,0.145609,-0.190121,...,0.247382,0.789973,0.061198,-0.507092,0.282881,8.0,19.0,0.269017,0.040762,-1.022177
363,RHS,Yes,Female,325.0,Dominant,0.472270,0.334002,0.989795,0.111557,0.000000,...,0.216456,0.577910,0.046853,0.000000,0.577910,0.0,17.0,0.331766,0.118204,-1.506182


In [18]:
# save the dataframe in a csv file
final_df.to_csv("final_df.csv")

The final dataset has 365 rows and 60 columns.