**<center><font size=5>EEG Data Analysis and Machine Learning(SVM)</font></center>**
***<center>Alcoholic vs Control Groups</center>***
***
**Autor**: Iván García Alvarez

**Fecha**: 23 Marzo del 2019

**[GitHub Repository](https://github.com/ruslan-kl/EEG-data-analysis)**

#### Table of Contents
- <a href='#intro'>1. Project Overview</a> 
- <a href='#env'>2. Setting up the Environment</a>
- <a href='#import'>2.1. Data Import</a>
- <a href='#Preprocessing'>5. Preprocessing</a>
- <a href='#Training'>5. Training</a> 

# <a id='intro'>1.Project Overview</a>

Este proyecto ha sido publicado en Kaggle [EEG-Alcohol](https://www.kaggle.com/ruslankl/eeg-data-analysis) como un reto para implementar Machine Learning, así que hemos tomado el reto y en este proyecto implementaremos Support Vector Machines.Al igual que el estudio de kaggle he escogido el [EEG-Alcohol](https://www.kaggle.com/nnair25/Alcoholics) dataset, el cual [EEG (Electroencephalography)](https://en.wikipedia.org/wiki/Electroencephalography) que son los datos de dos grupos - Alcoholic and Control Group. 
![](https://i.imgur.com/ZrmxJRu.jpg)
La cantidad de sujetos en cada grupo es 8. Los 64 electrodos se colocaron en el cuero cabelludo del sujeto para medir la actividad eléctrica del cerebro. Los valores de respuesta se muestrearon a 256 Hz (época de 3.9 ms) durante 1 segundo. Cada sujeto fue expuesto a un solo estímulo (S1) o a dos estímulos (S1 y S2) que eran imágenes de objetos elegidos de entre [1980 Snodgrass and Vanderwart picture set](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.294.1979&rep=rep1&type=pdf). Cuando se mostraron dos estímulos, se presentaron en una condición emparejada donde S1 era idéntico a S2 o en una condición no emparejada donde S1 difería de S2.

El propósito del algoritmo será detectar diferencias entre los valores de respuesta para diferentes estímulos entre el control y el grupo alcohólico.

# <a id='env'>2. Setting up the Environment</a>
## <a id='import'>2.1. Data Import</a>

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from sklearn import feature_extraction, model_selection, naive_bayes, metrics, svm
from IPython.display import Image
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline 

# <a id='Preprocessing'>Preprocessing</a>
Lo primero que haremos será hacer el preprcesamiento de los datos, tomaremos en cuenta las siguientes recomendaciones para esto:
* No nos interesa un "id" del sujeto, pues puede ocasionar un overfitting

* Nos interesa mucho cuales sensores son los sensores importantes, para esto tomaremos las conclusiones dadas por el estudio previo que son 'FPZ', 'FP2', 'AF3', 'AFZ', 'AF4', 'F5', 'F3', 'F1', 'FZ', 'FC5', 'FC3', 'FCZ', 'T7', 'CZ', 'C4', 'C6','TP7', 'CP5', 'CP3', 'CP1', 'CPZ', 'CP2', 'CP4', 'CP8', 'P5', 'P1', 'PZ', 'P2', 'P4', 'P6', 'P8', 'PO7', 'PO4', 'O1'

* Analizando los datos nos dimos cuenta que la posicion del sensor y el canal, es lo mismo entonces por motivos de procesamiento hemos decidir eliminar el campo chanel.

* Para un mejor rendimiento del algoritmo se ha decidido dejar todo en valores númericos

In [None]:
from tqdm import tqdm_notebook as tqdm
import os


def filter_columns():
    path = '../Train'
    columns = [2,3,4,5,6,7]
    set_headers = True
    fw = open('filtered_train_data.csv', 'w')
    fw2 = open('filtered_test_data.csv', 'w')

    os.chdir(path)
    for file in tqdm(os.listdir('./')):
        f = open(file)
        flines = f.readlines()
        flen = len(flines)
        if flen > 1:
            if set_headers:
                hsplit = flines[0].split(',')
                temp_line = []
                for col in columns:
                    temp_line.append(hsplit[col])
                
                fw.write(",".join(temp_line))
                set_headers = False

            for i in range(1, flen):
                xsplit = flines[i].split(',')
                temp_line = []
                for col in columns:
                    try:
                        temp_line.append(xsplit[col])
                    except:
                        print(file)
                
                fw.write(",".join(temp_line) + '\n')
    set_headers = True            
    print("listo train")
    path = '../Test'
    os.chdir(path)
    for file in tqdm(os.listdir('./')):
        f = open(file)
        flines = f.readlines()
        flen = len(flines)
        if flen > 1:
            if set_headers:
                hsplit = flines[0].split(',')
                temp_line = []
                for col in columns:
                    temp_line.append(hsplit[col])
                
                fw2.write(",".join(temp_line))
                set_headers = False

            for i in range(1, flen):
                xsplit = flines[i].split(',')
                temp_line = []
                for col in columns:
                    try:
                        temp_line.append(xsplit[col])
                    except:
                        print(file)
                
                fw2.write(",".join(temp_line) + '\n')

        f.close()
    fw.close()
    fw2.close()
    os.chdir('../EEG-data-analysis-master')

def change_sensors():
    sensors_dict = {
        'AF1': 'AF3',
        'AF2': 'AF4',
        'PO1': 'PO3',
        'PO2': 'PO4'
    }
    f = open('filtered_train_data.csv')
    fw = open('changed_train_sensor_data.csv', 'w')
    f2 = open('filtered_test_data.csv')
    fw2 = open('changed_test_sensor_data.csv', 'w')

    flines = f.readlines()
    fw.write(flines[0])
    for i in tqdm(range(1, len(flines))):
        if len(flines[i]) > 1:
            xsplit = flines[i].split(',')
            if xsplit[0] in sensors_dict:
                xsplit[0] = sensors_dict.get(xsplit[0])

            fw.write(",".join(xsplit))
            
    flines = f2.readlines()
    fw2.write(flines[0])
    for i in tqdm(range(1, len(flines))):
        if len(flines[i]) > 1:
            xsplit = flines[i].split(',')
            if xsplit[0] in sensors_dict:
                xsplit[0] = sensors_dict.get(xsplit[0])

            fw2.write(",".join(xsplit))
    

    f.close()
    fw.close()
    f2.close()
    fw2.close()

def filter_sensor():
    good_sensors = ['FPZ', 'FP2', 'AF3', 'AFZ', 'AF4', 'F5', 'F3', 'F1', 'FZ', 'FC5', 'FC3', 'FCZ', 'T7', 'CZ', 'C4', 'C6', 'TP7', 'CP5', 'CP3', 'CP1', 'CPZ', 'CP2', 'CP4', 'CP8', 'P5', 'P1', 'PZ', 'P2', 'P4', 'P6', 'P8', 'PO7', 'PO4', 'O1']

    f = open('changed_train_sensor_data.csv')
    fw = open('changed_train_sensor_data_cool.csv', 'w')
    f2 = open('changed_test_sensor_data.csv')
    fw2 = open('changed_test_sensor_data_cool.csv', 'w')

    flines = f.readlines()
    fw.write(flines[0])
    for i in tqdm(range(1, len(flines))):
        if len(flines[i]) > 3:
            xsplit = flines[i].split(',')
            if xsplit[0] in good_sensors:
                xsplit[0] = str(good_sensors.index(xsplit[0]))
                fw.write(",".join(xsplit))
        
    flines = f2.readlines()
    fw2.write(flines[0])
    for i in tqdm(range(1, len(flines))):
        if len(flines[i]) > 3:
            xsplit = flines[i].split(',')
            if xsplit[0] in good_sensors:
                xsplit[0] = str(good_sensors.index(xsplit[0]))
                fw2.write(",".join(xsplit))

    f.close()
    fw.close()
    f2.close()
    fw2.close()

def juntar_test_data():
    f_new = open('data.csv', 'w')
    f1 = open('test_data.csv')
    f2 = open('train_data.csv')
    
    flines = f1.readlines()
    
    f_new.write(flines[0])
    for i in tqdm(range(1, len(flines))):
        if len(flines[i]) > 3:
            xsplit = flines[i].split(',')
            f_new.write(",".join(xsplit))
                
    flines = f2.readlines()
    for i in tqdm(range(1, len(flines))):
        if len(flines[i]) > 3:
            xsplit = flines[i].split(',')
            f_new.write(",".join(xsplit))

    f1.close()
    f2.close()
    f_new.close()

def limpiar_s():
    f1_new = open('train_data.csv','w')
    f2_new = open('test_data.csv','w')
    f1 = open('changed_train_sensor_data_cool.csv')
    f2 = open('changed_test_sensor_data_cool.csv')

    flines = f1.readlines()
    
    f1_new.write(flines[0])
    for i in tqdm(range(1, len(flines))):
        if len(flines[i]) > 3:
            xsplit = flines[i].split(',')
            if(xsplit[4] == 'S1 obj'):
                xsplit[4] = '1'
            elif('S2 nomatch' in xsplit[4]):
                xsplit[4] = '2'
            elif(xsplit[4]== 'S2 match'):
                xsplit[4] = '3'
                
            if(xsplit[0] == 'S1 obj'):
                xsplit[0] = '1'
            elif('S2 nomatch' in xsplit[4]):
                xsplit[0] = '2'
            elif(xsplit[0]== 'S2 match'):
                xsplit[0] = '3'
            if(xsplit[5] < '64'):
                f2_new.write(",".join(xsplit))

                
    flines = f2.readlines()
    f2_new.write(flines[0])
    for i in tqdm(range(1, len(flines))):
        if len(flines[i]) > 3:
            xsplit = flines[i].split(',')
            if(xsplit[4] == 'S1 obj'):
                xsplit[4] = '1'
            elif('S2 nomatch' in xsplit[4]):
                xsplit[4] = '2'
            elif(xsplit[4]== 'S2 match'):
                xsplit[4] = '3'
            f1_new.write(",".join(xsplit))
    f1.close()
    f2.close()
    f1_new.close()
    f2_new.close()
                

print('Empezando')
print('filtrando columnas')
#filter_columns()
print('cambiando los nombres de los sensores')
#change_sensors()
print('filtrando sensores')
#filter_sensor()
print('limpiando s')
#limpiar_s()
print('juntando test and train')
#juntar_test_data()
print('terminado')

Enseguida procederemos a visualizar los datos para aplicar los filtros necesarios.

In [6]:
data = pd.read_csv("data.csv",encoding='latin-1')
data.head(n=10)

Unnamed: 0,sensor_position,sample_num,sensor_value,subject_identifier,matching_condition,channel
0,1,0,0.834,a,1,1
1,1,1,3.276,a,1,1
2,1,2,5.717,a,1,1
3,1,3,7.67,a,1,1
4,1,4,9.623,a,1,1
5,1,5,9.623,a,1,1
6,1,6,8.647,a,1,1
7,1,7,5.229,a,1,1
8,1,8,1.322,a,1,1
9,1,9,-2.096,a,1,1


Enseguida lo que haremos sera clasificar los alcoholicos 'a' como 1 y grupo de control como 0 (campo a clasificar)
además que solo usaremos los campos "sensor_position","sample_num","sensor_value","matching_condition".

In [None]:
data["subject_identifier"]=data["subject_identifier"].map({'a':1,'c':0})
temporal = [[data["sensor_position"]],[data["sample_num"]],[data["sensor_value"]],[data["matching_condition"]]]
train = np.array(temporal)
y = np.array([data['subject_identifier']])

train = train.reshape(6621824,4)
y = y.reshape(6621824,1)
X_train, X_test, y_train, y_test = model_selection.train_test_split(train , y, test_size=0.00001, random_state=42)


# <a id='Training'>Training</a>
Teniendo los datasets de train y de prueba procederemos a entrenar el modelo.

In [None]:
svc = svm.SVC()
svc.fit(X_test, y_test)
#score_test = svc.score(X_test, y_test)
#recall_test = metrics.recall_score(y_test, svc.predict(X_test))
#precision_test = metrics.precision_score(y_test, svc.predict(X_test))

