# Introduction
This project will use the confusion EEG dataset hosted on Kaggle at https://www.kaggle.com/wanghaohan/eeg-brain-wave-for-confusion.  This dataset consists of EEG data taken from students watching MOOC courses of varying difficulty.  The purpose is to identify signals from the EEG that indicate whether or not the student is confused by the subject matter.  In theory confusing subject matter should require additional concentration, or at least a different type of focus from the student, which may be visible in the EEG data.

In [39]:
from pandas import read_csv
from matplotlib.pyplot import plot, scatter
%matplotlib inline
import numpy as np
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier

In [2]:
raw_data = read_csv("data/EEG data.csv")

In [3]:
raw_data.head()

Unnamed: 0,subject ID,Video ID,Attention,Meditation,Raw,Delta,Theta,Alpha 1,Alpha 2,Beta 1,Beta 2,Gamma1,Gamma2,predefined label,Self-defined label
0,0,0,56,43,278,301963,90612,33735,23991,27946,45097,33228,8293,0,0
1,0,0,40,35,-50,73787,28083,1439,2240,2746,3687,5293,2740,0,0
2,0,0,47,48,101,758353,383745,201999,62107,36293,130536,57243,25354,0,0
3,0,0,47,57,-5,2012240,129350,61236,17084,11488,62462,49960,33932,0,0
4,0,0,44,53,-8,1005145,354328,37102,88881,45307,99603,44790,29749,0,0


For these purposes I will group the data by session (one subject and one video) and use the values of the brainwaves in each of the reported ranges.  I will not be using the raw data as it doesn't look like there is enough resolution in the data to see individual brain waves on their own.  I will also not use the propriatary measurements.  They are likely computed as a function of the rest of the data, and therefore not likely useful on their own.  They are also less useful without information on how they are computed.
For the label, I will use the self defined label, since I feel that is more trustworthy.

In [4]:
feature_data = raw_data[['subject ID', 'Video ID', "Delta", "Theta", "Alpha 1", "Alpha 2", "Beta 1", "Beta 2", "Gamma1", "Gamma2", 'Self-defined label']]
sessions = feature_data.groupby(['subject ID', 'Video ID'])

Lets first try a simple logistic regression against the average values for each of the given brain waves.

In [42]:
averages = sessions.aggregate(np.mean)
features = averages[averages.columns[:-1]]
target = averages[averages.columns[-1]]
x_train, x_test, y_train, y_test = train_test_split(features, target, random_state=1337)

def run_model(model):
    """
    Run the given model against our data
    """
    scores = cross_val_score(model, features, target)
    print "score is %f" % np.mean(scores)
    return model

In [43]:
print "Logistic Regression"
lr = run_model(LogisticRegression(random_state=42))

Logistic Regression
score is 0.519905


This gives us a 60% accuarcy.  This is not too bad as a first try for this data set, as the researchers report a 65% accuracy as being good.  Nethertheless, lets try a few more models

In [44]:
print "Gradient Boosting"
gbc = run_model(GradientBoostingClassifier(random_state=42))

Gradient Boosting
score is 0.609923


In [36]:
print "KNN"
knn = run_model(KNeighborsClassifier(n_neighbors=1))

KNN
             precision    recall  f1-score   support

        0.0       0.73      0.57      0.64        14
        1.0       0.57      0.73      0.64        11

avg / total       0.66      0.64      0.64        25

accuracy is 0.640000
