### EEG Signal Classification using Machine Learning - Health Monitoring Signal Analysis

This project is to analyze the application of various machine learning algorithms on Health Montoring Signal Data. The Dataset is taken from UCI Respository can be found in this link (https://archive.ics.uci.edu/ml/datasets/arrhythmia) 

Ground Truth:

The aim is to distinguish between the presence and absence of cardiac arrhythmia and to classify it in one of the 16 groups. Class 01 refers to 'normal' ECG classes 02 to 15 refers to different classes of arrhythmia and class 16 refers to the rest of unclassified ones. For the time being, there exists a computer program that makes such a classification. However there are differences between the cardiolog's and the programs classification. Taking the cardiolog's as a gold standard we aim to minimise this difference by means of machine learning tools

Algorithms Applied and their results:

1) Support Vector Machine Classifier - Accuracy = 65.4%

2) Logistic Regression - Accuracy = 66.17%

3) RandomForest Classifier - 72.05%

4) Multi-Layer Perceptron = Accuracy = 63.9%

5) Gradient Boosting Classifier - Accuracy = 71.3%

#### I found that the Random Forest Classifier is showing better performance if the accuacy is taken as the performance metric

#### Step1: Importing Data and Dependencies

In [78]:
import pandas as pd
import numpy as np
from sklearn import svm
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

Data Analysis

In [3]:
df1 = pd.read_csv("../CNN/data_clean_imputed.csv")
df1.head()

Unnamed: 0,49,1,162,54,78,0,376,157,70,67,...,0.146,8.2,-1.9,0.147,0.148,0.1.4,0.5.1,15.8,19.8,1.2
0,66,1,160,70,76,160,368,153,75,0,...,-0.4,14.3,-1.7,0.0,0,-0.4,1.7,28.1,44.0,16
1,50,1,174,90,81,105,362,197,70,-31,...,0.0,5.3,-2.9,0.0,0,-0.3,1.6,3.5,18.5,1
2,29,0,172,69,93,129,390,137,60,62,...,-0.6,11.9,-1.3,0.0,0,0.3,1.7,20.4,30.6,6
3,64,0,170,70,94,162,405,237,95,76,...,0.0,10.3,-3.2,0.0,0,1.2,-1.9,14.3,-6.2,2
4,45,1,165,86,77,143,373,150,65,12,...,0.0,4.4,-2.2,0.0,0,0.5,1.5,4.9,17.2,1


In [4]:
df1["1.2"].value_counts()

1     244
10     50
2      44
6      25
16     22
4      15
3      15
5      13
9       9
15      5
14      4
7       3
8       2
Name: 1.2, dtype: int64

Observations:

a) The dataset has missing values where I have imputed them using mean of the feature

b) The last column is the label where it has classes from '0' to '16' which makes this a MULTI-CLASS Classification problem

c) The value_counts cell shows the distribution of the each class label in the dataset

#### Step 2: Data Preprocessing

The below function builds the dataset in required format and splits into train and test subsets with features and labels seperated. 

In [5]:
input_file = open("../CNN/data_clean_imputed.csv","r")
#input_file = open("../data/pca.csv","r")

lines = input_file.readlines()

TRAINING_SIZE = 316
# 1 for binary, 2 for multiclass
CLASSIFICATION_TYPE = 2
NUM_PCA = 270;

train_X = []
train_y = []

test_X = []
test_y = []

count = 0
for line in lines:
	tokens = line.strip().split(",")
	if count < TRAINING_SIZE:
		train_X.append([float(s) for s in tokens[0:NUM_PCA]])
		if CLASSIFICATION_TYPE == 2:
			train_y.append(int(tokens[len(tokens)-1]))
		elif int(tokens[len(tokens)-1]) == 1:
			train_y.append(0)
		else:
			train_y.append(1)
		count += 1
	else:
		test_X.append([float(s) for s in tokens[0:NUM_PCA]])
		if CLASSIFICATION_TYPE == 2:
			test_y.append(int(tokens[len(tokens)-1]))
		elif int(tokens[len(tokens)-1]) == 1:
			test_y.append(0)
		else:
			test_y.append(1)
#print "Y\n", y
#print "TEST Y\n", test_y

print ("train_X: ", len(train_X), "\ttrain_Y: ",len(train_y),"\ttest_X: ",len(test_X),"\ttest_y: ",len(test_y))

train_X:  316 	train_Y:  316 	test_X:  136 	test_y:  136


#### Step 3: Model Building and Evaluation

##### 1. Performance of Support Vector Machine Classifier

In [71]:
import warnings
warnings.filterwarnings('ignore')

y_predicted = OneVsRestClassifier(LinearSVC()).fit(train_X, train_y).predict(test_X)
confusion_matrix(y_predicted, test_y)

array([[71,  3,  1,  1,  4,  4,  1,  1,  1,  5,  1,  2,  5],
       [ 3,  2,  0,  1,  1,  1,  0,  0,  0,  0,  0,  1,  1],
       [ 0,  0,  4,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 1,  1,  0,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 2,  0,  0,  0,  0,  0,  0,  0,  1,  0,  1,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 1,  0,  0,  0,  0,  1,  0,  0,  0, 11,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  1,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0]])

In [72]:
print ("Accuracy of the Model on the Test Dataset is {} \n".format(accuracy_score(y_predicted, test_y)))
print ("Classification Report \n",classification_report(y_predicted, test_y))


Accuracy of the Model on the Test Dataset is 0.6544117647058824 

Classification Report 
              precision    recall  f1-score   support

          1       0.91      0.71      0.80       100
          2       0.29      0.20      0.24        10
          3       0.80      1.00      0.89         4
          4       0.33      0.33      0.33         3
          5       0.00      0.00      0.00         0
          6       0.00      0.00      0.00         4
          7       0.00      0.00      0.00         0
          8       0.00      0.00      0.00         0
          9       0.00      0.00      0.00         0
         10       0.65      0.85      0.73        13
         14       0.00      0.00      0.00         0
         15       0.00      0.00      0.00         0
         16       0.00      0.00      0.00         2

avg / total       0.78      0.65      0.71       136



##### 2. Performance of Logistic Regression

In [49]:
y_predicted = OneVsRestClassifier(LogisticRegression()).fit(train_X, train_y).predict(test_X)
confusion_matrix(y_predicted, test_y)

array([[66,  3,  0,  2,  3,  3,  1,  1,  0,  2,  1,  0,  3],
       [ 6,  2,  0,  0,  2,  1,  0,  0,  0,  0,  0,  1,  1],
       [ 0,  0,  5,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0],
       [ 1,  1,  0,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 2,  0,  0,  0,  0,  1,  0,  0,  1,  0,  1,  0,  2],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  0],
       [ 3,  0,  0,  0,  0,  1,  0,  0,  0, 14,  0,  1,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0]])

In [50]:
print ("Accuracy of the Model on the Test Dataset is {} \n".format(accuracy_score(y_predicted, test_y)))
print ("Classification Report \n",classification_report(y_predicted, test_y))

Accuracy of the Model on the Test Dataset is 0.6617647058823529 

Classification Report 
              precision    recall  f1-score   support

          1       0.85      0.78      0.81        85
          2       0.29      0.15      0.20        13
          3       1.00      0.83      0.91         6
          4       0.33      0.33      0.33         3
          5       0.00      0.00      0.00         0
          6       0.17      0.14      0.15         7
          7       0.00      0.00      0.00         0
          8       0.00      0.00      0.00         1
          9       0.50      1.00      0.67         1
         10       0.82      0.74      0.78        19
         14       0.00      0.00      0.00         0
         15       0.00      0.00      0.00         0
         16       0.00      0.00      0.00         1

avg / total       0.73      0.66      0.69       136



##### 3. Performance of Random Forest Classifier

In [51]:
y_predicted = OneVsRestClassifier(RandomForestClassifier()).fit(train_X, train_y).predict(test_X)
confusion_matrix(y_predicted, test_y)

array([[74,  0,  0,  2,  4,  6,  1,  1,  0,  6,  2,  2,  6],
       [ 3,  5,  1,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0],
       [ 0,  0,  4,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  1,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  2,  0,  0,  0,  0],
       [ 1,  1,  0,  0,  0,  0,  0,  0,  0, 11,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]])

In [52]:
print ("Accuracy of the Model on the Test Dataset is {} \n".format(accuracy_score(y_predicted, test_y)))
print ("Classification Report \n",classification_report(y_predicted, test_y))

Accuracy of the Model on the Test Dataset is 0.7205882352941176 

Classification Report 
              precision    recall  f1-score   support

          1       0.95      0.71      0.81       104
          2       0.71      0.50      0.59        10
          3       0.80      1.00      0.89         4
          4       0.33      1.00      0.50         1
          5       0.20      1.00      0.33         1
          6       0.00      0.00      0.00         0
          7       0.00      0.00      0.00         0
          8       0.00      0.00      0.00         0
          9       1.00      1.00      1.00         2
         10       0.65      0.85      0.73        13
         14       0.00      0.00      0.00         0
         15       0.00      0.00      0.00         0
         16       0.00      0.00      0.00         1

avg / total       0.88      0.72      0.78       136



##### 4. Performance of Multi-Layer-Perceptron Classifier

In [53]:
y_predicted = OneVsRestClassifier(MLPClassifier()).fit(train_X, train_y).predict(test_X)
confusion_matrix(y_predicted, test_y)

array([[67,  3,  1,  2,  3,  4,  1,  1,  0,  2,  0,  1,  5],
       [ 7,  3,  0,  0,  1,  0,  0,  0,  0,  0,  0,  1,  1],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  1,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  1,  3,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  1,  0,  0,  0,  0,  0,  2,  0,  0,  0,  0],
       [ 4,  0,  0,  1,  0,  2,  0,  0,  0, 14,  2,  1,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]])

In [54]:
print ("Accuracy of the Model on the Test Dataset is {} \n".format(accuracy_score(y_predicted, test_y)))
print ("Classification Report \n",classification_report(y_predicted, test_y))

Accuracy of the Model on the Test Dataset is 0.6397058823529411 

Classification Report 
              precision    recall  f1-score   support

          1       0.86      0.74      0.80        90
          2       0.43      0.23      0.30        13
          3       0.00      0.00      0.00         0
          4       0.00      0.00      0.00         0
          5       0.20      1.00      0.33         1
          6       0.00      0.00      0.00         5
          7       0.00      0.00      0.00         0
          8       0.00      0.00      0.00         0
          9       1.00      0.67      0.80         3
         10       0.82      0.58      0.68        24
         14       0.00      0.00      0.00         0
         15       0.00      0.00      0.00         0
         16       0.00      0.00      0.00         0

avg / total       0.78      0.64      0.70       136



##### 5. Performance of Gradient Boosting Classifier

In [90]:
y_predicted = OneVsRestClassifier(GradientBoostingClassifier()).fit(train_X, train_y).predict(test_X)
confusion_matrix(y_predicted, test_y)

array([[74,  2,  0,  2,  4,  3,  1,  1,  2,  7,  0,  0,  4],
       [ 3,  3,  0,  1,  0,  0,  0,  0,  0,  0,  0,  1,  0],
       [ 0,  0,  5,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  1,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 1,  0,  0,  0,  0,  3,  0,  0,  0,  1,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  1,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0],
       [ 0,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  1],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0]])

In [91]:
print ("Accuracy of the Model on the Test Dataset is {} \n".format(accuracy_score(y_predicted, test_y)))
print ("Classification Report \n",classification_report(y_predicted, test_y))

Accuracy of the Model on the Test Dataset is 0.7132352941176471 

Classification Report 
              precision    recall  f1-score   support

          1       0.95      0.74      0.83       100
          2       0.43      0.38      0.40         8
          3       1.00      0.83      0.91         6
          4       0.00      0.00      0.00         0
          5       0.20      1.00      0.33         1
          6       0.50      0.60      0.55         5
          7       0.00      0.00      0.00         1
          8       0.00      0.00      0.00         0
          9       0.00      0.00      0.00         0
         10       0.53      0.90      0.67        10
         14       0.50      1.00      0.67         1
         15       0.33      0.33      0.33         3
         16       0.00      0.00      0.00         1

avg / total       0.84      0.71      0.76       136

