# Classification

## K-Nearest Neighbours
![](k-nearest-neighbours/1-k-nearest-neighbours.png)
![](k-nearest-neighbours/2-k-nearest-neighbours.png)

#### Calculating similarity/distance in a multi-dimensional space
For example: euclidian distance, but values have to be normalized
![](k-nearest-neighbours/3-k-nearest-neighbours.png)

## Classification accuracy
![](k-nearest-neighbours/4-k-nearest-neighbours.png)

#### Jaccard Index (jaccard simillarity coeff  )
![](k-nearest-neighbours/5-k-nearest-neighbours.png)
j(10,9) = 9/(10+10-9) = 0.81
j(10,5) = 5/(10+10-5) = 0.33
j(10,2) = 2/(10+10-2) = 0.11

#### F1-score (Confusion matrix)
![](k-nearest-neighbours/6-k-nearest-neighbours.png)
Horizontal rows -> real values, Vertical -> predicted
Out of 15 ch1, model correctly predicted 6, and 9 misidentified as ch2
Out of 25 ch2, model correctly predicted 24, and 1 misidentified as ch1

ch1 prd = 7, ch1 real = 15
ch0 prd = 35, ch0 real = 25

little ch1 were picked up, but ~all, classified as ch1 were ch1
precision = probability of selected prediction to be true

~all ch0 were picked up, but through ones, classified as ch0 were ch1
recall = probability of item to be selected correctly  (rate of "picking up")

Imagine we are classifying apples and oranges, so if all are predicted as O, then O_recall is high, but O_prcs is low, and A_recall is low

$$ F1-score = 2*(prc*rec)/(prc+rec) $$

(F1 responds to low values a little more than mean)

Calculated score:
![](k-nearest-neighbours/7-k-nearest-neighbours.png)

## Logarithmic loss
Measures performance, where output is probability between 0 and 1
![](k-nearest-neighbours/9-k-nearest-neighbours.png)

# Practice

It's important to choose __K__ value correctly
![](k-nearest-neighbours/10-k-nearest-neighbours.png)

In [10]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import preprocessing
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [12]:
ds = pd.read_csv('k-nearest-neighbours/EEG_Eye_State_Classification.csv')
# dataset['eyeDetection'].value_counts()
ds.head()

Unnamed: 0,AF3,F7,F3,FC5,T7,P7,O1,O2,P8,T8,FC6,F4,F8,AF4,eyeDetection
0,4329.23,4009.23,4289.23,4148.21,4350.26,4586.15,4096.92,4641.03,4222.05,4238.46,4211.28,4280.51,4635.9,4393.85,0
1,4324.62,4004.62,4293.85,4148.72,4342.05,4586.67,4097.44,4638.97,4210.77,4226.67,4207.69,4279.49,4632.82,4384.1,0
2,4327.69,4006.67,4295.38,4156.41,4336.92,4583.59,4096.92,4630.26,4207.69,4222.05,4206.67,4282.05,4628.72,4389.23,0
3,4328.72,4011.79,4296.41,4155.9,4343.59,4582.56,4097.44,4630.77,4217.44,4235.38,4210.77,4287.69,4632.31,4396.41,0
4,4326.15,4011.79,4292.31,4151.28,4347.69,4586.67,4095.9,4627.69,4210.77,4244.1,4212.82,4288.21,4632.82,4398.46,0


In [None]:
ds.hist(column='P7', bins=50)

In [None]:
# for a in dataset.columns:
#     dataset.hist(column=a)

In [None]:
x = ds[['AF3', 'F7', 'F3', 'FC5', 'T7', 'P7', 'O1', 'O2', 'P8', 'T8', 'FC6',
     'F4', 'F8', 'AF4', 'eyeDetection']]
print("before \n" + str(x[:3]))
# x = preprocessing.StandardScaler().fit(x).transform(x.astype(float))
# print("after \n" + str(x[:3]))
y = ds['eyeDetection']

In [None]:
x_train, x_test, y_train, y_test = train_test_split( x, y, test_size=0.2, random_state=4)
print ('Train set:', x_train.shape,  y_train.shape)
print ('Test set:', x_test.shape,  y_test.shape)

In [None]:
import sklearn
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

In [None]:
def estimate(k):
     model = KNeighborsClassifier(n_neighbors=k).fit(x_train,y_train)

     y_est = model.predict(x_test)
     y_est_train = model.predict(x_train)

     return y_est, y_est_train

In [None]:
y_est, y_est_train = estimate(3)

# print("Train set Accuracy: ", metrics.accuracy_score(y_train, y_est_train))
# print("Test set Accuracy: ", metrics.accuracy_score(y_test, y_est), "\n")
#
# print("Train set Precision: ", metrics.precision_score(y_train, y_est_train))
# print("Test set Precision: ", metrics.precision_score(y_test, y_est), "\n")
#
# print("Train set Recall: ", metrics.recall_score(y_train, y_est_train))
# print("Test set Recall: ", metrics.recall_score(y_test, y_est))