# k-Nearest Neighbor Classifier

- KNN is a non-parametric and lazy learning algorithm
- Non-parametric means there is no assumption for underlying data distribution
- All training data used in the testing phase
- Training phase is quick
- Testing phase is slow as it has to compare with all the training test

<img src='img/knn.png'>

In [1]:
import pickle as pkl

with open('../data/titanic_tansformed.pkl', 'rb') as f:
    df_data = pkl.load(f)

In [2]:
df_data.head()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,1,2,3,female,male,C,Q,S
0,0,22.0,1,0,7.25,0,0,1,0,1,0,0,1
1,1,38.0,1,0,71.2833,1,0,0,1,0,1,0,0
2,1,26.0,0,0,7.925,0,0,1,1,0,0,0,1
3,1,35.0,1,0,53.1,1,0,0,1,0,0,0,1
4,0,35.0,0,0,8.05,0,0,1,0,1,0,0,1


In [3]:
df_data.shape

(889, 13)

In [4]:
data = df_data.drop("Survived",axis=1)
label = df_data["Survived"]

In [5]:
from sklearn.model_selection import train_test_split  
data_train, data_test, label_train, label_test = train_test_split(data, label, test_size = 0.2, random_state = 101)

In [6]:
from sklearn.neighbors import KNeighborsClassifier

import time

tic = time.time()
knn_cla = KNeighborsClassifier()
knn_cla.fit(data_train,label_train)
print('Time taken for training Decision Tree', (time.time()-tic), 'secs')

predictions = knn_cla.predict(data_test)
print('Accuracy', knn_cla.score(data_test, label_test))

from sklearn.metrics import classification_report, confusion_matrix                
print(confusion_matrix(label_test, predictions))  
print(classification_report(label_test, predictions)) 

Time taken for training Decision Tree 0.0014171600341796875 secs
Accuracy 0.7247191011235955
[[81 26]
 [23 48]]
             precision    recall  f1-score   support

          0       0.78      0.76      0.77       107
          1       0.65      0.68      0.66        71

avg / total       0.73      0.72      0.73       178



### Hyperparameters for kNN
- The hyperparameter is the number of neighbors that it should consider before classifying

In [7]:
from sklearn.model_selection import StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

n_neighbors = [2,3,4,5,6,7,8, 9]
score_func = 'accuracy'

knn_cla = KNeighborsClassifier()
knn_grid = GridSearchCV(estimator=knn_cla, 
                    param_grid=[{'n_neighbors':n_neighbors}], 
                    cv=5, 
                    scoring=score_func)
knn_grid.fit(data_train, label_train)
print('Best Score', knn_grid.best_score_)
print('Best value for k', knn_grid.best_estimator_.n_neighbors)

Best Score 0.7229254571026723
Best value for k 7
