#### References
- http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
- http://scikit-learn.org/stable/auto_examples/exercises/plot_digits_classification_exercise.html#sphx-glr-auto-examples-exercises-plot-digits-classification-exercise-py
- http://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html#sphx-glr-auto-examples-neighbors-plot-classification-py
- http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#sphx-glr-auto-examples-text-document-classification-20newsgroups-py

<b> Why KNN ? </b>

KNN is an algorithm that is useful for matching a point with its closest k neighbors in a multi-dimensional space. It can be used for data that are continuous, discrete, ordinal and categorical which makes it particularly useful for dealing with all kind of missing data.

The assumption behind using KNN for missing values is that a point value can be approximated by the values of the points that are closest to it, based on other variables.

<b> KNN Implementation: </b>

<b> KNeighborsClassifier(): </b> This is the classifier function for KNN. It is the main function for implementing the algorithms. Some important parameters are:

<b> Accuracy Score: </b> 

accuracy_score(): This function is used to print accuracy of KNN algorithm. By accuracy, we mean the ratio of the correctly predicted data points to all the predicted data points. Accuracy as a metric helps to understand the effectiveness of our algorithm

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In [1]:
import numpy as np
from sklearn.preprocessing import Imputer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

#Customized Utilities
from Utilities.Summarize import Summarize as smr
from Utilities.color import color

In [22]:
#Read Data
cancer_data = np.genfromtxt(fname ='./Dataset/breast-cancer-wisconsin.data', delimiter= ',', dtype= float)

In [23]:
print("Dataset Lenght:: ", len(cancer_data))
print("Dataset:: ", str(cancer_data))
print("Dataset Shape:: ", cancer_data.shape)

Dataset Lenght::  699
Dataset::  [[1.000025e+06 5.000000e+00 1.000000e+00 ... 1.000000e+00 1.000000e+00
  2.000000e+00]
 [1.002945e+06 5.000000e+00 4.000000e+00 ... 2.000000e+00 1.000000e+00
  2.000000e+00]
 [1.015425e+06 3.000000e+00 1.000000e+00 ... 1.000000e+00 1.000000e+00
  2.000000e+00]
 ...
 [8.888200e+05 5.000000e+00 1.000000e+01 ... 1.000000e+01 2.000000e+00
  4.000000e+00]
 [8.974710e+05 4.000000e+00 8.000000e+00 ... 6.000000e+00 1.000000e+00
  4.000000e+00]
 [8.974710e+05 4.000000e+00 8.000000e+00 ... 4.000000e+00 1.000000e+00
  4.000000e+00]]
Dataset Shape::  (699, 11)


In [24]:
#The cancer dataset’s first column consists of patient’s id. To make this prediction process unbiased, 
#we should remove this patient id
cancer_data = np.delete(arr = cancer_data, obj= 0, axis = 1)

X = cancer_data[:,range(0,9)]
Y = cancer_data[:,9]

In [25]:
# Impute Missing Values using median
imp = Imputer(missing_values="NaN", strategy='median', axis=0)
X = imp.fit_transform(X)

In [27]:
#Train test split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 100)
y_train = y_train.ravel()
y_test = y_test.ravel()

In [30]:
for K in range(5):
    K_value = K+1
    neigh = KNeighborsClassifier(n_neighbors = K_value, weights='uniform', algorithm='auto')
    neigh.fit(X_train, y_train) 
    y_pred = neigh.predict(X_test)
    print("Accuracy is ", accuracy_score(y_test,y_pred)*100,"% for K-Value:",K_value)

Accuracy is  95.23809523809523 % for K-Value: 1
Accuracy is  93.33333333333333 % for K-Value: 2
Accuracy is  95.71428571428572 % for K-Value: 3
Accuracy is  95.23809523809523 % for K-Value: 4
Accuracy is  95.71428571428572 % for K-Value: 5
