K-nearest neighbors (KNN) is a supervised classification algorithm. 

Reference: https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn
        
<img src="knn1.png">

Explanation of KNN

From the figure, we notice that there are two classes: Class A (red stars) and Class B (green triangles).

Our new point is the yellow square with a question mark.

Find distance between the new point and every other point in the dataset. 

Example: 

distance - 0.4, 0.49, 0.9, 1.5,...

class    -  A,    B,   A,   B, ...


k = 1, find one neighbor the new point is closest to. In this case, it looks like the new point is close to Class A.

k = 2, find two neighbors that are closest to the new point. It is a tie. 

k = 3, two closests belong to Class A and one belongs to Class B, so the new point belongs to class A. 

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import numpy as np
import scipy
import seaborn as sea

In [None]:
dproj = pd.read_csv('project.csv')

In [None]:
print(dproj.head())

In [None]:
print(dproj.shape)

In [None]:
print(dproj.columns)

In [None]:
print(dproj["TARGET CLASS"].unique())

In [None]:
print(dproj.isnull().sum())

In [None]:
print(dproj.describe())

In [None]:
dproj.hist(figsize=(20,20))

In [None]:
dprojx = dproj[['XVPM', 'GWYH', 'TRAT', 'TLLZ', 'IGGA', 'HYKR', 'EDFS', 'GUUB', 'MGJM',
       'JHZC']]
dprojy = np.array(dproj['TARGET CLASS']).reshape(-1,1)

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
x = np.array(scaler.fit_transform(dprojx))
y = dprojy

In [None]:
print(np.mean(x), np.std(x))

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=42)

In [None]:
print(x_train.shape, x_test.shape)
print(y_train.shape, y_test.shape)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

In [None]:
"""
We are fitting models for different values of k. We are then finding accuracy for each model 
and storing it in scores_list.
"""
k_range = range(1, 20)
scores_list = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(x_train, np.ravel(y_train))
    y_pred = knn.predict(x_test)
    scores_list.append(metrics.accuracy_score(y_test, y_pred))

In [None]:
# We are plotting accuracy for different values of k
%matplotlib inline
import matplotlib.pyplot as plt

plt.plot(k_range, scores_list)
plt.xlabel("Value of k in KNN")
plt.ylabel("Test Accuracy")

In [None]:
# from the above plot, we can say that k=8 gives the best accuracy
# then we build a model with 8 neighbors

knn = KNeighborsClassifier(n_neighbors = 8)
knn.fit(x_train, np.ravel(y_train))

In [None]:
y_pred = knn.predict(x_test)
print(metrics.accuracy_score(y_test, y_pred))

In [None]:
from sklearn.metrics import confusion_matrix
#let us get the predictions using the classifier we had fit above
y_pred = knn.predict(x_test)
print(confusion_matrix(y_test,y_pred))
y_pred_1d = y_pred.flatten()
y_test_1d = y_test.flatten()
pd.crosstab(y_test_1d, y_pred_1d, rownames=['True'], colnames=['Predicted'], margins=True)

In [None]:
#!pip install --upgrade scikit-learn

In [None]:
from sklearn.metrics import plot_confusion_matrix

plot_confusion_matrix(knn, x_test, y_test, cmap="Blues")  
plt.show()  

In [None]:
from sklearn.metrics import roc_curve
y_pred_proba = knn.predict_proba(x_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

In [None]:
plt.plot([0,1],[0,1],'k--')
plt.plot(fpr,tpr, label='Knn')
plt.xlabel('fpr')
plt.ylabel('tpr')
plt.title('Knn(n_neighbors=8) ROC curve')
plt.show()

In [None]:
# Area under ROC curve
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test,y_pred_proba)


In [None]:
knn = KNeighborsClassifier(n_neighbors = 14)
knn.fit(x_train, np.ravel(y_train))

y_pred = knn.predict(x_test)
print(metrics.accuracy_score(y_test, y_pred))

In [None]:
"""
In-class activity: Consider the customers.csv and perform KNN on it. For this, you can consider spending score 
as a target label.
"""

Conclusion: KNN will not work on a multiclass classifier if the class labels are too many. 

KNN will also not work when the dataset is very large.