# KNearest Neighbours

KNearest Neighbour is a non-parametric method and can be used for both classification and regression. The predicted value of new data point is measured by plurality vote of its neighbour. 

Commonly used distance metric is Euclidean distance. 

One major drawback occurs when class distribution is skewed. One way to overcome this is by using weights for the distance of K nearest neighbours.

Another drawback is time complexity of it on large datatset.

In this model, I am using breast cancer dataset to predict if the new datapoint is malignant or not.

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing, neighbors
from sklearn.model_selection import train_test_split
import pickle

**Reading data from csv file**

In [2]:
df=pd.read_csv('/content/drive/My Drive/Colab Notebooks/ML Algorithms/DataSets/breast-cancer-wisconsin.data')
df.replace('?', -99999, inplace=True)
df.drop(['id'], 1, inplace=True)

**Getting train and test set data from oandas dataframe**

In [3]:
x=np.array(df.drop([' class'], 1))
y=np.array(df[' class'])

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

**Training the model**

In [4]:
clf=neighbors.KNeighborsClassifier()
clf.fit(x_train, y_train)

acc=clf.score(x_test, y_test)
print(acc)

0.9857142857142858


In [5]:
example_measures=np.array([[4,2,1,1,1,2,3,2,1], [4,2,1,2,2,2,3,2,1]])
example_measures=example_measures.reshape(2, -1)

prediction = clf.predict(example_measures)

In [6]:
print(prediction)

[2 2]


# K_Nearest_Neighbour_from_scratch

In [7]:
import warnings
from collections import Counter as cntr
import random

In [8]:
def kNearestNeighbour(data, predict, k=3):
  if len(data)>k:
    warnings.warn("K is less than total voting groups")
  
  dst=[]
  for grp in data:
    for pnt in data[grp]:
      ecl_dst=np.linalg.norm((np.array(predict)-np.array(pnt))**2)
      dst.append([ecl_dst, grp])
  
  votes=[i[1] for i in sorted(dst)[:k]]
  # print(cntr(votes).most_common(1))
  result = cntr(votes).most_common(1)[0][0]
  return result

In [9]:
df=pd.read_csv('/content/drive/My Drive/Colab Notebooks/ML Algorithms/DataSets/breast-cancer-wisconsin.data')
df.replace('?', -99999, inplace=True)
df.drop(['id'], 1, inplace=True)
data=df.astype('float').values.tolist()
random.shuffle(data)

test_size=0.2
train={2:[], 4:[]}
test={2:[], 4:[]}

x_train=data[:-int(len(data)*test_size)]
x_test=data[-int(len(data)*test_size):]

for i in x_train:
  train[i[-1]].append(i[:-1])

for i in x_test:
  test[i[-1]].append(i[:-1])

correct=0
total=0

for group in test:
  for pnt in test[group]:
    result=kNearestNeighbour(train, pnt, k=5)
    if result==group:
      correct += 1
    total += 1

print("accuracy = ", correct/total)

accuracy =  0.9712230215827338
