#K-Nearest Neighbor

---

* K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique.

* K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories.

* K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm.

* K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems.

* K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.

* It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset.

* KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data.


**Iris dataset**

* This data sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x4 numpy.ndarray

* The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal Length and Petal Width.

**Loading Data**

In [None]:
import csv
import random
def loadDataset(filename, split, trainingSet=[], testSet=[]):
  with open(filename, 'r') as csvfile:
    lines = csv.reader(csvfile)
    dataset = list(lines)
    # print(str(dataset))
    for x in range(1, len(dataset)):
      for y in range(4):
        dataset[x][y] = float(dataset[x][y])
      if random.random() < split:
        trainingSet.append(dataset[x])
      else:
        testSet.append(dataset[x])

In [None]:
trainingSet = []
testSet = []
# r means raw literal used for \ to be remain same
loadDataset(r'/content/drive/MyDrive/Colab Notebooks/Iris.csv', 0.66, trainingSet, testSet)
# repr() Function returns a printable representation of an object
print ('Train: ' + repr(len(trainingSet)))
print ('Test: ' + repr(len(testSet)))

Train: 102
Test: 48


**Find Similarity**

In [None]:
import math
def euclideanDistance(instance1, instance2, length):
  distance = 0
  for x in range(length-1):
    # formula-> distance = sqrt(summation of (x1-x2)^2)
    distance += pow((instance1[x]-instance2[x]), 2)
  return math.sqrt(distance)

In [None]:
# testing above function
data1 = [2, 2, 2, 'a']
data2 = [4, 4, 4, 'b']
distance = euclideanDistance(data1, data2, 3)
print('Distance: ' + repr(distance))

Distance: 2.8284271247461903


**Finding K nearest neighbors**

In [None]:
import operator
def getNeighbors(trainingSet, testInstance, k):
  distances = []
  length = len(testInstance)-1
  for x in range(len(trainingSet)):
    dist = euclideanDistance(testInstance, trainingSet[x], length)
    distances.append((trainingSet[x], dist))
  distances.sort(key=operator.itemgetter(1))
  neighbors = []
  for x in range(k):
    neighbors.append(distances[x][0])
  return neighbors


In [None]:
# testing above function
trainSet = [[2,2,2,'a'], [4,4,4,'b']]
testInstance = [5,5,5]
k = 1
neighbors = getNeighbors(trainSet, testInstance, 1)
print(neighbors)

[[4, 4, 4, 'b']]


In [None]:
import operator
def getResponse(neighbors):
  classVotes = {}
  for x in range(len(neighbors)):
    response = neighbors[x][-1]
    if response in classVotes:
      classVotes[response] += 1
    else:
      classVotes[response] = 1
      # to sort a dict by values pass its items and values as key and we want in descending order so reverse is True
    sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)
    return sortedVotes[0][0]

In [None]:
# testing above function
neighbors = [[1,1,1,'a'], [2,2,2,'a'],[3,3,3,'b']]
response = getResponse(neighbors)
print(response)

a


**Find Accuracy**

In [None]:
def getAccuracy(testSet, predictions):
  correct = 0
  for x in range(len(testSet)):
    if testSet[x][-1] is predictions[x]:
      correct += 1
  print(correct)
  return (correct/float(len(testSet))) * 100.0

In [None]:
# testing above function
testSet = [[1,1,1,'a'], [2,2,2,'a'],[3,3,3,'b']]
predictions = ['a', 'a', 'a']
accuracy = getAccuracy(testSet, predictions)
print(accuracy)

2
66.66666666666666


**Code Driver**

In [None]:
# prepare data
trainingSet = []
testSet = []
split = 0.67

loadDataset('/content/drive/MyDrive/Colab Notebooks/Iris.csv', split, trainingSet, testSet)
print ('Train set:' + repr(len(trainingSet)))
print ('Test set:' + repr(len(testSet)))

# generate predictions
predictions = []
k = 3
for x in range(len(testSet)):
  neighbors = getNeighbors(trainingSet, testSet[x], k)
  result = getResponse(neighbors)
  predictions.append(result)
  print('> Predicted=' + repr(result) + ',  Actual=' + repr(testSet[x][-1]))

accuracy = getAccuracy(testSet, predictions)
# **Error - Accuracy function is not working properly**
print('Accuracy: ' + repr(accuracy) + '%')

Train set:96
Test set:54
> Predicted='Iris-setosa',  Actual='Iris-setosa'
> Predicted='Iris-setosa',  Actual='Iris-setosa'
> Predicted='Iris-setosa',  Actual='Iris-setosa'
> Predicted='Iris-setosa',  Actual='Iris-setosa'
> Predicted='Iris-setosa',  Actual='Iris-setosa'
> Predicted='Iris-setosa',  Actual='Iris-setosa'
> Predicted='Iris-setosa',  Actual='Iris-setosa'
> Predicted='Iris-setosa',  Actual='Iris-setosa'
> Predicted='Iris-setosa',  Actual='Iris-setosa'
> Predicted='Iris-setosa',  Actual='Iris-setosa'
> Predicted='Iris-setosa',  Actual='Iris-setosa'
> Predicted='Iris-setosa',  Actual='Iris-setosa'
> Predicted='Iris-setosa',  Actual='Iris-setosa'
> Predicted='Iris-versicolor',  Actual='Iris-versicolor'
> Predicted='Iris-versicolor',  Actual='Iris-versicolor'
> Predicted='Iris-versicolor',  Actual='Iris-versicolor'
> Predicted='Iris-versicolor',  Actual='Iris-versicolor'
> Predicted='Iris-versicolor',  Actual='Iris-versicolor'
> Predicted='Iris-versicolor',  Actual='Iris-versicol

# KNN using scikit learn

In [None]:
from sklearn import datasets
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB

In [None]:
dataset = datasets.load_iris() # iris dataset from sklearn's datasets

In [None]:
model = GaussianNB()
model.fit(dataset.data, dataset.target)

GaussianNB()

In [None]:
expected = dataset.target
predicted = model.predict(dataset.data)

In [None]:
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       0.94      0.94      0.94        50
           2       0.94      0.94      0.94        50

    accuracy                           0.96       150
   macro avg       0.96      0.96      0.96       150
weighted avg       0.96      0.96      0.96       150

[[50  0  0]
 [ 0 47  3]
 [ 0  3 47]]
