**What is k-Nearest Neighbours ?**    
k-Nearest Neighbours finds the most similar training examples (how close in terms of Euclidean distance) with the test examples.     
It classifies the test examples by a majority vote among the k-most similar examples  


k-Nearest Neighbours is a non-parametric and lazy algorithm 
>Non-parametic: Does not make any assumptions about the distributions of the underlying data    
>Lazy : Does not use the training data points to do any generalization i.e. It is fast 



**Implement k-Nearest Neighbours from Scratch in Python **

1) Load the Iris data set   
2) Split into training and test dataset and convert to numpy arrays   
3) Functions: 
> 3a) Create function to find the Euclidean distance between Test example and training example      
> 3b) Create function to find the k - nearest training examples from the test example       
> 3c) Create function to count votes the classification of k-nearest training examples and the maximum votes is the prediction      
> 3d) Creat function to get accuracy from the prediction and test results   

4) Create a function that run 3 functions for each test example  

In [0]:
# Import iris datasets 
from sklearn.datasets import load_iris
data = load_iris()


In [48]:
# Import libraries 
import pandas as pd
import numpy as np

# Import data to dataframe 
df = pd.DataFrame(data= np.c_[data['data'], data['target']], columns= data['feature_names'] + ['target'])
df['label'] = df.target.replace(dict(enumerate(data.target_names)))
df = df.drop(['target'], axis=1)
df.head()


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [0]:
# Splitting data sets 

from sklearn.utils import shuffle
df = shuffle(df)

split = 0.8

end_of_training = int(len(df) * split)

train_set = df.iloc[:end_of_training]

test_set = df.iloc[end_of_training:]


# Converting dataset to train & test 
train_set = train_set.values
test_set = test_set.values

In [81]:
# array[start:stop]
print(train_set[:5])

# array row #2 (starts from row 0) 
print(train_set[1])

# array row #2 + first 3 columns (start from column 0)
print(train_set[1][:3])

# array shape [121,5] => 121 rows and 5 columns 

# array number of rows 
print(train_set.shape)
print((train_set).shape[0])

# array number columns 
print((train_set).shape[1])


[[5.2 3.4 1.4 0.2 'setosa']
 [5.6 2.9 3.6 1.3 'versicolor']
 [4.8 3.0 1.4 0.1 'setosa']
 [6.5 3.0 5.5 1.8 'virginica']
 [6.5 2.8 4.6 1.5 'versicolor']]
[5.6 2.9 3.6 1.3 'versicolor']
[5.6 2.9 3.6]
(120, 5)
120
5


In [0]:
#find the distance between test and training example  
import math 


def euclideandistance(example1, example2):
  distance = []
  distance = [pow((example1[feature_number] - example2[feature_number]),2) for feature_number in range(example1.shape[0]-1)]
  euclideandistance = math.sqrt(sum(distance))
  return euclideandistance

In [0]:
# find the k closest distances of example to the training set 
import operator 

def getk_closest(train_set, test_example, k):
  distances = []
  example1 = test_example
  for train_number in range((train_set).shape[0]):
    example2 = train_set[train_number]
    distances.append((example2[4] , euclideandistance(example1, example2)))
  distances.sort(key=operator.itemgetter(1))
  k_closest = []
  k_closest = distances[:k]
  return k_closest

  

In [0]:
# Getting the prediction 

from collections import Counter 

def prediction(k_closest):
  lis2 = [x[0] for x in k_closest]
  prediction = Counter(lis2).most_common(1)[0][0]
  return(prediction)


In [0]:
# Getting the accuracy 

def score(test_set, prediction):
  score = []
  for test_number in range(len(test_set)):
    if test_set[test_number][-1] == prediction[test_number]:
      score.append(1)
    else:
      score.append(0)
  accuracy = float(float(sum(score))/float(len(score))*100)

  return ((accuracy))


In [89]:
# Runs the functions above for each test example find the prediciton 
# Output the accuracy of the prediciont 

k  = 3
predictions = []
def main():
  for x in range(len(test_set)):
    k_closest = getk_closest(train_set, test_set[x], k)
    #print((k_closest))
    prediction(k_closest)
    predictions.append(prediction(k_closest))
    print 'Test Number', x , '=> Predicted:', prediction(k_closest), ': Actual' , test_set[x][-1]
  print 'Accuracy:' , score(test_set, predictions),'%'
  
  
main()

Test Number 0 => Predicted: setosa : Actual setosa
Test Number 1 => Predicted: virginica : Actual virginica
Test Number 2 => Predicted: virginica : Actual virginica
Test Number 3 => Predicted: virginica : Actual versicolor
Test Number 4 => Predicted: setosa : Actual setosa
Test Number 5 => Predicted: versicolor : Actual versicolor
Test Number 6 => Predicted: versicolor : Actual versicolor
Test Number 7 => Predicted: versicolor : Actual versicolor
Test Number 8 => Predicted: setosa : Actual setosa
Test Number 9 => Predicted: setosa : Actual setosa
Test Number 10 => Predicted: versicolor : Actual versicolor
Test Number 11 => Predicted: versicolor : Actual versicolor
Test Number 12 => Predicted: virginica : Actual virginica
Test Number 13 => Predicted: virginica : Actual virginica
Test Number 14 => Predicted: versicolor : Actual versicolor
Test Number 15 => Predicted: setosa : Actual setosa
Test Number 16 => Predicted: setosa : Actual setosa
Test Number 17 => Predicted: virginica : Actual