### [KNN from scratch](https://medium.com/analytics-vidhya/implementing-k-nearest-neighbours-knn-without-using-scikit-learn-3905b4decc3c)
#### So let’s start with the implementation of KNN. It really involves just 3 simple steps:
1. Calculate the distance(Euclidean, Manhattan, etc) between a test data point and every training data point. This is to see who is closer and who is far by how much.
2. Sort the distances and pick K nearest distances(first K entries) from it. Those will be K closest neighbors to your given test data point.
3. Get the labels of the selected K neighbors. The most common label(label with a majority vote) will be the predicted label for our test data point.

In [62]:
import pandas as pd
import numpy as np
import scipy.spatial
from collections import Counter

In [63]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state = 42, test_size = 0.2)

#### We will define a class ‘KNN’ inside which we will define every essential function that will make our algorithm work. We will be having the following methods inside our class.
1. fit: As discussed earlier, it’ll just keep the data with itself, since KNN does not perform any explicit training process.
2. Distance: We will calculate Euclidean distance here.
3. Predict: This is the phase where we will predict the class for our testing instance using the complete training data. We will implement the 3 stepped process discusses above in this method.
4. Score: Finally We’ll have a score method, to calculate the score for our model based on the test data
5. What about ‘K’ ?: The most important guy here is K, we will pass ‘K’ as an argument while initializing the object for our KNN class(inside __init__)

In [64]:
class KNN:
    def __init__(self, k):
        self.k = k
        
############################################################

    def fit(self, X, y):
        self.X_train = X
        self.y_train = y
        
############################################################

    def distance(self, X1, X2):
        distance = scipy.spatial.distance.euclidean(X1, X2)
    
###########################################################

    def predict(self, X_test):
        final_output = []
        for i in range(len(X_test)):
            d = []
            votes = []
            for j in range(len(X_train)):
                dist = scipy.spatial.distance.euclidean(X_train[j] , X_test[i])
                d.append([dist, j])
            d.sort()
            d = d[0:self.k]
            for d, j in d:
                votes.append(y_train[j])
            ans = Counter(votes).most_common(1)[0][0]
            final_output.append(ans)
            
        return final_output
    
####################################################################
    
    def score(self, X_test, y_test):
        predictions = self.predict(X_test)
        return (predictions == y_test).sum() / len(y_test)

In [65]:
X_train[0]

array([4.6, 3.6, 1. , 0.2])

In [66]:
X_test[0]

array([6.1, 2.8, 4.7, 1.2])

In [67]:
np.sqrt((4.6 - 6.1)**2 + (3.6 - 2.8)**2 + (1 - 4.7)**2 + (.2 - 1.2)**2)

4.192851058647326

In [68]:
distance = scipy.spatial.distance.euclidean(X_train[0], X_test[0])
distance

4.192851058647326

#### See what’s happening:
- We will pass the K while creating an object for the class ‘KNN’.
- Fit method just takes in the training data, nothing else.
- We used scipy.spatial.distance.euclidean for calculating the distance between two points.
- Predict method runs a loop for every test data point, each time calculating distance between the test instance and every training instance. It stores distance and index of the training data together in a 2D list. It then sorts that list based on distance and then updates the list keeping only the K shortest distances(along with their indices) in the list.
- It then pulls out labels corresponding to those K nearest data points and checks which label has the majority using Counter. That majority label becomes the label of the test data point.
- Score method just compares our test output with their actual output to find the accuracy of our prediction.
- Kudos! That’s it. It’s really that simple! Now let’s run our model and test our algorithm on the test data we split apart earlier.

In [None]:
clf = KNN(3)
clf.fit(X_train, y_train)
prediction = clf.predict(X_test)
for i in prediction:
    print(i)

In [None]:
prediction == y_test # all predictions are true

In [None]:
clf.score(X_test, y_test)

## `END -----------------------------------------`