# KNN-K Nearest Neighbor Algorithm

### About KNN

1. It is a Supervised Learning Model - which means that it uses Labelled Data
2. Used for both Classification and Regression - To classify data points to the categories and Regression is about predicting numbers example - Does a person get a job on the basis of educational qualification and what is salary he is going to make.
3. Can be used for non-linear data
4. we have to define the K value in K-Neighbors
5. In KNN we Try to find the Euclidean Distance or Manhattan Distance to calculate the distance between datapoint
6. In Regression we will take the mean of 5 datapoints to make the prediction

### Adavantages of KNN

1. Works well with smaller datasets with less number of features
2. Can be used for both Classification and Regression
3. Easy to implement for multi-class classification Problems
4. Different distance criteria can be used eg:manhattan distance,Euclidean Distance

### Disadvantages of KNN

1. Choosing Optimum K value
2. Less efficient with high dimensional data
3. Dosen't Perform well on imbalanced Data - imbalanced data means when we are having a very less number of instance in the minority class than the majority class
4. Very sensitive towards Outliers

## Math Behind KNN Classifier

We actually use 2 types of measurements that is Euclidean Disatance,Manhattan Distance.

The Euclidean distance formula is used to find the distance between two points on a plane. This formula says the distance between two points (x1 1 , y1 1 ) and (x2 2 , y2 2 ) is d = √(x2 – x1)^2 + (y2 – y1)^2.It is mainly used when there is low dimensionality in the features.it can have a slanted line between the datapoints.if we have more feature we will add (z2 -z1)^2 in the above formula.

Manhattan Distance = | x 1 − x 2 | + | y 1 − y 2 | is the Formula used to find the distance between the datapoints when the data is having high dimensionality in features.it always moves in a 90 degree lines.

###### Referred From siddardhan Youtube.

### Calculating Euclidean Distance and Manhattan Distance

In [1]:
#Importing the Dependencies
import numpy as np 

In [2]:
#Consider two points with 2 dimensions
p1 = (1,1)
p2 = (2,2)

In [4]:
#Calculating the Euclidean Distance
dist = (p1[0]-p2[0])**2 + (p1[1]-p2[1])**2 
eucli_dist = np.sqrt(dist)
eucli_dist

1.4142135623730951

In [5]:
#Consider two points with 3 dimensions
p1 = (1,1,1)
p2 = (2,2,2)

#Calculating the Euclidean Distance
dist = (p1[0]-p2[0])**2 + (p1[1]-p2[1])**2 + (p1[2]-p2[2])**2
eucli_dist = np.sqrt(dist)
eucli_dist

1.7320508075688772

In [8]:
#Consider two points with 4 dimensions
p1 = (1,1,1,1)
p2 = (2,2,2,2)


dist = 0 

for i in range(len(p1)):
    dist = dist + (p1[i]-p2[i])**2
    
eucli_dist = np.sqrt(dist)
print(eucli_dist)

2.0


##### Creating a Function for Euclidean Distance

In [26]:
#Both the p1 and p2 should be in same dimension
def get_eucli_dist(p1,p2):
    dist = 0 

    for i in range(len(p1)):
        dist = dist + (p1[i]-p2[i])**2
    
    eucli_dist = np.sqrt(dist)
    print(eucli_dist)

In [12]:
get_eucli_dist((1,1,1),(2,2,2))

1.7320508075688772


### Manhattan Distance

Formula for Manhattan Distance : Manhattan distance = |x2 - x1| + |y2 - y1|

In [27]:
#Creating a function for Manhattan Distance 
def get_manhattan_dist(p1,p2):
    dist = 0 

    for i in range(len(p1)):
        dist = dist + abs(p1[i]-p2[i])
    man_dist = dist
    print(man_dist)

In [19]:
get_manhattan_dist((1,1,1,1),(2,2,2,2))

4


### Calculating the distance between 2 data points from heart Dataset which is similar

In [35]:
get_eucli_dist((60,0,3,150,240,0,1,171,0,0.9,2,0,2),(66,1,0,160,228,0,0,138,0,2.3,2,0,1))

37.188170162028676


In [36]:
get_manhattan_dist((60,0,3,150,240,0,1,171,0,0.9,2,0,2),(66,1,0,160,228,0,0,138,0,2.3,2,0,1))

68.4


### Calculating the distance between 2 data points from heart Dataset which is not similar

In [33]:
get_eucli_dist((55,0,1,132,342,0,1,166,0,1.2,2,0,2),(67,1,0,160,286,0,0,108,1,1.5,1,3,2))

86.26754893933176


In [34]:
get_manhattan_dist((55,0,1,132,342,0,1,166,0,1.2,2,0,2),(67,1,0,160,286,0,0,108,1,1.5,1,3,2))

162.3


## Building K-Nearest Neighbor from Scratch

In [37]:
#Importing the Dependencies
import numpy as np 
import statistics

In [89]:
#Knn Classifier 
class KNN_Classifier():
    
    #Initiating the Parameters
    def __init__(self,distance_metric):
        
        self.distance_metric = distance_metric
        
    #getting the distance metric     
    def get_distance_metric(self,training_data,testing_data):
        
        if (self.distance_metric == 'euclidean'):
            
            dist = 0
            
            for i in range(len(training_data)-1):   #we need to take all the features excluding the target column
                dist = dist + (training_data[i] - testing_data[i])**2
                
            euclidean_dist = np.sqrt(dist)
            return euclidean_dist
        
        elif (self.distance_metric == 'manhattan'):
            
            dist=0
            
            for i in range(len(training_data)-1):
                dist = dist + abs(training_data[i]-testing_data[i])
                
            manhattan_dist = dist
            return manhattan_dist
                
        
    #Getting the nearest neighbors for the new datapoint. X_train is having all the data point and in test_data the new point.
    #We are actually using this function in inside a function which is Predict function
    def nearest_neighbors(self, X_train, test_data, k ):
        
        distance_list = []
        
        for training_data in X_train:
            
            distance = self.get_distance_metric(training_data,test_data)
            distance_list.append((training_data,distance))
            
        distance_list.sort(key=lambda x:x[1]) # this line actually sort the array distance_list according to distance
        
        
        neighbors_list = []
        
        for j in range(k):
            neighbors_list.append(distance_list[j][0])
            
        return neighbors_list
    
    
    #Predict the class of a new data point     
    def predict(self,X_train,test_data,k):
        
        neighbors = self.nearest_neighbors(X_train, test_data, k)
        
        for data in neighbors:
            label = []  #we are actually creating a list for storing the target variable values
            label.append(data[-1]) #the first 5 values of target variable  label is stored in label array 
            
        predicted_class = statistics.mode(label) #from that actually we are taking the most repeated value from the label array.
        
        return predicted_class
        

## Implementation of KNN algorithm

In [38]:
#Diabetics Prediction
#Importing the Dependencies
import numpy as np
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [41]:
#Loading the diabetes data to the pandas dataframe
diabetes_dataset = pd.read_csv('E:\ML\diabetes.csv')

In [42]:
diabetes_dataset.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [43]:
#No of rows and columns in dataset
diabetes_dataset.shape

(768, 9)

In [44]:
#Seperating the features and target
X = diabetes_dataset.drop(['Outcome'],axis = 1)
Y = diabetes_dataset['Outcome']

In [49]:
# Converting the dataset into numpy array 
X = np.array(X)
Y = np.array(Y)

In [50]:
print(X)
print(Y)

[[  6.    148.     72.    ...  33.6     0.627  50.   ]
 [  1.     85.     66.    ...  26.6     0.351  31.   ]
 [  8.    183.     64.    ...  23.3     0.672  32.   ]
 ...
 [  5.    121.     72.    ...  26.2     0.245  30.   ]
 [  1.    126.     60.    ...  30.1     0.349  47.   ]
 [  1.     93.     70.    ...  30.4     0.315  23.   ]]
[1 0 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0
 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 1 0
 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1
 1 0 0 1 1 1 0 0 0 1 0 0 0 1 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0
 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 1 1 0 1 1 1 1
 0 0 0 0 0 1 0 0 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0
 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 0 0
 1 0 1 0 1 1 0 1 0 0 1 0 1 1 0 0 1 0 1 0 0 1 0 1 0 1 1 1 0 0 1 0

In [74]:
#Splitting the data into training and testing
X_train, X_test, Y_train,Y_test = train_test_split(X , Y, test_size = 0.2, stratify = Y, random_state = 2)

In [75]:
print(X.shape,X_train.shape,X_test.shape)

(768, 8) (614, 8) (154, 8)


In [76]:
print(X_train)

[[0.00e+00 1.19e+02 0.00e+00 ... 3.24e+01 1.41e-01 2.40e+01]
 [6.00e+00 1.05e+02 7.00e+01 ... 3.08e+01 1.22e-01 3.70e+01]
 [1.00e+00 1.89e+02 6.00e+01 ... 3.01e+01 3.98e-01 5.90e+01]
 ...
 [1.10e+01 8.50e+01 7.40e+01 ... 3.01e+01 3.00e-01 3.50e+01]
 [4.00e+00 1.12e+02 7.80e+01 ... 3.94e+01 2.36e-01 3.80e+01]
 [0.00e+00 8.60e+01 6.80e+01 ... 3.58e+01 2.38e-01 2.50e+01]]


In [77]:
#From the above cell we understand that we dont have the target column so we have to add the target column to the X_train
X_train = np.insert(X_train,8,Y_train,axis=1)

In [78]:
print(X_train)

[[0.00e+00 1.19e+02 0.00e+00 ... 1.41e-01 2.40e+01 1.00e+00]
 [6.00e+00 1.05e+02 7.00e+01 ... 1.22e-01 3.70e+01 0.00e+00]
 [1.00e+00 1.89e+02 6.00e+01 ... 3.98e-01 5.90e+01 1.00e+00]
 ...
 [1.10e+01 8.50e+01 7.40e+01 ... 3.00e-01 3.50e+01 0.00e+00]
 [4.00e+00 1.12e+02 7.80e+01 ... 2.36e-01 3.80e+01 0.00e+00]
 [0.00e+00 8.60e+01 6.80e+01 ... 2.38e-01 2.50e+01 0.00e+00]]


In [79]:
X_train.shape

(614, 9)

In [80]:
print(X_train[:,8])

[1. 0. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.
 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0.
 0. 1. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1.
 0. 0. 1. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1.
 0. 0. 1. 0. 0. 1. 0. 1. 0. 1. 1. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0.
 0. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 0.
 0. 0. 0. 1. 0. 1. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0.
 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 1. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 0.
 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 1. 1. 1. 1. 0. 0. 1. 0. 0.
 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 1. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0.

### X_train --> Training data with features and target
### X_test --> Test Data without target

## Model Training 

In [90]:
#Loading the created Model
classifier = KNN_Classifier(distance_metric='euclidean')

#### Note : KNN can only predict one data Point at a Time

In [91]:
prediction = classifier.predict(X_train,X_test[0],5)

In [95]:
print(X_train,X_test[0])

[[0.00e+00 1.19e+02 0.00e+00 ... 1.41e-01 2.40e+01 1.00e+00]
 [6.00e+00 1.05e+02 7.00e+01 ... 1.22e-01 3.70e+01 0.00e+00]
 [1.00e+00 1.89e+02 6.00e+01 ... 3.98e-01 5.90e+01 1.00e+00]
 ...
 [1.10e+01 8.50e+01 7.40e+01 ... 3.00e-01 3.50e+01 0.00e+00]
 [4.00e+00 1.12e+02 7.80e+01 ... 2.36e-01 3.80e+01 0.00e+00]
 [0.00e+00 8.60e+01 6.80e+01 ... 2.38e-01 2.50e+01 0.00e+00]] [  3.    106.     72.      0.      0.     25.8     0.207  27.   ]


In [96]:
X_test.shape

(154, 8)

In [98]:
#Now the values predicted from the classifier is one output so if we need multiple test we can use a for loop for it
X_test_size = X_test.shape[0]

In [99]:
X_test_size

154

In [100]:
y_pred = []
for i in range(X_test_size):
    prediction = classifier.predict(X_train,X_test[i],k=5)
    y_pred.append(prediction)


In [101]:
print(y_pred)

[0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0]


In [102]:
y_true = y_pred

## Model Evaluation

In [104]:
accuracy = accuracy_score(y_true,Y_test)

In [106]:
print(accuracy*100)

69.48051948051948
