## Introduction

KNN algorithm – is used to classify a point based on its surroundings, in which we find the points which are surrounding our variable and try to group it with the once which are close to it.

For example, if there is a point, which is surrounded by groups of red points and yellow points, based on the surroundings and the points close to it we categorize this point into yellow or red.
In our case, if the distance between our point and red points is 1.22 and the distance from yellow points is 3.45, then based on the distance we tend to categorize our point to be red.

In this we could also state the k value, which is the number of neighbors which we wish to consider for classifying our point. In which we would calculate the distance from those many points and then make a decision. For example we consider our k value to be 5 then, we would calculate the distance between the nearest 5 values and based on the closest group we would categorize its color.


## Behind the algorithm:

We would consider the distance between the variables which are present and the predictor variable.
          			 
                     Distance- SQRT((x2-x1)^2-(y2-y1)^2)
Then arrange them in ascending order
Then based on the k value take the average of those many numbers which are close to our point i.e. if k=5 we take the average of first 5 distances and if it is 10 we take the average of first 10 distances. 


### Importing all necessary libraries
##### Here we are calling all the libraries which we would be using in our code

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import StandardScaler
from math import sqrt
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import accuracy_score 
from sklearn.metrics import classification_report 

### The code below is used to import the dataset

#### Here we are also getting rid of the "Name" column as it is string variable and it wouldnt help us in this problem, the other reason to eliminate this is to convert our data into float.

In [2]:
# import the dataset
df=pd.read_csv("/Users/sarankaja/downloads/NBA/nba_logreg.csv")
df=df.drop('Name',axis=1)

### We are using the below command to know how our data looks like

In [3]:

df.head()

Unnamed: 0,GP,MIN,PTS,FGM,FGA,FG%,3P Made,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TOV,TARGET_5Yrs
0,36,27.4,7.4,2.6,7.6,34.7,0.5,2.1,25.0,1.6,2.3,69.9,0.7,3.4,4.1,1.9,0.4,0.4,1.3,0.0
1,35,26.9,7.2,2.0,6.7,29.6,0.7,2.8,23.5,2.6,3.4,76.5,0.5,2.0,2.4,3.7,1.1,0.5,1.6,0.0
2,74,15.3,5.2,2.0,4.7,42.2,0.4,1.7,24.4,0.9,1.3,67.0,0.5,1.7,2.2,1.0,0.5,0.3,1.0,0.0
3,58,11.6,5.7,2.3,5.5,42.6,0.1,0.5,22.6,0.9,1.3,68.9,1.0,0.9,1.9,0.8,0.6,0.1,1.0,1.0
4,48,11.5,4.5,1.6,3.0,52.4,0.0,0.1,0.0,1.3,1.9,67.4,1.0,1.5,2.5,0.3,0.3,0.4,0.8,1.0


## Functions

### Euclidean distance
#### In this function we are calculating the distance between every row of our test and train data by using the distance formula
                            Distance- SQRT((x2-x1)^2-(y2-y1)^2)

### Locate Neighbors
#### Here we are sorting the distances which we calculated with the help of Euclidean distance, storing it and returning the list of similar neighbours. These would be the ones which are close to our test variables.

### Making Prediction
#### Here we would be using the data from our neighbours to make predictions by showing the once which is repeated the most number of times.

In [4]:

# Calculate the Euclidean distance
def euclidean_distance(row1, pred):
    dist = 0.0
    for i in range(len(row1)-1):
        dist += (row1[i] - pred[i])**2
    return sqrt(dist)
 
# Locate neighbors
def get_neighbors(train, test, neighbors):
    distances = list()
    for train_row in train:
        distance1 = euclidean_distance(test, train_row)
        distances.append((train_row, distance1))
    distances.sort(key=lambda tup: tup[1])
    neighbors_list = list()
    for i in range(neighbors):
        neighbors_list.append(distances[i][0])
    return neighbors_list
 
# Make a prediction 
def predict(train, test, neighbors):
    neighbors = get_neighbors(train, test, neighbors)
    output = [row[-1] for row in neighbors]
    prediction = max(set(output), key=output.count)
    return prediction
 


### Here we are converting out data set into test and train, in which we are taking first 1000 values of our dataset as train and the remaining in test dataset.

### Then we are calling all the functinons we used above to run the knn model for different k values i.e 5,25 & 51, then we are comparing it with our actual y values to get the accuracy and confusion matrix(compares our true positves to false positives and true negatives to false negatives).

In [5]:


#dataset=trainSet
# define model parameter

trainSet=[]
testSet=[]
for i in range(len(df)):
    if(i<=999):
        trainSet.append(list(df.loc[i]))
    else:
        testSet.append(list(df.loc[i]))
# print(len(trainSet))
# print(len(testSet))
# print(i)
knna=[]
knna25=[]
knna51=[]
actual_class=[]
for i in range(len(testSet)):
    actual_class.append(testSet[i][19])


k=5
    
for i in range(len(testSet)):
    label = predict(trainSet,testSet[i], k)
    knna.append((label))
        
results =confusion_matrix(actual_class, knna) 
print('confusion matrix for 5:',results)
print ('Accuracy Score for k=5 :',accuracy_score(actual_class, knna))

k=25
    
for i in range(len(testSet)):
    label = predict(trainSet,testSet[i], k)
    knna25.append((label))
        
results =confusion_matrix(actual_class, knna25) 
print('confusion matrix for 25:',results)
print ('Accuracy Score for k=25 :',accuracy_score(actual_class, knna25))


k=51
    
for i in range(len(testSet)):
    label = predict(trainSet,testSet[i], k)
    knna51.append((label))
        
results =confusion_matrix(actual_class, knna51) 
print('confusion matrix for 51:',results)
print ('Accuracy Score for k=51 :',accuracy_score(actual_class, knna51))
        

confusion matrix for 5: [[ 66  45]
 [ 85 144]]
Accuracy Score for k=5 : 0.6176470588235294
confusion matrix for 25: [[ 77  34]
 [ 70 159]]
Accuracy Score for k=25 : 0.6941176470588235
confusion matrix for 51: [[ 73  38]
 [ 66 163]]
Accuracy Score for k=51 : 0.6941176470588235


### We can see the accuracies and confusion matrices for various k values, by looking at which we can say that we have more accuracy when our K value is larger such as 25 and 51. Other intresting fact is for both 25 and 51 the accuracy is same. 