# KNN Classification

K-nearest neighbors is an algorithm that is used to predict or classify a result based on data inputs and its resulting output(s). For this project, I am trying to determine the classification accuracy of distance between a RFID reader (transmitter) and an RFID tag (receiver), based on the data used for the project. 

The project was designed by setting up an antenna connected to an RFID reader along different points in a 28 sq ft grid. The RFID reader received backscattered RSSI data from a passive RFID tag, which was placed in a fixed location. The antenna was mounted to a rotating platform. The antenna transmits and receives in 30-degree intervals. Once the antenna was rotated along a 360-degree interval, the antenna was moved to the next point in the grid and the process is then repeated. Once the data was cleaned, overall, 16,183 data points were able to be used for the algorithm.

The distance classifier shows the range in which the signal was received. 


## Explanation of Column Names
- x.coord = X-coordinate of grid used for testing  
- y.coord = Y-coordinate of grid used for testing  
- angle =  Current angle to the tranmitter  
- distance = Measured distance from receiver to transmitter   
- RSSI = Recieved Signal Strength Index, an indicator of how strong a signal is receieved from a transmitter to receiver  
- rel = Angle relative to the transmitter  
- dist1 = Distance range in which the receiver was placed (Classifier)  

In [2]:
#import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing, utils

#import dataset
df = pd.read_csv(r'D:\Data Practice\KnnData2.csv')
df.head()

Unnamed: 0,ID,x.coord,y.coord,angle,distance,RSSI,rel,dist1
0,6,0,0,150,5.0,-57,113.130102,"(4,5]"
1,7,0,0,180,5.0,-59,143.130102,"(4,5]"
2,9,0,0,240,5.0,-59,-156.869898,"(4,5]"
3,11,0,0,300,5.0,-58,-96.869898,"(4,5]"
4,14,0,1,30,4.472136,-59,3.434949,"(4,5]"


For the KNN analysis, the inputs will be designated as the angle in which the tranmission occurs and the RSSI at that specific angle. The output is the distance classifier 'dist1.' The angle was orginally set at fixed positions. To accurately show the angle in which the antenna was placed relative to the RFID tag, the column 'rel' is used. Both metrics will be used to compare in which is more accurate. 

In [8]:
#inputs and outputs

#Compass Angle
feature_cols1 = ['angle', 'RSSI']
x = df[feature_cols1]

y = df.dist1

## KNN

> K-nearest neighbors is an algorithm used for classification or
regression analysis. The algorithm is used to determine the
closest amount, k, of training instances to a specific parameter.
In both the regression and classification analysis, due to the use
of continuous variables, the identification of the nearest training
instances is found through the Euclidean distance defined as:  
√((𝑞1 − 𝑝1)^2 + (𝑞2 − 𝑝2)^2), (2)  
where 𝑝 = (𝑝1, 𝑝2) and 𝑞 = (𝑞1, 𝑞2) are defined as two points
in a Cartesian, Euclidean space.

For classification, KNN uses a majority voting system, based on the value of k (cluster). [This article](https://www.analyticsvidhya.com/blog/2018/03/introduction-k-neighbours-algorithm-clustering/) goes more in depth towards the KNN algorithm. Using the image provided, the cluster size is bound by the circles. Based on majority rule, where the classification is equal to the pluarity bounded by the boxes, for:

- k = 3, the result is a Red Triangle  
- k = 5, the result is a Blue Square
- k = 11, the result is a Blue Square  

Similar to my [Damwon League of Legends Project](https://github.com/inm2/Damwon-Analysis), to avoid overtraining, the data must be split into training and testing sets. I chose an 80/20 split.

I chose to use a value of k = 5 for the cluster size, to keep measurements standard. 

In [11]:
#(80/20 Split - Compass Angle)
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = .2, random_state=1)

#Create KNN Classifier
knn = KNeighborsClassifier(n_neighbors = 5)

#Fit the classiifer to the data
knn.fit(xtrain, ytrain)

#show first 5 model predictions on the test data
knn.predict(xtest)

#check accuracy of the model on the test data
print("Knn Score (Compass Angle) = ")
print(knn.score(xtest, ytest))

Knn Score (Compass Angle) = 
0.5146740809391411


The Accuracy (KNN Score) using the Compass Angle as a metric is 51.46% accuracy. This value is pretty low, and is a result of keeping the angle measurement static along a grid. The process is repeated for the relative angle, and the result is nearly 91% accuracy.  

In [12]:
#inputs and outputs

#Relative Angle
feature_cols2 = ['rel', 'RSSI']
x = df[feature_cols2]

y = df.dist1

#(80/20 Split - Relative Angle)
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = .2, random_state=1)

#Create KNN Classifier (k = 5 in this case)
knn = KNeighborsClassifier(n_neighbors = 5)

#Fit the classiifer to the data
knn.fit(xtrain, ytrain)

#show first 5 model predictions on the test data
knn.predict(xtest)

#check accuracy of the model on the test data
print("Knn Score (Relative Angle) = ")
print(knn.score(xtest, ytest))

Knn Score (Relative Angle) = 
0.9076305220883534


## Cross-Validation

In the example above, the data is set to be randomized. This is to avoid overfitting. To further avoid over-generalization of the data, another split to the data, [a validation set](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation), can be used. Instead of further separating the testing data into smaller subsets, cross-validation can be used to split the training data into k smaller sets (also known as k-fold cross-validation). 

For this project, the relative angle training data will be subject to 5-fold and 10-fold cross-validation. For each fold, the KNN score, or accuracy, of the model will be computed. To get the true accuracy of the model, the means of each fold's accuracy are computed. 

In [17]:
#Using cv = 5 cross-validation
knn_cv = KNeighborsClassifier(n_neighbors = 5)
cv_scores1 = cross_val_score(knn_cv, x, y, cv = 5)

print('Cross Validation Score (cv = 5): ')
print(cv_scores1)
print('cv_scores1 (Relative Angle 5-fold) mean:{}'.format(np.mean(cv_scores1)))

#Using cv = 10 cross-validation
knn_cv = KNeighborsClassifier(n_neighbors = 5)
cv_scores2 = cross_val_score(knn_cv, x, y, cv = 10)

print('\nCross Validation Score (cv = 10): ')
print(cv_scores2)
print('cv_scores2 (Relative Angle 10-fold) mean:{}'.format(np.mean(cv_scores2)))

Cross Validation Score (cv = 5): 
[0.89935165 0.90358467 0.8974042  0.92212608 0.91131026]
cv_scores1 (Relative Angle 5-fold) mean:0.906755373612161

Cross Validation Score (cv = 10): 
[0.91178285 0.91352687 0.92279185 0.90426189 0.90179123 0.90543881
 0.91898578 0.92269635 0.91960421 0.89486704]
cv_scores2 (Relative Angle 10-fold) mean:0.9115746868347154


## Hypothesis Testing

Judging by the accuracy of the compass angle vs the relative angle, the relative angle has a higher accuracy. Since we can already infer that the relative angle is a superior variable compared to the compass angle, doing some sort of hypothesis test is unnecessary.

However, in general cases, doing an A/B test on your data is an important factor to determine if your assumption is correct when comparing it to a new assessment. Since we already have the accuracy means of the relative angle, two tests can be used to determine if the samples are significantly different and if the two have any dependencies on each other: t-test and correlation testing.  

A t-test is used to compare whether or not the means of two independent samples are significantly different. 

Hypothesis:  
H0: The means of both data sets are equal. (p > 0.05)  
H1: The means of both data sets are unequal.  

A correlation test is to determine if there is a relationship amongst two datasets.

Hypothesis:  
H0: Both data sets are independent of each other. (p > 0.05)    
H1: Both data sets have some sort of dependency.  

I have already evaluated the 10-fold relative angle model, so the compass angle will need to be evaluated. 

[This website has more evaluation tests that can be done, so credit to this website for the code](https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/)

In [19]:
#Compass Angle
feature_cols1 = ['angle', 'RSSI']
x_angle = df[feature_cols1]
cv_scores3 = cross_val_score(knn_cv, x_angle, y, cv = 10)

print('\nCross Validation Score (cv = 10): ')
print(cv_scores3)
print('cv_scores2 (Compass Angle 10-fold) mean:{}'.format(np.mean(cv_scores3)))


Cross Validation Score (cv = 10): 
[0.50771129 0.52439778 0.48239654 0.51945645 0.51080914 0.48207664
 0.53741497 0.54421769 0.54050711 0.51205937]
cv_scores2 (Compass Angle 10-fold) mean:0.5161046974878054


In [23]:
from scipy.stats import ttest_ind, pearsonr

In [25]:
#t-test
stat, p = ttest_ind(cv_scores2, cv_scores3)
print('stat=%.3f, p=%.3f' % (stat, p))

if p > 0.05:
    print('Near same distribution')
else:
    print('Different distributions')

#Correlation test
stat2, p2 = pearsonr(cv_scores2, cv_scores3)
print('stat=%.3f, p=%.3f' % (stat2, p2))
if p2 > 0.05:
    print('Probably independent')
else:
    print('Probably dependent')

stat=52.190, p=0.000
Different distributions
stat=0.318, p=0.370
Probably independent


For the t-test, since the p-value is less than 0.05, the null hypothesis, H0, is rejected. 

For the correlation test, the p-value is greater than 0.05, so there is evidence to fail to reject the null hypothesis.