# K-Nearest Neighbors

We now move on to perhaps the simplest to understand yet surprisingly effective supervised learning model. Due to its simplicity, it is not very extensible for more complex problems, but when it is appropriate to use, K-Nearest Neighbors proves to be an important model to keep in mind. Let's take a look:

K-Nearest Neighbors (KNN) is a simple yet effective supervised learning algorithm that can be used for both classification and regression tasks. The basic idea behind KNN is to classify new data points by finding the k nearest neighbors in the training dataset, where k is a user-defined parameter, and then assigning the class label of the new data point based on the most frequent class among the k nearest neighbors. In regression tasks, the algorithm predicts the average value of the target variable for the k nearest neighbors.

The training algorithm for KNN is very simple. It involves only storing the training dataset in memory so that the distances between new data points and the existing training data points can be calculated efficiently. When a new data point is encountered, the algorithm calculates the distances between that point and all the other data points in the training dataset, and then selects the k nearest neighbors based on these distances.

The distance between two data points can be calculated using various metrics, such as Euclidean distance, Manhattan distance, or Minkowski distance. Euclidean distance is the most commonly used metric, and it is defined as the square root of the sum of the squared differences between the corresponding features of the two data points.

One potential modification to the KNN algorithm is to use weighted voting instead of simple majority voting. This means that instead of counting the number of neighbors in each class, the algorithm assigns weights to each neighbor based on their distance to the new data point. The closer neighbors are given higher weights, and their contributions to the final classification decision are weighted accordingly.

KNN is suitable for datasets where the decision boundaries are irregular or nonlinear. It can also work well with small datasets where the number of training examples is relatively low. However, KNN may not perform well on datasets with a large number of features or where the number of classes is very large.

The advantages of KNN include its simplicity, flexibility, and ability to handle both classification and regression tasks. It can also work well with noisy data and does not make any assumptions about the underlying data distribution. The disadvantages of KNN include its computational complexity, as the algorithm needs to calculate distances between each new data point and all the training data points. This can be time-consuming for large datasets. Another potential drawback of KNN is that it can be sensitive to the choice of distance metric and the value of k.

We will only look at KNN as a classifier. Let's look at how well this model can predict a team's seed based on two of its other features: ADJOE and BARTHAG

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import sklearn as sk

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/joshyaffee/A-First-Semester-of-Machine-Learning---INDE-577/main/Datasets/cbb.csv")
df = df.dropna()
df = df[df['SEED'].isin([1, 5, 9, 13, 16])]
df.head()

Unnamed: 0,TEAM,CONF,G,W,ADJOE,ADJDE,BARTHAG,EFG_O,EFG_D,TOR,...,FTRD,2P_O,2P_D,3P_O,3P_D,ADJ_T,WAB,POSTSEASON,SEED,YEAR
0,North Carolina,ACC,40,33,123.3,94.9,0.9531,52.6,48.1,15.4,...,30.4,53.9,44.6,32.7,36.2,71.7,8.6,2ND,1.0,2016
1,Wisconsin,B10,40,36,129.1,93.6,0.9758,54.8,47.7,12.4,...,22.4,54.8,44.7,36.5,37.5,59.3,11.3,2ND,1.0,2015
4,Gonzaga,WCC,39,37,117.8,86.3,0.9728,56.6,41.1,16.2,...,26.9,56.3,40.0,38.2,29.0,71.5,7.7,2ND,1.0,2017
7,Duke,ACC,39,35,125.2,90.6,0.9764,56.6,46.5,16.3,...,23.9,55.9,46.3,38.7,31.4,66.4,10.7,Champions,1.0,2015
8,Virginia,ACC,38,35,123.0,89.9,0.9736,55.2,44.7,14.7,...,26.3,52.5,45.7,39.5,28.9,60.7,11.1,Champions,1.0,2019


In [3]:
from sklearn.model_selection import train_test_split
X = df.loc[:, ['ADJOE', 'BARTHAG']]
y = df.loc[:,['SEED']]
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, stratify = y)

In [4]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(Xtrain, ytrain)
y_pred_train = knn.predict(Xtrain)
y_pred_test = knn.predict(Xtest)

  return self._fit(X, y)


In [5]:
# create a scatter plot of the correct labels
fig = px.scatter(df, x='ADJOE', y='BARTHAG', color='SEED', hover_data=['TEAM'])
fig.show()

In [6]:
# create a scatter plot of the predicted labels
train_df = Xtrain.copy()
train_df['ytrain'] = ytrain
test_df = Xtest.copy()
test_df['ytest'] = ytest
train_df['y_pred'] = y_pred_train
test_df['y_pred'] = y_pred_test

fig = px.scatter(train_df, x='ADJOE', y='BARTHAG', color='ytrain', title = 'Training Data, Actual Labels')
fig.show()

In [7]:
fig = px.scatter(train_df, x='ADJOE', y='BARTHAG', color='y_pred', title = 'Training Data, Predicted Labels')
fig.show()

In [8]:
# calculate accuracy: the proportion of correctly predicted labels
from sklearn.metrics import accuracy_score
accuracy_score(ytrain, y_pred_train)

0.6434782608695652

In [9]:
fig = px.scatter(test_df, x='ADJOE', y='BARTHAG', color='ytest', title = 'Testing Data, Actual Labels')
fig.show()

In [10]:
fig = px.scatter(test_df, x='ADJOE', y='BARTHAG', color='y_pred', title = 'Testing Data, Predicted Labels')
fig.show()

In [11]:
# calculate accuracy: the proportion of correctly predicted labels
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_pred_test)

0.46153846153846156

As you can see, the KNN model is able to get the general sense of the decision boundaries, but will of course fall short when there is a large overlap in the feature space. Just imagine what would happen if we looked at all seeds! Still, KNN is a very powerful tool for supervised learning and takes minimal training since no parameters need to be tuned.