# **K-Nearest Neighbors | Jun Lee**

K-Nearest Neighbors (KNN) is classification algorithm. It works by comparing a new data point to the `k` closest data points in the training set. The new data point is then classified based on the majority class among its k nearest neighbors. We are essentially just calculating the distance between points to classify data.

### **Lets get into some theory...**
**Distance Calculation**: The core idea of KNN is to calculate the distance between data points. Today, we are going to be using the Euclidiean Distance
   - **Euclidean Distance**: This is the most common distance metric, calculated as:
     
$$
d = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}
$$
     

2. **Choosing `k`**: The number of neighbors `k` is crucial. A small `k` may lead to overfitting, while a large `k` might oversimplify the model. The `k` value should be carefully selected to balance accuracy and generalization. our K varible is the amount of neighbors that will be classified when considering a new datapoint.

3. **Voting Mechanism**: The algorithm assigns the new data point to the class that is most common among its `k` nearest neighbors.

**Library Imports** <br>
<br>
Various libraries are imported for data handling (pandas, numpy), preprocessing (preprocessing), and classification (KNeighborsClassifier). Metrics like confusion_matrix, f1_score, and accuracy_score are used to evaluate the model.

In [4]:
import sklearn
from sklearn.utils import shuffle
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
from sklearn import linear_model, preprocessing
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score    
from sklearn.metrics import accuracy_score

**Taking in our Dataset** <br>
<br>
Here we are taking in our dataset we will use for KNN, which is "car.data".
We have defined our varible "data" and parsed the csv into it and then printed out the first couple of vales to ensure everything is running smoothly by using the head() function.

In [5]:
data = pd.read_csv("car.data")
print(data.head())

  buying  maint doors persons lug_boot safety  class
0  vhigh  vhigh     2       2    small    low  unacc
1  vhigh  vhigh     2       2    small    med  unacc
2  vhigh  vhigh     2       2    small   high  unacc
3  vhigh  vhigh     2       2      med    low  unacc
4  vhigh  vhigh     2       2      med    med  unacc


**Label Encoding to Numerical values** <br>
<br>
LabelEncoder is used to convert categorical variables (like buying, maintence, etc.) into numerical values. This is necessary because KNN works with numerical data. Each feature in the dataset is encoded into integers.

In [6]:
le = preprocessing.LabelEncoder()
buying = le.fit_transform(list(data["buying"]))
maint = le.fit_transform(list(data["maint"]))
doors = le.fit_transform(list(data["doors"]))
persons = le.fit_transform(list(data["persons"]))
lug_boot = le.fit_transform(list(data["lug_boot"]))
safety = le.fit_transform(list(data["safety"]))
cls = le.fit_transform(list(data["class"]))

**What are we predicting?** <br>
We have chosen to predict the "class" which is how the car is rated in all of these varibles.

In [7]:
predict = "class"

**Setting x and y** <br>
<br>
Again, our y varible is what we are predicting, which is te class of the car, and our x varibles is everything we are taking into account, such as `saftey`, `persons`, etc. We are parsing in the numerical vales we converted to above using `LabelEncoder()`
<br>
<br>
The data is split into training (90%) and testing (10%) sets using train_test_split. The training set is used to train the model, and the test set is used to evaluate its performance.

In [8]:
x = list (zip(buying, maint, doors, persons, lug_boot, safety))
y = list(cls)

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1)
print(x_train, y_test)

[(2, 1, 0, 2, 0, 2), (1, 2, 1, 2, 1, 1), (3, 3, 3, 2, 1, 2), (0, 0, 3, 1, 1, 1), (2, 3, 1, 0, 2, 2), (1, 1, 0, 1, 0, 1), (2, 0, 3, 0, 1, 2), (3, 1, 0, 2, 0, 1), (0, 1, 0, 2, 0, 0), (2, 0, 3, 2, 0, 1), (0, 0, 2, 2, 1, 2), (0, 1, 0, 1, 0, 1), (2, 0, 0, 0, 0, 2), (0, 2, 0, 2, 2, 2), (0, 0, 0, 2, 0, 1), (1, 1, 2, 2, 1, 2), (2, 0, 2, 0, 1, 0), (1, 1, 0, 0, 1, 1), (2, 3, 2, 1, 2, 0), (0, 0, 3, 2, 0, 0), (1, 0, 2, 1, 0, 0), (1, 2, 1, 0, 1, 1), (0, 3, 1, 1, 2, 0), (3, 3, 1, 0, 2, 2), (1, 2, 3, 1, 2, 0), (1, 1, 2, 2, 2, 1), (0, 1, 3, 0, 2, 1), (3, 0, 2, 0, 0, 2), (3, 2, 1, 1, 0, 0), (2, 0, 3, 0, 2, 2), (0, 1, 2, 1, 1, 1), (2, 0, 0, 0, 2, 2), (1, 2, 1, 2, 0, 1), (1, 2, 2, 1, 2, 2), (0, 1, 3, 0, 1, 2), (2, 1, 3, 0, 2, 0), (1, 3, 1, 1, 2, 1), (1, 0, 3, 2, 0, 2), (3, 2, 3, 0, 2, 1), (0, 3, 0, 1, 0, 1), (2, 2, 2, 0, 0, 0), (1, 0, 2, 2, 1, 0), (3, 0, 2, 2, 1, 1), (3, 3, 3, 1, 1, 0), (0, 1, 1, 1, 1, 2), (1, 0, 2, 0, 0, 2), (2, 1, 1, 0, 0, 1), (1, 2, 1, 1, 1, 1), (1, 0, 3, 1, 0, 0), (1, 2, 0, 0, 1, 2),

**Training the Model and Evaluating** <br>
<br>
The KNeighborsClassifier is initialized with the Euclidean distance metric `(metric="euclidean")`, `p=3` (used for Minkowski distance), and `n_neighbors=9` (meaning the 9 closest neighbors are considered). The model is then trained using the training data `(x_train, y_train)`.

The accuracy (acc) of the model is calculated using the test data. The score() function returns the proportion of correctly predicted data points. Predictions are made using predict().

In [9]:
model = KNeighborsClassifier(metric="euclidean", p=3, n_neighbors=9)
model.fit(x_train, y_train)
acc = model.score(x_test, y_test)
print(acc)

0.9653179190751445


In [10]:
y_pred = model.predict(x_test)
y_pred

array([0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 0, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 0, 2, 2, 0, 2,
       2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 1, 2, 2, 0,
       2, 0, 2, 2, 0, 0, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 0, 2, 3, 0, 2, 2,
       2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 3, 2, 2, 3, 2, 2, 3, 2, 2, 2, 2, 2,
       2, 0, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2,
       0, 2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 2, 0, 2, 0, 2, 3, 2, 2, 0, 2,
       0, 2, 2, 0, 3, 2, 0, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0, 0, 2],
      dtype=int64)

**Predictions** <br>
<br>
The code maps predictions to human-readable class names (unacc, acc, good, vgood). It also finds the 9 nearest neighbors for each test point using kneighbors(). <br>
<br>
The model.predict(x_test) line generates predictions for each data point in the test set. <br>
The predictions are numerical values corresponding to the encoded class labels `(0, 1, 2, 3)`. The names list translates these numerical predictions back into human-readable class names `("unacc", "acc", "good", "vgood")`. <br>
The loop iterates through each prediction, printing out: <br>
    -The predicted class. <br>
    -The feature data of the corresponding test sample `(x_test[x])`. <br>
    -The actual class label from the test set `(y_test[x])`.<br>
Finding Neighbors: model.kneighbors`([x_test[x]], 9, True)` returns the distances and indices of the 9 nearest neighbors for each test point. This helps in understanding which training points influenced the prediction.

In [11]:
predicted = model.predict(x_test)
names = ["unnac", "acc", "good", "vgood"]
for x in range (len(predicted)):
    print("predicted: ", names[predicted[x]], "Data:", x_test[x], "actal", names[y_test[x]])
    n = model.kneighbors([x_test[x]], 9, True)
print("N: ", n)

predicted:  unnac Data: (3, 1, 0, 2, 0, 0) actal unnac
predicted:  good Data: (3, 1, 2, 2, 0, 2) actal unnac
predicted:  good Data: (1, 1, 3, 2, 2, 1) actal good
predicted:  good Data: (2, 1, 1, 1, 2, 1) actal good
predicted:  unnac Data: (1, 0, 3, 1, 1, 2) actal unnac
predicted:  good Data: (2, 3, 3, 2, 2, 2) actal good
predicted:  good Data: (0, 3, 0, 2, 1, 0) actal good
predicted:  good Data: (1, 0, 0, 2, 2, 2) actal good
predicted:  good Data: (2, 2, 2, 0, 1, 2) actal good
predicted:  good Data: (0, 2, 2, 0, 2, 0) actal good
predicted:  unnac Data: (3, 2, 2, 2, 2, 0) actal unnac
predicted:  unnac Data: (2, 0, 1, 1, 0, 2) actal unnac
predicted:  vgood Data: (1, 1, 2, 2, 1, 0) actal vgood
predicted:  good Data: (0, 3, 0, 1, 1, 1) actal good
predicted:  good Data: (3, 2, 0, 2, 2, 0) actal good
predicted:  good Data: (0, 1, 3, 0, 0, 1) actal good
predicted:  good Data: (1, 2, 2, 2, 1, 1) actal good
predicted:  good Data: (3, 3, 0, 0, 0, 0) actal good
predicted:  good Data: (0, 3, 0, 2,

In [12]:
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy_score(y_test, y_pred))

[[ 34   0   3   0]
 [  2   1   0   0]
 [  0   0 125   0]
 [  1   0   0   7]]
0.9653179190751445


**To interpret the results in the confusion matrix** <br>
(Source: Mr Zampogna)
[[94, 13], [15, 32]] <br>

             Predicted Positive    Predicted Negative       
Actual Positive        94                13 <br>
Actual Negative        15                32 <br>

The overall accuracy of my model, calculated as the proportion of correct predictions to the total number of predictions, is approximately 96.5%. This indicates that the model is highly accurate. However, the confusion matrix reveals that the second class is the most challenging to predict accurately, with more misclassifications compared to the other classes. We can then use this for identifying where the model may need improvement, such as through adjusting the value of k, or using different features.