# K-Nearest Neighbors
Load the `mnist` dataset. Split it into training and test sets. Train and test a k-nearest neighbor model using scikit-learn. Check the documentation to identify the most important hyperparameters, attributes, and methods of the model. Use them in practice.

## Importing Modules

In [9]:
import pandas as pd
import sklearn.model_selection
import sklearn.neighbors
import sklearn.metrics
import plotly.express as px

## Loading the Dataset

In [2]:
df = pd.read_csv("../../datasets/mnist.csv")
df = df.set_index("id")
df.head()

Unnamed: 0_level_0,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
31953,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
34452,8,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60897,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
36953,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1981,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Splitting the Dataset into Train and Test Sets

In [3]:
x = df.drop(["class"], axis=1)
y = df["class"]
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y)

print("df:", df.shape)
print("x_train:", x_train.shape)
print("x_test:", x_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)

df: (4000, 785)
x_train: (3000, 784)
x_test: (1000, 784)
y_train: (3000,)
y_test: (1000,)


## Training a Model

In [14]:
model = sklearn.neighbors.KNeighborsClassifier()
model.fit(x_train, y_train);

## Testing the Trained Model

In [6]:
y_predicted = model.predict(x_test)
accuracy = sklearn.metrics.accuracy_score(y_test, y_predicted)
accuracy

0.906

## Hyperparameter Tuning

In [7]:
k_list = range(1, 10)
metric_list = ["euclidean", "manhattan", "chebyshev"]
result_df = pd.DataFrame(columns=["K", "Metric", "Accuracy"])

for k in k_list:
    for metric in metric_list:
        model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=k, metric=metric)
        model.fit(x_train, y_train)
        y_predicted = model.predict(x_test)
        accuracy = sklearn.metrics.accuracy_score(y_test, y_predicted)
        result_df = result_df.append({"K": k, "Metric": metric, "Accuracy": accuracy}, ignore_index=True)

result_df

Unnamed: 0,K,Metric,Accuracy
0,1,euclidean,0.915
1,1,manhattan,0.903
2,1,chebyshev,0.616
3,2,euclidean,0.896
4,2,manhattan,0.868
5,2,chebyshev,0.566
6,3,euclidean,0.906
7,3,manhattan,0.888
8,3,chebyshev,0.564
9,4,euclidean,0.909


In [13]:
k_df = result_df[result_df["Metric"]=="euclidean"]

fig = px.line(x=k_df["K"], y=k_df["Accuracy"], labels={'x':'K', 'y':'Accuracy'})
fig.show()