# KNN Exercises

## Create a new notebook, knn_model, and work with the titanic dataset to answer the following:

### 1. Fit a K-Nearest Neighbors classifier to your training sample and transform (i.e. make predictions on the training sample)

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
import acquire as ac
import prepare as prep

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [2]:
t = prep.titanic()

train, val, test = prep.train_val_test(t,'survived')

x_train, y_train = prep.split_x_y(train,'survived')
x_val, y_val = prep.split_x_y(val,'survived')

mms = MinMaxScaler()

x_train[['age', 'fare']] = mms.fit_transform(x_train[['age', 'fare']])
x_val[['age', 'fare']] = mms.transform(x_val[['age', 'fare']])

In [3]:
knn = KNeighborsClassifier()
knn.fit(x_train, y_train)

In [4]:
y_pred = knn.predict(x_train)

### 2. Evaluate your results using the model score, confusion matrix, and classification report.

In [5]:
accuracy = knn.score(x_train, y_train)
TN, FP, FN, TP = confusion_matrix(y_train, y_pred).ravel()
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.90      0.87       384
           1       0.82      0.74      0.78       239

    accuracy                           0.84       623
   macro avg       0.84      0.82      0.83       623
weighted avg       0.84      0.84      0.84       623



### 3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [6]:
print(f'Accuracy is {round(accuracy*100,2)}%.')
print(f'True positive rate: {round(TP/(TP + FN)*100,2)}%.')
print(f'False positive rate: {round(FP/(FP+TN)*100,2)}%.')
print(f'True negative rate: {round(TN/(FP+TN)*100,2)}%.')
print(f'False negative rate: {round(FN/(FN+TP)*100,2)}%.')
print(f'Precision: {round(TP/(FP+TP)*100,2)}%.')
print(f'Recall: {round(TP/(TP+FN)*100,2)}%.')

Accuracy is 84.11%.
True positive rate: 74.48%.
False positive rate: 9.9%.
True negative rate: 90.1%.
False negative rate: 25.52%.
Precision: 82.41%.
Recall: 74.48%.


f1-score and support are 0.78 and 239

### 4. Run through steps 1-3 setting k to 10

In [7]:
knn10 = KNeighborsClassifier(n_neighbors=10)
knn10.fit(x_train, y_train)

In [8]:
y_pred = knn10.predict(x_train)
accuracy = knn10.score(x_train, y_train)
TN, FP, FN, TP = confusion_matrix(y_train, y_pred).ravel()
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.92      0.87       384
           1       0.84      0.67      0.75       239

    accuracy                           0.83       623
   macro avg       0.83      0.80      0.81       623
weighted avg       0.83      0.83      0.82       623



In [9]:
print(f'Accuracy is {round(accuracy*100,2)}%.')
print(f'True positive rate: {round(TP/(TP + FN)*100,2)}%.')
print(f'False positive rate: {round(FP/(FP+TN)*100,2)}%.')
print(f'True negative rate: {round(TN/(FP+TN)*100,2)}%.')
print(f'False negative rate: {round(FN/(FN+TP)*100,2)}%.')
print(f'Precision: {round(TP/(FP+TP)*100,2)}%.')
print(f'Recall: {round(TP/(TP+FN)*100,2)}%.')

Accuracy is 82.5%.
True positive rate: 66.95%.
False positive rate: 7.81%.
True negative rate: 92.19%.
False negative rate: 33.05%.
Precision: 84.21%.
Recall: 66.95%.


### 5. Run through steps 1-3 setting k to 20

In [10]:
knn20 = KNeighborsClassifier(n_neighbors=20)
knn20.fit(x_train, y_train)

In [11]:
y_pred = knn20.predict(x_train)
accuracy = knn20.score(x_train, y_train)
TN, FP, FN, TP = confusion_matrix(y_train, y_pred).ravel()
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.80      0.89      0.84       384
           1       0.78      0.64      0.70       239

    accuracy                           0.79       623
   macro avg       0.79      0.76      0.77       623
weighted avg       0.79      0.79      0.78       623



In [12]:
print(f'Accuracy is {round(accuracy*100,2)}%.')
print(f'True positive rate: {round(TP/(TP + FN)*100,2)}%.')
print(f'False positive rate: {round(FP/(FP+TN)*100,2)}%.')
print(f'True negative rate: {round(TN/(FP+TN)*100,2)}%.')
print(f'False negative rate: {round(FN/(FN+TP)*100,2)}%.')
print(f'Precision: {round(TP/(FP+TP)*100,2)}%.')
print(f'Recall: {round(TP/(TP+FN)*100,2)}%.')

Accuracy is 78.97%.
True positive rate: 63.6%.
False positive rate: 11.46%.
True negative rate: 88.54%.
False negative rate: 36.4%.
Precision: 77.55%.
Recall: 63.6%.


### 6. What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

In [None]:
# The models with fewer neighbors performed better. This is likely because it with more neighbors the information becomes less consistent.

### 7. Which model performs best on our out-of-sample data from validate?

In [19]:
knn.score(x_val, y_val)

0.746268656716418

In [18]:
knn10.score(x_val, y_val)

0.753731343283582

In [16]:
knn20.score(x_val, y_val)

0.7835820895522388