# K Nearest Neighbors Model

A simple classification model. This notebook can be used as a template for others.

Remark from Will: I may later go through the process of doing some diagnostic plots, but working a little with the numbers it is already quite clear that at least when test_size = 1/5, k=6 is the optimal choice; it obtains not only the best accuracy, but also obtains optimal values for 2/3 recall values and 2/3 precision values. Thus it is substantially better. The accuracy of this optimal model is 0.716. With a little experimentation, test_size = 1/5 also seems to avoid overfitting.

# Imports

In [84]:
## Standard imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from seaborn import set_style
import csv

set_style("whitegrid")

In [85]:
## More specific imports for this notebook

import joblib ## For saving trained models
from sklearn.neighbors import KNeighborsClassifier ## Import the model here
from sklearn.model_selection import train_test_split ## Import train_test_split
from sklearn.metrics import confusion_matrix ## Import confusion_matrix
from sklearn.metrics import accuracy_score

# Initial Settings

In [86]:
data_fp = '../../data/processed_data/specgram_db0.npy' ## Import raw wave data

file = open('../../data/processed_data/metadatadata.csv','r') ##Import raw classifications corresponding to raw data
data_cat = list(csv.reader(file, delimiter=','))
file.close()

# Load Data

In [87]:
# Load data (currently without classifications attached)
x_data = np.load(data_fp)

#Flatten the 2-dim matrices into vectors for the kNN
X = []
for x in x_data:
    X.append(x.flatten())

In [153]:
df = pd.read_csv("../../data/processed_data/metadata.csv")

situation_to_number = {'brushing': 0, 'food': 1, 'isolation': 2}
breed_to_number = {'european_shorthair': 0, 'maine_coon': 1}
sex_to_number = {0:0, 1:1} ## Sex already transfered to number; this is for uniformity of the code

# Create a new column with numerical values based on the mapping
model_type = 'sex'
category_to_number = sex_to_number
df['numerical_'+model_type] = df[model_type].map(category_to_number)

y = df['numerical_'+model_type]

# Train-Test Split

In [154]:
## Set up the train test split

# Use these variables to automate saving runs with different filesnames
test_size = 1/5
random_state = 440
x_train, x_test, y_train, y_test = train_test_split(X.copy(), y,
                                        shuffle = True,
                                        random_state = random_state,
                                        test_size=test_size)

# Fit Model

In [175]:
## Use this variable to automate saving runs with different filesnames
k = 10

## Make the model object
knn = KNeighborsClassifier(k)

## "Fit" the model object
knn.fit(x_train, y_train)

# Assess Model Performance
Do things like test accuracy, etc.

In [176]:
## Predict on the training set
y_test_pred = knn.predict(x_test)

## Compute confusion matrix for model
conf_mat = confusion_matrix(y_test, y_test_pred)

## Compute accuracy for the model
acc = accuracy_score(y_test, y_test_pred)

print(acc)
print(conf_mat)

0.8295454545454546
[[70  2]
 [13  3]]


# Save Trained Model

In [177]:
# Build model_filename based on characteristics of test

model_filename = '../../data/trained_models/' ## Save location destination
model_filename += (model_type+'_')
model_filename += ('k'+str(k)) ## Save k-value used for model
model_filename += ('s'+str(test_size)) ## Save test_size used for train test split
model_filename += ('r'+str(random_state)) ## Save random_state used for train test split
model_filename += '.pkl'
print(model_filename)

# Save the model to disk
joblib.dump(knn, model_filename)

../../data/trained_models/sex_k10s0.2r440.pkl


['../../data/trained_models/sex_k10s0.2r440.pkl']

# Discussion of Data Runs

The remainder of the file is mardown copies of the key data points from lots of kNN runs on different data points.

Main Takeaways:
- Extremely high reliability for predicting the breed
- Pretty reliable for predicting the sex (without knowing the neutering status)
- Limited but significant ability to predict the situation

# Runs on Situation Data
kNN runs for situation data:

k=3
- Accuracy = 0.6363636363636364
- [[16  5 10]
 [ 7 10  4]
 [ 5  1 30]]
 
k=4
- Accuracy = 0.6477272727272727
- [[19  3  9]
 [ 8  8  5]
 [ 5  1 30]]

k=5 (Highest Accuracy)
- Accuracy = 0.6704545454545454
- [[19  1 11]
 [ 7  9  5]
 [ 4  1 31]]

k=6
- Accuracy = 0.6022727272727273
- [[16  3 12]
 [ 8  9  4]
 [ 7  1 28]]

k=7
- Accuracy = 0.6477272727272727
- [[16  4 11]
 [ 7  9  5]
 [ 2  2 32]]

k=8
- Accuracy = 0.625
- [[14  4 13]
 [ 6 10  5]
 [ 3  2 31]]

k=9
- Accuracy = 0.6363636363636364
- [[14  4 13]
 [ 7  9  5]
 [ 2  1 33]]

k=10
- Accuracy = 0.6363636363636364
- [[13  4 14]
 [ 5 11  5]
 [ 2  2 32]]

# Runs on Breed Data
kNN runs for situation data:

k=3
- Accuracy = 0.9318181818181818
- [[51  2]
 [ 4 31]]

 
k=4 (Highest Accuracy)
- Accuracy = 0.9545454545454546
- [[53  0]
 [ 4 31]]

k=5
- Accuracy = 0.9090909090909091
- [[48  5]
 [ 3 32]]

k=6
- Accuracy = 0.9318181818181818
- [[50  3]
 [ 3 32]]

k=7
- Accuracy = 0.9090909090909091
- [[47  6]
 [ 2 33]]

k=8
- Accuracy = 0.9090909090909091
- [[48  5]
 [ 3 32]]

k=9
- Accuracy = 0.9090909090909091
- [[47  6]
 [ 2 33]]

k=10
- Accuracy = 0.9090909090909091
- [[47  6]
 [ 2 33]]

# Runs on Sex Data
kNN runs for situation data:

k=3
- Accuracy = 0.8409090909090909
- [[65  7]
 [ 7  9]]

 
k=4 (Highest Accuracy)
- Accuracy = 0.8863636363636364
- [[71  1]
 [ 9  7]]

k=5
- Accuracy = 0.8409090909090909
- [[66  6]
 [ 8  8]]

k=6
- Accuracy = 0.8522727272727273
- [[69  3]
 [10  6]]

k=7
- Accuracy = 0.8636363636363636
- [[68  4]
 [ 8  8]]

k=8
- Accuracy = 0.8295454545454546-- 
- [[69  3]
 [12  4]]

k=9
- Accuracy = 0.8068181818181818
- [[67  5]
 [12  4]]

k=10
- Accuracy = 0.8295454545454546
- [[70  2]
 [13  3]]