In this notebook we will be covering a new modeling technique called K-Nearest Neighbors. 

A quick reminder about the model requirements that Dr. Sheng talks about in her videos are that the data must be standardized to have each column be on the same scale. Having columns on different scales will cause issues with distance calculations.

We will demonstrate this before diving deeper.


https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

## Setup

In [None]:
import numpy as np
import pandas as pd

from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix,\
recall_score, precision_score, f1_score, accuracy_score, make_scorer,\
precision_recall_fscore_support, mean_absolute_error, mean_squared_error

from sklearn.model_selection import train_test_split, cross_validate

from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')
from google.colab import drive
import seaborn as sns

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.dummy import DummyClassifier


drive.mount('/content/drive')

## Data

In [None]:
! ls /content/drive/

In [None]:
titanic_cleaned = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/6482_to_4482/titanic_cleaned.csv').drop('Cabin',axis=1)

In [None]:
titanic_cleaned.head()

In [None]:
X = pd.get_dummies(titanic_cleaned.drop('Survived', axis=1))
y = titanic_cleaned.Survived
print(X.shape, y.shape)

# Model assumption testing preparation

##demonstrate a random number generator

In [None]:
import random
random.randint(100000,9999999) # generate a really big number randomly

## apply random numbers to a new column

In [None]:
X_random_col = X.copy()
X_random_col['a_random_big'] = random.sample(range(9999999, 99999999), X_random_col.shape[0]) # now generate a bunch of random numbers one new number for each row
X_random_col[['Age','a_random_big']].head()

# Standard Scaler

StandardScaler allows us to convert existing unscaled data into scaled data. A quick demonstration showing this in action on our dataframe is below. Notice how our data is on different scales prior to transforming the data with StandardScaler

Before

In [None]:
X

After

notice that we have lost our column names, but the data is now much consistent. technically we don't need to scale our One-Hot encoded data as the scale was very close already, but we let StandardScaler transform it for ease of reading here. 

In [None]:
pd.DataFrame(StandardScaler().fit_transform(X))

## now do the same to a lot of columns 

In [None]:
X_random_columns_scaled = X.copy()
X_random_columns_scaled['a_random_big'] = random.sample(range(9999999, 99999999), X_random_col.shape[0])
X_random_columns_scaled['b_random_big'] = random.sample(range(9999999, 99999999), X_random_col.shape[0])
X_random_columns_scaled['c_random_big'] = random.sample(range(9999999, 99999999), X_random_col.shape[0])
X_random_columns_scaled['d_random_big'] = random.sample(range(9999999, 99999999), X_random_col.shape[0])
X_random_columns_scaled['e_random_big'] = random.sample(range(9999999, 99999999), X_random_col.shape[0])

 # now generate a bunch of random numbers one new number for each row
print('prescaled',"\n")
display(X_random_columns_scaled[['Age','a_random_big','b_random_big','c_random_big','d_random_big','e_random_big']].head())

#now scale

# keep the column names since we will lose them
column_names = X_random_columns_scaled.columns

X_random_columns_scaled = StandardScaler().fit_transform(X_random_columns_scaled)
X_random_columns_scaled = pd.DataFrame(X_random_columns_scaled,columns = column_names )
print("\n",'scaled')
X_random_columns_scaled.head()
 

# Dummy Classifier

https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html

dummy classifiers are useful because they allow us to see how a Naive classification strategy would perform. This establishes a baseline that we can try and beat with our model. It's often suprisingly hard to do better than some of these strategies. 

In [None]:
print("uniform f1 cv score:",round(cross_val_score(DummyClassifier(strategy="uniform"),X_random_col, y,scoring='f1').mean(),2))
print("stratified f1 cv score:",round(cross_val_score(DummyClassifier(strategy="stratified"),X_random_col, y,scoring='f1').mean(),2))

In [None]:
parameters = {'n_neighbors': [i for i in range(50)],
              'p': [i for i in range(2)]  
              }

# KNN with unscaled data with random large columns

Notice the performance here. Since we have data that is unscaled our model performs no better than the stratified dummy classifier. This is a poor performing model because our data is not prepared properly. 

In [None]:
clf = GridSearchCV(KNeighborsClassifier(), parameters,scoring='f1').fit(X_random_col, y)
result_df = pd.DataFrame(clf.cv_results_)
result_df[result_df['rank_test_score']==1]['mean_test_score']

# KNN with scaled data random columns

Notice the performance here. Now that we have scaled our data our model performance is substantially better than our unscaled and dummy classifiers. 

In [None]:
clf = GridSearchCV(KNeighborsClassifier(), parameters,scoring='f1').fit(X_random_columns_scaled, y)
result_df = pd.DataFrame(clf.cv_results_)
result_df[result_df['rank_test_score']==1]['mean_test_score']

# KNN with scaled original data

Notice the performance here. Eliminating the random columns increases the performance slightly. 

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X)

X_scaled = scaler.transform(X)
X_scaled[:2]

# notice that StandardScaler has converted the data to a numpy array. we have lost the column names. 

In [None]:
clf = GridSearchCV(KNeighborsClassifier(), parameters,scoring='f1').fit(X_scaled, y)
result_df = pd.DataFrame(clf.cv_results_)
result_df[result_df['rank_test_score']==1]['mean_test_score']

## KNN hyperparameter exploration

In [None]:
parameters = {'n_neighbors': [i for i in range(150)],
              'p': [i for i in range(2)]  
              }

clf = GridSearchCV(KNeighborsClassifier(), parameters,scoring='f1').fit(X_scaled, y)
result_df = pd.DataFrame(clf.cv_results_)
#result_df[result_df['rank_test_score']==1]['mean_test_score']

Evaluating the model default of 5 nearest neighbors. Would that strategy have achieve good results for us in this modeling task?

In [None]:
sns.lmplot('param_n_neighbors', 'mean_test_score', data=result_df , fit_reg=False)
plt.title("all models mean F1 score optimized by n_neighbors")
plt.show()

In [None]:
result_df[['param_n_neighbors','param_p','mean_test_score']].sort_values(by='mean_test_score',ascending=False).head(10)

In [None]:
!cp "/content/drive/My Drive/Colab Notebooks/4482_KNN_scaled.ipynb" ./

# run the second shell command, jupyter nbconvert --to html "file name of the notebook"
# create html from ipynb

!jupyter nbconvert --to html "4482_KNN_scaled.ipynb"