# KNN - Practice Notebook

In [None]:
# Run this cell unchanged
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Below we import the titanic dataset. To view the documentation for this dataset, please see [this kaggle competition.](https://www.kaggle.com/c/titanic/data)

In [None]:
# Run this cell unchanged
raw = pd.read_csv('data/titanic.csv')

Next we drop features that are not useful for a predictive model. 

In [None]:
drop = ['Name','PassengerId', 'Ticket', 'Cabin']

data = raw.drop(drop, axis=1)
data = data.dropna()

The target of this dataset is `Survived` which indicates whether or not a passenger survived the sinking of the titanic. 

In the cell below, we seperate the target from the predictors. 

In [None]:
X = data.drop('Survived', axis = 1)
y = data.Survived

## 1. Create a train test split

In the cell below, create a train test split for our dataset using `2021` as the random_state

In [None]:
# Import train_test_split
# YOUR CODE HERE

# Create a train test split
# YOUR CODE HERE

The goal of this notebook is to focus on KNN so will will take care of the preprocessing for the categorical columns.

In the cell below, we one hot encode the categorical columns of our dataset.

In [None]:
# Run this cell unchanged
# ================== One Hot Encoding Functions ===========================

def ohe_transform(encoder, data):
    # Transform categorical data
    encoded = encoder.transform(data)
    # Collect encoded feature_names
    encoded_features = encoder.get_feature_names()
    # Create a vectorized function for replacing string values
    replace = lambda string, original, new: string.replace(original, new)
    replace = np.vectorize(replace)
    # Loop over the original column names
    for idx in range(len(data.columns)):
        # Create original string value to be replaced
        original = f'x{idx}'
        # Isolate column name
        new = train_cat.columns[idx]
        # Replace string values
        encoded_features = replace(encoded_features, original, new)
    
    encoded = pd.DataFrame(encoded, 
                           columns=encoded_features)
    
    return encoded

def concat_encoded(model_data, model_target, encoded_data):
    # Create copies of modeling data
    data = model_data.copy()
    target = model_target.copy()
    # Reset the index for all datasets
    data = data.reset_index(drop=True)
    target = target.reset_index(drop=True)
    encoded_data = encoded_data.reset_index(drop=True)
    # Concatenate the encoded data to the modeling data
    concat = pd.concat([data, encoded_data], axis = 1)
    
    return concat, target

# ================== One Hot Encoder Preprocessing ===========================

from sklearn.preprocessing import OneHotEncoder

# Initialize encoder
encoder = OneHotEncoder(sparse = False)
# Isolate categorical features
train_cat = X_train.select_dtypes('object')
test_cat = X_test.select_dtypes('object')
# Fit encoder
encoder.fit(train_cat)

# Encode categorical features
train_encoded = ohe_transform(encoder, train_cat)
test_encoded = ohe_transform(encoder, test_cat)

# Drop categorical features from modeling data
X_train_no_categoricals = X_train.drop(train_cat.columns, axis = 1)
X_test_no_catigoricals = X_test.drop(test_cat.columns, axis = 1)

# Add encoded features to modeling data
X_train, y_train = concat_encoded(X_train_no_categoricals,
                                 y_train, train_encoded)
X_test, y_test = concat_encoded(X_test_no_catigoricals,
                                 y_test, test_encoded)

**Great.** Let's take a look at our training data. 

In [None]:
# Run this cell unchanged
X_train.head(3)

## 2. Create a KNN model

In the cell below:
* Initalize a KNN model with the default settings.
* Fit the KNN model to the training data.

In [None]:
# Import the KNN classifier model from sklearn
# YOUR CODE HERE

# Initialize the model with default settings
# YOUR CODE HERE

# Fit the model to the training data
# YOUR CODE HERE

## 3. Generate training and validation scores

Let's evaluate our model using `f1`. 

In the cell below,
* Import the necessary tools for calculating the f1 score
* Import `cross_val_score` from sklearn's `model_selection` module
* Calculate the f1 score for the training data
    * Store this metric in the variable `train_score`
* Calculate the *mean* validation f1 score by passing the training data into `cross_val_score`
    * Store this metric in the variable `val_score`.

In [None]:
# Import the necessary tools for calculating the f1 score
# YOUR CODE HERE

# Import cross_val_score from sklearn's model_selection module
# YOUR CODE HERE


# Calculate the f1 score for the training data
# YOUR CODE HERE

# Calculate the mean validation f1 score 
# by passing the training data into cross_val_score
# YOUR CODE HERE


print('Train:', train_score)
print('Test:', val_score)

## 4. Multiple Choice

In [None]:
# Run this cell unchanged
from src.questions import *
question_4.display()

## 5. Find the best `k`.

A preliminary interpretation of the above metrics is that the model may be overfit to the training data. Let's see what happens to the training and validation scores as we changed the number of neighbors used to generate a prediction. 

In [None]:
# Create a list of integers from 1-150
ks = np.arange(1, 151)

# Create an empty list for train scores
train_scores = []
# Create an empty list for validation scores
val_scores = []
# Loop over the different options for k
for k in ks:
    # Initialize a knn model with a specific k
    # YOUR CODE HERE
    # Fit the model to the training data
    # YOUR CODE HERE
    # Calculate the f1 score for the training data
    # YOUR CODE HERE
    # Calculate the f1 score for the validation data
    # YOUR CODE HERE
    # Append the train score to the train scores list
    # YOUR CODE HERE
    # Append the validation score to the validation scores list
    # YOUR CODE HERE

Now let's plot the training and testing scores!

In [None]:
# Run this cell unchanged
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(ks, train_scores, label='Train', lw=3)
ax.plot(ks, val_scores, label='Test', lw=3)
best_k = sorted(list(zip(val_scores, ks)), reverse=True)[1][1]
ax.vlines(best_k, 0.4, 1, color='black', lw=2, label='Best K')
ax.set_xlabel('K', fontsize=15)
ax.set_ylabel('F1-Score', fontsize=15)
ax.legend(fontsize=15)
ax.grid();

## 6. Multiple Choice

In [None]:
# Run this cell unchanged
question_6.display()

## 7. Text answer

In [None]:
# Run this cell unchanged
question_7.display()

### A Quick fun aside about KNN

KNN is most frequently used as a machine learning prediction tool, but at its core, it is simply an algorithm for finding similarity. 

Because of that, KNN has a remarkable amount of interesting applications. 

Below, let's look at a simple one:
> Using KNN to improve a visualization. 

In the cell below, we import a dataset containing the population for every illinois municipality from 2010-2018.

In [None]:
# Run this cell unchanged
df = pd.read_csv('data/illinois-populations.csv')\
.set_index('city')
df.head()

We could plot the population for every municipality if wanted to, but the magnitude of populations are quite extremely different, so the visualization becomes very **un**informative. 

In [None]:
# Run this cell unchanged
df.T.plot(figsize=(15,6), legend=False);

Instead, we can use `NearestNeighbors` to find the the 5 municipalities that are *most similar* to a given municipality, and only plot the populations for those 6 communities (The community we have selected + the five most similar).

In [None]:
# Run this cell unchanged
# Import NearestNeighbors
from sklearn.neighbors import NearestNeighbors
# Initialize
neighbors = NearestNeighbors()
# Fit to the entire dataset
neighbors.fit(df)
# Use .kneighbors to return
# The 6 most similar observations for
# each data point
distance, indices = neighbors.kneighbors(df, 6)

Now, using the indices generated above, we can plot the population data for any selected town, and show the five communities whose populations from 2010-2018 are most similar.

In [None]:
# Run this cell unchanged
from src.plotter import PlotPopulation
city_plotter = PlotPopulation(df, indices)
city_plotter.display()