# Exercise KNN
Fill the ellipses `...` with code, and don't remove `assert` lines.

## We will use the Titanic dataset.

### Overview

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).



### Data Dictionary

| VARIABLE  | DESCRIPTIONS | 
|---|---|
| **Survived** | Survival (0 = No, 1 = Yes) |
| **Pclass** | Ticket class	(1 = 1st, 2 = 2nd, 3 = 3rd) |
| **Name** | Name of Passenger |
| **Sex** | sex |
| **Age** | Age in years |
| **Sibsp** | # of siblings / spouses aboard the Titanic	 |
| **Parch** | # of parents / children aboard the Titanic |
| **Ticket** | Ticket number |
| **Fare** | Passenger fare |
| **Cabin** | Cabin number |
| **Embarked** | Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) |

### Variable Notes
**Pclass**: A proxy for socio-economic status (SES)

- 1st = Upper

- 2nd = Middle

- 3rd = Lower

**Age**: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

**Sibsp**: The dataset defines family relations in this way...

- Sibling = brother, sister, stepbrother, stepsister

- Spouse = husband, wife (mistresses and fiancés were ignored)

**Parch**: The dataset defines family relations in this way...

- Parent = mother, father

- Child = daughter, son, stepdaughter, stepson

- Some children travelled only with a nanny, therefore parch=0 for them.



In [1]:
# load our dataset
import pandas as pd 

data = pd.read_csv("data.csv")
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [2]:
X = data.drop(columns=['Survived'])
Y = data['Survived']

# split our data into training and testing set with 90:10 ratio
# use a fixed random state for reproducible results
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.1, random_state=42)

### Data Preprocess
We will preprocess the data !!! Remember the training and test set separately to avoid data snooping

In [3]:
# Remove category features
for i in x_train:
    if(x_train[i].dtypes == object):
        x_train = x_train.drop(columns=i)
        x_test = x_test.drop(columns=i)
        
# Fill na values with 0
x_train = x_train.fillna(0)
x_test = x_test.fillna(0)

x_train 

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
165,166,3,9.0,0,2,20.5250
541,542,3,9.0,4,2,31.2750
625,626,1,61.0,0,0,32.3208
388,389,3,0.0,0,0,7.7292
76,77,3,0.0,0,0,7.8958
...,...,...,...,...,...,...
106,107,3,21.0,0,0,7.6500
270,271,1,0.0,0,0,31.0000
860,861,3,41.0,2,0,14.1083
435,436,1,14.0,1,2,120.0000


In [4]:
# Convert x_train, x_test, y_train, y_test to numpy array
import numpy as np
x_train_np = x_train.to_numpy()
x_test_np = x_test.to_numpy()
y_train_np = y_train.to_numpy()
y_test_np = y_test.to_numpy()

# Scale the training and test set with sklearn libary
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(x_train_np)
x_train_np = scaler.transform(x_train_np)
x_test_np = scaler.transform(x_test_np)

### KNN: $k$-Nearest Neighbors
Evaluate the test set with data from the training set.

In case of ties, pick the smallest class (i.e. we prefer class 0 to class 1 to class 2).

In [49]:
# Remember, no training is needed for KNN!
import math
def evaluateKNN_single(k, x_train, y_train, data):
    '''
    Evaluate the classification for `data` with k-nearest neighbor
    given training set (x_train, y_train).

    Note that this function takes in one input instead of the whole
    testing set.
    
    Input:
        k      : hyperparameter for KNN
        x_train: features of training set
        y_train: labels of training set
        data   : features of the data point to be evaluated
    Output:
        Classification of the input data point.
    '''
    # IMPLEMENT HERE
    distances =[]
    for i in range (x_train.shape[0]) : 
        distance = math.sqrt(sum([( float(data[j]) - float(x_train[i][j]) ) ** 2 for j in range (x_train.shape[1])]))
        distances.append({ 
            "label" : y_train[i], 
            "value" : distance
        })
    distances.sort(key = lambda x : x["value"])
    labels = [item["label"] for item in distances]
    return max(set(labels[:k]), key=lambda x: (labels.count(x), -x))

In [50]:
# Evaluation code for the whole dataset
def evaluateKNN(k, x_train=x_train_np, y_train=y_train_np, x_test=x_test_np, y_test=y_test_np):    
    correct = sum(list(map(lambda x: evaluateKNN_single(k, x_train, y_train, x[0]) == x[1], zip(x_test, y_test))))
    print(f'Test accuracy with k={k}: {correct/len(y_test)*100:.4f}% ({correct}/{len(y_test)})')
    

In [51]:
assert evaluateKNN_single(10, x_train_np, y_train_np, x_test_np[0]) in [0, 1], "The return value is not of the correct type"
evaluateKNN(5)

Test accuracy with k=5: 67.7778% (61/90)


In [None]:
assert evaluateKNN_single(10, x_train_np, y_train_np, x_test_np[0]) in [0, 1], "The return value is not of the correct type"
evaluateKNN(1)

### (Optional) Try other things

NameError: name 'distance' is not defined