# Predicting Titanic survivors with *k*-NN

In this Notebook we're going to predict whether passengers survived on the Titanic or not, using the *k*-NN algorithm. This is a classic dataset and you can find it on [Kaggle](https://www.kaggle.com/c/titanic).

In [1]:
import seaborn as sns
import sklearn as sk
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split #We need this to split the data

## Data set

Let's first look at the dataset and see which variables we can use.

In [5]:
df = pd.read_csv("titanic.csv")
df.head(30) #show a bit more of the dataset

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Cabin
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,C85
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,C123
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,E46
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,


* *PassengerId* is just an ID variable, we don't use it
* *Survived* is our dependent variable
* There are 5 variables that are easy to work with: *Pclass*, *Sex*, *Age* (though it contains some NaNs), *SibSp* (number of siblings and spouses), *Parch* (number of parents and children).
* The others would require a lot more clever data manipulation to be useful. If you check out the Kaggle page you can see how people approach this.

## Data cleaning

Let's select the variables. We also need to drop the rows with NaN's in them. Unfortunately our *k*-NN algorithm won't work with NaN's. Dealing with missing values is actually a very complicated topic within statistics. For now, let's just drop the rows with NaN's. And see how many people survived.

In [6]:
df = df[['Survived','Pclass', 'Age', 'SibSp', 'Sex', 'Parch']]
df = df.dropna() #get rid of rows with empty cells
df.head()
df['Survived'].value_counts()

0    424
1    290
Name: Survived, dtype: int64

Let's add dummy variables for the variable *Sex*.

In [8]:
dummies = pd.get_dummies(df['Sex'])
df = pd.concat([df, dummies], axis=1) #the axis=1 means: add it to the columns (axis=0 is rows)
df.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Sex,Parch,female,male,female.1,male.1
0,0,3,22.0,1,male,0,0,1,0,1
1,1,1,38.0,1,female,0,1,0,1,0
2,1,3,26.0,0,female,0,1,0,1,0
3,1,1,35.0,1,female,0,1,0,1,0
4,0,3,35.0,0,male,0,0,1,0,1


## Building the model

Let's build the model. Remember we can only add one of the variables *male* and *female*. They are perfectly correlated in this dataset so the model wouldn't be able to distinguish between them.

In [9]:
X = df[['Age', 'Pclass', 'SibSp', 'Parch', 'female']] #create the X matrix

y = df['Survived'] #create the y-variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data, store it into different variables

X_train.head() #show the head of the training set

Unnamed: 0,Age,Pclass,SibSp,Parch,female,female.1
641,24.0,1,0,0,1,1
433,17.0,3,0,0,0,0
202,34.0,3,0,0,0,0
585,18.0,1,0,2,1,1
544,50.0,1,1,0,0,0


Let's use the *KNeightborsClassifier* class from sklearn:

In [32]:
from sklearn.neighbors import KNeighborsClassifier #the object class we need

knn = KNeighborsClassifier() #create a KNN-classifier with 5 neighbors (default)
knn = knn.fit(X_train, y_train) #this fits the k-nearest neigbor model with the train data



## Model evaluation

Let's start by calculating accuracy. As always, we do the evaluation on the test data.

In [33]:
knn.score(X_test, y_test) #calculate the fit on the *test* data

0.827906976744186

Accuracy is 82.8%. An easy comparison is to compare with the best baseline guess: always guess "Not Survived". That would give us 424 / (424 + 290) = 59.4% (see *value_counts* above). So the model is a lot better than the baseline guess. Let's create a confusion matrix to evaluate precision and recall.

In [34]:
from sklearn.metrics import confusion_matrix
y_test_pred = knn.predict(X_test) #the predicted values
cm = confusion_matrix(y_test, y_test_pred) #creates a "confusion matrix"
cm

array([[118,  16],
       [ 21,  60]], dtype=int64)

Let's pretty print that.

In [35]:
conf_matrix = pd.DataFrame(cm, index=['Not survived (actual)', 'Survived (actual)'], columns = ['Not survived (predicted)', 'Survived (predicted)']) 
conf_matrix

Unnamed: 0,Not survived (predicted),Survived (predicted)
Not survived (actual),118,16
Survived (actual),21,60


#### Accuracy

Let's start with the accuracy. We already calculated it with the *.score* method, but just to check.

In [36]:
(118+60)/(118+21+16+60)


0.827906976744186

Indeed, accuracy is 82.8%. 

In [37]:
#### Precision (survived)

Remember precision and recall can only be calculated for an *outcome*, not for an entire variable (unlike accuracy). I'll calculate precision and recall for *survived* and leave the others up for the reader to try.

Let's start with precision. This is the number of correctly predicted survivors, divided by the total number of predicted survivors. Remember: how "precise" am I in saying people survived?

In [38]:
60/(16+60)

0.7894736842105263

Precision is 78.9%, so a little bit worse than accuracy overall but not much.

#### Recall (survived)

Now recall. This is the number of correctly predicted survivors, divided by the total number of actual survivors. Remember: how many people survived do I "recall"?

In [39]:
60/(21+60)

0.7407407407407407

So 74.1%. So again a bit worse than accuracy overall but not much. How is it possible that both precision and recall are worse than accuracy, you might ask. Well, remember that *survived* is just one outcome. Apparently the model is better in predicting *not survived*. It's typical but not always the case that the more common outcome is predicted better.

## Parameter setting

Finally, let's try out different settings for the most important parameter *k*. I'll use a for-loop to do a simple parameter grid search. I'll use a built-in function *classification_report* in sklearn to print out accuracy, precision and recall quickly.

In [40]:
from sklearn.metrics import classification_report

for i in range(1,11):
    knn_new = KNeighborsClassifier(n_neighbors = i) #make a new kNN model with i (1-10) neighbors
    knn_new = knn_new.fit(X_train, y_train) #fit new model on train data
    y_test_pred_new = knn_new.predict(X_test) #predict using new model, with test data
    print(f"With {i} neighbors the result is:")
    print(classification_report(y_test, y_test_pred_new)) #use a built-in function to print out accuracy, precision and recall


With 1 neighbors the result is:
              precision    recall  f1-score   support

           0       0.86      0.79      0.82       134
           1       0.70      0.79      0.74        81

    accuracy                           0.79       215
   macro avg       0.78      0.79      0.78       215
weighted avg       0.80      0.79      0.79       215

With 2 neighbors the result is:
              precision    recall  f1-score   support

           0       0.80      0.93      0.86       134
           1       0.84      0.60      0.71        81

    accuracy                           0.81       215
   macro avg       0.82      0.77      0.78       215
weighted avg       0.81      0.81      0.80       215

With 3 neighbors the result is:
              precision    recall  f1-score   support

           0       0.87      0.83      0.85       134
           1       0.74      0.79      0.76        81

    accuracy                           0.81       215
   macro avg       0.80      0.8

The scores seem broadly similar, but 6 or 7 neighbors seem to give the best result. However, with such a small dataset, it could well be coincidence. We don't know if this result would generalize. For this we could try out different test-train splits (a method called cross-validation).