# Predicting Titanic survivors with *k*-NN

In this Notebook we're going to predict whether passengers survived on the Titanic or not, using the *k*-NN algorithm. This is a classic dataset and you can find it on [Kaggle](https://www.kaggle.com/c/titanic).

In [3]:
import seaborn as sns
import sklearn as sk
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split #We need this to split the data

## Data set

Let's first look at the dataset and see which variables we can use.

In [7]:
df = pd.read_csv("titanic.csv")
df.head(30) #show a bit more of the dataset

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Cabin
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,C85
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,C123
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,E46
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,


* *PassengerId* is just an ID variable, we don't use it
* *Survived* is our dependent variable
* There are 5 variables that are easy to work with: *Pclass*, *Sex*, *Age* (though it contains some NaNs), *SibSp* (number of siblings and spouses), *Parch* (number of parents and children).
* The others would require a lot more clever data manipulation to be useful. If you check out the Kaggle page you can see how people approach this.

## Data cleaning

Let's select the variables. We also need to drop the rows with NaN's in them. Unfortunately our *k*-NN algorithm won't work with NaN's. Dealing with missing values is actually a very complicated topic within statistics. For now, let's just drop the rows with NaN's. And see how many people survived.

In [8]:
df = df[['Survived','Pclass', 'Age', 'SibSp', 'Sex', 'Parch']]
df = df.dropna() #get rid of rows with empty cells
df.head()
df['Survived'].value_counts()

0    424
1    290
Name: Survived, dtype: int64

Let's add dummy variables for the variable *Sex*.

In [9]:
dummies = pd.get_dummies(df['Sex'])
df = pd.concat([df, dummies], axis=1) #the axis=1 means: add it to the columns (axis=0 is rows)
df.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Sex,Parch,female,male
0,0,3,22.0,1,male,0,0,1
1,1,1,38.0,1,female,0,1,0
2,1,3,26.0,0,female,0,1,0
3,1,1,35.0,1,female,0,1,0
4,0,3,35.0,0,male,0,0,1


## Building the model

Let's build the model. Remember we can only add one of the variables *male* and *female*. They are perfectly correlated in this dataset so the model wouldn't be able to distinguish between them.

In [17]:
from sklearn.preprocessing import normalize
X = df[['Age', 'Pclass', 'SibSp', 'Parch', 'female']] #create the X matrix
y = df['Survived'] #create the y-variable
X = normalize(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data, store it into different variables
X_train

array([[0.99794027, 0.04536092, 0.        , 0.        , 0.04536092],
       [0.99450545, 0.10468478, 0.        , 0.        , 0.        ],
       [0.99764151, 0.06139332, 0.        , 0.        , 0.03069666],
       ...,
       [0.99922929, 0.02775637, 0.        , 0.02775637, 0.        ],
       [0.99503719, 0.09950372, 0.        , 0.        , 0.        ],
       [0.99902487, 0.03121953, 0.        , 0.        , 0.03121953]])

Let's use the *KNeightborsClassifier* class from sklearn:

In [16]:
from sklearn.neighbors import KNeighborsClassifier #the object class we need

knn = KNeighborsClassifier() #create a KNN-classifier with 5 neighbors (default)
knn = knn.fit(X_train, y_train) #this fits the k-nearest neigbor model with the train data



## Model evaluation

Let's start by calculating accuracy. As always, we do the evaluation on the test data.

In [19]:
knn.score(X_test, y_test) #calculate the fit on the *test* data

0.8093023255813954

Accuracy is 80.9%. An easy comparison is to compare with the best baseline guess: always guess "Not Survived". That would give us 424 / (424 + 290) = 59.4% (see *value_counts* above). So the model is a lot better than the baseline guess.