# Kaggle Competition | Titanic Machine Learning from Disaster

> The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
>
> One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
>
> In this contest, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
>
> This Kaggle Getting Started Competition provides an ideal starting place for people who may not have a lot of experience in data science and machine learning."

From the competition [homepage](https://www.kaggle.com/c/titanic).

## This notebook

The aim of this notebook is to give you some insight into a more "real" world problem where are data isn't cleaned for us. We need to explore the data, clean it and then apply some ML methods.

First we'll need to install some dependencies and then import them.

In [1]:
pip install -r requirements.txt



Note: you may need to restart the kernel to use updated packages.


In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
from patsy import dmatrices
from sklearn import datasets, svm
from statsmodels.nonparametric import smoothers_lowess
from statsmodels.nonparametric.kde import KDEUnivariate

In [34]:
%matplotlib inline

## The data

First let's read in a data and see what we're dealing with.

In [41]:
df = pd.read_csv("data/train.csv") 
print(fr"{df.shape[0]} rows, {df.shape[1]} columns")
df.head()

891 rows, 12 columns


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [59]:
f"The survived column is a boolean with values: {set(df['Survived'].unique())}"

'The survived column is a boolean with values: {0, 1}'

In [62]:
f"The Pclass column represents the class of the ticket with values: {set(df['Pclass'].unique())}"

'The Pclass column represents the class of the ticket with values: {1, 2, 3}'

In [66]:
f"The SidSp column takes values: {set(df['SibSp'].unique())}"

'The SidSp column takes values: {0, 1, 2, 3, 4, 5, 8}'

In [69]:
f"The Parch column takes values: {set(df['Parch'].unique())}"

'The Parch column takes values: {0, 1, 2, 3, 4, 5, 6}'

In [71]:
f"The Embarked column takes values: {set(df['Embarked'].dropna().unique())}"

"The Embarked column takes values: {'Q', 'C', 'S'}"

## Missing values

One thing we want to watch out for is how complete our dataset is. First, let's have a look at the number of rows and how many entries are non-null.

In [25]:
percentage_complete = df.notnull().sum() / df.shape[0]
percentage_complete.map("{:.1%}".format)

PassengerId    100.0%
Survived       100.0%
Pclass         100.0%
Name           100.0%
Sex            100.0%
Age             80.1%
SibSp          100.0%
Parch          100.0%
Ticket         100.0%
Fare           100.0%
Cabin           22.9%
Embarked        99.8%
dtype: object

Clearly we can see that _Cabin_ in particular has a significant number of entries missing. How can we deal with that? At this point it's probably worth just removing it from our dataset.

In [26]:
df.drop(["Cabin"], axis=1, inplace=True)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


Now that we're removed _Cabin_ we're safe to drop all remaining NaN values. If we'd ran this line first our dataset would have been ~23% of it's original size!

In [28]:
df.dropna(inplace=True)
df.shape

(712, 11)

## Exploring the data visually

In [None]:
# specifies the parameters of our graphs
fig = plt.figure(figsize=(18,6), dpi=1600) 
alpha=alpha_scatterplot = 0.2 
alpha_bar_chart = 0.55

# lets us plot many diffrent shaped graphs together 
ax1 = plt.subplot2grid((2,3),(0,0))
# plots a bar graph of those who surived vs those who did not.               
df.Survived.value_counts().plot(kind='bar', alpha=alpha_bar_chart)
# this nicely sets the margins in matplotlib to deal with a recent bug 1.3.1
ax1.set_xlim(-1, 2)
# puts a title on our graph
plt.title("Distribution of Survival, (1 = Survived)")    

plt.subplot2grid((2,3),(0,1))
plt.scatter(df.Survived, df.Age, alpha=alpha_scatterplot)
# sets the y axis lable
plt.ylabel("Age")
# formats the grid line style of our graphs                          
plt.grid(b=True, which='major', axis='y')  
plt.title("Survival by Age,  (1 = Survived)")

ax3 = plt.subplot2grid((2,3),(0,2))
df.Pclass.value_counts().plot(kind="barh", alpha=alpha_bar_chart)
ax3.set_ylim(-1, len(df.Pclass.value_counts()))
plt.title("Class Distribution")

plt.subplot2grid((2,3),(1,0), colspan=2)
# plots a kernel density estimate of the subset of the 1st class passangers's age
df.Age[df.Pclass == 1].plot(kind='kde')    
df.Age[df.Pclass == 2].plot(kind='kde')
df.Age[df.Pclass == 3].plot(kind='kde')
 # plots an axis lable
plt.xlabel("Age")    
plt.title("Age Distribution within classes")
# sets our legend for our graph.
plt.legend(('1st Class', '2nd Class','3rd Class'),loc='best') 

ax5 = plt.subplot2grid((2,3),(1,2))
df.Embarked.value_counts().plot(kind='bar', alpha=alpha_bar_chart)
ax5.set_xlim(-1, len(df.Embarked.value_counts()))
# specifies the parameters of our graphs
plt.title("Passengers per boarding location");