# Titanic: a Machine Learning Case Study

![](titanic_main.png)

by Dr. Kristian Rother

*This tutorial is available under the conditions of the Creative Commons Attribution Share-alike License 4.0*

## Goal

We would like to utilize passenger data to predict whether or not they will survive a trip on the Titanic.

## Preparations

![](boarding.png)

Import a few Python libraries typically used in Machine Learning:

In [None]:
import pandas as pd  # handling of tabular data
import numpy as np   # number crunching
import matplotlib.pyplot as plt  # plotting

In [None]:
%matplotlib inline

### Load passenger data

Use `pandas` to load the file `titanic.csv`.

In [None]:
df = pd.read_csv('titanic.csv', index_col=0)

You can find a detailed documentation of the dataset on [www.kaggle.com/c/titanic](https://www.kaggle.com/c/titanic).


# Part 1: Exploratory Data Analysis

### 1.1. Inspect the data

Show the contents of the pandas DataFrame.

In [None]:
df.head(5)

In [None]:
df['Survived'].value_counts()

### Challenge
Examine the distribution of values in two other columns of the dataset using the `value_counts()` function.

### 1.2 Draw a histogram
Create a histogram grouping the passengers by age:

In [None]:
df['Age'].hist()

### Challenge
Explain the following line.

In [None]:
df[df['Survived']==1]['Age'].hist()

### 1.3 Scatterplot

In [None]:
df.plot.scatter('Age', 'Fare')

In [None]:
import seaborn as sns
sns.heatmap(df.corr())

### 1.4 Bar plot

Create a bar plot that groups the passenger class by survival:

In [None]:
g = df.groupby(['Survived', 'Pclass'])
g = g['Name'].count()
g = g.unstack()
g.plot.bar()

### Challenge
Create another bar plot, this time group the bars by gender.

### 1.5 Hypotheses
Collect ideas which **features** of passengers increase their chances of survival and which decrease them. Only after that start building a model.

## Part 2: Preprocessing Data

### 2.1 Data cleaning
At this point we need to clean and reshape the data a bit.

First, we take only four columns and remove all lines containing missing data:

In [None]:
cleaned = df[['Pclass', 'Age', 'Sex', 'Survived']]
cleaned = cleaned.dropna()

### 2.2 More features
We will add more data to the prediction: gender. To use the data, we need to convert it to numbers using **one-hot encoding**. 

In [None]:
gender = pd.get_dummies(cleaned['Sex'])
gender

Of course, we need to add the column to the input table (one is enough).

In [None]:
cleaned['female'] = gender['female']

### 2.3 Reshape the data
* Convert all **input features** to a matrix `X`.
* Convert the **target column** to an 1D-array `y`.

In [None]:
X = cleaned[['Pclass', 'Age']].values
y = cleaned['Survived'].values

In [None]:
X.shape, y.shape

## Part 3: Modeling

### 3.1 Create a Training/Test set

Split the data into a training and a test set:

In [None]:
from sklearn.model_selection import train_test_split

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=42)

In [None]:
Xtrain.shape, Xtest.shape

### Question
* Why do we need to create a separate test set?

### 3.2 Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

m = LogisticRegression()
m.fit(Xtrain, ytrain)

Calculate logreg coefficients

In [None]:
print('Pclass :', m.coef_[0][0])
print('Age    :', m.coef_[0][1])

## 4. Evaluate the model

Calculate the accuracy of the model for the training data:

In [None]:
m.score(Xtrain, ytrain)

With a *skewed dataset*, a confusion matrix is more robust:

In [None]:
from sklearn.metrics import confusion_matrix

ypred = m.predict(Xtrain)
confusion_matrix(ytrain, ypred)

### Challenge
Calculate the accuracy for the test data as well. Explain the differences.

### Question

Is this a good result? Why or why not?

### Challenge
Re-run the prediction above using the additional feature. How does the accuracy change?

## Part 5: Try a Random Forest

Let's try a different model: The Random Forest (an **ensemble of decision trees**)

In [None]:
from sklearn.ensemble import RandomForestClassifier

m = RandomForestClassifier()

### Challenge
Fit the Random Forest model to the training data yourself and evaluate it on the test set.

Compare how the following parameters affect prediction quality:

In [None]:
m1 = RandomForestClassifier(max_depth=2)
m2 = RandomForestClassifier(max_depth=3)
m3 = RandomForestClassifier(max_depth=10)

Limiting the complexity of a model is called **regularization**

## Part 6: Prediction

Create a data set for additional passengers and predict whether they will survive:

In [None]:
leo = np.array([[22, 3, 0]])
kate = np.array([[25, 1, 1]])

print(m.predict(leo))
print(m.predict(kate))

### Challenge
There is (at least) one error in the definition of the data for prediction. Can you find and fix it?