# Titanic competion

This notebook builds a predictive model for answer the question: "what sorts of people were more likely to survive at the Titanic?" using passenger data, as defined by the [Kaggle's Titanic competition](https://www.kaggle.com/c/titanic).

In [31]:
# Load dependencies.
import pandas as pd
import numpy as np

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Load train and test data.
test_data = pd.read_csv('./data/test.csv', sep=',')
train_data = pd.read_csv('./data/train.csv', sep=',')

In [32]:
# Take a quick look into the data.
train_data.head(8)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


## Women and children first

In [33]:
# Calculate survival rate of man, women and children.
children = train_data.loc[train_data.Age <= 8]
men = train_data.loc[(train_data.Sex == 'male') & (train_data.Age > 8)]
women = train_data.loc[(train_data.Sex == 'female') & (train_data.Age > 8)]

rate_children = sum(children['Survived'])/len(children)
rate_men = sum(men['Survived'])/len(men)
rate_women = sum(women['Survived'])/len(women)

print("Men survival rate:", rate_men)
print("Women survival rate:", rate_women)
print("Children survival rate:", rate_children)

Men survival rate: 0.17882352941176471
Women survival rate: 0.7574468085106383
Children survival rate: 0.6666666666666666


In [46]:
# Calculate the accuracy of a prediction assuming all men died and all women and children survided.
pred_survived = pd.Series(0, index=train_data.index)
pred_survived[
  train_data[(train_data.Sex == 'female') | (train_data.Age <= 8)].index
] = 1

true_survived = train_data['Survived']

accuracy_score(y_true = true_survived, y_pred = pred_survived)

0.7934904601571269

## Wealthy people first

In [57]:
# Calculate survival rate according to classes.
rates = np.zeros(3)
for i in range(3):
    subdata = train_data.loc[train_data.Pclass == (i+1)]
    rates[i] = sum(subdata['Survived'])/len(subdata)

print("First class survival rate:", rates[0])
print("Second class survival rate:", rates[1])
print("Third class survival rate:", rates[2])

First class survival rate: 0.6296296296296297
Second class survival rate: 0.47282608695652173
Third class survival rate: 0.24236252545824846


In [67]:
# Calculate the accuracy of a prediction assuming that only men from first and second classes died.
pred_survived = pd.Series(0, index=train_data.index)
pred_survived[
  train_data[(train_data.Sex == 'female') | (train_data.Age <= 8) | (train_data.Pclass == 1)].index
] = 1

true_survived = train_data['Survived']

accuracy_score(y_true = true_survived, y_pred = pred_survived)

0.755331088664422

## Decision tree.

In [83]:
# Remove useless columns fill missin ages with mean.
tree_train_data = train_data.drop(['Cabin', 'Fare', 'Name', 'Ticket'], axis=1)
tree_train_data = tree_train_data.fillna(tree1_train_data.mean())

# Separate output from input columns.
X = tree_train_data.drop('Survived', axis=1)
y = tree_train_data['Survived'].copy()

# Convert categorical values into indicator variables.
X = pd.get_dummies(X)

# Separate data into training and testing subsets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)

# Create and train a classifier using a decision tree.
titanic_tree_classifier = DecisionTreeClassifier(max_leaf_nodes=7, random_state=0)
titanic_tree_classifier.fit(X_train, y_train)

# Make predictions.
y_pred_dt = titanic_tree_classifier.predict(X_test)

# Check the accuracy of the decision tree predictions.
accuracy_score(y_true = y_test, y_pred = y_pred_dt)

0.8033898305084746

## Random forest

In [84]:
# Create and train a classifier using a random forest.
titanic_forest_classifier = RandomForestClassifier(n_estimators=3, max_leaf_nodes=4, random_state=324)  
titanic_forest_classifier.fit(X_train, y_train)

# Make predictions (again).
y_pred_rf = titanic_forest_classifier.predict(X_test)

# Check the accuracy of the random forest predictions.
accuracy_score(y_true = y_test, y_pred = y_pred_rf)

0.7898305084745763

## Results

In [96]:
# Make the predictions using the descision tree.
tree_test_data = test_data.drop(['Cabin', 'Fare', 'Name', 'Ticket'], axis=1)
tree_test_data = tree_test_data.fillna(tree1_train_data.mean())
tree_test_data = pd.get_dummies(tree_test_data)

tree_test_predictions = titanic_tree_classifier.predict(tree_test_data)

# Generate the submission file (to be uploaded to Kaggle).
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': tree_test_predictions})
output.to_csv('my_submission.csv', index=False)
print("The submission was successfully saved!")

The submission was successfully saved!
