# Predict Survived from Titanic Disaster

## August 2017, by Jude Moon
Python3


# Project Overview

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. 

In this project, I will analyze what sorts of people were likely to survive. In particular, I will apply the tools of machine learning to predict which passengers survived the tragedy.

This document is to keep notes as I work through the project and show my thought processes and approaches to solve this problem.

***

# Part1. Data Exploration


In [1]:
%pylab inline
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import os
import re
import sys
import pprint
import operator
import scipy.stats
from time import time

Populating the interactive namespace from numpy and matplotlib


In [4]:
# load data set
titanic_df = pd.read_csv("train.csv")

In [5]:
# the first 5 rows
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [10]:
# data type of each column
titanic_df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [8]:
# check any numpy NaN
titanic_df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [6]:
# statistics of central tendency and variability
titanic_df.describe()



Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,,0.0,0.0,7.9104
50%,446.0,0.0,3.0,,0.0,0.0,14.4542
75%,668.5,1.0,3.0,,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


I learned general idea about the passengers: 
- total passenger number in training data set is 891
- survival % is about 38%
- Pclass is treated as integer, but actually it is category
- since median of Pclass is 3rd class, passengers are donimated by 3rd class people
- average age is 29.7 with missing 177 data points
- sibsp and parch variables are little bit tricky with a lot of zeros
- mean fare is 32 units
- cabin has so many missing values

## Investigate Missing Values

### Would NaN introduce bias to the features?

In [51]:
titanic_df.groupby(titanic_df['Cabin']).mean()['Survived']

Cabin
A10      0.000000
A14      0.000000
A16      1.000000
A19      0.000000
A20      1.000000
A23      1.000000
A24      0.000000
A26      1.000000
A31      1.000000
A32      0.000000
A34      1.000000
A36      0.000000
A5       0.000000
A6       1.000000
A7       0.000000
B101     1.000000
B102     0.000000
B18      1.000000
B19      0.000000
B20      1.000000
B22      0.500000
B28      1.000000
B3       1.000000
B30      0.000000
B35      1.000000
B37      0.000000
B38      0.000000
B39      1.000000
B4       1.000000
B41      1.000000
           ...   
E12      1.000000
E121     1.000000
E17      1.000000
E24      1.000000
E25      1.000000
E31      0.000000
E33      1.000000
E34      1.000000
E36      1.000000
E38      0.000000
E40      1.000000
E44      0.500000
E46      0.000000
E49      1.000000
E50      1.000000
E58      0.000000
E63      0.000000
E67      0.500000
E68      1.000000
E77      0.000000
E8       1.000000
F E69    1.000000
F G63    0.000000
F G73    0.000000
F2  

In [83]:
def sort_cabin(cabin):
    if cabin != 'NaN':
        return cabin[0]

titanic_df['cabin'] = titanic_df['Cabin'].apply(sort_cabin)

TypeError: 'float' object is not subscriptable

In [82]:
Cabins = ['A10', 'B22', 'C24']

for cabin in Cabins:
    print(cabin[0])

A
B
C


In [75]:
'cabin'[0:1]

'c'

In [46]:
# survival rate by group with missing value on Cabin; True means missing value
titanic_df.groupby(titanic_df['Cabin'].isnull()).mean()['Survived']

Cabin
False    0.666667
True     0.299854
Name: Survived, dtype: float64

In [44]:
# survival rate by group with missing value on Age; True means missing value
titanic_df.groupby(titanic_df['Age'].isnull()).mean()['Survived']

Age
False    0.406162
True     0.293785
Name: Survived, dtype: float64

In [50]:
# replace NaN with the median of Age and create new column called age
titanic_df['age'] = titanic_df['Age'].fillna(titanic_df["Age"].median())

titanic_df.describe()



Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,age
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208,29.361582
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429,13.019697
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0,0.42
25%,223.5,0.0,2.0,,0.0,0.0,7.9104,22.0
50%,446.0,0.0,3.0,,0.0,0.0,14.4542,28.0
75%,668.5,1.0,3.0,,1.0,0.0,31.0,35.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292,80.0


Cabin has a great number of NaN value (or missing value). As expected, missing values of 'Cabin' have a high tendency of introducing bias, meaning that the group of passengers with missing value on 'Cabin' is associated with lower survival rate than those with Cabin value. This would cause that if a supervised classification algorithm was to use 'Cabin' as a feature, it might interpret "NaN" for 'Cabin' as a clue that a person is not survived. So, 