# Predict Survived from Titanic Disaster

## August-September 2017, by Jude Moon
Python3


# Project Overview

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. 

In this project, I will analyze what sorts of people were likely to survive. In particular, I will apply the tools of machine learning to predict which passengers survived the tragedy.

This document is to keep notes as I work through the project and show my thought processes and approaches to solve this problem. It consists of:

Part1. Data Exploration
- Missing Value (NaN) Investigation
- Outliers Investigation
- Summary of Data Exploration

Part2. Feature Engineering
- Creating New Features


***

# Part1. Data Exploration


In [1]:
%pylab inline
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import os
import re
import sys
import pprint
import operator
import scipy.stats
from time import time

Populating the interactive namespace from numpy and matplotlib


In [2]:
# load data set
titanic_df = pd.read_csv("train.csv")
#titanic_df = pd.read_csv("../input/train.csv")

In [3]:
# the first 5 rows
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# data type of each column
titanic_df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [5]:
# check any numpy NaN by column
titanic_df.isnull().sum(axis=0) # sum by column

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [6]:
# statistics of central tendency and variability
titanic_df.describe()



Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,,0.0,0.0,7.9104
50%,446.0,0.0,3.0,,0.0,0.0,14.4542
75%,668.5,1.0,3.0,,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


I learned general idea about the passengers: 
- total passenger number in training data set is 891
- survival % is about 38%
- Pclass is treated as integer, but actually it is category
- since median of Pclass is 3rd class, passengers are donimated by 3rd class people
- average age is 29.7 with missing 177 data points
- sibsp and parch variables are little bit tricky with a lot of zeros
- mean fare is 32 units
- cabin has so many missing values

## Missing Value (NaN) Investigation

### Would NaN introduce bias to 'Cabin'?

'Cabin' column has 687 missing values with is 77% of the total. This might introduce bias, so I would like to investigate what is average survival rates of the groups with missing value on 'Cabin' compared to others with 'Cabin' value. 

In [7]:
# survival rate by group with missing value on Cabin; True means missing value
titanic_df.groupby(titanic_df['Cabin'].isnull()).mean()['Survived']

Cabin
False    0.666667
True     0.299854
Name: Survived, dtype: float64

The survival rate of the group with missing value (True) is lower than the average (0.38), while that with value (False) is greater than the average. Missing values of 'Cabin' have a high tendency of introducing bias, meaning that the group of passengers with missing value on 'Cabin' is associated with lower survival rate than those with Cabin value. This would cause that if a supervised classification algorithm was to use 'Cabin' as a feature, it might interpret "NaN" for 'Cabin' as a clue that a person is not survived. So, I have to carefully use 'Cabin' as a feature for supervised classifier algorithms. I am not going to worry about dealing with NaN on 'Cabin' for now because it is not a number. And so I can simply convert NaN to string 'NaN' if I need to.

### What about missing value on column 'Age'? How to deal with missing values?

In [8]:
# survival rate by group with missing value on Age; True means missing value
titanic_df.groupby(titanic_df['Age'].isnull()).mean()['Survived']

Age
False    0.406162
True     0.293785
Name: Survived, dtype: float64

About 20% data is missing for 'Age'. There is a possibility of NaN-drived bias but not strong as the bias for 'Cabin'. My choice to deal with the missing value is fill NaN with the median of the sample.

In [9]:
# replace NaN with the median of Age and create new column called age
titanic_df['age'] = titanic_df['Age'].fillna(titanic_df["Age"].median())

titanic_df['age'].describe()

count    891.000000
mean      29.361582
std       13.019697
min        0.420000
25%       22.000000
50%       28.000000
75%       35.000000
max       80.000000
Name: age, dtype: float64

### What about missing value on column 'Embarked'? How to deal with missing values?
Only two observations are missing for 'Embarked'. I could ignore them or replace them with most frequent port.

In [10]:
titanic_df.groupby(titanic_df['Embarked']).count()['PassengerId']

Embarked
C    168
Q     77
S    644
Name: PassengerId, dtype: int64

In [11]:
# replace NaN with the dominant port and create new column called Port
titanic_df['embarked'] = titanic_df['Embarked'].fillna('C')

titanic_df['embarked'].isnull().sum()

0

## Outliers Investigation

### Is there an observation who has a lot of NaN?

In [12]:
# check any numpy NaN by row
#titanic_df.isnull().sum(axis=1) # sum by row
titanic_df.isnull().sum(axis=1).max() # find the max

2

No, there is no observation who has missing values more than two. So, we can keep all the observations.

### Are there any outliers in the dataset?

In [13]:
# I defined outliers as being above of 99% percentile here
# get lists of people above 99% percentile for each feature
highest = {}
for column in titanic_df.columns:
    if titanic_df[column].dtypes != "object": # exclude string data typed columns
        highest[column]=[]
        q = titanic_df[column].quantile(0.99)
        highest[column] = titanic_df[titanic_df[column] > q].index.tolist()
    
pprint.pprint(highest)

{'Age': [],
 'Fare': [27, 88, 258, 311, 341, 438, 679, 737, 742],
 'Parch': [13, 25, 610, 638, 678, 885],
 'PassengerId': [882, 883, 884, 885, 886, 887, 888, 889, 890],
 'Pclass': [],
 'SibSp': [159, 180, 201, 324, 792, 846, 863],
 'Survived': [],
 'age': [33, 96, 116, 493, 630, 672, 745, 851]}




In [14]:
# delete 'PassengerId' from dictionary highest
highest.pop('PassengerId', 0)

[882, 883, 884, 885, 886, 887, 888, 889, 890]

### What are the outliers repeatedly shown among the features?

In [15]:
# summarize the previous dictionary, highest
# create a dictionary of outliers and the frequency of being outlier
highest_count = {}
for feature in highest:
    for person in highest[feature]:
        if person not in highest_count:
            highest_count[person] = 1
        else:
            highest_count[person] += 1
             
highest_count

{13: 1,
 25: 1,
 27: 1,
 33: 1,
 88: 1,
 96: 1,
 116: 1,
 159: 1,
 180: 1,
 201: 1,
 258: 1,
 311: 1,
 324: 1,
 341: 1,
 438: 1,
 493: 1,
 610: 1,
 630: 1,
 638: 1,
 672: 1,
 678: 1,
 679: 1,
 737: 1,
 742: 1,
 745: 1,
 792: 1,
 846: 1,
 851: 1,
 863: 1,
 885: 1}

In [16]:
# This time, I defined outliers as being below of 1% percentile here
# get lists of people below 1% percentile for each feature
lowest = {}
for column in titanic_df.columns:
    if titanic_df[column].dtypes != "object": # exclude string data typed columns
        lowest[column]=[]
        q = titanic_df[column].quantile(0.01)
        lowest[column] = titanic_df[titanic_df[column] < q].index.tolist()

# delete 'PassengerId' from dictionary highest
lowest.pop('PassengerId', 0)

pprint.pprint(lowest)

{'Age': [],
 'Fare': [],
 'Parch': [],
 'Pclass': [],
 'SibSp': [],
 'Survived': [],
 'age': [78, 305, 469, 644, 755, 803, 831]}




In [17]:
for person in lowest['age']:
    if person not in highest_count:
        highest_count[person] = 1
    else:
        highest_count[person] += 1
 
highest_count

{13: 1,
 25: 1,
 27: 1,
 33: 1,
 78: 1,
 88: 1,
 96: 1,
 116: 1,
 159: 1,
 180: 1,
 201: 1,
 258: 1,
 305: 1,
 311: 1,
 324: 1,
 341: 1,
 438: 1,
 469: 1,
 493: 1,
 610: 1,
 630: 1,
 638: 1,
 644: 1,
 672: 1,
 678: 1,
 679: 1,
 737: 1,
 742: 1,
 745: 1,
 755: 1,
 792: 1,
 803: 1,
 831: 1,
 846: 1,
 851: 1,
 863: 1,
 885: 1}

Overall, there is no outlier that are repeatedly shown among the features.

We can focus on age and Fare for continous values and Parch and SibSp for integer values to further investiage outliers. 

### Take a look at outliers

In [18]:
# fare above 99% percentile
titanic_df.loc[highest['Fare'],['Fare', 'Survived']]

Unnamed: 0,Fare,Survived
27,263.0,0
88,263.0,1
258,512.3292,1
311,262.375,1
341,263.0,1
438,263.0,0
679,512.3292,1
737,512.3292,1
742,262.375,1


The mean fare is 32 units but there are outliers who paid 262, 263, or 512 units, which are 8 to 16 times higher than the mean fare. I am going to keep these outliers because this might help to classify survival as extreme cases in such decision tree algorithm.

In [19]:
# age above 99% percentile
titanic_df.loc[highest['age'],['age', 'Survived']]

Unnamed: 0,age,Survived
33,66.0,0
96,71.0,0
116,70.5,0
493,71.0,0
630,80.0,1
672,70.0,0
745,70.0,0
851,74.0,0


In [20]:
# age below 1% percentile
titanic_df.loc[lowest['age'],['age','Survived']]

Unnamed: 0,age,Survived
78,0.83,1
305,0.92,1
469,0.75,1
644,0.75,1
755,0.67,1
803,0.42,1
831,0.83,1


The average age is 29, but there are outliers who are 66 to 80 years old, which are 2.3 to 2.8 times higher than the average age, and who are younger than one year old. I am going to keep these outliers for now because the extreme cases of age might be usefull.

## Summary of Data Exploration


# Part2. Feature Engineering

## Brainstorming

The target is 'Survived', and the rest columns are the candidate features. 
- 'Fare', 'age', 'Sex', and 'embarked' are ready to go without engineering
- 'Name', 'Ticket', and 'Cabin' are text variables and might require variation reduction. For example, use only last name instead of using full name with title. 
- 'SibSp' and 'Parch' have a lot of zeros, where the zero value means truely zero or absence. Squre transformation can be considered. Also, creating a new feature like 'is_family' by combining 'SibSp', 'Parch' and 'family_name' from 'Name'.

## Creating New Features

### family_name

In [21]:
# a procedure to create a column with family name only
def get_familyname(name):
    full_name = name.split(',')
    return full_name[0]

# apply get_familyname procedure to the column of 'Name'
familyname = titanic_df['Name'].apply(get_familyname)

# add familyname to the DataFrame as new column
titanic_df['family_name'] = familyname.values

In [22]:
# survival rate by group with same family name
titanic_df.groupby(titanic_df['family_name']).mean()['Survived'].shape # how many unique family name?

(667,)

### ticket_prefix

In [23]:
# understand 'Ticket' values
titanic_df['Ticket'].head(10)

0           A/5 21171
1            PC 17599
2    STON/O2. 3101282
3              113803
4              373450
5              330877
6               17463
7              349909
8              347742
9              237736
Name: Ticket, dtype: object

In [24]:
# survival rate by group with same 'Ticket'
titanic_df.groupby(titanic_df['Ticket']).mean()['Survived'].shape # how many unique Ticket?

(681,)

'Ticket' variable is not consistant in terms of the format; some are mixed with letters and numbers, some has symbols (/ or .), and the others consist of only numbers. And not all Tickets are different. 

In [25]:
if ' ' in 'A/5 21171':
    print(True)
else:
    print(False)
    
if ' ' in '113803':
    print(True)
else:
    print(False)

True
False


In [26]:
# a procedure to create a column with ticket prefix only
def get_prefix(ticket):
    if ' ' in ticket:
        prefix = ticket.split(' ')
        return prefix[0]
    else:
        return 'None'

# apply get_familyname procedure to the column of 'Name'
ticketprefix = titanic_df['Ticket'].apply(get_prefix)

# add familyname to the DataFrame as new column
titanic_df['ticket_prefix'] = ticketprefix.values

In [27]:
# count of ticket prefix; False means ticket_prefix == 'None'
titanic_df.groupby(titanic_df['ticket_prefix'] != 'None').count()['PassengerId']

ticket_prefix
False    665
True     226
Name: PassengerId, dtype: int64

In [28]:
# survival rate by group with ticket prefix; False means ticket_prefix == 'None'
titanic_df.groupby(titanic_df['ticket_prefix'] != 'None').mean()['Survived']

ticket_prefix
False    0.383459
True     0.384956
Name: Survived, dtype: float64

I found no difference in survival rate in the group with vs. without ticket prefix, and both rates are similar to the average of the total.

In [29]:
# frequency of ticket_prefix
titanic_df.groupby(titanic_df['ticket_prefix']).count()['PassengerId']

ticket_prefix
A./5.           2
A.5.            2
A/4             3
A/4.            3
A/5            10
A/5.            7
A/S             1
A4.             1
C               5
C.A.           27
C.A./SOTON      1
CA              6
CA.             8
F.C.            1
F.C.C.          5
Fa              1
None          665
P/PP            2
PC             60
PP              3
S.C./A.4.       1
S.C./PARIS      2
S.O./P.P.       3
S.O.C.          5
S.O.P.          1
S.P.            1
S.W./PP         1
SC              1
SC/AH           3
SC/PARIS        5
SC/Paris        4
SCO/W           1
SO/C            1
SOTON/O.Q.      8
SOTON/O2        2
SOTON/OQ        7
STON/O         12
STON/O2.        6
SW/PP           1
W./C.           9
W.E.P.          1
W/C             1
WE/P            2
Name: PassengerId, dtype: int64

I found inconsistency in formatting of prefix. For example, A./5., A.5., A/5, and A/5. could be the same prefix and A/S could be the typo for A/5. I am not sure making the formatting consistent would help to better classify the survived, or the differences in the formatting actually would help to classify them. 

### cabin_initial

In [30]:
# understand 'Cabin' values
titanic_df['Cabin'].head(10)

0     NaN
1     C85
2     NaN
3    C123
4     NaN
5     NaN
6     E46
7     NaN
8     NaN
9     NaN
Name: Cabin, dtype: object

The 'Cabin' value consists of a capital letter following by numbers. I am not sure what the letter and numbers mean for but my intuition is that the letter could represent a room location or a room price, so the letter only can be used as a feature. 

In [31]:
# how many unique Cabin?
titanic_df.groupby(titanic_df['Cabin']).mean()['Survived'].shape 

(147,)

Out of 204 known 'Cabin', 147 are the unique 'Cabin', and some people share the same 'Cabin' value.

In [32]:
def get_initial_letter(cabin):
    cabin = str(cabin) # change data type to string becuase nan is float
    return cabin[0]

titanic_df['cabin_initial'] = titanic_df['Cabin'].apply(get_initial_letter)
titanic_df['cabin_initial'].head(10)

0    n
1    C
2    n
3    C
4    n
5    n
6    E
7    n
8    n
9    n
Name: cabin_initial, dtype: object

In [33]:
# survival rate by group with missing value on Cabin; True means missing value
titanic_df.groupby(titanic_df['cabin_initial']).mean()['Survived']

cabin_initial
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
T    0.000000
n    0.299854
Name: Survived, dtype: float64

In [34]:
# how many unique 'cabin_initial'?
titanic_df.groupby(titanic_df['cabin_initial']).mean()['Survived'].shape

(9,)

### w_family

In [35]:
# SibSp and Parch are combined as family by vectoried addition
sibsp = titanic_df['SibSp']
parch = titanic_df['Parch']

family = sibsp + parch

#change datatype to categories with 2 groups
def w_family(family):
    if family != 0:
        return True
    return False

# apply w_family procedure to the array
w_family = family.apply(w_family)

# add w_family to the DataFrame as new column
titanic_df['w_family'] = w_family.values

In [36]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,age,embarked,family_name,ticket_prefix,cabin_initial,w_family
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,22.0,S,Braund,A/5,n,True
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,38.0,C,Cumings,PC,C,True
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,26.0,S,Heikkinen,STON/O2.,n,False
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,35.0,S,Futrelle,,C,True
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,35.0,S,Allen,,n,False
