# Titanic -- Analyzing Raw Data to Verify Popular Myth

A popular myth surrounding the sinking of the Titanic that the crew observed the Birkenhead Drill, directing women and children to exit the limited life rafts first.  Can this be verified in the data?

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import pearsonr
import plotly.plotly as py
import cufflinks as cf

titanic_df = pd.read_csv('titanic_data.csv')

First, an overview of what is present in this data set:

In [2]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
#Basic stats on the overall data
#Because of NaN in data, this may show a warning on some versions of Pandas that NaN is present in data
print titanic_df.describe()

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000         NaN    0.000000   
50%     446.000000    0.000000    3.000000         NaN    0.000000   
75%     668.500000    1.000000    3.000000         NaN    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  


In [5]:
#Survived is a binary, mapping it to a string for display purposes
titanic_df['Survival'] = titanic_df['Survived'].apply(lambda s:'Survivor' if s == True else 'Victim')

#map gender to a binary for statistics
def numeric_gender(g):
    if g == 'male':
        return 0
    elif g == 'female':
        return 1
    else:
        return 2

#calculate Pearson's r for Sex:Survival using corrcoef
r = np.corrcoef(map(numeric_gender,titanic_df['Sex']), titanic_df['Survived'])[0,1]

print 'r: ' + str(r)

print 'scipy calculation of r: ' + str(pearsonr(map(numeric_gender,titanic_df['Sex']), titanic_df['Survived']))

print 'r squared: ' + str(r**2)

r: 0.543351380658
scipy calculation of r: (0.54335138065775523, 1.4060661308795969e-69)
r squared: 0.295230722863


At 890 degrees of freedom, a two-tailed P value for this correlation between gender and survival is 0.0001; this is extremely statistically significant.  

In [6]:
#Create grouped data to display the number of Survivors/Victims by Sex
titanic_groupby_sex_survived = titanic_df.groupby(['Sex','Survived'], as_index=False)

titanic_survival_gender = titanic_groupby_sex_survived['PassengerId'].count().reset_index()

#define Function to show clean names for Sex+Survival pair
def survived_gender_trans(sex, survived):
    survived_str = ' %s'%('Survivor' if survived == True else 'Victim')
    return (sex + survived_str).title()

#Create new column on grouped data for clean names for Sex+Survival pair
titanic_survival_gender['SurvivedGender'] = titanic_survival_gender.apply(lambda row: survived_gender_trans(row['Sex'], row['Survived']), axis=1)

In [7]:
#Display plotly graph of Survival By Gender (Sex:Survival pair)
titanic_survival_gender[['SurvivedGender','PassengerId']].iplot(kind='bar', filename='cufflinks/bar-chart', 
                                                                x='SurvivedGender', title = 'Survival by Gender',
                                                                yTitle = 'Number of Passengers')

The above graph visually displays how female passengers were significantly more likely to survive than male passengers.

Because the age data is incomplete, we'll limit the data to what contains age and compare that to survival:

In [8]:
#filter data to non-NaN age data
titanic_with_ages_df = titanic_df[titanic_df['Age'].notnull()]

In [9]:
#calculate Pearson's r for Age:Survived correlation
print np.corrcoef(titanic_with_ages_df['Age'], titanic_with_ages_df['Survived'])[0,1]

print pearsonr(titanic_with_ages_df['Age'], titanic_with_ages_df['Survived'])

-0.0772210945722
(-0.077221094572177656, 0.039124654013483327)


Pearson's r for age to survival is much closer to zero, therefore less significant, than gender.  At 713 degrees of freedom, a two-tailed P value for this correlation between age and survival is 0.039, which is considered to be statistically significant.

In [10]:
#Create grouped data to display the number of Survivors/Victims by Age
titanic_groupby_age_survived = titanic_with_ages_df.groupby(['Age','Survival'])

#unstack the data so it will show in 'two' stacked graphs
titanic_survival_age = titanic_groupby_age_survived['PassengerId'].count().unstack()

In [11]:
titanic_survival_age.iplot(kind='area', title='Age and Survival for Titanic Passengers',
                           xTitle='Age of Passengers', yTitle = 'Number of Passengers', 
                           fill=True, filename='cufflinks/filled-area')

Age and Survival for Titanic Passengers appears to show a statistical likelihood for victims to be adults rather than seniors or small children.  This supports the theory that the Burkenhead drill was followed; there is a strong correlation between sex and survival, and a statistically significant correlation between age and survival.  

However, before assuming causality, other factors would need to be brought into consideration to ensure that these two correlations are not a result of a common-causal variable.

## References

Cufflinks. (n.d.). Retrieved September 13, 2016, from https://plot.ly/ipython-notebooks/cufflinks/

Pandas: How to use apply function to multiple columns. Retrieved September 13, 2016, from http://stackoverflow.com/questions/16353729/pandas-how-to-use-apply-function-to-multiple-columns

Plot different DataFrames in the same figure. Retrieved September 14, 2016, from http://stackoverflow.com/questions/13872533/plot-different-dataframes-in-the-same-figure

Pearson product-moment correlation coefficient. (n.d.). Retrieved September 13, 2016, from https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
    
Plotly. (n.d.). Retrieved September 14, 2016, from https://plot.ly/pandas/
    
Titanic: Machine Learning from Disaster. (n.d.). Retrieved September 13, 2016, from https://www.kaggle.com/c/titanic/data
        
Window, B. &. (n.d.). Pandas.Series.notnull¶. Retrieved September 13, 2016, from http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.notnull.html
    
Women and children first. (n.d.). Retrieved September 13, 2016, from https://en.wikipedia.org/wiki/Women_and_children_first