## The investigated question

The sinking of Titanic is probably the most notorious peace time naval disaster in history. As such it is not only source of many popular legends, but also primary source of exposition of naval code of conduct in such tragic cases. For instance the the popular belief, that the captain should be the last person leaving the ship likely originates from dramatic portrayal of the Titanic disaster.

Another instance of maritime chivarly dictates, that 'woman and children (are) first' to board the rescue boats. However the 1997 Titanic movie tells a different story. Acording to the movie, the class and status of passangers took precedence over chivalry, and passangers for lower class were simply let drown regardless of their sex and age.

My goal is to investigate whether the data supports either the chivalry or class based interpretation of the story.

** Does the data provide strong support for one of these claims:**

** Question 1. Had woman a children a higher chance to survive the Titanic dissaster regardless of their class status? **

** Question 2. Had first class passangers a higher chance to survive then woman and children from lower classes?**

## Loading and type casting the data

In [6]:
#reading in csv
import pandas as pd

from IPython.display import display

passangers_df = pd.read_csv("titanic_data.csv")
display(passangers_df.head())

print "size {}".format(len(passangers_df))
print "'Name' type {}".format(passangers_df["Name"].dtype)
print "'Survived' type {}".format(passangers_df["Survived"].dtype)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


size 891
'Name' type object
'Survived' type int64


## Quick Exploration of Data

There are 891 entries in the data set. It appears we can rely on type casting done by pandas. The 'Survived' column is int64 instead of boolean, but that is not proble. Others columns which should be strings ('Name', 'Sex') are generic objects, but the entries can be manipulated like strings. For now further type casting does not seem necessary.

In [7]:
def count_values(df,col, normalize = True, dropna=False):
    return df[col].value_counts(normalize = normalize, sort=True, ascending=False, bins=None, dropna = dropna)
 
def print_value_counts(df,col, normalize = True, max_nr = -1):
    print "\n"
    print col
    print "total size: {}".format(len(df[col]))
    print "unique values: {}".format(len(df[col].unique()))
    
    if max_nr == -1:
        max_nr = len(df[col])
    else:
        print "top {} most frequent values:".format(max_nr)
        
    print count_values(df,col,normalize).iloc[:max_nr]

print_value_counts(passangers_df,"Sex")
print_value_counts(passangers_df,"Survived")



Sex
total size: 891
unique values: 2
male      0.647587
female    0.352413
Name: Sex, dtype: float64


Survived
total size: 891
unique values: 2
0    0.616162
1    0.383838
Name: Survived, dtype: float64


According to Wikipedia [Casualties and survivors](https://en.wikipedia.org/wiki/Sinking_of_the_RMS_Titanic#Casualties_and_survivors) there were 1207 passanger and 442 survivors which leads to 37% survival rate (since the Kaggle dataset contains only passangers, we consider only passanger information and excluded crew infromation). The provided Kaggle data set has only 891 data points. 

So about 300 data point were removed from the Kaggle data set. This raises the questions why there were removed. ** Specifically, were those data points removed (from the Kaggle sample) to accentuate trends which were not so clear form the entire (Wikipedia) dataset? **

It should be pointed out that any (and in particular passanger) data from the Titanic disaster is quite noisy, due to the historic era and tragic circumstances. According to Wikipedia page referenced earlier, the death toll was estimated between 1,490 and 1,635. This range represents about 7% of people aboard (passngers + crew). So there is no clear ground thruth to compare to. 

I also do not know whether the same data was used to generate the report from Wikipedia and to create the Kaggle sample. But it seems reasonable to assume that the general trends should be the same in both datasets.

Let us adopt the following assumption, which we will revisit later.

**Assumption 1: The Kaggle data is a representative sample of the (population of) passangers of the Titanic (as described in Wikipedia).**

**Question 3: Does Assumption 1 hold?**

As a quick check the survivor rate is very similar in the both data sets (38% in Kaggle vs 37% in Wikipedia). The proprtion of woman is also very similar. 35% in Kaggle to 38% in Wikipedia. Note that Wikipedia lists adults and children separately and reports only 33% female passangers. About 9% of passangers are listed as children. Using the [Passanger List](https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic#Passenger_list) from Wikipedia and assuming that passange of age 12 and below are considered children (more on that in Data Wrangling section) I establish that there were 53 girls on Titanic. So the total percentage of woman in the Wikipedia dataset rises to 38%.

For now Assumption 1 seems to hold. However if this would not be the case, the applicability of conclusions drawn from Kaggle data to the historical Titanic disaster would be qustionable. 

To investigate my question, I would need to know the age limit for child in 1912. Ideally I would like to use the same value as was adopted in the Wikipedia article, so the results are comparable. Unfortunatelly Wikipedia does not explicitly state such age limit for a child. I will address this issue in Data Wrangling phase. 

For my investigation I will need the values for 'Age', 'Pclass' (which itself is a proxy for 'Fare'), 'Sex' and of course 'Survived' for each passanger.

'Age' proves to be problematic since, since about 20% of passangers (177 passangers) don't have an entry for 'Age'.


In [8]:
print_value_counts(passangers_df,"Age", max_nr =5)



Age
total size: 891
unique values: 89
top 5 most frequent values:
NaN      0.198653
 24.0    0.033670
 22.0    0.030303
 18.0    0.029181
 30.0    0.028058
Name: Age, dtype: float64


The remaining relevant fields 'Pclass', 'Sex' and 'Survived' nicely partition the data into two or three (for class) partitions.

To be on the safe side I have checked whether 'Name' and  u'PassengerId' provide a unique identifier for each data point. They indeed do.


## Data Wrangling

Data wrangling consists of two steps:

1. Remove the 177 Pasangers with unspecified age from the dataset
2. Add a new (boolean) field 'Child' dataset

The first step is straightforward:

In [9]:
passangers_df_age = passangers_df.dropna(subset =  ["Age"])
#check whether the clean dataset has the right nr of entries 891-177 = 714
print len(passangers_df_age)

714


I was curious why there are so many NaN values. The [Passanger List](https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic#Passenger_list) at Wikipedia list only two passangers of unknown age. My first idea was, that majority of unknown age passangers did not survive to report their age. The data shows this is the case for 70% of them.

In [10]:
print_value_counts( passangers_df[passangers_df['Age'].isnull()],"Survived")



Survived
total size: 177
unique values: 2
0    0.706215
1    0.293785
Name: Survived, dtype: float64


**Note 1** The Kaggle dataset contains 177 passangers of unknown age. The Wikipedia Passanger List has only 2.

Does the proportion of woman and survivors change when passangers of unknown age are removed from the dataset?

In [11]:
def print_pretty_percent(prefix, field, val, dataset):
    print "{0:s} {1:2.2f}%".format(prefix, (count_values(dataset,field)[val])*100)

print_pretty_percent("Woman before\t\t",'Sex','female',passangers_df);
print_pretty_percent("Woman after\t\t",'Sex','female',passangers_df_age);
print "Woman Wikipedia\t\t {0:2.2f}%\n".format(37.70)

print_pretty_percent("Survived before\t\t",'Survived',1,passangers_df);
print_pretty_percent("Survived after\t\t",'Survived',1,passangers_df_age);
print "Survived Wikipedia\t {0:2.2f}%\n".format(36.62)

Woman before		 35.24%
Woman after		 36.55%
Woman Wikipedia		 37.70%

Survived before		 38.38%
Survived after		 40.62%
Survived Wikipedia	 36.62%



Both proportions increase slightly. The proportion of survivals is now considerably higher then reported in Wikipedia.

**Age limit for children**

As for the age, the situation is following. [Casualties and survivors](https://en.wikipedia.org/wiki/Sinking_of_the_RMS_Titanic#Casualties_and_survivors) reports summary data for children split by classes. In  [Passanger List](https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic#Passenger_list) from Wikipedia passengers list for each class which includes survival, age and gender (indirectly through using the title 'Miss' or 'Master') for each passanger. 

We can speculate about the cutoff age used in 'Casualties and survivors' by comparing it with cumulative counts for different cutoff ages from 'Passanger List'. Effectivelly we perform a small Data Acqusition step from a different data  source ('Passanger List') to disambiguate the data in existing data source (finding the exact cutoff age/Operational Definition for 'child' used in 'Causalties and Survivors').

By ordering the surviors in 'Passanger List' by age, we obtian the following counts:

In [12]:
wiki_children = pd.DataFrame({"Wiki": [6,24,79,109],
                              "<=11": [5,22,76,103],
                              "<=12": [5,24,80,109],
                              "<=13": [7,25,83,115],
                              "<=14": [7,27,89,123]},
                              index = ["Class I","Class II", "Class III","Total"])


**TODO: play with higlighting and styles here**

The age 12 and below seems to be the most likely cutoff age. We can validate this by checking a strong trend apparent in 'Casualties and survivors': all but one child from Class I and II survived the disaster. Can we observe this trend in passangers 12 and below and NOT in passangers 13 and below?

In [20]:
passangers_df_age[passangers_df_age["Age"]<=12][["Age","Pclass","Survived"]]

Unnamed: 0,Age,Pclass,Survived
7,2.00,3,0
10,4.00,3,1
16,2.00,3,0
24,8.00,3,0
43,3.00,2,1
50,7.00,3,0
58,5.00,2,1
59,11.00,3,0
63,4.00,3,0
78,0.83,2,1
