## The investigated question

The sinking of Titanic is probably the most notorious peace time naval disaster in history. As such it is not only source of many popular legends, but also primary source of exposition of naval code of conduct in such tragic cases. For instance the the popular belief, that the captain should be the last person leaving the ship likely originates from dramatic portrayal of the Titanic disaster.

Another instance of maritime chivarly dictates, that 'woman and children (are) first' to board the rescue boats. However the 1997 Titanic movie tells a different story. Acording to the movie, the class and status of passangers took precedence over chivalry, and passangers for lower class were simply let drown regardless of their sex and age.

My goal is to investigate whether the data supports either the chivalry or class based interpretation of the story.

** Does the data provide strong support for one of these claims:**

** Question 1. Had woman a children a higher chance to survive the Titanic dissaster regardless of their class status? **

** Question 2. Had first class passangers a higher chance to survive then woman and children from lower classes?**

## Loading and type casting the data

In [1]:
#reading in csv

import unicodecsv

passangers_header = []

def read_csv(filename):
    with open(filename, 'rb') as f:
        reader = unicodecsv.DictReader(f)
        header = reader.fieldnames
        return header,list(reader)
    

passangers_header, passangers_csv = read_csv("titanic_data.csv")

In [2]:
#parsing the data, code adapted from the lecture

# Takes a string which is either an empty string or represents an integer,
# and returns an int or None.
def parse_maybe_int(i):
    if i == '':
        return None
    else:
        return int(i)

def parse_maybe_float(f):
    if f == '':
        return None
    else:
        return float(f)
    
    
# Clean up the data types in the enrollments table
for passanger in passangers_csv:
    passanger['Age'] = parse_maybe_float(passanger['Age'])    
    passanger['Fare'] = parse_maybe_float(passanger['Fare'])
    
    passanger['Parch'] = parse_maybe_int(passanger['Parch'])
    passanger['PassengerId'] = parse_maybe_int(passanger['PassengerId'])
    passanger['Pclass'] = parse_maybe_int(passanger['Pclass'])
    passanger['SibSp'] = parse_maybe_int(passanger['SibSp'])
    
    passanger['Survived'] = bool(parse_maybe_int(passanger['Survived'])) 

passangers_csv[0]

{u'Age': 22.0,
 u'Cabin': u'',
 u'Embarked': u'S',
 u'Fare': 7.25,
 u'Name': u'Braund, Mr. Owen Harris',
 u'Parch': 0,
 u'PassengerId': 1,
 u'Pclass': 3,
 u'Sex': u'male',
 u'SibSp': 1,
 u'Survived': False,
 u'Ticket': u'A/5 21171'}

## Quick Exploration of Data

Let us explore what are the different values and their frequencies for each field.

In [36]:
from collections import defaultdict

def value_freq(field, csv, normalize = True, verbose = False):
    result_dict = defaultdict(int)
    total_count = 0
    
    #keeps track whether keys provide a unique identifier
    all_unique = True
    
    for entry in csv:
        key = entry[field] 
        
        result_dict[key] += 1
        if all_unique and key is not None and result_dict[key]==2:
            all_unique = False
        
        total_count +=1;
    
    #normalize/ calculate frequency
    if normalize:
        for key in result_dict:
            result_dict[key] /= float(total_count)
    
    
    if verbose:
        if None in result_dict:
            print "%s %s" % ("None", result_dict[None])

        print "%s %s" % ("All keys unique ", all_unique)
        print "%s %s" % ("different keys ", len(result_dict))
        print "%s %s" % ("total", total_count)
    
    return result_dict

attr_list = ['Age','Fare','Pclass','Sex','Survived']
value_freq('Sex',passangers_csv,normalize = True, verbose = True)

All keys unique  False
different keys  2
total 891


defaultdict(int, {u'female': 0.35241301907968575, u'male': 0.6475869809203143})

According to Wikipedia [Casualties and survivors](https://en.wikipedia.org/wiki/Sinking_of_the_RMS_Titanic#Casualties_and_survivors) there were 1207 passanger and 442 survivors which leads to 37% survival rate (since the Kaggle dataset contains only passangers, we consider only passanger information and excluded crew infromation). The provided Kaggle data set has only 891 data points. 

So about 300 data point were removed from the Kaggle data set. This raises the questions why there were removed. ** Specifically, were those data points removed (from the Kaggle sample) to accentuate trends which were not so clear form the entire (Wikipedia) dataset? **

It should be pointed out that any (and in particular passanger) data from the Titanic disaster is quite noisy, due to the historic era and tragic circumstances. According to Wikipedia page referenced earlier, the death toll was estimated between 1,490 and 1,635. This range represents about 7% of people aboard (passngers + crew). So there is no clear ground thruth to compare to. 

I also do not know whether the same data was used to generate the report from Wikipedia and to create the Kaggle sample. But it seems reasonable to assume that the general trends should be the same in both datasets.

Let us adopt the following assumption, which we will revisit later.

**Assumption 1: The Kaggle data is a representative sample of the (population of) passangers of the Titanic (as described in Wikipedia).**

**Question 3: Does Assumption 1 hold?**

As a quick check the survivor rate is very similar in the both data sets (38% in Kaggle vs 37% in Wikipedia). The proprtion of woman is also very similar. 35% in Kaggle to 38% in Wikipedia. Note that Wikipedia lists adults and children separately and reports only 33% female passangers. About 9% of passangers are listed as children. Using the [Passanger List](https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic#Passenger_list) from Wikipedia and assuming that passange of age 12 and below are considered children (more on that in Data Wrangling section) I establish that there were 53 girls on Titanic. So the total percentage of woman in the Wikipedia dataset rises to 38%.

For now Assumption 1 seems to hold. However if this would not be the case, the applicability of conclusions drawn from Kaggle data to the historical Titanic disaster would be qustionable. 

To investigate my question, I would need to know the age limit for child in 1912. Ideally I would like to use the same value as was adopted in the Wikipedia article, so the results are comparable. Unfortunatelly Wikipedia does not explicitly state such age limit for a child. I will address this issue in Data Wrangling phase. 

For my investigation I will need the values for 'Age', 'Pclass' (which itself is a proxy for 'Fare'), 'Sex' and of course 'Survived' for each passanger.

'Age' proves to be problematic since, since about 20% of passangers (177 passangers) don't have an entry for 'Age'.


In [29]:
value_freq('Age',passangers_csv,normalize = False, verbose = True);

None 177
All keys unique  False
different keys  89
total 891


The age also has 88 (other the None) values so I will have to use histograms for visualisation.

The remaining relevant fields 'Pclass', 'Sex' and 'Survived' nicely partition the data into two or three (for class) partitions.

To be on the safe side I have checked whether 'Name' and  u'PassengerId' provide a unique identifier for each data point. They indeed do.


## Data Wrangling

Data wrangling consists of two steps:

1. Remove the 177 Pasangers with unspecified age from the dataset
2. Add a new (boolean) field 'Child' dataset

The first step is straightforward:

In [30]:
def filter_val_entries(field, dataset):
    dataset_val = []
    dataset_none = []
    
    for entry in dataset:
        if entry[field] is not None:
            dataset_val.append(entry)
        else:
            dataset_none.append(entry)
    
    return dataset_val, dataset_none

passangers_csv_val, passangers_csv_none = filter_val_entries('Age', passangers_csv);

#check whether the clean dataset has the right nr of entries 891-177 = 714
print len(passangers_csv_val)
print

#another test whether all are None
value_freq('Age',passangers_csv_none,normalize = True, verbose = True);


714

None 1.0
All keys unique  True
different keys  1
total 177


I kept the entries with None values, because I was curious why there are so many of them. The [Passanger List](https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic#Passenger_list) at Wikipedia list only two passangers of unknown age. My first idea was, that majority of unknown age passangers did not survive to report their age. The data shows this is the case for 70% of them.

In [31]:
value_freq('Survived',passangers_csv_none, normalize = True, verbose = False)

defaultdict(int, {False: 0.7062146892655368, True: 0.2937853107344633})

**Note 1** The Kaggle dataset contains 177 passangers of unknown age. The Wikipedia Passanger List has only 2.

Does the proportion of woman and survivors change when passangers of unknown age are removed from the dataset?

In [76]:
def print_pretty_percent(prefix, field, val, dataset):
    print "{0:s} {1:2.2f}%".format(prefix, (value_freq(field, dataset, normalize = True, verbose = False)[val])*100)

print_pretty_percent("Woman before\t\t",'Sex','female',passangers_csv);
print_pretty_percent("Woman after\t\t",'Sex','female',passangers_csv_val);
print "Woman Wikipedia\t\t {0:2.2f}%\n".format(37.70)

print_pretty_percent("Survived before\t\t",'Survived',True,passangers_csv);
print_pretty_percent("Survived after\t\t",'Survived',True,passangers_csv_val);
print "Survived Wikipedia\t {0:2.2f}%\n".format(36.62)

Woman before		 35.24%
Woman after		 36.55%
Woman Wikipedia		 37.70%

Survived before		 38.38%
Survived after		 40.62%
Survived Wikipedia	 36.62%



Both proportions increase slightly. The proportion of survivals is now considerably higher then reported in Wikipedia.

**Age limit for children**

As for the age, [Casualties and survivors](https://en.wikipedia.org/wiki/Sinking_of_the_RMS_Titanic#Casualties_and_survivors) reports a strong pattern among children: all but one child from the First and Second Class has survived. When we compare that to 