## The investigated question

The sinking of Titanic is probably the most notorious peace time naval disaster in history. As such it is not only source of many popular legends, but also primary source of exposition of naval code of conduct in such tragic cases. For instance the the popular belief, that the captain should be the last person leaving the ship likely originates from dramatic portrayal of the Titanic disaster.

Another instance of maritime chivarly dictates, that 'woman and children (are) first' to board the rescue boats. However the 1997 Titanic movie tells a different story. Acording to the movie, the class and status of passangers took precedence over chivalry, and passangers for lower class were simply let drown regardless of their sex and age.

My goal is to investigate whether the data supports either the chivalry or class based interpretation of the story.

** Does the data provide strong support for one of these claims:**

** Question 1. Had woman a children a higher chance to survive the Titanic dissaster regardless of their class status? **

** Question 2. Had first class passangers a higher chance to survive then woman and children from lower classes?**

## Loading and type casting the data

In [88]:
#reading in csv
import pandas as pd

from IPython.display import display

passangers_df = pd.read_csv("titanic_data.csv")
display(passangers_df.head())

print "size {}".format(len(passangers_df))
print "'Name' type {}".format(passangers_df["Name"].dtype)
print "'Survived' type {}".format(passangers_df["Survived"].dtype)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


size 891
'Name' type object
'Survived' type int64


## Quick Exploration of Data

There are 891 entries in the data set. It appears we can rely on type casting done by pandas. The 'Survived' column is int64 instead of boolean, but that is not proble. Others columns which should be strings ('Name', 'Sex') are generic objects, but the entries can be manipulated like strings. For now further type casting does not seem necessary.

In [89]:
def count_values(df,col, normalize = True, dropna=False):
    return df[col].value_counts(normalize = normalize, sort=True, ascending=False, bins=None, dropna = dropna)
 
def print_value_counts(df,col, normalize = True, max_nr = -1):
    print "\n"
    print col
    print "total size: {}".format(len(df[col]))
    print "unique values: {}".format(len(df[col].unique()))
    
    if max_nr == -1:
        max_nr = len(df[col])
    else:
        print "top {} most frequent values:".format(max_nr)
        
    print count_values(df,col,normalize).iloc[:max_nr]

print_value_counts(passangers_df,"Sex")
print_value_counts(passangers_df,"Survived")



Sex
total size: 891
unique values: 2
male      0.647587
female    0.352413
Name: Sex, dtype: float64


Survived
total size: 891
unique values: 2
0    0.616162
1    0.383838
Name: Survived, dtype: float64


According to Wikipedia [Casualties and survivors](https://en.wikipedia.org/wiki/Sinking_of_the_RMS_Titanic#Casualties_and_survivors) there were 1207 passanger and 442 survivors which leads to 37% survival rate (since the Kaggle dataset contains only passangers, we consider only passanger information and excluded crew infromation). The provided Kaggle data set has only 891 data points. 

So about 300 data point were removed from the Kaggle data set. This raises the questions why there were removed. ** Specifically, were those data points removed (from the Kaggle sample) to accentuate trends which were not so clear form the entire (Wikipedia) dataset? **

It should be pointed out that any (and in particular passanger) data from the Titanic disaster is quite noisy, due to the historic era and tragic circumstances. According to Wikipedia page referenced earlier, the death toll was estimated between 1,490 and 1,635. This range represents about 7% of people aboard (passngers + crew). So there is no clear ground thruth to compare to. 

I also do not know whether the same data was used to generate the report from Wikipedia and to create the Kaggle sample. But it seems reasonable to assume that the general trends should be the same in both datasets.

Let us adopt the following assumption, which we will revisit later.

**Assumption 1: The Kaggle data is a representative sample of the (population of) passangers of the Titanic (as described in Wikipedia).**

**Question 3: Does Assumption 1 hold?**

As a quick check the survivor rate is very similar in the both data sets (38% in Kaggle vs 37% in Wikipedia). The proprtion of woman is also very similar. 35% in Kaggle to 38% in Wikipedia. Note that Wikipedia lists adults and children separately and reports only 33% female passangers. About 9% of passangers are listed as children. Using the [Passanger List](https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic#Passenger_list) from Wikipedia and assuming that passange of age 12 and below are considered children (more on that in Data Wrangling section) I establish that there were 53 girls on Titanic. So the total percentage of woman in the Wikipedia dataset rises to 38%.

For now Assumption 1 seems to hold. However if this would not be the case, the applicability of conclusions drawn from Kaggle data to the historical Titanic disaster would be qustionable. 

To investigate my question, I would need to know the age limit for child in 1912. Ideally I would like to use the same value as was adopted in the Wikipedia article, so the results are comparable. Unfortunatelly Wikipedia does not explicitly state such age limit for a child. I will address this issue in Data Wrangling phase. 

For my investigation I will need the values for 'Age', 'Pclass' (which itself is a proxy for 'Fare'), 'Sex' and of course 'Survived' for each passanger.

'Age' proves to be problematic since, since about 20% of passangers (177 passangers) don't have an entry for 'Age'.


In [90]:
print_value_counts(passangers_df,"Age", max_nr =5)



Age
total size: 891
unique values: 89
top 5 most frequent values:
NaN      0.198653
 24.0    0.033670
 22.0    0.030303
 18.0    0.029181
 30.0    0.028058
Name: Age, dtype: float64


The remaining relevant fields 'Pclass', 'Sex' and 'Survived' nicely partition the data into two or three (for class) partitions.

To be on the safe side I have checked whether 'Name' and  u'PassengerId' provide a unique identifier for each data point. They indeed do.


## Data Wrangling

Data wrangling consists of two steps:

1. Remove the 177 Pasangers with unspecified age from the dataset
2. Add a new (boolean) field 'Child' dataset

The first step is straightforward:

In [91]:
passangers_df_age = passangers_df.dropna(subset =  ["Age"])
#check whether the clean dataset has the right nr of entries 891-177 = 714
print len(passangers_df_age)

714


I was curious why there are so many NaN values. The [Passanger List](https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic#Passenger_list) at Wikipedia list only two passangers of unknown age. My first idea was, that majority of unknown age passangers did not survive to report their age. The data shows this is the case for 70% of them.

In [92]:
print_value_counts( passangers_df[passangers_df['Age'].isnull()],"Survived")



Survived
total size: 177
unique values: 2
0    0.706215
1    0.293785
Name: Survived, dtype: float64


**Note 1** The Kaggle dataset contains 177 passangers of unknown age. The Wikipedia Passanger List has only 2.

Does the proportion of woman and survivors change when passangers of unknown age are removed from the dataset?

In [93]:
def print_pretty_percent(prefix, field, val, dataset):
    print "{0:s} {1:2.2f}%".format(prefix, (count_values(dataset,field)[val])*100)

print_pretty_percent("Woman before\t\t",'Sex','female',passangers_df);
print_pretty_percent("Woman after\t\t",'Sex','female',passangers_df_age);
print "Woman Wikipedia\t\t {0:2.2f}%\n".format(37.70)

print_pretty_percent("Survived before\t\t",'Survived',1,passangers_df);
print_pretty_percent("Survived after\t\t",'Survived',1,passangers_df_age);
print "Survived Wikipedia\t {0:2.2f}%\n".format(36.62)

Woman before		 35.24%
Woman after		 36.55%
Woman Wikipedia		 37.70%

Survived before		 38.38%
Survived after		 40.62%
Survived Wikipedia	 36.62%



Both proportions increase slightly. The proportion of survivals is now considerably higher then reported in Wikipedia.

**Age limit for children - Wikipedia**

As for the age, the situation is following. [Casualties and survivors](https://en.wikipedia.org/wiki/Sinking_of_the_RMS_Titanic#Casualties_and_survivors) reports summary data for children split by classes. In  [Passanger List](https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic#Passenger_list) from Wikipedia passengers list for each class which includes survival, age and gender (indirectly through using the title 'Miss' or 'Master') for each passanger. 

We can speculate about the cutoff age used in 'Casualties and survivors' by comparing it with cumulative counts for different cutoff ages from 'Passanger List'. Effectivelly we perform a small Data Acqusition step from a different data  source ('Passanger List') to disambiguate the data in existing data source (finding the exact cutoff age/Operational Definition for 'child' used in 'Causalties and Survivors').

By ordering the surviors in 'Passanger List' by age, we obtian the following counts:

In [94]:
wiki_children = pd.DataFrame({"Wiki": [6,24,79,109],
                              "<=11": [5,22,76,103],
                              "<=12": [5,24,80,109],
                              "<=13": [7,25,83,115],
                              "<=14": [7,27,89,123]},
                              index = ["Class I","Class II", "Class III","Total"])

display(wiki_children)

#subtract Wiki column from the others
diff_df = wiki_children.sub(wiki_children["Wiki"],axis = 0)
#remove it from the view + apply abs 
diff_df = diff_df.iloc[:,:4].applymap(abs)
# highlight min in each row
diff_df.style.highlight_min(axis=1)


Unnamed: 0,<=11,<=12,<=13,<=14,Wiki
Class I,5,5,7,7,6
Class II,22,24,25,27,24
Class III,76,80,83,89,79
Total,103,109,115,123,109


Unnamed: 0,<=11,<=12,<=13,<=14
Class I,1,1,1,1
Class II,2,0,1,3
Class III,3,1,4,10
Total,6,0,6,14


The age 12 and below seems to be the most likely cutoff age, although the difference to other candidate ages is not large.

I have attempted to validate the selcted age, by checking a strong trend apparent in 'Casualties and survivors': all but one child from Class I and II survived the disaster. Can I observe this trend in passangers of age 12 and below and NOT in passangers of age 13 and below?

In [95]:
def format_percentage(p):
    return "{0:2.2f}%".format(p*100)

def analyze_children_survival (df,age):
    print "Child of <{} years".format(age)
    
    grouped = df[df["Age"]<=age][["Pclass","Survived"]].groupby(["Pclass"])
    
    res = pd.concat([grouped.sum(),grouped.count(),grouped.mean()],axis = 1)
    res.columns = ["Survived","Total","Percentage"]
    res["Percentage"] = res["Percentage"].apply(format_percentage)
    
    return res

display(analyze_children_survival(passangers_df_age, 11))
display(analyze_children_survival(passangers_df_age, 12))
display(analyze_children_survival(passangers_df_age, 15))
display(analyze_children_survival(passangers_df_age, 16))

Child of <11 years


Unnamed: 0_level_0,Survived,Total,Percentage
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,3,4,75.00%
2,17,17,100.00%
3,19,47,40.43%


Child of <12 years


Unnamed: 0_level_0,Survived,Total,Percentage
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,3,4,75.00%
2,17,17,100.00%
3,20,48,41.67%


Child of <15 years


Unnamed: 0_level_0,Survived,Total,Percentage
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,5,6,83.33%
2,19,19,100.00%
3,25,58,43.10%


Child of <16 years


Unnamed: 0_level_0,Survived,Total,Percentage
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,8,9,88.89%
2,19,21,90.48%
3,28,70,40.00%


The mentioned trend can be observed for ages up to 16 (exclusive). So observing this trend does not provide additional support for 12 years as cut off age.

**Age limit for children - Kaggle**

Wikipedia also mention the cost of the ticket. For Class I and II only  price range and average price are reported. But a Class III ticket cost £3 for a child and £7 for an adult. 

The ticket price often included the cost of a train ticket to the port, so the final price is slightly higher. I can use this information to estimate the cutoff age using the 'Fare' column. I would expect to see a sharp increase of mean and minimum ticket price above the cutoff age. 

In [96]:
def analyze_children_fare (df, max_age, pclass):
    #print "Child of {} years".format(age)
    
    grouped = df[df["Pclass"]== 3][["Age","Fare"]]
    grouped = grouped[grouped["Age"]<=max_age].groupby(["Age"])
    res = pd.concat([grouped.min(),grouped.mean(),grouped.max(),grouped.count()],axis = 1)
    res.columns = ["Min","Mean","Max","Count"]
    return res

def highligh_cheap_fare(val):
    if (7.0 <= val and val <=9.0):
        color = 'red' 
    else:
        color = 'black'
    return 'color: %s' % color

child_fare_df = analyze_children_fare(passangers_df_age,16,3)
#highlight
display(child_fare_df.style.applymap(highligh_cheap_fare, subset = ["Min","Mean"]))



#%pylab inline
#analyze_children_fare (passangers_df_age, 80,3)[["Min","Mean"]].plot()

#returns list of tickets associated with only one passanger
# def filter_single_tickets(df):
#     #Series with single ticket names
#     grouped = df.groupby(["Ticket"]).size()
#     single_t = grouped[grouped == 1].index.tolist()

#     return df[df["Ticket"].isin(single_t)]

# df_single_ticket_only = filter_single_tickets(passangers_df_age)
# display(analyze_children_fare(df_single_ticket_only,16,3))

#%pylab inline
#analyze_children_fare (df_single_ticket_only, 80,3)[["Min","Mean"]].plot()                
                
#uniq = (data.groupby(["Ticket"]).size()[passangers_df_age.groupby(["Ticket"]).size() == 1]).reset_index()["Ticket"]


#data.groupby(["Ticket"]).count()    
    
#print single_tickets(passangers_df_age)
    
#display(passangers_df_age[passangers_df_age["Pclass"]== 3].groupby(["Ticket","Age","Fare","SibSp","Parch"]).size().iloc[:50])
#display(df_single_ticket_only[(df_single_ticket_only["Age"] >=9.0) & (df_single_ticket_only["Age"] <=15)][["Age","Pclass","Fare","SibSp","Parch"]].sort_values("Age"))

#passangers_df_age[passangers_df_age["Age"]==20][["Pclass","Fare"]].groupby(["Pclass"]).min()


Unnamed: 0_level_0,Min,Mean,Max,Count
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.42,8.5167,8.5167,8.5167,1
0.75,19.2583,19.2583,19.2583,2
1.0,11.1333,26.8075,46.9,5
2.0,10.4625,24.5446,39.6875,7
3.0,15.9,22.7875,31.3875,3
4.0,11.1333,21.6536,31.275,7
5.0,12.475,21.0403,31.3875,3
6.0,12.475,21.875,31.275,2
7.0,29.125,34.4062,39.6875,2
8.0,21.075,25.1,29.125,2


Indeed there are some passangers travelling on cheap adult tickets (highlighted in red). Surprisingly tickets for young children are much more then expected £3 and £7. It turns out that these passangers travel on group/family ticket or for free if they are really small. The following view demonstrates the existence of group tickets. 

In [97]:
#group tickets
display(passangers_df_age[passangers_df_age["Pclass"]== 3].groupby(["Ticket","Age","Fare","SibSp","Parch"]).size().iloc[:50])

Ticket   Age    Fare     SibSp  Parch
14973    22.00  8.0500   0      0        1
1601     26.00  56.4958  0      0        1
         28.00  56.4958  0      0        1
         32.00  56.4958  0      0        2
21440    51.00  8.0500   0      0        1
2223     18.00  8.3000   0      0        1
2620     22.00  7.2250   0      0        1
2623     40.00  7.2250   0      0        1
2625     0.42   8.5167   0      1        1
2627     17.00  14.4583  0      0        1
2628     45.50  7.2250   0      0        1
2648     20.00  4.0125   0      0        1
2650     29.00  15.2458  0      2        1
2651     12.00  11.2417  1      0        1
         14.00  11.2417  1      0        1
2653     1.00   15.7417  0      2        1
         20.00  15.7417  1      1        1
2659     15.00  14.4542  1      0        1
         27.00  14.4542  1      0        1
2663     20.00  7.2292   0      0        1
2665     14.50  14.4542  1      0        1
2666     0.75   19.2583  2      1        2
         5.00   

Let me consider only single tickets now.

In [98]:
#returns list of tickets associated with only one passanger
def filter_single_tickets(df):
    #Series with single ticket names
    grouped = df.groupby(["Ticket"]).size()
    single_t = grouped[grouped == 1].index.tolist()

    return df[df["Ticket"].isin(single_t)]

df_single_ticket_only = filter_single_tickets(passangers_df_age)
df_selected_age = df_single_ticket_only[(df_single_ticket_only["Age"] >=0.0) & (df_single_ticket_only["Age"] <=15)]
display(df_selected_age[["Ticket","Age","Pclass","Fare","SibSp","Parch"]].sort_values("Age").style.applymap(highligh_cheap_fare, subset = ["Fare"]))

Unnamed: 0,Ticket,Age,Pclass,Fare,SibSp,Parch
803,2625,0.42,3,8.5167,0,1
479,3101298,2.0,3,12.2875,0,1
184,315153,4.0,3,22.025,0,2
445,33638,4.0,1,81.8583,0,2
691,349256,4.0,3,13.4167,0,1
750,29103,4.0,2,23.0,1,1
852,2678,9.0,3,15.2458,1,1
780,2687,13.0,3,7.2292,0,0
14,350406,14.0,3,7.8542,0,0
111,2665,14.5,3,14.4542,1,0


It seems that a 13 old can already travel on her own (ticket "2687"). So 12 years indeed seem to be the right cutoff age.

Clearly there are some small children (tickets "3101298", "315153", "349256") who couldn't possibly travel on their own. They likely travelled with their parents (family ticket Fare, age <= 4, pchar >1 and children from Class III likely would not have a nanny). But there is no record of their parents or other pasangers traveling on the same ticket. Therefore I suspect that the Kaggle dataset is incomplete and some passanger records are missing.

In [99]:
passangers_df_age[passangers_df_age["Ticket"].isin(["3101298","315153","349256"])]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
184,185,1,3,"Kink-Heilmann, Miss. Luise Gretchen",female,4.0,0,2,315153,22.025,,S
479,480,1,3,"Hirvonen, Miss. Hildur E",female,2.0,0,1,3101298,12.2875,,S
691,692,1,3,"Karun, Miss. Manca",female,4.0,0,1,349256,13.4167,,C


**Age limit for children - Conclusion**

**A passanger is considered a child if he or she is 12 or younger**

I will add a new column 'ResCat' (Rescue Category) which takes value 'child', 'female', 'male' depending on the sex and age of the passanger.

In [100]:
passangers_df_age['ResCat'] = pd.Series(passangers_df_age["Sex"], index=passangers_df_age.index)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [118]:
def set_child(row):
    if row["Age"] <= 12:
        row["ResCat"]="child"
    return row

passangers_df_age = passangers_df_age.apply(set_child,axis =1)

Unnamed: 0,Age,ResCat
0,22.0,male
1,38.0,female
2,26.0,female
3,35.0,female
4,35.0,male
6,54.0,male
7,2.0,child
8,27.0,female
9,14.0,female
10,4.0,child


In [124]:
 passangers_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
