# Women and children first?

## Preliminaries

In [31]:
# Run this cell to start.
import numpy as np
import pandas as pd
# Safe settings for Pandas.
pd.set_option('mode.chained_assignment', 'raise')

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## Background

We are going to look at the details of who was lost, and who survived, in the sinking of the RMS Titanic.

We first read the dataset containing information about the passengers and crew
who were on the RMS Titanic when it sank.

The data file is `titanic_stlearn.csv`.

See the [Titanic dataset page](https://github.com/matthew-brett/datasets/tree/master/titanic) for more detail.

You might also want to look at [Encylopedia
Titanica](https://www.encyclopedia-titanica.org/titanic-statistics.html) for
more background.

In [2]:
titanic = pd.read_csv('titanic_stlearn.csv')
titanic.head()

Unnamed: 0,name,gender,age,class,embarked,country,ticketno,fare,sibsp,parch,survived
0,"Abbing, Mr. Anthony",male,42.0,3rd,Southampton,United States,5547.0,7.11,0.0,0.0,no
1,"Abbott, Mr. Eugene Joseph",male,13.0,3rd,Southampton,United States,2673.0,20.05,0.0,2.0,no
2,"Abbott, Mr. Rossmore Edward",male,16.0,3rd,Southampton,United States,2673.0,20.05,1.0,1.0,no
3,"Abbott, Mrs. Rhoda Mary 'Rosa'",female,39.0,3rd,Southampton,England,2673.0,20.05,1.0,1.0,yes
4,"Abelseth, Miss. Karen Marie",female,16.0,3rd,Southampton,Norway,348125.0,7.13,0.0,0.0,yes


This data file contains the following columns:

* `name`: a string with the name of the passenger.
* `gender`: a string with one of two labels: "male" and "female".
* `age`: a numeric value with the person's age on the day of the sinking. The
  age of babies (under 12 months) is given as a fraction of one year, rounded
  to the nearest month (2 months = 2/12 = 0.1667).
* `class`: a string specifying the class for passengers: "1st", "2nd", "3rd";
  or the type of service aboard for crew members. See below for discussion of
  passengers, crew and the crew service types.
* `embarked`: a string with the person's port of embarkation, one of:
  "Belfast", "Cherbourg", "Queenstown" or "Southampton".
* `country`: a string with the person's home country.
* `ticketno`: a numeric value specifying the persons ticket number (NA for crew
  members, also see below).
* `fare`: a numeric value with the ticket price (NA for crew members, musicians
  and employees of the shipyard company, also see below).
* `sibsp`: an integer specifying the number of siblings/spouses aboard; adopted
  from Vanderbilt data set (see below).  Always NA for crew, sometimes NA for
  passengers.
* `parch`: an ordered factor specifying the number of parents/children aboard;
  adopted from Vanderbilt data set (see below).  Always NA for crew, sometimes
  NA for passengers.
* `survived`: a string with one of two labels: "no" and "yes". It specifies
  whether the person survived the sinking.

## Women and children first

The RMS Titanic sank on 15th April 1912. A standard rule of evacuation at the
time was [Women and Children
First](https://en.wikipedia.org/wiki/Women_and_children_first).  Wikipedia
claims that the original suggestion for this rule was from a French passenger
of a ship in danger, in 1840.

How strictly was that rule applied in the evacuation of the Titanic?

In [4]:
gender_by_survived = pd.crosstab(titanic['gender'], titanic['survived'])
# Show the table in the notebook
gender_by_survived

survived,no,yes
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,130,359
male,1366,352


These counts are useful, but even more useful would be *proportions* of women
who were lost and who survived.  

In [6]:
gender_by_survived_p = pd.crosstab(titanic['gender'], titanic['survived'], normalize = 'index')

# Show the table in the notebook
gender_by_survived_p

survived,no,yes
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,0.265849,0.734151
male,0.795111,0.204889


This should look like pretty convincing evidence that the crew largely followed
the "women" part of the "Women and children first" rule.  Next we investigate
the "children" part.

We need a Series that allows us to categorize the passenger as a `male`, a
`female` or a `child`.

First we make a new series `mwc` (Man Woman Child) that has a copy of the data
from the `gender` column.

In [32]:
mwc = titanic['gender'].copy()
mwc.head()

0      male
1      male
2      male
3    female
4    female
Name: gender, dtype: object

In [33]:
is_child = titanic['age'] < 15
mwc[is_child] = 'child'
# Show the unique values and counts for the "mwc" Series.
mwc.value_counts()

male      1651
female     432
child      124
Name: gender, dtype: int64

Create a cross-tabulation data frame called `mwc_by_survived_p` that has the
proportions of children, females and males that were saved and lost. 

In [34]:
new_titanic = pd.DataFrame()
new_titanic['gender'] = titanic['gender'].copy()
new_titanic['survived'] = titanic['survived'].copy()

In [35]:
new_titanic['gender'] = mwc

In [36]:
mwc_by_survived_p = pd.crosstab(new_titanic['gender'], new_titanic['survived'], normalize = 'index')
mwc_by_survived_p

survived,no,yes
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
child,0.475806,0.524194
female,0.243056,0.756944
male,0.806784,0.193216


## Being at the front

The next factor we know well is that passengers in higher classes were more
likely to survive.

The problem we have at the moment is that the `class` column in this dataset is a mix of things:

In [37]:
titanic['class'].value_counts()

3rd                 709
victualling crew    431
1st                 324
engineering crew    324
2nd                 284
restaurant staff     69
deck crew            66
Name: class, dtype: int64

The `class` column contains "1st", "2nd", "3rd" for some people, but it has job
titles for others, such as "deck crew".

Worse than that, some of the people in "1st" and "2nd" class were closer to
being crew than passengers.  For example, there were [8
musicians](https://en.wikipedia.org/wiki/Musicians_of_the_RMS_Titanic), who
were all listed as "2nd" class passengers. There were [9 members of the Guarantee
Group](https://en.wikipedia.org/wiki/Crew_of_the_RMS_Titanic#Guarantee_group)
on board, whose job was to monitor the ship and fix any problems that arose on
her maiden voyage.  They also have passenger classes listed as "1st" or "2nd".

We would like to be able to classify the people (rows) in the dataset as one of the following:

* Genuine First class passenger: "1st".
* Genuine Second class passenger: "2nd".
* Genuine Third class passenger: "3rd".
* Musician: "musician".
* Guarantee group: "guarantee".
* Deck crew: "deck".
* Engineering crew: "engineering".
* Victualling crew or restaurant staff: "catering".

That is, we need a new Series, maybe called `roles`, with one element per row
in the dataset, that has one of these string labels, classifying the person in
the corresponding row. For example, the first five people in the dataset are
genuine Third Class passengers, so the first five elements in `roles` would be
"3rd".

Much of the information we need is in the `class` column of `titanic` - but we
have more work to do, especially for the musicians and the guarantee group.

One way of doing this task is to use a *recoding function*.  You saw one of
these in action your "stop and search" homework.  In the homework, the function
applied to a Series (and therefore, a column of a data frame), and, when used
with `apply`, returned a Series.

Here we need to use information from multiple columns in the person's row to
classify them, so we need to take a different approach.   We need to `apply` a
function to the whole data frame, to return our new Series `roles`.

Here is an example of how to do this.  The function below is a *row recoding
function*.  It accepts a *row* as its argument, and returns a value.

In this case, the function returns "adult" for a row where the person's age was
15 or more, and otherwise (for persons with age < 15) returns "female child"
for "female" persons and "male child" otherwise.

In [38]:
def classify_mf_child(row):
    if row.loc['age'] >= 15:
        return 'adult'
    if row.loc['gender'] == 'female':
        return 'female child'
    return 'male child'

To see the function in action, let's classify the first row of `titanic`:

In [39]:
classify_mf_child(titanic.iloc[0])

'adult'

Classify the second row:

In [40]:
classify_mf_child(titanic.iloc[1])

'male child'

Then we can `apply` this function to the whole data frame, to return a classification for each row in the data frame:

In [41]:
mf_child = titanic.apply(classify_mf_child, axis='columns')
mf_child.head()

0         adult
1    male child
2         adult
3         adult
4         adult
dtype: object

In [42]:
musician_ticket = titanic['ticketno'] == 250654.0
titanic[musician_ticket]

Unnamed: 0,name,gender,age,class,embarked,country,ticketno,fare,sibsp,parch,survived
144,"Brailey, Mr. William Theodore Ronald",male,24.0,2nd,Southampton,England,250654.0,,,,no
150,"Bricoux, Mr. Roger Marie",male,20.0,2nd,Southampton,France,250654.0,,,,no
237,"Clarke, Mr. John Frederick Preston",male,28.0,2nd,Southampton,England,250654.0,,,,no
516,"Hartley, Mr. Wallace Henry",male,33.0,2nd,Southampton,England,250654.0,,,,no
576,"Hume, Mr. John Law",male,21.0,2nd,Southampton,,250654.0,,,,no
680,"Krins, Mr. Georges Alexandre",male,23.0,2nd,Southampton,England,250654.0,,,,no
1189,"Taylor, Mr. Percy Cornelius",male,40.0,2nd,Southampton,England,250654.0,,,,no
1304,"Woodward, Mr. John Wesley",male,32.0,2nd,Southampton,England,250654.0,,,,no


In [23]:
def classify_role(row):
    # Your code here
    if row.loc['ticketno'] == 250654.0:
        return 'musician'

    
    elif row.loc['name'] == 'Campbell, Mr. William Henry':
        return 'guarantee'
    elif row.loc['name'] == 'Cunningham, Mr. Alfred Fleming':
        return 'guarantee'
    elif row.loc['name'] == 'Parkes, Mr. Francis':
        return 'guarantee'
    elif row.loc['name'] == 'Andrews, Mr. Thomas':
        return 'guarantee'
    elif row.loc['name'] == 'Chisholm, Mr. Roderick Robert Crispin':
        return 'guarantee'
    elif row.loc['name'] == 'Frost, Mr. Anthony Wood':
        return 'guarantee'
    elif row.loc['name'] == 'Knight, Mr. Robert':
        return 'guarantee'
    elif row.loc['name'] == 'Parr, Mr. William Henry Marsh':
        return 'guarantee'
    elif row.loc['name'] == 'Watson, Mr. Ennis Hastings':
        return 'guarantee'
      
    elif row.loc['class'] == '3rd':
        return '3rd'
    elif row.loc['class'] == '2nd':
        return '2nd'
    elif row.loc['class'] == '1st':
        return '1st'
    elif row.loc['class'] == 'deck crew':
        return 'deck'
    elif row.loc['class'] == 'victualling crew':
        return 'catering'
    elif row.loc['class'] == 'restaurant staff':
        return 'catering'
    elif row.loc['class'] == 'engineering crew':
        return 'engineering'
        

`apply` your function to the `titanic` data frame to make a new Series, then
use this Series to create a new data frame `role_by_survived_p` that is a
cross-tabulation of the *proportion* of *male* passengers with each role, that
survived or were lost. For example, `role_by_survived_p` will have a row
corresponding to "catering", with two values, where one value will be the
proportion of *male* catering staff that survived, and the other will be the
proportion of male catering staff that were lost.

In [26]:
titanic_roles = titanic.apply(classify_role, axis='columns')
titanic_roles

0               3rd
1               3rd
2               3rd
3               3rd
4               3rd
           ...     
2202           deck
2203       catering
2204    engineering
2205       catering
2206       catering
Length: 2207, dtype: object

In [27]:
males = titanic['gender'] == 'male'
males_df = titanic[males]

In [28]:
male_roles_df = pd.DataFrame()
male_roles_df['gender'] = males_df['gender']
male_roles_df['class'] = titanic_roles
male_roles_df['survived'] = males_df['survived']
male_roles_df

Unnamed: 0,gender,class,survived
0,male,3rd,no
1,male,3rd,no
2,male,3rd,no
5,male,3rd,yes
6,male,2nd,no
...,...,...,...
2202,male,deck,yes
2203,male,catering,yes
2204,male,engineering,no
2205,male,catering,no


In [29]:
role_by_survived_p = pd.crosstab(male_roles_df['class'], male_roles_df['survived'], normalize = 'index')
role_by_survived_p

survived,no,yes
class,Unnamed: 1_level_1,Unnamed: 2_level_1
1st,0.649718,0.350282
2nd,0.853659,0.146341
3rd,0.84787,0.15213
catering,0.838574,0.161426
deck,0.348485,0.651515
engineering,0.780864,0.219136
guarantee,1.0,0.0
musician,1.0,0.0


You can see that 1st class passengers had a much higher proportion of survival rate than 2nd and 3rd class passengers.