Credits:
This code originated from [Donne Martin](http://donnemartin.com) work on the Kaggle ML competition. Much of the source and some of the text comes from Donne's submission (under the Apache 2.0 license). 

# Analyzing the passenger list from the Titanic

In this notebook, we will analyze some of the features that can be extracted from the Titanic public domain data-set. 

The idea of this notebook is simply to explore the dataset to see if we can find some worthwile trends.

Kaggle has a competition site using ML to explore the data to predict the likelyhood of survival. If you want to see this competition, check out [competition site](https://www.kaggle.com/c/titanic-gettingStarted).


## Description

![alt text](http://upload.wikimedia.org/wikipedia/commons/6/6e/St%C3%B6wer_Titanic.jpg)

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

## Data Set

The dataset we'll provide can be found under `/data/titanic/all_passengers.csv`. 
The columns we'll explore below, but here is a short explaination:

| Column name | Semantic |
|------------------|-------------------|
| survival | Survival (0 = No; 1 = Yes) |
| pclass      | Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) |
| name | Name of passenger |
| sex  | Gender ('male' or 'female' |
| age      | Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5 |
| sibsp | Number of siblings/spouses aboard |
| boat | The lifeboat on which the passenger was placed |
| parch    | Number of parents/children aboard|
| ticket | Ticket number |
| fare | Passenger fare |
| cabin | Cabin |
| embarked | Port of embarkation |


## Setup Imports and Variables

In [1]:
import pandas as pd
import numpy as np
import pylab as plt

# Set the global default size of matplotlib figures
plt.rc('figure', figsize=(10, 5))

# Size of matplotlib figures that contain subplots
fizsize_with_subplots = (10, 10)

# Size of matplotlib histogram bins
bin_size = 10

## Explore the Data

Read the data:

In [2]:
passengers = pd.read_csv('/data/titanic/all_passengers.csv')

Lets take a look at the first 5 elements to explore the data

In [5]:
passengers.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


We could of course also look at the last 5

In [6]:
passengers.tail()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1304,3,0,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.0,0,0,2670,7.225,,C,,,
1308,3,0,"Zimmerman, Mr. Leo",male,29.0,0,0,315082,7.875,,S,,,


Let's also explore the name and type of the columns:

In [None]:
passengers.dtypes

Type 'object' is a string for pandas, which poses problems with machine learning algorithms and many analytics algorithms.  

It is often worth noting which columns these are as we'll need to convert these to number representations if we want to explore them as features.

Get some basic information on the DataFrame:

In [None]:
passengers.info()

Age, Cabin, and Embarked are missing values.  Cabin has too many missing values, whereas we might be able to infer values for Age and Embarked.

Generate various descriptive statistics on the DataFrame:

In [None]:
passengers.describe()

Now that we have a general idea of the data set contents, we can dive deeper into each column.  We'll be doing exploratory data analysis and cleaning data to setup 'features' we'll be using in our machine learning algorithms.

Plot a few features to get a better idea of each:

In [None]:
# Set up a grid of plots
fig = plt.figure(figsize=fizsize_with_subplots) 
fig_dims = (3, 2)

# Plot death and survival counts
plt.subplot2grid(fig_dims, (0, 0))
passengers['survived'].value_counts().plot(kind='bar', 
                                         title='Death and Survival Counts')

# Plot Pclass counts
plt.subplot2grid(fig_dims, (0, 1))
passengers['pclass'].value_counts().plot(kind='bar', 
                                       title='Passenger Class Counts')

# Plot Sex counts
plt.subplot2grid(fig_dims, (1, 0))
passengers['sex'].value_counts().plot(kind='bar', 
                                    title='Gender Counts')
plt.xticks(rotation=0)

# Plot Embarked counts
plt.subplot2grid(fig_dims, (1, 1))
passengers['embarked'].value_counts().plot(kind='bar', 
                                         title='Ports of Embarkation Counts')

# Plot the Age histogram
plt.subplot2grid(fig_dims, (2, 0))
passengers['age'].hist()
plt.title('Age Histogram')

Next we'll explore various features to view their impact on survival rates.

## Feature: Passenger Classes

From our exploratory data analysis in the previous section, we see there are three passenger classes: First, Second, and Third class.  We'll determine which proportion of passengers survived based on their passenger class.

Generate a cross tab of Pclass and Survived:

In [None]:
pclass_xt = pd.crosstab(passengers['pclass'], passengers['survived'])
pclass_xt

Plot the cross tab:

In [None]:
# Normalize the cross tab to sum to 1:
pclass_xt_pct = pclass_xt.div(pclass_xt.sum(1).astype(float), axis=0)

pclass_xt_pct.plot(kind='bar', 
                   stacked=True, 
                   title='Survival Rate by Passenger Classes')
plt.xlabel('Passenger Class')
plt.ylabel('Survival Rate')

We can see that passenger class seems to have a significant impact on whether a passenger survived.  Those in First Class the highest chance for survival.

## Feature: Sex

Gender might have also played a role in determining a passenger's survival rate.  We'll need to map Sex from a string to a number to prepare it for machine learning algorithms.

Generate a mapping of Sex from a string to a number representation:

In [None]:
sexes = sorted(passengers['sex'].unique())
genders_mapping = dict(zip(sexes, range(0, len(sexes) + 1)))
genders_mapping

Transform Sex from a string to a number representation:

In [None]:
passengers['Sex_Val'] = passengers['sex'].map(genders_mapping).astype(int)
passengers.head()

Plot a normalized cross tab for Sex_Val and Survived:

In [None]:
sex_val_xt = pd.crosstab(passengers['Sex_Val'], passengers['survived'])
sex_val_xt_pct = sex_val_xt.div(sex_val_xt.sum(1).astype(float), axis=0)
sex_val_xt_pct.plot(kind='bar', stacked=True, title='Survival Rate by Gender')

The majority of females survived, whereas the majority of males did not.

Next we'll determine whether we can gain any insights on survival rate by looking at both Sex and Pclass.

Count males and females in each Pclass:

In [None]:
# Get the unique values of Pclass:
passenger_classes = sorted(passengers['pclass'].unique())

for p_class in passenger_classes:
    print( 'M: ', p_class, len(passengers[(passengers['sex'] == 'male') & 
                             (passengers['pclass'] == p_class)]))
    print( 'F: ', p_class, len(passengers[(passengers['sex'] == 'female') & 
                             (passengers['pclass'] == p_class)]))

Plot survival rate by Sex and Pclass:

In [None]:
# Plot survival rate by Sex
females_df = passengers[passengers['sex'] == 'female']
females_xt = pd.crosstab(females_df['pclass'], passengers['survived'])
females_xt_pct = females_xt.div(females_xt.sum(1).astype(float), axis=0)
females_xt_pct.plot(kind='bar', 
                    stacked=True, 
                    title='Female Survival Rate by Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Survival Rate')

# Plot survival rate by Pclass
males_df = passengers[passengers['sex'] == 'male']
males_xt = pd.crosstab(males_df['pclass'], passengers['survived'])
males_xt_pct = males_xt.div(males_xt.sum(1).astype(float), axis=0)
males_xt_pct.plot(kind='bar', 
                  stacked=True, 
                  title='Male Survival Rate by Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Survival Rate')

The vast majority of females in First and Second class survived.  Males in First class had the highest chance for survival.

## Feature: Age

The Age column seems like an important feature--unfortunately it is missing many values.  We'll need to fill in the missing values like we did with Embarked.

Filter to view missing Age values:

In [None]:
passengers[passengers['age'].isnull()][['sex', 'pclass', 'age']].head()

Determine the Age typical for each passenger class by Sex_Val.  We'll use the median instead of the mean because the Age histogram seems to be right skewed.

(Note that the code below may generate a FutureWarning. You can ignore that (at least for now).

In [None]:
# To keep Age in tact, make a copy of it called AgeFill 
# that we will use to fill in the missing ages:
passengers['AgeFill'] = passengers['age']

# Populate AgeFill
passengers['AgeFill'] = passengers['AgeFill'] \
                        .groupby([passengers['Sex_Val'], passengers['pclass']]) \
                        .apply(lambda x: x.fillna(x.median()))

Ensure AgeFill does not contain any missing values:

In [None]:
len(passengers[passengers['AgeFill'].isnull()])

Plot a normalized cross tab for AgeFill and Survived:

In [None]:
# Set up a grid of plots
fig, axes = plt.subplots(2, 1, figsize=fizsize_with_subplots)

# Histogram of AgeFill segmented by Survived
df1 = passengers[passengers['survived'] == 0]['age']
df2 = passengers[passengers['survived'] == 1]['age']
max_age = max(passengers['AgeFill'])
axes[0].hist([df1, df2], 
             bins=int(max_age / bin_size), 
             range=(1, max_age), 
             stacked=True)
axes[0].legend(('Died', 'Survived'), loc='best')
axes[0].set_title('Survivors by Age Groups Histogram')
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Count')

# Scatter plot Survived and AgeFill
axes[1].scatter(passengers['survived'], passengers['AgeFill'])
axes[1].set_title('Survivors by Age Plot')
axes[1].set_xlabel('Survived')
axes[1].set_ylabel('Age')

Unfortunately, the graphs above do not seem to clearly show any insights.  We'll keep digging further.

Plot AgeFill density by Pclass:

In [None]:
for pclass in passenger_classes:
    passengers.AgeFill[passengers.pclass == pclass].plot(kind='kde')
plt.title('Age Density Plot by Passenger Class')
plt.xlabel('Age')
plt.legend(('1st Class', '2nd Class', '3rd Class'), loc='best')

When looking at AgeFill density by Pclass, we see the first class passengers were generally older then second class passengers, which in turn were older than third class passengers.  We've determined that first class passengers had a higher survival rate than second class passengers, which in turn had a higher survival rate than third class passengers.

In [None]:
# Set up a grid of plots
fig = plt.figure(figsize=fizsize_with_subplots) 
fig_dims = (3, 1)

# Plot the AgeFill histogram for Survivors
plt.subplot2grid(fig_dims, (0, 0))
survived_df = passengers[passengers['survived'] == 1]
survived_df['AgeFill'].hist(bins=int(max_age / bin_size), range=(1, max_age))

# Plot the AgeFill histogram for Females
plt.subplot2grid(fig_dims, (1, 0))
females_df = passengers[(passengers['Sex_Val'] == 0) & (passengers['survived'] == 1)]
females_df['AgeFill'].hist(bins=int(max_age / bin_size), range=(1, max_age))

# Plot the AgeFill histogram for first class passengers
plt.subplot2grid(fig_dims, (2, 0))
class1_df = passengers[(passengers['pclass'] == 1) & (passengers['survived'] == 1)]
class1_df['AgeFill'].hist(bins=int(max_age / bin_size), range=(1, max_age))

In the first graph, we see that most survivors come from the 20's to 30's age ranges and might be explained by the following two graphs.  The second graph shows most females are within their 20's.  The third graph shows most first class passengers are within their 30's.

## Feature: Family Size

Feature enginering involves creating new features or modifying existing features which might be advantageous to a machine learning algorithm.

Define a new feature FamilySize that is the sum of Parch (number of parents or children on board) and SibSp (number of siblings or spouses):

In [None]:
passengers['FamilySize'] = passengers['sibsp'] + passengers['parch']
passengers.head()

Plot a histogram of FamilySize:

In [None]:
passengers['FamilySize'].hist()
plt.title('Family Size Histogram')

Plot a histogram of AgeFill segmented by Survived:

In [None]:
# Get the unique values of Embarked and its maximum
family_sizes = sorted(passengers['FamilySize'].unique())
family_size_max = max(family_sizes)

df1 = passengers[passengers['survived'] == 0]['FamilySize']
df2 = passengers[passengers['survived'] == 1]['FamilySize']
plt.hist([df1, df2], 
         bins=family_size_max + 1, 
         range=(0, family_size_max), 
         stacked=True)
plt.legend(('Died', 'Survived'), loc='best')
plt.title('Survivors by Family Size')

Based on the histograms, it is not immediately obvious what impact FamilySize has on survival.  The machine learning algorithms might benefit from this feature.

Additional features we might want to engineer might be related to the Name column, for example honorrary or pedestrian titles might give clues and better predictive power for a male's survival.