# Surviving Titanic

## Introduction and initial question

Did the chance for suviving the shipwreck change depending on which social class the passenger had?
(We will use first, second, and third class tickets as a proxy for social class in the following analysis.)

## The dataset

Taken from Kaggle's competition, https://www.kaggle.com/c/titanic/data

### Data Dictionary

|Variable|Definition|Key|
|--------|----------|---|
|survival|Survival|0 = No, 1 = Yes|
|pclass|Ticket class|1 = 1st, 2 = 2nd, 3 = 3rd|
|sex|Sex||
|Age|Age in years||	
|sibsp|# of siblings / spouses aboard the Titanic||
|parch|# of parents / children aboard the Titanic||
|ticket|Ticket number||
|fare|Passenger fare||
|cabin|Cabin number||
|embarked|Port of Embarkation|C = Cherbourg, Q = Queenstown, S = Southampton|

### Variable Notes

* pclass: A proxy for socio-economic status (SES)
  * 1st = Upper
  * 2nd = Middle
  * 3rd = Lower

* age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

* sibsp: The dataset defines family relations in this way...
  * Sibling = brother, sister, stepbrother, stepsister
  * Spouse = husband, wife (mistresses and fiancés were ignored)

* parch: The dataset defines family relations in this way...
  * Parent = mother, father
  * Child = daughter, son, stepdaughter, stepson
*Some children travelled only with a nanny, therefore parch=0 for them.*

### My own analysis of the variables types

***NEED TO WRITE***

### Start by Importing necessary libraries

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame, Series

import matplotlib.pyplot as plt
import seaborn

%matplotlib inline

### Now read in the data

In [None]:
passengers = pd.read_csv('titanic-data.csv')

# Check info about the data we read
passengers.info()

We can see that Age, Cabin, and Embarked columns are missing values.
We either need to estimate the missing values or drop them. We can estimate missing ages but for our analysis I believe the non-null 714 values are enough. I choose to drop the rows with null values.

In [None]:
passengers = passengers.dropna(subset=['Age'])

Cabin has to many missing values and according to wikipedia also a bias towards first class passengers. This column will be dropped during this analysis. Something that would be interesting to look up if cabin position would influence survival but this will be out of scope for this time.

In [None]:
passengers = passengers.drop('Cabin', axis=1)

Embarked only had three missing vaules initially. However, it is hard to imagine a scenario where your point of origin would affect our survival rate. I choose to drop this column as well. 

In [None]:
passengers = passengers.drop('Embarked', axis=1)

Further, we have some columns containing data that might be interesting. (Titles in Name, Ticket price in Fare, etc.) With more analysis these might contain valuable information but to keep our analysis simple, I will drop these as well.

In [None]:
passengers = passengers.drop('Name', axis=1)
passengers = passengers.drop('PassengerId', axis=1)
passengers = passengers.drop('Fare', axis=1)

Since it seems like SibSp/Parch and ticket groups by ticket numbers don't have any correlation we drop the three columns in favor for one with a combined Relatives value instead.

In [None]:
passengers = passengers.drop('Ticket', axis=1)

passengers['Relatives'] = passengers.SibSp + passengers.Parch
passengers = passengers.drop('SibSp', axis=1)
passengers = passengers.drop('Parch', axis=1)

Lastly, to make column names more coherent we rename "Pclass" to just "Class" and to better described to contents we convert Survived from integers to bools. 

In [None]:
passengers.Survived = passengers.Survived.apply(bool)
passengers = passengers.rename(columns={'Pclass': 'Class'})

This gives us something like the below to work with.

In [None]:
passengers.head()

In [None]:
survived_by_class = passengers.groupby(['Survived','Class']).size().unstack()
survived_by_class.plot(kind='bar', stacked=True);

In [None]:
survived_by_sex = passengers.groupby(['Survived', 'Sex']).size().unstack()
survived_by_sex.plot(kind='bar', stacked=True);