# Surviving Titanic

## Introduction and initial question

Was there any factors that made it more possible to survive the Titanic shipwreck? Did it matter which class you were in or if you were a man or woman? 

## The dataset

Taken from Kaggle's competition, https://www.kaggle.com/c/titanic/data

### Data Dictionary

|Variable|Definition|Key|
|--------|----------|---|
|survival|Survival|0 = No, 1 = Yes|
|pclass|Ticket class|1 = 1st, 2 = 2nd, 3 = 3rd|
|sex|Sex||
|Age|Age in years||	
|sibsp|# of siblings / spouses aboard the Titanic||
|parch|# of parents / children aboard the Titanic||
|ticket|Ticket number||
|fare|Passenger fare||
|cabin|Cabin number||
|embarked|Port of Embarkation|C = Cherbourg, Q = Queenstown, S = Southampton|

### Variable Notes

* pclass: A proxy for socio-economic status (SES)
  * 1st = Upper
  * 2nd = Middle
  * 3rd = Lower

* age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

* sibsp: The dataset defines family relations in this way...
  * Sibling = brother, sister, stepbrother, stepsister
  * Spouse = husband, wife (mistresses and fiancés were ignored)

* parch: The dataset defines family relations in this way...
  * Parent = mother, father
  * Child = daughter, son, stepdaughter, stepson
*Some children travelled only with a nanny, therefore parch=0 for them.*

### My own analysis of the variables types

***NEED TO WRITE***

### Start by Importing necessary libraries

***Explain about the libraries used for this post***

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame, Series

import matplotlib.pyplot as plt
import seaborn

%matplotlib inline

### Now read in the data

Next we start by reading in the data using the pandas CSV reader.

In [None]:
passengers = pd.read_csv('titanic-data.csv')

# Check info about the data we read
passengers.info()

We can see that Age, Cabin, and Embarked columns are missing values.
We either need to estimate the missing values or drop them. We can estimate missing ages but for our analysis I believe the non-null 714 values are enough. We choose to drop the rows with null values.

In [None]:
passengers = passengers.dropna(subset=['Age'])

Cabin has to many missing values and according to wikipedia also a bias towards first class passengers. This column will be dropped during this analysis. Something that would be interesting to look up if cabin position would influence survival but this will be out of scope for this time.

In [None]:
passengers = passengers.drop('Cabin', axis=1)

Embarked only had three missing vaules initially. However, it is hard to imagine a scenario where your point of origin would affect our survival rate. I choose to drop this column as well. 

In [None]:
passengers = passengers.drop('Embarked', axis=1)

Further, we have some columns containing data that might be interesting. (Titles in Name, Ticket price in Fare, etc.) With more analysis these might contain valuable information but to keep our analysis simple, I will drop these as well.

In [None]:
passengers = passengers.drop('Name', axis=1)
passengers = passengers.drop('PassengerId', axis=1)
passengers = passengers.drop('Fare', axis=1)

To make some plotting easier we add a column 'AgeGroup' where we split our sample passengers ages into four groups with equal spans.  
Also, upon further inspection it seems like SibSp/Parch and ticket groups by ticket numbers don't have any correlation making it hard to pinpoint which passenger are related to which etc. There might be a bit too much work to little return for analysing this data so we drop the two columns.

In [None]:
passengers = passengers.drop('Ticket', axis=1)
passengers = passengers.drop('SibSp', axis=1)
passengers = passengers.drop('Parch', axis=1)

Lastly, to make column names more coherent we rename "Pclass" to just "Class" and to better described to contents we convert the Sex, Class, and Survived column's data types from Strings to Categories, and Integers to booleans respectively. 

In [None]:
passengers.Sex = passengers.Sex.astype('category')
passengers.Class = passengers.Class.astype('category')

passengers.Survived = passengers.Survived.apply(bool)

passengers = passengers.rename(columns={'Pclass': 'Class'})

This gives us something like the below to work with.

In [None]:
passengers.head()

### Descriptive stats

In [None]:
passengers.describe()

We can see here that we our sample mostly consists of third class travelers and a typical passenger were around 30 years old. Even after trimming of some rows we still have 714 entries in our sample to work with.

Let us also plot some different relations between class, sex, and age for the survivors and non-survivors for our data to see if we can find some interesting anomalities.

In [None]:
fig, axs = plt.subplots(ncols=2, figsize=(8, 4))

survived_by_class = passengers.groupby(['Survived','Class']).size().unstack()
survived_by_class.plot(kind='bar', stacked=True, ax=axs[0])
axs[0].set_title('1. Survivors by Class')

survived_by_sex = passengers.groupby(['Survived', 'Sex']).size().unstack()
survived_by_sex.plot(kind='bar', stacked=True, ax=axs[1])
axs[1].set_title('2. Survivors by Sex')

plt.show()

From the above plots we can see some interesting facts. For example, there were proportionaly more women than men that survived the accident. This could be somewhat influenced by the fact that there where much more men than women in third class which also had higher rate of casualities. 

We also tried plotting the survival rate for passengers born the same year but too much noise did make it hard to get anything out of the plot. By bining the ages to intervals of 5 we can keep some of the noise down by increasing the number of samples per data point. From the second graph it looks light we might have a weak relationship between higher age and increased mortality rate.

In [None]:
#Grouped by age in full years
survival_by_age = passengers.groupby(lambda x: int(passengers.loc[x].Age)).Survived

survival_rate_by_age = survival_by_age.apply(lambda x: x[x == True].count()/(x.count() * 1.0))

plt.scatter(survival_rate_by_age.index, survival_rate_by_age);

In [None]:
#Group by age in intervals of 5
survival_by_age_group = passengers.groupby(lambda x: int(passengers.loc[x].Age/5)*5).Survived

survival_by_age_group.count()

In [None]:
survival_rate_by_age_group = survival_by_age_group.apply(lambda x: x[x == True].count()/(x.count() * 1.0))

plt.scatter(survival_rate_by_age_group.index, survival_rate_by_age_group);

### Chi^2 test for relevance

### Logistical regression for probabilities