# Analyzing the Titanic Data

## Introductionary words

This **project** is part of the **[Udacity data analyst nanodegree](https://www.udacity.com/course/data-analyst-nanodegree--nd002)**. This analysis is done by **[Guillaume Simler](https://github.com/guillaumesimler)** as part of his nanodegree's graduation.

For more infos, please have a look at the related **[githup repo](https://github.com/guillaumesimler/nanodap1)**

## Some discussions about the facts

Everybody knows about the story of the RMS Titanic, the unsinkable sunken cruiser, and her catastrophic ending. If it is not the case, please have a look at the [wikipedia page](https://en.wikipedia.org/wiki/RMS_Titanic).

As it was considered unsinkable, there were no need for life boat capacity matching at least the number of passengers and crews. So this error became quite **fatal**.
Actually you would need far more capacity as the sinking of the [Costa Concordia](https://en.wikipedia.org/wiki/Costa_Concordia_disaster) showed.

One last thing about the Titanic
> *Built by Irishmen. Sunk by Englishmen*

## 1. Loading modules & files 

In [None]:
# Import Modules

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Import Data 

passenger_df = pd.read_csv('titanic-data.csv')

#### Testing the data loading

In [None]:
passenger_df.head(7)

In [None]:
# Number of data set

print len(passenger_df.index) 

## 2. Data wrangling: First Analysis

Several topics need to be checked after the data:
1. Some passengers have no **Cabin** number. Out of the small data set, it seems to be linked with the class: the third class passengers seems to have no numbered cabine. **This needs to be checked !**
2. Some persons don't have any age data filed. **It needs to be checked how frequent this happens !!**


#### 2.1. The "unassigned cabin issue": Data Analysis

In [None]:
# Select the data for the passenger with no cabin

passengers_without_cabins = passenger_df.loc[pd.isnull(passenger_df['Cabin'])]

In [None]:
# Check if there is a link between the absence of cabin and the class

pwoc_nb = passengers_without_cabins.groupby('Pclass')

print 'The number of passengers by class without a cabin number is'
print pwoc_nb.size()


In [None]:
# Check the number of passenger with cabins

pwc_nb = passenger_df.groupby('Pclass').size() - pwoc_nb.size()

print 'The number of passengers by class with a cabin number is'
print pwc_nb

In [None]:
# Check the proportion of the passenger with cabins

print pwc_nb / passenger_df.groupby('Pclass').size()

#### Intermediate Results 

You have a completely different view between the classes:
- a large assignment in the first class
- low and decreasing assigment in the two last classes

Consequently we might check the following **hypothesis**: 

the known assigned in the first class are spouces or children staying in the cabin of their related husband/father

In [None]:
# Check if the first class non assigned are Sibling or spouses

pwoc_nb.sum()['SibSp'] + pwoc_nb.sum()['Parch']  

The 11 spouses and children are no matches to the 40 1st class passengers without cabins. 

**We can't draw any conclusions but make some suppositions**:
- the hypothesis is true, but the data set (a sample of 891 passengers out of 1320 passengers and 892 staff menbers) is not "clean" enough. The related data are in the remaining part of the population
- the non assigned shared a cabin but had no filial or marital relationship with the assigned guest. This would include household staff and mistresses
- some cabins were not prebooked and assigned only on the ship and the related records were lost


#### 2.1. The "unassigned cabin issue": Conclusion and consequences

#### The higher proportion of  assigned cabins in the first class

This higher proportion of the known assigned is not that surprising. Although the following comments are assumptions, they might be more than plausible
- the first class passenger more likely booked their cabins centrally and a pre-assignment was made by the company. The latter had more interest to keep a record for this high-value customers for **marketing purposes** (though this term was not coined at the time)
- the other customers would generally be assigned a cabin on the ship itself. If a record was kept - which is not sure -, it lays at the bottom of the Atlantic.

#### Consequences for our inquiries

The unfortunate consequence is that an analysis of survival rates by deck (the first character of the cabin number) is only possible for first class passengers. 
You could also think that a lot of these passengers would be still up enjoying the luxury life on board whereas the second and third class passengers would be more likely asleep.

So **consequently** we will drop an analysis of **survival rates by deck**-

#### 2.2. number of passengers without age data: data analysis



In [None]:
print 'Example of a passanger without age data'

print passenger_df.iloc[5]

In [None]:
# Check the data

passengers_wo_age = passenger_df.loc[passenger_df['Age'].isnull()]

print len(passengers_wo_age)

#### Intermediate Results

Having **177** passengers (~20%) without age might waterdown the assumptions. Yet the results with the remaining data could be significant enough

In [None]:
passengers_wo_age.head(10)

#### 2.2. number of passengers without age data: Conclusions

The first view of the data shows that there seems to be **no real correlation between variables**. The results are not dependent from
- the booked class: all are present
- the embarkment place: all are present
- the sex of the passenger: both women and men don't have data
- the assignment of cabins

So **consequently** we would need to accept the **watering down of the age data**.