# Titanic Disaster

In this recap, we will explore the famous [Titanic](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Titanic_dataset.csv) dataset listing all passengers with various properties.

‚ùì Start loading `matplotlib`, `numpy` and `pandas` the usual way

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

‚ùì Load the CSV data as a into a `titanic_df` variable.

The csv file is available at this url: https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Titanic_dataset.csv

<details>
    <summary>üí° <strong>Hint</strong> - Click to reveal</summary>
    Try using <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html"><code>pandas.DataFrame.read_csv</code></a>
</details>

In [None]:
titanic_df = pd.read_csv('https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Titanic_dataset.csv')

‚ùì Explore the dataset with the usual methods (`shape`, `dtypes`, `describe()`, `info()`, `isnull().sum()`).

Do not hesitate to add cells by pressing `B`.

In [None]:
titanic_df.head()

In [None]:
titanic_df.isnull().sum()

It seems that the `Cabin` information is missing in 687 rows.

‚ùì Use the [`pandas.DataFrame.drop`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) function to get rid of the `Cabin` in `titanic_df`

In [None]:
titanic_df.drop('Cabin', axis=1, inplace=True)

## Classes Analysis

Let's have a look at the ticket breakdown.

‚ùì Using a `groupby()`, create a `pclass_df` dataframe counting the number of tickets sold per class (1, 2 or 3)

In [None]:
pclass_df = titanic_df.groupby("Pclass").count()["PassengerId"].to_frame(name="count")
pclass_df

Looking at the number is not very visual, let's try to make sense of the data with a plot.

‚ùì Plot the `pclass_df` dataframe built in the previous question as a barchart

In [None]:
pclass_df.plot(kind="bar")

Let's now have a look at **survivors**.

‚ùì Plot a barchart showing the *survival rate* of each passenger class. `0` means no one survived in the class, `1` means everyone survived.

In [None]:
titanic_df[["Pclass","Survived"]].groupby('Pclass').mean().plot(kind='bar')

### Gender Analysis

Let's have a look at the `Sex` column.

‚ùì Use the [`pandas.Series.unique`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html) function to check the different values used in that column

In [None]:
titanic_df['Sex'].unique()

‚ùì Plot a barchart showing the *survival rate* of each gender. Based on the data, which gender had the more favourable outcome?

In [None]:
titanic_df[['Survived', 'Sex']].groupby('Sex').mean().plot(kind='bar')

Let's build a fancier histogram where we show the total number of passengers + the total number of survivors for each gender.

‚ùì Build a `survivors_df` DataFrame with two columns: `Total` and `Survived`, and two rows (`male` and `female`). Plot it.

In [None]:
survivors_df = titanic_df[['Survived', 'Sex']].groupby('Sex').sum()
survivors_df['Total'] = titanic_df[['Survived', 'Sex']].groupby('Sex').count()
survivors_df.plot(kind='bar')

## Children

The former analysis did not take into account ages. We want to differentiate between a child and an adult and see how *survival rates* are affected.

‚ùì Use boolean indexing to create a `children_df` containing only rows of child passengers

In [None]:
children_df = titanic_df[titanic_df['Age'] <= 17]
children_df.head()

‚ùì How many children were there in the Titanic?

In [None]:
children_df.shape[0]

‚ùì How many children survived?

In [None]:
children_df['Survived'].sum()

‚ùì Plot a barchart of survivors / total for each category: `male`, `female`, `children`. Bear in mind that you need to **substract** the boys from the `male` statistics, and the girls from the `female` statistics.

In [None]:
survivors_df.loc['children'] = [children_df['Survived'].sum(), children_df.shape[0]]
survivors_df

In [None]:
children_gender_df = children_df[['Survived', 'Sex']].groupby('Sex').sum()
children_gender_df['Total'] = children_df[['Survived', 'Sex']].groupby('Sex').count()
children_gender_df.loc['children'] = [ 0, 0 ]
children_gender_df

In [None]:
(survivors_df - children_gender_df).plot(kind='bar')

## [Optional] Big families

‚ùì Find out if it was harder for bigger families to survive?
  
Here you create a new column in your `DataFrame` for the family size of the passenger.

In [None]:
titanic_df['family_size'] = titanic_df['SibSp'] + titanic_df['Parch']
titanic_df.groupby('family_size')['Survived'].mean().plot(kind='bar');

## [Optional] Distinguished titles

‚ùì Were passengers with distinguished titles given preferrence during the evacuation?
   
With some string manipulations, create a new column for each user with their title

In [None]:
titanic_df['Title'] = titanic_df['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
titanic_df.groupby('Title').count()['PassengerId'].sort_values().plot(kind='bar', logy=True)

In [None]:
titanic_df.groupby('Title')['Survived'].mean().sort_values().plot(kind='bar')