# Titanic Survival Prediction

## Introduction

On April 15, 1912, during her first voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. This project will consist of an analysis of the different circumstances and characteristics (referred to as _features_) of each passenger, and will try to predict the survival of passengers based on these features. 

This prediction model is created as part of a Kaggle challenge, which can be found [here](https://www.kaggle.com/c/titanic).


## Data

The raw data for this analysis consists of 3 csv files:
- train.csv 
- test.csv

The data in train.csv will be used to train the prediction model, and the test.csv will be used to test how well the model is able to predict the survival of passengers by submitting the results to Kaggle. The train.csv file contains the following columns:

| **Column** | **Description** | **Key** |
| ---| --- | --- |
| PassengerId | The unique identifier of each passenger | |
| Survived | Whether or not the passenger survived | 0 = not survived, 1 = survived|
| Pclass | The class in which the passenger was staying on the Titanic | 1 = 1st class, 2 = 2nd class, 3 = 3rd class|
| Name | Name and title of the passenger | |
| Sex | Gender of the passenger | male, female| 
| Age | Age in years | |
| SibSp | Number of siblings / spouses aboard the Titanic | | 
| Parch | Number of parents / children aboard the Titanic | |
| Ticket | Ticket number of the passenger | |
| Fare | Price of the passenger's ticket | |
| Cabin | Cabin number of the passenger | |
| Embarked | Port from which the passenger embarked the Titanic | C = Cherbourg, Q = Queenstown, S = Southampton|

The test.csv file contains all above mentioned columns, except for the column "Survived".

For the data exploration, data preparation and model creation, only the train.csv file will be used. 

## Package imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Data Exploration

In [3]:
titanic_df = pd.read_csv("../raw_data/train.csv")
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
titanic_df.shape
print(f'There are {titanic_df.shape[0]} rows/passengers and {titanic_df.shape[1]} columns in the dataframe')

There are 891 rows/passengers and 12 columns in the dataframe


In [6]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### Missing values

In [7]:
titanic_df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Not all columns are filled with values. The following columns have missing values:
- Age: 177 missing values
- Cabin: 687 missing values
- Embarked: 2 missing values

Let's see for each of the columns why the values could be missing

#### Age

In [8]:
titanic_df[titanic_df["Age"].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
878,879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S


There is no clear reason why the ages are missing for these passengers. We will see in the data preparation what the best solution for the missing values will be. 

#### Cabin

In [16]:
titanic_df[titanic_df["Cabin"].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S


In [18]:
titanic_df["Cabin"].value_counts()

B96 B98        4
G6             4
C23 C25 C27    4
C22 C26        3
F33            3
              ..
E34            1
C7             1
C54            1
E36            1
C148           1
Name: Cabin, Length: 147, dtype: int64

In [13]:
titanic_df[titanic_df["Cabin"].isnull()]["Pclass"].value_counts()

3    479
2    168
1     40
Name: Pclass, dtype: int64

In [15]:
titanic_df[titanic_df["Cabin"].notnull()]["Pclass"].value_counts()

1    176
2     16
3     12
Name: Pclass, dtype: int64

It is also very difficult to see why some passengers have a registration of the cabin, and why some have not. As [this article from BBC](https://www.bbc.co.uk/bitesize/topics/z8mpfg8/articles/zkg9dxs) states, there were cabins for all 3 classes on the Titanic. Since there are so many missing values, we might drop the whole column in the data preparation step.

#### Embarked

In [19]:
titanic_df[titanic_df["Embarked"].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


There are only 2 passengers of which the Embarked column fields are missing. A quick google search shows that:
- [Miss. Amelie Icard embarked on Southampton](https://www.encyclopedia-titanica.org/titanic-survivor/amelia-icard.html)
- [Mrs. George Nelson (Martha Evelyn) Stone also embarked on Southamption](https://www.encyclopedia-titanica.org/titanic-survivor/martha-evelyn-stone.html)

In [None]:
##

## Data Preparation

From the Data Exploration, it became clear that the data is not completely clean yet. The data needs to be cleaned before we can investigate the features further.

### Filling missing values

#### Embarked

As we saw in the data exploration, the 2 missing values from the Embarked column were both "S" from Southhampton. Therefore, I will fill these missing values with "S". 

In [20]:
titanic_df[titanic_df["Embarked"].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


In [22]:
titanic_df["Embarked"].fillna("S", inplace=True)

In [26]:
titanic_df.loc[[61,829]]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,S
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,S


In [27]:
titanic_df["Embarked"].isnull().sum()

0