# Kaggle Titanic Survival Competition

In [48]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Understanding the dataset

First an foremost I want to understand what the dataset that I am working with looks like. A few general questions that I want to answer are:

- What are the different columns? How many are there?
- What does the data look like?
- What data types are in the dataset?
- How many entries are there in the dataset? 

In [49]:
titanic_df = pd.read_csv('./data/train.csv')

print(titanic_df.columns)
print(len(titanic_df.columns))

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
12


In [50]:
print(titanic_df.head(1))

   PassengerId  Survived  Pclass                     Name   Sex   Age  SibSp  \
0            1         0       3  Braund, Mr. Owen Harris  male  22.0      1   

   Parch     Ticket  Fare Cabin Embarked  
0      0  A/5 21171  7.25   NaN        S  


In [51]:
print(titanic_df.dtypes)

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


In [52]:
num_entries = titanic_df.shape[0]
print("Number of entries:", num_entries)

Number of entries: 891


## Data Cleaning

In the cleaning process I am doing some data checking and manipulation tasks that will hopefully improve the usability and quality of precitions in the later steps. 

These include:
- Possibly dropping unnecessary columns
- Handling missing values by either removing the entries, or adding data
- Type checking & possibly converting them to more suitable types
- Removing duplicate data
- Possibly adapting granularity of the data (either making it less or more granular)
- Finding outliers and handling them appropriately
- Possibly adding new features with feature engineering

In [53]:
# First I create a deep copy to guarantee that the original dataframe stays as it is
t_df = titanic_df.copy(deep=True)

### Dropping/ removing the PassengerID column

As each row already has an ID, the PassengerID column is redundant and can be dropped.

In [54]:
t_df = t_df.drop('PassengerId', axis=1)

### Handling missing values

One way of dealing with missing values would be to infer the data from similar entries, rather than removing the entry altogether. For example, suppose we have a passenger with a missing value for cabin:

- We could try to find out if they are married to another passenger by checking the name.
- We could check if the class gives us a rough estimate of where the cabin would be.
- We could check the price to infer what type of cabin it might be and where it would be located.

Whilst this may sound sensible, it is by no means a hard and fast rule and could therefore make the model perform worse. It is therefore a good idea to split the datasets into two versions: one where all the entries with missing values are removed, and one where the missing values are inferred from other features.

In [55]:
t_df['Pclass'] = t_df['Pclass'].astype('int8')
# t_df['Age'] = t_df['Age'].astype('int8')
t_df['Parch'] = t_df['Parch'].astype('int8')
t_df['Survived'] = t_df['Survived'].astype('bool')
print(t_df.dtypes)

Survived       bool
Pclass         int8
Name         object
Sex          object
Age         float64
SibSp         int64
Parch          int8
Ticket       object
Fare        float64
Cabin        object
Embarked     object
dtype: object


### Checking for duplicates

First we sort the data, then we check for duplicates. It is important to sort the data first, since the duplicated() method only compare each row to the previous row. So if there are rows between duplicates, the duplicated() method will not catch these.

For the sorting values I chose the 'Ticket' and 'Cabin' features.


In [56]:
t_df_sorted = t_df.sort_values(by=['Ticket', 'Cabin'])

duplicates = t_df_sorted.duplicated()
number_duplicate = duplicates.sum()
print("Number of duplicates:",number_duplicate)

Number of duplicates: 0


## Definition of the questions to be answered

The overall question is: What are the main factors that contributed to the survival rate of a passenger on the Titanic?

### Defining the assumptions to test

To build a successful model that attempts to predict survivability, it is first necessary to find related features. To find possible features that affect survivability, I first define some assumptions that I want to test:

- Age has an effect on survivability (due to the fitness of the individual)
- The cabin has an effect on survivability (due to proximity to exits, etc.)
- Class has an effect on survivability (due to the fact that higher class passengers had priority on emergency ships)
- Tickets and cabins are a proxy for class and therefore affect survivability (they reflect the class of passenger).

Please note that these assumptions are completely made up and based on intuition. They are just there to give me something to work with and investigate. Even if there is a relationship between survivability and the feature, this does not necessarily mean that there is a correlation or proof of my initial assumption.