# Introduction
First iteration of Kaggle Competition "Titanic". 

The goal of the competition is to predict the survival rate of the passengers, given data from the ship archives. 

### Imports
Import libraries and write settings here.

In [17]:
# Data manipulation
import pandas as pd
import numpy as np
## pandas_profiling is a useful tool to do the first EDA before cleaning up the data
## https://towardsdatascience.com/speed-up-your-exploratory-data-analysis-with-pandas-profiling-88b33dc53625
from pandas_profiling import ProfileReport
import string

# Statistics and model training
import statistics as stat

# Options for pandas
pd.options.display.max_columns = 50
pd.options.display.max_rows = 30

# Display all cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

from IPython import get_ipython
ipython = get_ipython()

# autoreload extension
if 'autoreload' not in ipython.extension_manager.loaded:
    %load_ext autoreload

%autoreload 2

# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import scatter_matrix
import plotly.graph_objects as go
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)

plt.style.use('ggplot')

# Analysis/Modeling


### Data set dictionary

link: https://www.kaggle.com/c/titanic/data

- Survived => categorical, Yes = 1, No = 0
- pclass => class of the passenger, in particular ticket class (1st = Upper 2nd = Middle 3rd = Lower)
- sex 
- age (in years) (Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5)
- sibsp => number of spouses/siblings on board (absolute count, finite)
- parch => number of parent, children on board (absolute count, finite)
- ticket => ticket id
- fare => passenger fare
- cabin number
- Embarked => port at which the passnger boarded (C = Cherbourg, Q = Queenstown, S = Southampton)

### EDA

In [18]:
# load data
ts = pd.read_csv('titanic_train.csv')

In [19]:
# explore head
ts.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [20]:
# descriptive stats and info
ts.describe()
ts.info()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [21]:
# here we use pandas_profiling to generate an initial eda report
profile = ProfileReport(ts)
profile.to_notebook_iframe()

In [22]:
# we make a copy of the data set to manipulate
ts_o = ts.copy()


Something that I understood while continuing with the data processing is that the profile_reporting is something useful if you are already familiar with the data and the subject. In this case, it is much more useful to do the analysis myself because I need to understand the data better and deeply. Thus, let's start!

#### Manual EDA

First, it could be useful to plot the data against the survivability rate. This would give us a sense of the dependency of each of the variable. Also, at this stage, we want to check the type of the variables and transform them if needed. 

In [23]:
ts_o.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [24]:
# we have 12 columns and 891 rows
ts_o.shape

(891, 12)

In [25]:
ts_o.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


#### Feature exploration - Age and Name

Staring with Age, we have 17% of missing values. We can try using multiple approaches to avoid dropping these values and losing this information. We will create additional columns for each approaches and test which result in a better performance of the model. 

- Substitute the missing with the statistics (mean, median) based on the port of embark
- Substitute missing values based on the inferred salutation of the name, where available


In [26]:
# 1) now we want to extract the titles from the name
# here we extract the title for each name, so that we can use it for feature engineering
ts_o['Salutation'] = ts_o.Name.apply(lambda name: name.split(',')[1].split('.')[0].strip())

ts_o.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Salutation
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr


In [50]:
# then we want to evaluate the median values for the salutation sub-groups
salutation_labels = list(ts_o['Salutation'].unique())
salutation_median = {}

# obtain the median age value by salutation
for i in salutation_labels:
    salutation_median.update({i : stat.median(ts_o[(ts_o['Salutation']==i) & (ts_o['Age'].notnull())]['Age'])})
    
# we generate an additional column that we fill missing value for age with the median
ts_o['age_filled'] = ts_o['Age'].fillna(ts_o['Salutation'].map(salutation_median))


ts_o.head(20)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Salutation,age_filled
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,22.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,38.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,26.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,35.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr,35.0
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,Mr,30.0
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,Mr,54.0
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,Master,2.0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,Mrs,27.0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,Mrs,14.0


this provides more flavours to the Age group in respect to the simple median obtain from the Embarking category. Let's grab the median by Salutation and create an additional column  

In [64]:
# here we evaluate the mean age from the port of embark

ports = ['C', 'Q', 'S']
mean={}
for i in ports:
    mean.update({i:ts_o[(ts_o['Embarked']==i) & (ts_o['Age'].notnull())]['Age'].mean()})

ts_o['age_mean'] = ts_o['Age'].fillna(ts_o['Embarked'].map(mean,2))

ts_o.head(20)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Salutation,age_filled,age_mean
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,22.0,22.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,38.0,38.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,26.0,26.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,35.0,35.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr,35.0,35.0
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,Mr,30.0,28.089286
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,Mr,54.0,54.0
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,Master,2.0,2.0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,Mrs,27.0,27.0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,Mrs,14.0,14.0


#### Feature Exploration - Ticket and Cabin

Ticket and cabin numbers are variable with a lot of missing values. Since most of the folks didn't have a cabin associated, I will generate a label that uses the cabin first letter when available, thus specifying the cabin deck, or, alternatively will put unknown. 

In [82]:
# here we populate the cabin nan values with unavailable else with the first letter of the deck cabin

ts_o['cabin_filled'] = ts_o.Cabin.str[0].fillna('not_on_deck')
ts_o.head(20)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Salutation,age_filled,age_mean,cabin_filled
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,22.0,22.0,not_on_deck
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,38.0,38.0,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,26.0,26.0,not_on_deck
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,35.0,35.0,C
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr,35.0,35.0,not_on_deck
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,Mr,30.0,28.089286,not_on_deck
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,Mr,54.0,54.0,E
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,Master,2.0,2.0,not_on_deck
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,Mrs,27.0,27.0,not_on_deck
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,Mrs,14.0,14.0,not_on_deck


Now, we should try to use the ticket number and the fare to evaluate if there are groups that travelled together. Probably, this would match with cabin purchased closer to each other. 

# Results
Show graphs and stats here

# Conclusions and Next Steps
Summarize findings here