# Titanic: Machine Learning from Disaster

In [1]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib as plt

CURR_PATH = os.path.curdir
sys.path.append(os.path.join(CURR_PATH, '../titanic-classes/'))

In [2]:
# Read data from file and load into pandas dataframe
data_path = '../data/'
df_train = pd.read_csv(os.path.join(data_path, 'train.csv'))
df_test = pd.read_csv(os.path.join(data_path, 'test.csv'))

### Raw data analysis

In [3]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
display(df_train.describe())

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [5]:
# We drop the irrelevant fields which do not aid in training the data: 
# Name, PassengerId
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [6]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


**Inference from above raw data**

- More than 75% of data does not have cabin info, we can remove it from the list of features
- We need to replace the null values in Age and Embarked
- We do not need the feature Name, however we can use the Title of each person to gain insight
- We can safely drop PassengerId, Ticket number as they are not relevant to predict

## Data cleaning

In [7]:
from data_cleaner import TitanicCleaner

We import **TitanicCleaner** class. It has the following methods:

- remove_irrelavant_features()
- extract_titles()
- replace_null_embarked()
- replace_null_age()
- display_head(number_of_rows: int)

See the _src/titanic-classes/data_cleaner.py_ for more info

In [8]:
# We initialize an object of Titanic Cleaner defined under 
df_tit_train = df_train
titanic_train = TitanicCleaner(df_tit_train)

We get rid of irrelavant features:
- PassengerId, Ticket number are either random or are of no significance to predicting the final outcome.
- And since more than 75% of Cabin number values are missing, we can remove this feature as well. 

In [9]:
# Removes the features PassengerId, Ticket number, Cabin number
titanic_train.remove_irrelavant_features()
titanic_train.display_head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S


---
Name also might not be a factor in the prediction. However, we can extract the title from each person's name which might not just give us an insight into Age and Status.

In [10]:
# Next we create feature called Title, based on each person's name. We drop the feature 'Name'
titanic_train.extract_titles()
titanic_train.display_head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,0,3,male,22.0,1,0,7.25,S,Mr
1,1,1,female,38.0,1,0,71.2833,C,Mrs
2,1,3,female,26.0,0,0,7.925,S,Miss
3,1,1,female,35.0,1,0,53.1,S,Mrs
4,0,3,male,35.0,0,0,8.05,S,Mr


***
The features Age and Embarked have some null values.

- We guess the age based on the Title. We find median age of each Title group and assign that value
- We replace null values in Embarked with the mode (highest frequency of value)

In [11]:
# We replace null values in Embarked
titanic_train.replace_null_embarked()
print("")




In [12]:
# We replace null values in Age
titanic_train.replace_null_age().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null object
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Embarked    889 non-null object
Title       891 non-null object
dtypes: float64(2), int64(4), object(3)
memory usage: 62.7+ KB
