Titanic Survivor Analysis

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # library used for data visualization
import matplotlib.pyplot as plt # library used for data visualization
import re # library to work with Regular Expressions
from sklearn.preprocessing import LabelEncoder # used to encode categorical features to numerical ones

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

# Suppress warnings
import warnings  
warnings.filterwarnings('ignore')

In [None]:
# Read train and test data
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

# Print shape of train and test data
print("Train shape:{}".format(train.shape))
print("Test shape:{}".format(test.shape))

In [None]:
# Show first rows from train
train.head()

Feature description:
* PassengerId: ID of a passenger
* Survived: if passenger survived sinking of the Titanic (1 - survived, 0 - didn't survived)
* Pclass: passenger class (1, 2 or 3)
* Name: full name of the passenger
* Sex: passenger sex  (male or feamale)
* Age: passenger age in years
* SibSp: number of siblings or spouses aboard the Titanic
* Parch: number of parents or children aboard the Titanic
* Ticket: ticket number
* Fare: passenger fare
* Cabin: passenger cabin number
* Embarked: port of embarkation (S = Southampton, C = Cherbourg, Q = Queenstown)

In [None]:
# Describe train numeric features
train.describe()

In [None]:
# Check which data is missing in train dataset
train.isnull().sum()

Only three features with missing data: Age, Cabin and Embarked.

In [None]:
# Check which data is missing in test dataset
test.isnull().sum()

In [None]:
# How many passengers survived?
train['Survived'].value_counts().plot(kind='bar')
train['Survived'].value_counts()

# Only around 30% of passengers survived

In [None]:
# How many passengers were in each calss?
train['Pclass'].value_counts().sort_index().plot(kind='bar')
train['Pclass'].value_counts().sort_index()

# As we can see most passengers traveled in 3rd class (low socioeconomic status)

In [None]:
# How many people survived in each class?
sns.countplot(x='Pclass', hue='Survived', data=train)
pd.crosstab(train['Pclass'], train['Survived'])

# In 2nd class more passengers survived than died (slightly, but still).

In [None]:
# How age was importat to survival?
sns.violinplot(x='Survived', y='Age', data=train)

# As we can see the age was important only in case of children in favour of their survival.

In [None]:
# What is mean, median age based on the passenger class?
train.groupby(['Pclass','Sex'])['Age'].aggregate(['mean','median','max','min'])

In [None]:
# Lets fill in missing values for age with median age based on passenger class and their sex.
train.loc[train['Age'].isnull(), 'Age'] = train.groupby(['Pclass','Sex'])['Age'].transform('median')
test.loc[test['Age'].isnull(), 'Age'] = test.groupby(['Pclass','Sex'])['Age'].transform('median')

In [None]:
# Bin age
bins = [0,10,20,30,40,50,60,70,80]
train['AgeBin'] = pd.cut(train['Age'], bins)
test['AgeBin'] = pd.cut(test['Age'], bins)

In [None]:
# Display age bins of survived passengers (closer look which can be already observed in violin plot)
train[train['Survived'] == 1]['AgeBin'].value_counts().sort_index().plot(kind='bar')

In [None]:
# So, how may passengers survived in case of passenger below age of 20?
train[train['Age'] < 18]['Survived'].value_counts().apply(
    lambda x: x/len(train[train['Age'] < 18])).plot(kind='pie')

# As we can see slighty more then half of the children (age < 18) survived. 

In [None]:
# How sex was important for survival?
sns.catplot(hue='Sex',x='Survived',data=train, kind='count')
train.groupby('Sex')['Survived'].value_counts()

In [None]:
# How sex based on passenger class was important for survival?
sns.catplot(x='Pclass', y='Survived', hue='Sex', data=train, kind='violin', split='ture')

In [None]:
# Before plotting data based on feature Embarked, we need to fill in missing values.
print(train['Embarked'].value_counts())

# We will fill in missing values with the most common one. In this case is 'S'.
train.loc[train['Embarked'].isnull(), 'Embarked'] = 'S'

In [None]:
# Is port where passenger embark on Titanic important for survival?
sns.violinplot(x='Embarked', y='Survived', data=train)
train.groupby('Embarked')['Survived'].value_counts()

# It does not matter in which port the passenger boarded the Titanic.
# His chances of survival among the other people from this port were around 50%.
# Conclusion: port is probably not important. 
# People who boarded the Titanic in C were more likely to survive among the other people from the same port.

In [None]:
# Having a parent or children aboard was important for survival?
sns.countplot(x='Parch', hue='Survived', data=train)
train.groupby('Parch')['Survived'].value_counts()

# Having more children or parents aboard slightly decreases chances of survival.

In [None]:
# Having siblings or spouse aboard was important for survival?
sns.countplot(x='SibSp', hue='Survived', data=train)
train.groupby('SibSp')['Survived'].value_counts()

# Having more than 2 SibSp decrease chances of survival.

In [None]:
# Create new feature FamilySize = SibSp + Parch
train['FamilySize'] = train['SibSp'] + train['Parch']
test['FamilySize'] = test['SibSp'] + test['Parch']

In [None]:
# How family size was important for survival
sns.factorplot(x='FamilySize', y='Survived', data=train, kind='bar')
pd.crosstab(train['FamilySize'], train['Survived'])

# Passengers that had 1-3 family members had better chance for survival. 
# Specialy in case of famly size 3. Big families weren't so luckly.

In [None]:
# How much of cabin data is missing?
print("Missed data in Cabin = {:.2}%".format(train['Cabin'].isnull().sum()/len(train['Cabin'])))

# Although there is a lot of missing values (so lot that this feature should be excluded 
# before training the model) lets try to get some insights from the data there is avaiable.

In [None]:
# Create new feature (deck) based on first letter from cabin.
# Missing values will be assigned to deck 'n'.
train['Deck'] = train['Cabin'].astype(str).apply(lambda x: x[0])
test['Deck'] = test['Cabin'].astype(str).apply(lambda x: x[0])

In [None]:
# How many passengers survived based on deck where their cabin was located?
sns.countplot(x='Deck', hue='Survived', data=train, order=np.sort(train['Deck'].unique()))

# In decks B,C,D,E,F there are more passengers that survived than died.

In [None]:
# How deck is connected with fare and survival rate?
sns.boxplot(x='Deck', y='Fare', hue='Survived', data=train, order=np.sort(train['Deck'].unique()))
train.groupby('Deck')['Fare'].aggregate(['min','max','mean','median','count'])

# Deck that are more expensive had higher rate of survival.

In [None]:
# Is passengers class based on fare?
g = sns.FacetGrid(train, hue="Pclass", height=4, aspect=2)
g = g.map(sns.distplot, "Fare", bins=5)

train.groupby('Pclass')['Fare'].aggregate(['min','max','mean','median','count'])

# As we can see higher class mean higher fare (blue=1st, orange=2nd, green=3rd).
# We could try to guess passenger deck based on fare and class:
# 3rd -> G,F,T
# 2nd -> A,E,D
# 1st -> C,B
# But we will not do it, because there is too much data missing
# and we could fit model on wrong features.

In [None]:
# Lets try to get some insights from Name
# Get title from passenger name
train['Title'] = train['Name'].apply(lambda x: re.search(' ([A-z]+)\.', x).group(1))
test['Title'] = test['Name'].apply(lambda x: re.search(' ([A-z]+)\.', x). group(1))

# Get count of all titles
sns.countplot(y='Title', hue='Survived', data=train,)
train['Title'].value_counts()

In [None]:
# There is a lot of titles with a small count. Lets group them in some way.
title_dict = {
    'Mr': 'Mr', # adult man (regardless of marital status)
    'Mrs': 'Mrs', # adult woman (married women, widows, and divorcées)
    'Mme': 'Mrs', # french title equivalent to english Ms
    'Ms': 'Mrs', # adult woman (regardless of marital status)
    'Miss': 'Miss', # female children and unmarried women
    'Mlle': 'Miss', # french title given to an unmarried woman (equivalent to english Miss)
    'Master':  'Master', # male children (young boys)
    'Major': 'Officer', # military rank
    'Col': 'Officer', # military rank
    'Capt': 'Officer', # military rank
    'Lady': 'Royalty',
    'Sir': 'Royalty',
    'Don': 'Royalty',
    'Dona': 'Royalty',
    'Countess': 'Royalty', 
    'Jonkheer': 'Royalty', # lowest rank within the nobility
    'Rev': 'Rev', # the Reverend
    'Dr': 'Dr' # academic title
}

train['Title'] = train['Title'].map(title_dict)
test['Title'] = test['Title'].map(title_dict)

In [None]:
# How title (grouped) was important in survival?
sns.countplot(x='Title', hue='Survived', data=train)
pd.crosstab(train['Title'], train['Survived'])

# So we can see what we already know, first they saved women and children.
# One important thing that we can see, is that young boys were taken after womens (adults and children).
# This is what we can suspect from plot below.

In [None]:
# Since the young boys were not so eagerly saved, it's worth taking a closer look at them.
print(train[train['Title'] == 'Master']['Age'].aggregate(['min','max','mean','median']))

survived_boys = train[(train['Title'] == 'Master') &
                    (train['Survived'] == 1)]
dead_boys = train[(train['Title'] == 'Master') &
                    (train['Survived'] == 0)]

sns.distplot(survived_boys['Age'], bins=5, color='orange')
sns.distplot(dead_boys['Age'], bins=5, color='blue')

# There is no patter so we could predict which boys survived and which don't.

In [None]:
# How class and age of boys were connected with their survival?
sns.catplot(x='Pclass', y='Age', hue='Survived', data=train[train['Title'] == 'Master'], 
            kind='violin', split='ture')
pd.crosstab(train[train['Title'] == 'Master']['Pclass'],train[train['Title'] == 'Master']['Survived'])

# Young boys were saved in first place only if
# they traveled in 1st or 2nd class (every one of them survived).

In [None]:
# Before training model, it is better to chnage Fare from numerical values to some categories.
# Lets asume below categories:
# (79+]     = very_high
# (44-79]   = high
# (19-44]   = above_average
# (9-19]    = normal
# (-1-9]    = cheap
# NaN       = unknown

# Create bins and corresponding categories
fare_bins = [-20, -1, 9, 19, 44, 79, 700]
fare_bins_cat = ['unknown','cheap','normal','above_average','high','very_high']

# Fill in missing values with value -10
train['Fare'] = train['Fare'].fillna(-10)
test['Fare'] = test['Fare'].fillna(-10)

# Create new feature FareBin
train['FareBin'] = pd.cut(train['Fare'], fare_bins, labels=fare_bins_cat)
test['FareBin'] = pd.cut(test['Fare'], fare_bins, labels=fare_bins_cat)

In [None]:
# Is fare range connected with survival?
sns.countplot('FareBin', hue='Survived', data=train)

# What we already known: higher price, higher survival rate

In [None]:
# Get Ticket prefix
#train['TicketPrefix'] = train['Ticket'].apply(lambda x: re.search('([A-z]+.*)( )', x).group(1)
#                                              .replace('.','').replace(' ','') 
#                                              if re.search('[A-z]+.* ', x) != None 
#                                              else 'NUMBER')

Feature preparation

In [None]:
# We need to save passengerId from test set, because it will be needed in creating the output file.
passengerId = test['PassengerId']

# Firstly lets remove all unnecessary features
features_to_remove = ['PassengerId','Ticket','Cabin','Name','Age','Fare']
train.drop(features_to_remove, axis=1, inplace=True)
test.drop(features_to_remove, axis=1, inplace=True)

In [None]:
# Secondly we need to convert categorical features (objects) to numerical ones before fitting the model
# There are two possibilities to achive this, use pandas get_dummies or use LabelEncoder.
# I will use LabelEncoder to keep the number of fatures low.
# To keep code clean I will create simple function that will encode selected features.
def encode_feature(train, test, features):
    data_combined = pd.concat([train[features], test[features]])
    for feature in features:
        label_encoder = LabelEncoder().fit(data_combined[feature])
        train[feature] = label_encoder.transform(train[feature])
        test[feature] = label_encoder.transform(test[feature])
    return train, test

train, test = encode_feature(train, test, ['Sex','Embarked','AgeBin','Deck','Title','FareBin'])

In [None]:
# Lets check our data after encoding features
train.head()

In [None]:
# Plot correlation of data
plt.figure(figsize=(16,8))
sns.heatmap(train.corr(), linewidth=.5, annot=True)

Model

In [None]:
# Model imports
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import make_scorer

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

In [None]:
# Get training features (X) and output label (y)
X = train.drop(['Survived'], axis=1)
y = train['Survived']

# Split X data to train and validation sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Simple function that evaluate model
def score(model, X, y):
    cv = StratifiedKFold(n_splits=5, shuffle=True)
    scores = cross_val_score(model, X, y, cv=cv)
    print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std()))

In [None]:
rfc = RandomForestClassifier(random_state=7)
score(rfc, X, y)

In [None]:
lr = LogisticRegression(random_state=7)
score(lr, X, y)

In [None]:
xgb = XGBClassifier(random_state=7)
score(xgb, X, y)

Submition

In [None]:
# We will submit predictions from XGBoostClassifier
# Fit model with all data
xgb.fit(X,y)
# Make predictions for test data
predictions = xgb.predict(test)

# Create submission file
submission = pd.DataFrame({
        'PassengerId': passengerId,
        'Survived': predictions
    })
submission.to_csv('submission.csv', index=False)

NOTES:
* AgeSentinel - missing age data with strange value (for example -100), so model will know that this data should be treated differently.
* Select parameters for the models to boost accuracy.
* Change order of categorial features before passing them to LabelEncoder (order is important!)

References:
* [titanic-eda-keras-nn-pipelines](https://www.kaggle.com/kabure/titanic-eda-keras-nn-pipelines)
* [a-comprehensive-ml-workflow-with-python](https://www.kaggle.com/mjbahmani/a-comprehensive-ml-workflow-with-python)