# Overview
The Titanic competition is a classification problem where we are tasked with predicting whether or not a passenger survived the ship's sinking.

**Goal**: place in top the 10% of rolling leaderboard.

In [None]:
# initial imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# set the random seed state for reproducibility
random_state = 42

In [None]:
# import train and test set
train_df = pd.read_csv('../input/titanic/train.csv')
test_df = pd.read_csv('../input/titanic/test.csv')

In [None]:
# confirm train_df
train_df.head()

In [None]:
# confirm test_df
test_df.head()

---
# EDA

In [None]:
# get basic info on the train_df
train_df.info()

In [None]:
# get basic info on the test_df
test_df.info()

**Dimensions:**

The training data has 891 rows and 11 columns. Survived is the dependent variable.

Our test data has 418 rows and 10 columns. ~32% of our overall data is in the test set.

**Data Dictionary:**

* PassengerId - int. Key.
* Survived - int. Survival. 0 = No, 1 = Yes. Boolean.
* Pclass - int. Ticket class.
* Name - str. Name of passenger.
* Sex - str. Sex of passenger.
* Age - float. Age of passenger. If age is estimated, takes xx.5. Has nulls.
* SibSp - int. # of siblings/spouses on board.
* Parch - int. # of parents/ children aboard the titanic.
* Ticket - str. Ticket number.
* Fare - float. Fare paid by passenger. Has nulls.
* Cabin - str. Cabin of passenger. Has nulls.
* Embarked - str. Port of embarkation. C = Cherbourg, Q = Queenstown, S = Southampton. Has nulls.

**Initial Hypotheses:**

* Young children will survive.
* Young women will survive.
* Rich people will survive.
* Location on ship matters.
* Titles will or will not survive (depending on title).
* Larger families will not survive.

### Survived EDA

In [None]:
# Survived
sns.catplot(x='Survived', data=train_df, kind='count').set(title='Survived')
plt.show()

In [None]:
survived_perc = round((train_df['Survived'].sum()) / len(train_df.index) * 100,2)
print(f'Percentage who survived: {survived_perc}%')

## Categorical EDA

### Pclass EDA

In [None]:
# PClass
sns.catplot(x='Pclass', data=train_df, kind='count').set(title='Pclass')
plt.show()

In [None]:
# PClass and Survived
sns.catplot(x='Pclass', hue='Survived', data=train_df, kind='count').set(title='Pclass and Survived')
plt.show()

As expected, Pclass appears to be correlated with survival.

### Sex EDA

In [None]:
# Sex
sns.catplot(x='Sex', data=train_df, kind='count').set(title='Sex')
plt.show()

In [None]:
# Sex and Survived
sns.catplot(x='Sex', hue='Survived', data=train_df, kind='count').set(title='Sex and Survived')
plt.show()

Sex appears to be correlated with survival.

### Embarked EDA

In [None]:
# Embarked
sns.catplot(x='Embarked', data=train_df, kind='count').set(title='Embarked')
plt.show()

In [None]:
# Embarked and Survived
sns.catplot(x='Embarked', hue='Survived', data=train_df, kind='count').set(title='Embarked and Survived')
plt.show()

Embarked may be correlated with Pclass.

In [None]:
# Embarked and Pclass
sns.catplot(x='Embarked', hue='Pclass', data=train_df, kind='count').set(title='Embarked and Pclass')
plt.show()

S is disproportionately 3rd class. I would consider this a spurious correlation as Pclass is likely the underlying structure driving survival here.

## Quantitative EDA

In [None]:
# basic distributions of train_df
train_df.describe()

In [None]:
# basic distributions of test_df
test_df.describe()

### Age EDA

In [None]:
# Survived and Age
sns.boxplot(x='Survived', y='Age', data=train_df).set(title='Survived and Age')
plt.show()

In [None]:
# Pclass and Age
sns.boxplot(x='Pclass', y='Age', data=train_df).set(title='Pclass and Age')
plt.show()

1st class is older than second class which is older than 3rd class on average. Second class has fewer children.

In [None]:
# Sex and Age
sns.boxplot(x='Sex', y='Age', data=train_df).set(title='Sex and Age')
plt.show()

Men were slightly older in general and had all the elderly individuals.

---
# Modeling

In [None]:
train_df.info()

In [None]:
test_df.info()

## Null Imputation
The following columns have nulls in the train_df

* Age
* Cabin
* Embarked

The following columns have nulls in the test_df

* Age
* Fare
* Cabin

In [None]:
# make copies of original data
e_train_df = train_df.copy()
e_test_df = test_df.copy()

In [None]:
# categorical impute
e_train_df['Cabin'] = e_train_df['Cabin'].fillna('Missing') 
e_train_df['Embarked'] = e_train_df['Embarked'].fillna(e_train_df['Embarked'].mode()[0])

e_test_df['Cabin'] = e_test_df['Cabin'].fillna('Missing')
e_test_df['Embarked'] = e_test_df['Embarked'].fillna(e_test_df['Embarked'].mode()[0])

In [None]:
# checking Age for distribution and outliers
sns.boxplot(x='Age', data=train_df).set(title='Age Distribution')
plt.show()

In [None]:
# checking Fare for distribution and outliers
sns.boxplot(x='Fare', data=train_df).set(title='Fare Distribution')
plt.show()

It appears that Age is relatively symmetrical and that Fare is right skewed. I will use mean imputation for Age and median imputation for Fare.

In [None]:
# quantitative impute
e_train_df['Age'] = e_train_df['Age'].fillna(e_train_df['Age'].mean())
e_train_df['Fare'] = e_train_df['Fare'].fillna(e_train_df['Fare'].median())

e_test_df['Age'] = e_test_df['Age'].fillna(e_test_df['Age'].mean())
e_test_df['Fare'] = e_test_df['Fare'].fillna(e_test_df['Fare'].median())

In [None]:
# confirm the imputation worked
e_train_df.info()

In [None]:
# confirm the imputation worked
e_test_df.info()

## Feature Engineering

### Age Bucket
Bucket age into 'Child, 'Adult', 'Elderly'

In [None]:
# age_bucket
age_bins = [0, 18, 65, 100]
age_labels = ['child','adult', 'elderly']

e_train_df['age_bucket'] = pd.cut(x=e_train_df['Age'], bins=age_bins,
                    labels=age_labels).astype('object')

e_test_df['age_bucket'] = pd.cut(x=e_test_df['Age'], bins=age_bins,
                    labels=age_labels).astype('object')

In [None]:
# ship_location and survived
sns.catplot(x='age_bucket', hue='Survived', data=e_train_df, kind='count').set(title='age_bucket and Survived')
plt.show()

Being either a child or elderly increased survival rate.

### Role
I noticed in examining the data that certain names have titles (Mr, Ms, Don, Capt, etc.) and I want to bucket these into roles.

In [None]:
# extract titles
e_train_df[['last_name','intermediate']] = e_train_df['Name'].str.split(', ', expand=True)
e_train_df[['title','first_name']] = e_train_df['intermediate'].str.split('.', 1, expand=True)
e_train_df = e_train_df.drop(columns=['last_name', 'intermediate', 'first_name'])

e_test_df[['last_name','intermediate']] = e_test_df['Name'].str.split(', ', expand=True)
e_test_df[['title','first_name']] = e_test_df['intermediate'].str.split('.', 1, expand=True)
e_test_df = e_test_df.drop(columns=['last_name', 'intermediate', 'first_name'])

In [None]:
# unique titles
train_title_set = set(e_train_df['title'].tolist())
test_title_set = set(e_test_df['title'].tolist())
title_sorted = sorted(train_title_set.union(test_title_set))

print(title_sorted)

In [None]:
# assign titles to roles
def assign_role(row):
    if row['title'] in ['Capt', 'Col', 'Major']:
        return 'officer'
    elif row['title'] in ['Don', 'Dona', 'Dr', 'Jonkheer', 'Lady', 'Master', 'Rev', 'Sir', 'the Countess']:
        return 'important'
    elif row['title'] in ['Miss', 'Mlle', 'Mme', 'Mr', 'Mrs', 'Ms']:
        return 'average'
    
e_train_df['role'] = e_train_df.apply(lambda row: assign_role(row), axis=1)
e_test_df['role'] = e_test_df.apply(lambda row: assign_role(row), axis=1)

In [None]:
# role and survived
sns.catplot(x='role', hue='Survived', data=e_train_df, kind='count').set(title='Role and Survived')
plt.show()

Both "important" and "officer" roles appear to improve survival.


### Ship Location
Ship location matters due to proximity to lifeboats and upper deck access.


In [None]:
# see if there is a connection between Ticket and Cabin
e_train_df.loc[e_train_df['Cabin'].notnull(), ['Ticket', 'Cabin']].head(10)

Cabin itself is a high cardinality feature and ticket does not appear useful in any obvious way.

We will try and extract ship_location using the Cabin's first letter.

In [None]:
# unique titles from each data set
train_set = set(e_train_df['Cabin'].tolist())
test_set = set(e_test_df['Cabin'].tolist())
sorted_set = sorted(train_set.union(test_set))

print(sorted_set)

Some passengers have multiple cabins, but they all appear to be in the same general ship_location.

In [None]:
# ship_location
e_train_df['ship_location'] = e_train_df['Cabin'].astype(str).str[0]
e_test_df['ship_location'] = e_test_df['Cabin'].astype(str).str[0]

I want to see if how ship location relates to survival and Pclass.

In [None]:
# ship_location and survived
sns.catplot(x='ship_location', hue='Survived', data=e_train_df, kind='count').set(title='ship_location and Survived')
plt.show()

Some locations appear to correlate with higher chances of survival.

This also makes me think that passenger class correlates with having been assigned a cabin at all.

In [None]:
# ship_location and Pclass
sns.catplot(x='ship_location', hue='Pclass', data=e_train_df, kind='count').set(title='ship_location and Pclass')
plt.show()

Pclass correlates with having been assigned a cabin.

### Family Size
Family size is total amount of siblings, spouses, parents and children.

In [None]:
# family_size
e_train_df['family_size'] = e_train_df['SibSp'] + e_train_df['Parch']
e_test_df['family_size'] = e_test_df['SibSp'] + e_test_df['Parch']

I want to see how family_size relates to survival.

In [None]:
# survival and family_size
sns.boxplot(x='Survived', y='family_size', data=e_train_df).set(title='Survived and family_size')
plt.show()

It appears larger overall families survive more. This surprises me. Lets examine it further by treating it like a categorical variable.

In [None]:
# family_size and survived
sns.catplot(x='family_size', hue='Survived', data=e_train_df, kind='count').set(title='Survived and family_size')
plt.show()

Solo travelers and those with small families (3 or less) do better by far.



In [None]:
# travel_solo
e_train_df['travel_solo'] = e_train_df.apply(lambda row: 1 if row['family_size']==0 else 0, axis=1)
e_test_df['travel_solo'] = e_test_df.apply(lambda row: 1 if row['family_size']==0 else 0, axis=1)

## Model Preparation

In [None]:
e_train_df.info()

In [None]:
e_test_df.info()

In [None]:
# make a copy of enhanced dataframes for modeling
m_train_df = e_train_df.copy()
m_test_df = e_test_df.copy()

# drop cols
drop_list = ['Name', 'Ticket', 'Cabin', 'title', 'family_size']
m_train_df.drop(columns=drop_list, inplace=True)
m_test_df.drop(columns=drop_list, inplace=True)

# dummy variables
dummy_list = ['Pclass', 'Sex', 'Embarked', 'age_bucket', 'role', 'ship_location']
m_train_df = pd.get_dummies(m_train_df, columns=dummy_list)
m_test_df = pd.get_dummies(m_test_df, columns=dummy_list)

In [None]:
m_train_df.info()

## Column Confirmation
We need to confirm that our df have the same dimensions at this point

In [None]:
train_set = set(m_train_df.columns)
test_set = set(m_test_df.columns)

test_train_diff_set = test_set - train_set
train_test_diff_set = train_set - test_set
print(f'Missing columns in test not in train: {test_train_diff_set}')
print(f'Missing columns in train not in test: {train_test_diff_set}')

It appears our dummy_encoder missed 'ship_location_t' on the test set because it is not present.

In [None]:
m_test_df['ship_location_T'] = 0

In [None]:
train_set = set(m_train_df.columns)
test_set = set(m_test_df.columns)

test_train_diff_set = test_set - train_set
train_test_diff_set = train_set - test_set
print(f'Missing columns in test not in train: {test_train_diff_set}')
print(f'Missing columns in train not in test: {train_test_diff_set}')

## Super Feature
Combine the most important variables into a single predictor.

Note: I did lots of experimentation here to find the right combination. It came down to treating Fare as a proxy for social class (to bolster Pclass) plus the strongest binary flags.

In [None]:
m_train_df['fare_pclass_solo_female'] = m_train_df['Fare'] * (m_train_df['Pclass_1'] + m_train_df['travel_solo'] + m_train_df['Sex_female'] + m_train_df['age_bucket_child'] + m_train_df['age_bucket_elderly'])
m_test_df['fare_pclass_solo_female'] = m_test_df['Fare'] * (m_test_df['Pclass_1'] + m_test_df['travel_solo'] + m_test_df['Sex_female'] + m_test_df['age_bucket_child'] + m_test_df['age_bucket_elderly'])

## Scaling

In [None]:
# standardize
from sklearn.preprocessing import StandardScaler

standardize_list = ['Age', 'SibSp', 'Parch', 'Fare', 'fare_pclass_solo_female']

train_features = m_train_df[standardize_list]
train_scaler = StandardScaler().fit(train_features.values)
train_features = train_scaler.transform(train_features.values)

m_train_df[standardize_list] = train_features

test_features = m_test_df[standardize_list]
test_scaler = StandardScaler().fit(test_features.values)
test_features = test_scaler.transform(test_features.values)

m_test_df[standardize_list] = test_features

In [None]:
m_train_df.info()

In [None]:
m_test_df.info()

## Model Implementation

In [None]:
# split into x and y
dependent_variable = 'Survived'
m_train_df.drop(columns=['PassengerId'], inplace=True)

y_train = m_train_df[dependent_variable].copy()
x_train = m_train_df.drop(columns=[dependent_variable], axis=1).copy()

# assure dependent variable is gone
x_train.head()

In [None]:
# use RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    criterion='gini',
    n_estimators=1000,
    min_samples_split=10,
    min_samples_leaf=1,
    max_features='auto',
    oob_score=True,
    random_state=random_state,
    n_jobs=-1
)

fitted_model = model.fit(x_train.values, y_train.values)

---
# Submission

In [None]:
# construct submission
submission_df = pd.DataFrame()
submission_df['PassengerId'] = m_test_df['PassengerId'].copy()

m_test_df.drop(columns=['PassengerId'], inplace=True)

submission_df['Survived'] = fitted_model.predict(m_test_df.values)

submission_df.to_csv('submission.csv', index=False)

## Result
My public score for this notebook was 0.80622 at the time of submission.

This put me at 320/13413 (top 3%) on the leaderboard.