This document is a thorough overview of my process for building a predictive model for Kaggle's Titanic competition. I will provide all my essential steps in this model as well as the reasoning behind each decision I made. This model achieves a score of 82.78%, which is in the top 3% of all submissions at the time of this writing. This is a great introductory modeling exercise due to the simple nature of the data, yet there is still a lot to be gleaned from following a process that ultimately yields a high score.

You can get my original code on my GitHub: https://github.com/zlatankr/Projects/tree/master/Titanic  
You get also read my write-up on my blog:  https://zlatankr.github.io/posts/2017/01/30/kaggle-titanic 

### The Problem

We are given information about a subset of the Titanic population and asked to build a predictive model that tells us whether or not a given passenger survived the shipwreck. We are given 10 basic explanatory variables, including passenger gender, age, and price of fare, among others. More details about the competition can be found on the Kaggle site, [here](https://www.kaggle.com/c/titanic). This is a classic binary classification problem, and we will be implementing a random forest classifer.

### Exploratory Data Analysis

The goal of this section is to gain an understanding of our data in order to inform what we do in the feature engineering section.  

We begin our exploratory data analysis by loading our standard modules.

In [None]:
import os
from typing import Tuple, List

import pandas as pd
import numpy as np

We then load the data, which we have downloaded from the Kaggle website ([here](https://www.kaggle.com/c/titanic/data) is a link to the data if you need it).

In [None]:
train = pd.read_csv(os.path.join('../input', 'train.csv'))
test = pd.read_csv(os.path.join('../input', 'test.csv'))

First, let's take a look at the summary of all the data. Immediately, we note that `Age`, `Cabin`, and `Embarked` have nulls that we'll have to deal with. 

In [None]:
train.info()

It appears that we can drop the `PassengerId` column, since it is merely an index. Note, however, that some people have reportedly improved their score with the `PassengerId` column. However, my cursory attempt to do so did not yield positive results, and moreover I would like to mimic a real-life scenario, where an index of a dataset generally has no correlation with the target variable.

In [None]:
train.head()

### Feature Engineering

Having done our cursory exploration of the variables, we now have a pretty good idea of how we want to transform our variables in preparation for our final dataset. We will perform our feature engineering through a series of helper functions that each serve a specific purpose. 

This first function creates two separate columns: a numeric column indicating the length of a passenger's `Name` field, and a categorical column that extracts the passenger's title.

In [None]:
def process_name(df: pd.DataFrame) -> pd.DataFrame:
    """
    Creates two separate columns: a numeric column indicating the length of a
    passenger's Name field, and a categorical column that extracts the
    passenger's title.

    Args:
        df: The train or test set
    
    """
    df_new = df.copy()
    df_new['Name_Len'] = df_new['Name'].apply(len)
    df_new['Name_Title'] = df_new['Name'].apply(lambda x: x.split(',')[1]).apply(lambda x: x.split()[0])
    return df_new

Next, we impute the null values of the `Age` column by filling in the mean value of the passenger's corresponding title and class. This more granular approach to imputation should be more accurate than merely taking the mean age of the population.

In [None]:
def _get_mean_age_if_exist(name_title: str, p_class: int, age_mean: np.float64, grouped_age_means: pd.Series):
    """
    Given a title and class returns train set mean age if known, else global train set mean age.

    Args:
        name_title: Passenger title
        p_class: Passenger class
        age_mean: Global train set mean age
        grouped_age_means: Train set mean age given a name title and p class
    
    """
    if (name_title, p_class) in grouped_age_means.index:
        return grouped_age_means[name_title, p_class]
    else:
        return age_mean


def impute_age(df: pd.DataFrame, age_mean: np.float64, grouped_age_means: pd.Series) -> pd.DataFrame:
    """
    Imputes the null values of the Age column by filling in the mean value of
    the passenger's corresponding title and class.

    Args:
        df: The train or test set
        age_mean: Global train set mean age
        grouped_age_means: Train set mean age given a name title and p class
    
    """
    df_new = df.copy()
    df_new['Age_Null_Flag'] = df_new['Age'].isnull().apply(int)
    df_new.loc[df_new['Age'].isnull(), "Age"] = df_new.loc[df_new['Age'].isnull()].apply(
        lambda x: _get_mean_age_if_exist(x["Name_Title"], x["Pclass"], age_mean, grouped_age_means),
        axis=1
    ).copy()
    return df_new

We combine the `SibSp` and `Parch` columns into a new variable that indicates family size, and group the family size variable into three categories.

In [None]:
def size_family(df: pd.DataFrame) -> pd.DataFrame:
    """
    Combines the SibSp and Parch columns into a new variable that indicates
    family size, and group the family size variable into three categories.

    Args:
        df: The train or test set
    
    """
    df_new = df.copy()
    df_new['Family_Size'] = np.where(
        (df_new['SibSp'] + df_new['Parch']) == 0, 'Solo',
        np.where(
            (df_new['SibSp'] + df_new['Parch']) <= 3, 'Nuclear',
            'Big'
        )
    )
    return df_new

We also fill in the missing values of `Fare` in our test set with the mean value of `Fare` from the training set (transformations of test set data must always be fit using training data).

In [None]:

def fill_fare_na(df: pd.DataFrame, fare_mean: float) -> pd.DataFrame:
    """
    Fills NA Fares values with fitted mean value.

    Args:
        df: The train or test set
        fare_mean: Train set mean fare
    
    """
    df_new = df.copy()
    df_new['Fare'].fillna(fare_mean, inplace=True)
    return df_new

The `Ticket` column is used to create two new columns: `Ticket_Lett`, which indicates the first letter of each ticket (with the smaller-n values being grouped based on survival rate); and `Ticket_Len`, which indicates the length of the `Ticket` field. 

In [None]:
def group_ticket(df: pd.DataFrame) -> pd.DataFrame:
    """
    The Ticket column is used to create three new columns: Ticket_Letter, which
    indicates the first letter of each ticket (with the smaller-n values being
    grouped based on survival rate); Ticket_Category, which indicated the category
    of the ticket, and Ticket_Length, which indicates the length of the Ticket field.

    Args:
        df: The train or test set
    
    """
    letter_group_1 = ['1', '2', '3', 'S', 'P', 'C', 'A']
    letter_group_2 = ['W', '4', '7', '6', 'L', '5', '8']

    df_new = df.copy()
    df_new['Ticket_Letter'] = df_new['Ticket'].apply(lambda x: str(x)[0])
    df_new['Ticket_Letter'] = df_new['Ticket_Letter'].apply(str)
    df_new['Ticket_Category'] = np.where(
                                (df_new['Ticket_Letter']).isin(letter_group_1), df_new['Ticket_Letter'],
                                np.where(
                                    (df_new['Ticket_Letter']).isin(letter_group_2), 'Low_ticket',
                                    'Other_ticket'
                                )
                            )
    df_new['Ticket_Length'] = df_new['Ticket'].apply(len)
    return df_new

The following two functions extract the first letter of the `Cabin` column and its number, respectively. 

In [None]:
def get_cabin_first_letter(df: pd.DataFrame) -> pd.DataFrame:
    """
    Extracts the first letter of the Cabin column.

    Args:
        df: The train or test set
    
    """
    df_new = df.copy()
    df_new['Cabin_Letter'] = df_new['Cabin'].apply(lambda x: str(x)[0])
    return df_new

In [None]:
def get_cabin_number(df: pd.DataFrame) -> pd.DataFrame:
    """
    Extracts the number of the Cabin column.

    Args:
        df: The train or test set
    
    """
    df_new = df.copy()
    df_new['Cabin_Number'] = df_new['Cabin'].apply(lambda x: str(x).split(' ')[-1][1:])
    df_new['Cabin_Number'].replace('an', np.NaN, inplace=True)
    df_new['Cabin_Number'] = df_new['Cabin_Number'].apply(
        lambda x: int(x) if not pd.isnull(x) and x != '' else np.NaN
    )
    return df_new


def get_dummy_cabin_number(df: pd.DataFrame, cabin_number_bins: np.ndarray) -> pd.DataFrame:
    """
    Get the category from cabin number and dummify it.

    Args:
        df: The train or test set
        cabin_number_bins: Cabin nubers categories
    
    """
    df_new = df.copy()
    dummies_cols = [f"Cabin_Number_{i}" for i in range(3)]

    df_new.loc[:, dummies_cols] = 0

    concerned_index = df_new['Cabin_Number'].dropna().index

    if df_new.loc[~df_new["Cabin_Number"].isna()].shape[0] > 0:
        categories = pd.cut(
            df_new.loc[concerned_index, 'Cabin_Number'],
            bins=cabin_number_bins,
            labels=False,
            include_lowest=True
        )
        df_new.loc[concerned_index, dummies_cols] = pd.get_dummies(categories, prefix="Cabin_Number")

    return df_new

We fill the null values in the `Embarked` column with the most commonly occuring value, which is 'S.'

In [None]:

def impute_embarked(df: pd.DataFrame) -> pd.DataFrame:
    """
    Fills the null values in the Embarked column with the most commonly
    occurring value, which is 'S'.

    Args:
        df: The train or test set
    
    """
    df_new = df.copy()
    df_new['Embarked'] = df_new['Embarked'].fillna('S')
    return df_new

Next, because we are using scikit-learn, we must convert our categorical columns into dummy variables. The following function does this, and then it drops the original categorical columns. It also makes sure that each category is present in both the training and test datasets.

In [None]:
def dummy_cols(df: pd.DataFrame, dummy_columns: List[str], dummy_columns_values: List[str]) -> pd.DataFrame:
    """
    Converts our categorical columns into dummy variables, and then drops the
    original categorical columns. It also makes sure that each category is
    present in both the training and test datasets.

    Args:
        df: The train or test set
        dummy_columns: The columns to be dummified.
        dummy_columns_values: The dummified columns values
            Each different value from the train set dummy columns correspond to one feature.
    
    """
    df_new = df.copy()
    df_new.loc[:, dummy_columns_values] = 0
    df_new.loc[:, dummy_columns] = df_new.loc[:, dummy_columns].applymap(str)

    dummies = pd.get_dummies(
        df_new[dummy_columns], columns=dummy_columns, prefix=dummy_columns
    )

    dummies = dummies.drop([col for col in dummies.columns if col not in dummy_columns_values], axis=1)

    df_new.loc[:, dummies.columns] = dummies
    return df_new

Our last helper function drops any columns that haven't already been dropped. In our case, we only need to drop the `PassengerId` column, which we have decided is not useful for our problem (by the way, I've confirmed this with a separate test). Note that dropping the `PassengerId` column here means that we'll have to load it later when creating our submission file.

In [None]:
def drop_cols(df: pd.DataFrame, drop_columns: List[str]) -> pd.DataFrame:
    """
    Drops columns in the given list.

    Args:
        df: The train or test set
        drop_columns: The columns to be dropped.
    
    """
    df_new = df.copy()
    df_new = df_new.drop(drop_columns, axis=1)
    return df_new
        

In [None]:
def process_data(train: pd.DataFrame, test: pd.DataFrame, dummy_columns: list, drop_columns: list) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Apply all neccessary transformations to clean train and test data

    Args:
        train: The train set
        test: The test set
        drop_columns: The columns to be dropped.
        dummy_columns: The columns to be dummified.
            
    Returns:
        new_train
        new_test
        
    """
    train_processed = train.copy()
    test_processed = test.copy()

    train_processed = process_name(train_processed)
    test_processed = process_name(test_processed)

    # Compute mean values from train
    age_mean = train_processed["Age"].mean()
    grouped_age_means = train_processed.groupby(['Name_Title', 'Pclass'])['Age'].mean()

    fare_mean = train_processed['Fare'].mean()

    train_processed = impute_age(train_processed, age_mean, grouped_age_means)
    test_processed = impute_age(test_processed, age_mean, grouped_age_means)

    train_processed = fill_fare_na(train_processed, fare_mean)
    test_processed = fill_fare_na(test_processed, fare_mean)

    train_processed = impute_embarked(train_processed)
    test_processed = impute_embarked(test_processed)

    train_processed = size_family(train_processed)
    test_processed = size_family(test_processed)

    train_processed = group_ticket(train_processed)
    test_processed = group_ticket(test_processed)

    train_processed = get_cabin_number(train_processed)
    test_processed = get_cabin_number(test_processed)

    # Compute cabin number bins from train
    _, cabin_number_bins = pd.qcut(train_processed['Cabin_Number'], 3, retbins=True, labels=False)

    train_processed = get_dummy_cabin_number(train_processed, cabin_number_bins)
    test_processed = get_dummy_cabin_number(test_processed, cabin_number_bins)

    train_processed = get_cabin_first_letter(train_processed)
    test_processed = get_cabin_first_letter(test_processed)

    train_processed = get_cabin_number(train_processed)
    test_processed = get_cabin_number(test_processed)

    # Compute dummy column values from train
    dummy_columns_values = pd.get_dummies(
        train_processed.loc[:, dummy_columns].applymap(str), prefix=dummy_columns
    ).columns

    train_processed = dummy_cols(train_processed, dummy_columns, dummy_columns_values)
    test_processed = dummy_cols(test_processed, dummy_columns, dummy_columns_values)

    train_processed = drop_cols(train_processed, drop_columns)
    test_processed = drop_cols(test_processed, drop_columns)

    return train_processed, test_processed

Having built our helper functions, we can now execute them in order to build our dataset that will be used in the model:a

In [None]:
drop_columns = ['Name', 'SibSp', 'Parch', 'Cabin', 'Ticket', 'Ticket_Letter', 'Pclass', 'Sex', 'Embarked',
             'Ticket_Category', 'Cabin_Letter', 'Name_Title', 'Family_Size', 'PassengerId', 'Cabin_Number']
dummy_columns = ['Pclass', 'Sex', 'Embarked', 'Ticket_Category', 'Cabin_Letter', 'Name_Title', 'Family_Size']

train_processed, test_processed = process_data(train, test, dummy_columns, drop_columns) 

We can see that our final dataset has 55 columns, composed of our target column and 44 predictor variables. Although highly dimensional datasets can result in high variance, I think we should be fine here.

In [None]:
print(len(train_processed.columns))

### Hyperparameter Tuning

We will use grid search to identify the optimal parameters of our random forest model. Because our training dataset is quite small, we can get away with testing a wider range of hyperparameter values. When I ran this on my 8 GB Windows machine, the process took less than ten minutes. I will not run it here for the sake of saving myself time, but I will discuss the results of this grid search.

from sklearn.model_selection import GridSearchCV  
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(max_features='auto',
                                oob_score=True,
                                random_state=1,
                                n_jobs=-1)

param_grid = { "criterion"   : ["gini", "entropy"],
             "min_samples_leaf" : [1, 5, 10],
             "min_samples_split" : [2, 4, 10, 12, 16],
             "n_estimators": [50, 100, 400, 700, 1000]}

gs = GridSearchCV(estimator=rf,
                  param_grid=param_grid,
                  scoring='accuracy',
                  cv=3,
                  n_jobs=-1)

gs = gs.fit(train.iloc[:, 1:], train.iloc[:, 0])

print(gs.best_score_)   
print(gs.best_params_)  
print(gs.cv_results_)

Looking at the results of the grid search:  

0.838383838384  
{'min_samples_split': 10, 'n_estimators': 700, 'criterion': 'gini', 'min_samples_leaf': 1}  

...we can see that our optimal parameter settings are not at the endpoints of our provided values, meaning that we do not have to test more values. What else can we say about our optimal values? The `min_samples_split` parameter is at 10, which should help mitigate overfitting to a certain degree. This is especially good because we have a relatively large number of estimators (700), which could potentially increase our generalization error.

### Model Estimation and Evaluation<a name="model"></a>

We are now ready to fit our model using the optimal hyperparameters. The out-of-bag score can give us an unbiased estimate of the model accuracy, and we can see that the score is 83.73%, which is only a little higher than our final leaderboard score.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(criterion='gini', 
                             n_estimators=700,
                             min_samples_split=10,
                             min_samples_leaf=1,
                             max_features='sqrt',
                             oob_score=True,
                             random_state=1,
                             n_jobs=-1)
rf.fit(train_processed.iloc[:, 1:], train_processed.iloc[:, 0])
print(f'{rf.oob_score_:.4f}')

Let's take a brief look at our variable importance according to our random forest model. We can see that some of the original columns we predicted would be important in fact were, including gender, fare, and age. But we also see title, name length, and ticket length feature prominently, so we can pat ourselves on the back for creating such useful variables.

In [None]:
pd.concat((pd.DataFrame(train_processed.iloc[:, 1:].columns, columns=['variable']), 
           pd.DataFrame(rf.feature_importances_, columns=['importance'])), 
          axis=1).sort_values(by='importance', ascending=False)[:20]

Our last step is to predict the target variable for our test data and generate an output file that will be submitted to Kaggle. 

In [None]:
predictions = rf.predict(test_processed)
predictions = pd.DataFrame(predictions, columns=['Survived'])
test = pd.read_csv(os.path.join('../input', 'test.csv'))
predictions = pd.concat((test.iloc[:, 0], predictions), axis=1)
predictions.to_csv('y_test_predictions.csv', sep=',', index=False)

## Conclusion

This exercise is a good example of how far basic feature engineering can take you. It is worth mentioning that I did try various other models before arriving at this one. Some of the other variations I tried were different groupings for the categorical variables (plenty more combinations remain), linear discriminant analysis on a couple numeric columns, and eliminating more variables, among other things. This is a competition with a generous allotment of submission attempts, and as a result, it's quite possible that even the leaderboard score is an overestimation of the true quality of the model, since the leaderboard can act as more of a validation score instead of a true test score. 

I welcome any comments and suggestions.