# Data preparation pipelines and model optimization for the [Kaggle Titanic dataset](https://www.kaggle.com/c/titanic)

This kernel illustrates the use of [scikit-learn Pipelines](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) for data preparation and easily reproducible transformation.

First import the required libraries:

In [1]:
import pandas as pd, sklearn, numpy as np
import seaborn as sns
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams["font.size"] = "16"

Load the datasets. Note that for simplicity, we'll keep the label field `Survived` in the `train` and `test` dataframes until actually starting transformations and predictions.

In [2]:
train = pd.read_csv('./kaggle_titanic_dataset/train.csv')
test = pd.read_csv('./kaggle_titanic_dataset/test.csv')
n_train, m_train = train.shape

Take a peek at the columns:

In [3]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


`train.info()` shows that many values in `Age` and `Cabin` columns are missing:

In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


Check statistical summary for numerical attributes:

In [5]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Important things to note here are that 
- The dataset is slightly skewed as only 38% survived, which could be important to take into account in cross-validation
- Some values in the `Fare` column are zero
- As noted above, many `Age`s are missing.

As there are lots of data exploration kernels out there, I will skip any further data exploration here and focus on the actual data transformation pipelines. Data exploration shows that `Age` is an important indicator for survival and the missing values should be imputed. The data processing steps carried out below are:
1. Split the dataset into features and labels. To simplify the kernel, we drop `Ticket`, `Cabin`, `Embarked`, and of course `PassengerId` from the feature set.
2. Extract the title from the `Name` field and use it as a feature to impute the missing values. For example, title `Mr.` says that the person in question was not a child.
3. One-hot encode all categorical attributes
4. Convert the dataframe to numpy matrix.

First we split the dataset into features and labels and give them common names `X_train` and `y_train`. Note that these are still dataframes, conversion to NumPy is done just before the prediction algorithms.

In [6]:
def drop_unused_columns(df):
    return df.drop(['PassengerId', 'Cabin', 'Ticket', 'Embarked'], axis=1)

def to_features_and_labels(df):
    y = df['Survived'].values
    X = drop_unused_columns(df)
    X = X.drop('Survived', axis=1)
    return X, y

# Assume that PassengerId and Name do not matter
X_train, y_train = to_features_and_labels(train)
X_test = drop_unused_columns(test)
X_train.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Fare
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1
4,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05


First write a [scikit-learn transformer](http://scikit-learn.org/stable/data_transforms.html) for converting name to title. You may think that there's a lot of overhead in the classes (and that's true), but the transformer classes are highly reusable and therefore may save a lot of time in future.

In [7]:
from sklearn.base import BaseEstimator, TransformerMixin
class DataFrameColumnMapper(BaseEstimator, TransformerMixin):
    def __init__(self, column_name, mapping_func, new_column_name=None):
        self.column_name = column_name
        self.mapping_func = mapping_func
        self.new_column_name = new_column_name if new_column_name is not None else self.column_name
    def fit(self, X, y=None):
        # Nothing to do here
        return self
    def transform(self, X):
        transformed_column = X.transform({self.column_name: self.mapping_func})
        Y = X.copy()
        Y = Y.assign(**{self.new_column_name: transformed_column})
        if self.column_name != self.new_column_name:
            Y = Y.drop(self.column_name, axis=1)
        return Y

# Return a lambda function that extracts title from the full name, this allows instantiating the pattern only once
def extract_title():
    import re
    pattern = re.compile(', (\w*)')
    return lambda name: pattern.search(name).group(1)

# Example usage and output 
df = DataFrameColumnMapper(column_name='Name', mapping_func=extract_title(), new_column_name='Title').fit_transform(X_train)
df.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Title
0,3,male,22.0,1,0,7.25,Mr
1,1,female,38.0,1,0,71.2833,Mrs
2,3,female,26.0,0,0,7.925,Miss
3,1,female,35.0,1,0,53.1,Mrs
4,3,male,35.0,0,0,8.05,Mr


Let's take a look at the transformed names:

In [8]:
df['Title'].value_counts()[1:10]

Miss      182
Mrs       125
Master     40
Dr          7
Rev         6
Major       2
Mlle        2
Col         2
Ms          1
Name: Title, dtype: int64

It seems that only the, say, five most frequent would be useful for imputing ages, so let us write a transformer that transforms the less frequent fields of a categorical attribute as "Other".

In [9]:
class CategoricalTruncator(BaseEstimator, TransformerMixin):
    def __init__(self, column_name, n_values_to_keep=5):
        self.column_name = column_name
        self.n_values_to_keep = n_values_to_keep
        self.values = None
    def fit(self, X, y=None):
        # Here we must ensure that the test set is transformed similarly in the later phase and that the same values are kept
        self.values = list(X[self.column_name].value_counts()[:self.n_values_to_keep].keys())
        return self
    def transform(self, X):
        transform = lambda x: x if x in self.values else 'Other'
        y = X.transform({self.column_name: transform})
        return X.assign(**{self.column_name: y})

CategoricalTruncator('Title', n_values_to_keep=3).fit_transform(df)['Title'].value_counts()

Mr       517
Miss     182
Mrs      125
Other     67
Name: Title, dtype: int64

Let us see what we have done so far by putting the transformers together in a pipeline:

In [10]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('name_to_title', DataFrameColumnMapper(column_name='Name', mapping_func=extract_title(), new_column_name='Title')),
    ('truncate_titles', CategoricalTruncator('Title', n_values_to_keep=3))
])

df = pipeline.fit_transform(X_train)
df['Title'].value_counts()

Mr       517
Miss     182
Mrs      125
Other     67
Name: Title, dtype: int64

Now write a generic imputer that uses values in a given column to compute the mean or median for the group in question.

In [88]:
class ImputerByReference(BaseEstimator, TransformerMixin):
    def __init__(self, column_to_impute, column_ref, impute_type='median'):
        self.column_to_impute = column_to_impute
        self.column_ref = column_ref
        # TODO
        # self.impute_func = np.median if impute_type == 'median' or impute_type is None else np.mean
    def fit(self, X, y=None):
        # Pick columns of interest
        df = X.loc[:, [self.column_to_impute, self.column_ref]]
        # Dictionary containing mean per group
        self.value_per_group = df.groupby(self.column_ref).median().to_dict()[self.column_to_impute]
        return self
    def transform(self, X):
        def transform(row):
            row_copy = row.copy()
            if pd.isnull(row_copy.at[self.column_to_impute]):
                row_copy.at[self.column_to_impute] = self.value_per_group[row_copy.at[self.column_ref]]
            return row_copy
        return X.apply(transform, axis=1)

ImputerByReference('Age', 'Title').fit_transform(df).head(10)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Title
0,3,male,22.0,1,0,7.25,Mr
1,1,female,38.0,1,0,71.2833,Mrs
2,3,female,26.0,0,0,7.925,Miss
3,1,female,35.0,1,0,53.1,Mrs
4,3,male,35.0,0,0,8.05,Mr
5,3,male,32.36809,0,0,8.4583,Mr
6,1,male,54.0,0,0,51.8625,Mr
7,3,male,2.0,3,1,21.075,Other
8,3,female,27.0,0,2,11.1333,Mrs
9,2,female,14.0,1,0,30.0708,Mrs


The full pipeline so far is below. Let us use `.info()` to check that no values are missing any more:

In [89]:
pipeline = Pipeline([
    ('name_to_title', DataFrameColumnMapper(column_name='Name', mapping_func=extract_title(), new_column_name='Title')),
    ('truncate_titles', CategoricalTruncator('Title', n_values_to_keep=3)),
    ('impute_ages_by_title', ImputerByReference('Age', 'Title'))
])

df = pipeline.fit_transform(X_train)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
Pclass    891 non-null int64
Sex       891 non-null object
Age       891 non-null float64
SibSp     891 non-null int64
Parch     891 non-null int64
Fare      891 non-null float64
Title     891 non-null object
dtypes: float64(2), int64(3), object(2)
memory usage: 48.8+ KB


Now we one-hot encode all categorical attributes using again a generic transformer. Note that the encoding get huge if there are columns with many different values (like `Ticket`). The first step is convert all categorical attributes to numerical ones by factorizing to integers (`Mrs.` is 0, `Mr.` is 1, for example.):

In [90]:
class CategoricalToOneHotEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns
    def fit(self, X, y=None):
        # Categorical attributes
        if self.columns is None:
            self.columns = X.select_dtypes(exclude='number')
        
        # Keep track of which categorical attributes are assigned to which integer. This is important 
        # when transforming the test set.
        mappings = {}
        
        for col in self.columns:
            labels, uniques = X.loc[:, col].factorize()
            int_and_cat = list(enumerate(uniques))
            cat_and_int = [(x[1], x[0]) for x in int_and_cat]
            mappings[col] = {'int_to_cat': dict(int_and_cat), 'cat_to_int': dict(cat_and_int)}
    
        self.mappings = mappings
        return self

    def transform(self, X):
        Y = X.copy()
        for col in self.columns:
            transformed_col = Y.loc[:, col].transform(lambda x: self.mappings[col]['cat_to_int'][x])
            for key, val in self.mappings[col]['cat_to_int'].items():
                one_hot = (transformed_col == val) + 0
                Y = Y.assign(**{'{}_{}'.format(col, key): one_hot})
            Y = Y.drop(col, axis=1)
        return Y
    
CategoricalToOneHotEncoder().fit_transform(df).head()   

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_male,Sex_female,Title_Miss,Title_Mr,Title_Mrs,Title_Other
0,3,22.0,1,0,7.25,1,0,0,1,0,0
1,1,38.0,1,0,71.2833,0,1,0,0,1,0
2,3,26.0,0,0,7.925,0,1,1,0,0,0
3,1,35.0,1,0,53.1,0,1,0,0,1,0
4,3,35.0,0,0,8.05,1,0,0,1,0,0


Finally define a transformer for converting the DataFrame to a NumPy matrix and build the data preparation pipeline. Here one could also include, e.g., `StandardScaler` for algorithms that are sensitive to differences in scale.

In [136]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Imputer

class DataFrameToValuesTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        # Remember the order of attributes before converting to NumPy
        self.attribute_order = list(X)
        return self
    def transform(self, X):
        return X.loc[:, self.attribute_order].values

def build_preprocessing_pipeline():
    return Pipeline([
        ('name_to_title', DataFrameColumnMapper(column_name='Name', mapping_func=extract_title(), new_column_name='Title')),
        ('truncate_titles', CategoricalTruncator(column_name='Title', n_values_to_keep=3)),
        ('impute_ages_by_title', ImputerByReference(column_to_impute='Age', column_ref='Title')),
        ('encode_categorical_onehot', CategoricalToOneHotEncoder()),
        ('encode_pclass_onehot', CategoricalToOneHotEncoder(columns=['Pclass'])),
        ('to_numpy', DataFrameToValuesTransformer()),
        ('imputer', Imputer(strategy='median')),
        ('scaler', MinMaxScaler())
    ])

X_prepared = build_preprocessing_pipeline().fit_transform(X_train)
print('Prepared training data: {} samples, {} features'.format(*X_train_prepared.shape))

Prepared training data: 891 samples, 13 features


In [137]:
def build_pipeline(classifier=None):
    preprocessing_pipeline = build_preprocessing_pipeline()
    return Pipeline([
        ('preprocessing', preprocessing_pipeline),
        ('classifier', classifier) # Expected to be filled by grid search
    ])

from sklearn.metrics import accuracy_score, precision_score, make_scorer

def build_grid_search(pipeline, param_grid):
    return GridSearchCV(pipeline, param_grid, cv=5, return_train_score=True, refit='accuracy',
                        scoring={ 'accuracy': make_scorer(accuracy_score),
                                   'precision': make_scorer(precision_score)
                                },
                        verbose=1)

def pretty_cv_results(cv_results, 
                      sort_by='rank_test_accuracy',
                      sort_ascending=True,
                      n_rows=5):
    df = pd.DataFrame(cv_results)
    cols_of_interest = [key for key in df.keys() if key.startswith('param_') 
                        or key.startswith('mean_train') 
                        or key.startswith('mean_test_')
                        or key.startswith('rank')]
    return df.loc[:, cols_of_interest].sort_values(by=sort_by, ascending=sort_ascending).head(n_rows)

def run_grid_search(grid_search):
    grid_search.fit(X_train, y_train)
    print('Best test score accuracy is:', grid_search.best_score_)
    return pretty_cv_results(grid_search.cv_results_)

Trying different algorithms is now straightforward. Choose the parameters to vary and run the grid search with cross validation to find both the best preprocessing pipeline and classifier.

## [Logistic classifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier)

In [138]:
from sklearn.linear_model import SGDClassifier

param_grid = [
    { 'preprocessing__truncate_titles__n_values_to_keep': [3, 4, 5],
      'classifier': [SGDClassifier(loss='log', tol=None, random_state=42)],
      'classifier__alpha': np.logspace(-5, -3, 3),
      'classifier__penalty': ['l2'],
      'classifier__max_iter': [20],
    }
]
grid_search = build_grid_search(pipeline=build_pipeline(), param_grid=param_grid)
linear_cv = run_grid_search(grid_search=grid_search)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
Best test score accuracy is: 0.8249158249158249


[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:   23.6s finished


In [139]:
linear_cv

Unnamed: 0,mean_test_accuracy,mean_test_precision,mean_train_accuracy,mean_train_precision,param_classifier,param_classifier__alpha,param_classifier__max_iter,param_classifier__penalty,param_preprocessing__truncate_titles__n_values_to_keep,rank_test_accuracy,rank_test_precision
7,0.824916,0.788958,0.833612,0.804796,"SGDClassifier(alpha=0.001, average=False, clas...",0.001,20,l2,4,1,1
8,0.824916,0.788958,0.833051,0.803383,"SGDClassifier(alpha=0.001, average=False, clas...",0.001,20,l2,5,1,1
6,0.810325,0.771262,0.817059,0.780977,"SGDClassifier(alpha=0.001, average=False, clas...",0.001,20,l2,3,3,4
3,0.774411,0.771389,0.793765,0.792953,"SGDClassifier(alpha=0.001, average=False, clas...",0.0001,20,l2,3,4,3
4,0.771044,0.734951,0.803294,0.784224,"SGDClassifier(alpha=0.001, average=False, clas...",0.0001,20,l2,4,5,6


## [Random forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [106]:
from sklearn.ensemble import RandomForestClassifier

param_grid = [
    { 'preprocessing__truncate_titles__n_values_to_keep': [5],
      'classifier': [RandomForestClassifier(random_state=42)],
      'classifier__n_estimators': [10, 30, 100],
      'classifier__max_features': range(4, 14)
    }
]
rf_grid_search = build_grid_search(pipeline=build_pipeline(), param_grid=param_grid)
rf_cv_results = run_grid_search(grid_search=rf_grid_search)
rf_cv_results

Fitting 5 folds for each of 30 candidates, totalling 150 fits
Best test score accuracy is: 0.8215488215488216


[Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed:  1.6min finished


## [SVM](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

In [107]:
param_grid = [
    { 
        'preprocessing__truncate_titles__n_values_to_keep': [5],
        'classifier': [ SVC(random_state=42) ],
        'classifier__C': np.logspace(-1, 1, 3),
        'classifier__kernel': ['linear', 'poly', 'rbf']
    }
]


svm_grid_search = build_grid_search(pipeline=build_pipeline(), param_grid=param_grid)
svm_cv_results = run_grid_search(grid_search=svm_grid_search)
svm_cv_results

Fitting 5 folds for each of 9 candidates, totalling 45 fits


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:   25.8s finished


Best test score accuracy is: 0.8271604938271605


Unnamed: 0,mean_test_accuracy,mean_test_precision,mean_train_accuracy,mean_train_precision,param_classifier,param_classifier__C,param_classifier__kernel,param_preprocessing__truncate_titles__n_values_to_keep,rank_test_accuracy,rank_test_precision
8,0.82716,0.801655,0.833332,0.813861,"SVC(C=10.0, cache_size=200, class_weight=None,...",10.0,rbf,5,1,2
6,0.826038,0.79139,0.828282,0.797534,"SVC(C=10.0, cache_size=200, class_weight=None,...",10.0,linear,5,2,3
3,0.815937,0.768258,0.821264,0.779457,"SVC(C=10.0, cache_size=200, class_weight=None,...",1.0,linear,5,3,4
7,0.787879,0.74644,0.812572,0.79834,"SVC(C=10.0, cache_size=200, class_weight=None,...",10.0,poly,5,4,5
2,0.786756,0.740055,0.786754,0.741903,"SVC(C=10.0, cache_size=200, class_weight=None,...",0.1,rbf,5,5,6


## [Gradient boosting](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)

In [140]:
from sklearn.ensemble import GradientBoostingClassifier

param_grid = [
    { 
        'preprocessing__truncate_titles__n_values_to_keep': [5],
        'classifier': [ GradientBoostingClassifier(random_state=42) ],
        'classifier__loss': ['deviance'],
        'classifier__n_estimators': [50, 100],
        'classifier__max_features': [7, 13],
        'classifier__max_depth': [3, 5],
        'classifier__min_samples_leaf': [1],
        'classifier__min_samples_split': [2]
    }
]


gb_grid_search = build_grid_search(pipeline=build_pipeline(), param_grid=param_grid)
gb_cv_results = run_grid_search(grid_search=gb_grid_search)
gb_cv_results

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed:   25.3s finished


Best test score accuracy is: 0.8282828282828283


Unnamed: 0,mean_test_accuracy,mean_test_precision,mean_train_accuracy,mean_train_precision,param_classifier,param_classifier__loss,param_classifier__max_depth,param_classifier__max_features,param_classifier__min_samples_leaf,param_classifier__min_samples_split,param_classifier__n_estimators,param_preprocessing__truncate_titles__n_values_to_keep,rank_test_accuracy,rank_test_precision
7,0.828283,0.804512,0.960997,0.97836,([DecisionTreeRegressor(criterion='friedman_ms...,deviance,5,13,1,2,100,5,1,3
5,0.826038,0.801994,0.951458,0.968739,([DecisionTreeRegressor(criterion='friedman_ms...,deviance,5,7,1,2,100,5,2,5
1,0.824916,0.807302,0.894505,0.902404,([DecisionTreeRegressor(criterion='friedman_ms...,deviance,3,7,1,2,100,5,3,1
3,0.823793,0.799224,0.903761,0.911311,([DecisionTreeRegressor(criterion='friedman_ms...,deviance,3,13,1,2,100,5,4,7
4,0.823793,0.802323,0.92733,0.947028,([DecisionTreeRegressor(criterion='friedman_ms...,deviance,5,7,1,2,50,5,4,4


## Prepare the submission

In [150]:
def get_predictions(estimator):
    predictions = estimator.predict(X_test)
    indices = test.loc[:, 'PassengerId']
    as_dict = [{'PassengerId': index, 'Survived': prediction} for index, prediction in zip(indices, predictions)]
    return pd.DataFrame.from_dict(as_dict)

predictions = get_predictions(gb_grid_search.best_estimator_)

In [156]:
import os
dest_file = os.path.join('submissions', 'submission.csv')
predictions.to_csv(dest_file, index=False)