# Reusable data transformations using scikit-learn pipelines and hyperparameter optimization

This kernel illustrates the use of [scikit-learn Pipelines](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) for data transformation. Custom transformers written here are easily reusable for other projects and they also enable including any data transformation parameters in the hyperparameter optimization.

First import the required libraries:

In [29]:
import pandas as pd, sklearn, numpy as np, os
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score, precision_score, make_scorer

Load the datasets. Note that for simplicity, we'll keep the label field `Survived` in the `train` and `test` dataframes until actually starting transformations and predictions.

In [9]:
data_folder = './kaggle_titanic_dataset'
train = pd.read_csv(os.path.join(data_folder, 'train.csv'))
test = pd.read_csv(os.path.join(data_folder, 'test.csv'))
n_train, m_train = train.shape

Take a peek at the columns:

In [10]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


`train.info()` shows that many values in `Age` and `Cabin` columns are missing:

In [11]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


To get more insights, check statistical summary for numerical attributes using `train.describe()`:

In [12]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Important things to note here are that 
- The dataset is slightly skewed as only 38% survived, which could be important to take into account in cross-validation
- Some values in the `Fare` column are zero
- As noted above, many `Age`s are missing.

As there are lots of data exploration kernels out there, I will skip any further data exploration here and focus on the actual data transformation pipelines. Data exploration shows that `Age` is an important indicator for survival and the missing values should be imputed. The data processing steps carried out below are:
1. Split the dataset into features and labels. To simplify the kernel, we drop `Ticket`, `Cabin`, `Embarked`, and of course `PassengerId` from the feature set.
2. Extract the title from the `Name` field and use it as a feature to impute the missing values. For example, title `Mr.` says that the person in question was not a child.
3. One-hot encode all categorical attributes
4. Convert the dataframe to numpy matrix.

First we split the dataset into features and labels and give them common names `X_train` and `y_train`. Note that these are still dataframes, conversion to NumPy is done just before feeding the inputs to prediction algorithms.A Also note that we split the given training set into training and validation sets. The validation set works as a hold-out set that can be used for estimating the generalization error after hyperparameter optimization. The algorithm used in the final submission use, of course, all training data available.

In [13]:
from sklearn.model_selection import train_test_split

def drop_unused_columns(df):
    return df.drop(['PassengerId', 'Cabin', 'Ticket', 'Embarked'], axis=1)

def to_features_and_labels(df):
    y = df['Survived'].values
    X = drop_unused_columns(df)
    X = X.drop('Survived', axis=1)
    return X, y

X_train_val, y_train_val = to_features_and_labels(train) # All data with labels, to be split into train and val
X_test = drop_unused_columns(test)

# Split the available training data into training set (used for choosing the best model) 
# and validation set (used for estimating the generalization error, could also be called "hold-out" set)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.20, random_state=42)
X_train.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Fare
331,1,"Partner, Mr. Austen",male,45.5,0,0,28.5
733,2,"Berriman, Mr. William John",male,23.0,0,0,13.0
382,3,"Tikkanen, Mr. Juho",male,32.0,0,0,7.925
704,3,"Hansen, Mr. Henrik Juul",male,26.0,1,0,7.8542
813,3,"Andersson, Miss. Ebba Iris Alfrida",female,6.0,4,2,31.275


First write a [scikit-learn transformer](http://scikit-learn.org/stable/data_transforms.html) for converting name to title. You may think that there's a lot of overhead involved in writing such classes (and you're right), but the transformer classes are highly reusable and therefore save a lot of time in future projects.

In [14]:
from sklearn.base import BaseEstimator, TransformerMixin
class DataFrameColumnMapper(BaseEstimator, TransformerMixin):
    def __init__(self, column_name, mapping_func, new_column_name=None):
        self.column_name = column_name
        self.mapping_func = mapping_func
        self.new_column_name = new_column_name if new_column_name is not None else self.column_name
    def fit(self, X, y=None):
        # Nothing to do here
        return self
    def transform(self, X):
        transformed_column = X.transform({self.column_name: self.mapping_func})
        Y = X.copy()
        Y = Y.assign(**{self.new_column_name: transformed_column})
        if self.column_name != self.new_column_name:
            Y = Y.drop(self.column_name, axis=1)
        return Y

# Return a lambda function that extracts title from the full name, this allows instantiating the pattern only once
def extract_title():
    import re
    pattern = re.compile(', (\w*)')
    return lambda name: pattern.search(name).group(1)

# Example usage and output 
df = DataFrameColumnMapper(column_name='Name', mapping_func=extract_title(), new_column_name='Title').fit_transform(X_train)
df.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Title
331,1,male,45.5,0,0,28.5,Mr
733,2,male,23.0,0,0,13.0,Mr
382,3,male,32.0,0,0,7.925,Mr
704,3,male,26.0,1,0,7.8542,Mr
813,3,female,6.0,4,2,31.275,Miss


Let's take a look at the transformed names:

In [15]:
df['Title'].value_counts()[1:10]

Miss      143
Mrs        96
Master     33
Dr          5
Rev         5
Major       2
Col         2
Mlle        2
Ms          1
Name: Title, dtype: int64

It seems that only the, say, five most frequent would be useful for imputing ages, so let us write a transformer that transforms the less frequent fields of a categorical attribute as "Other". The number of classes to keep is included as a constructor argument that can be optimized using cross-validation.

In [18]:
class CategoricalTruncator(BaseEstimator, TransformerMixin):
    def __init__(self, column_name, n_values_to_keep=5):
        self.column_name = column_name
        self.n_values_to_keep = n_values_to_keep
        self.values = None
    def fit(self, X, y=None):
        # Here we must ensure that the test set is transformed similarly in the later phase and that the same values are kept
        self.values = list(X[self.column_name].value_counts()[:self.n_values_to_keep].keys())
        return self
    def transform(self, X):
        transform = lambda x: x if x in self.values else 'Other'
        y = X.transform({self.column_name: transform})
        return X.assign(**{self.column_name: y})

# Print title counts
title_counts = CategoricalTruncator('Title', n_values_to_keep=3).fit_transform(df)['Title'].value_counts()
title_counts

Mr       419
Miss     143
Mrs       96
Other     54
Name: Title, dtype: int64

Let us see what we have done so far by putting the transformers together in a pipeline:

In [21]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('name_to_title', DataFrameColumnMapper(column_name='Name', mapping_func=extract_title(), new_column_name='Title')),
    ('truncate_titles', CategoricalTruncator('Title', n_values_to_keep=3))
])

df = pipeline.fit_transform(X_train)
df.head(10)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Title
331,1,male,45.5,0,0,28.5,Mr
733,2,male,23.0,0,0,13.0,Mr
382,3,male,32.0,0,0,7.925,Mr
704,3,male,26.0,1,0,7.8542,Mr
813,3,female,6.0,4,2,31.275,Miss
118,1,male,24.0,0,1,247.5208,Mr
536,1,male,45.0,0,0,26.55,Other
361,2,male,29.0,1,0,27.7208,Mr
29,3,male,,0,0,7.8958,Mr
55,1,male,,0,0,35.5,Mr


Now write a generic imputer that uses values in a given column ("Title" in our case) to impute missing values for a numeric column ("Age") with the median value for the group in question.

In [22]:
class ImputerByReference(BaseEstimator, TransformerMixin):
    def __init__(self, column_to_impute, column_ref):
        self.column_to_impute = column_to_impute
        self.column_ref = column_ref
        # TODO Allow specifying the aggregation function
        # self.impute_func = np.median if impute_type == 'median' or impute_type is None else np.mean
    def fit(self, X, y=None):
        # Pick columns of interest
        df = X.loc[:, [self.column_to_impute, self.column_ref]]
        # Dictionary containing mean per group
        self.value_per_group = df.groupby(self.column_ref).median().to_dict()[self.column_to_impute]
        return self
    def transform(self, X):
        def transform(row):
            row_copy = row.copy()
            if pd.isnull(row_copy.at[self.column_to_impute]):
                row_copy.at[self.column_to_impute] = self.value_per_group[row_copy.at[self.column_ref]]
            return row_copy
        return X.apply(transform, axis=1)

# Example output
ImputerByReference('Age', 'Title').fit_transform(df).head(10)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Title
331,1,male,45.5,0,0,28.5,Mr
733,2,male,23.0,0,0,13.0,Mr
382,3,male,32.0,0,0,7.925,Mr
704,3,male,26.0,1,0,7.8542,Mr
813,3,female,6.0,4,2,31.275,Miss
118,1,male,24.0,0,1,247.5208,Mr
536,1,male,45.0,0,0,26.55,Other
361,2,male,29.0,1,0,27.7208,Mr
29,3,male,30.0,0,0,7.8958,Mr
55,1,male,30.0,0,0,35.5,Mr


The full pipeline so far is below. Let us use `.info()` to check that no values are missing from the transformed data:

In [23]:
pipeline = Pipeline([
    ('name_to_title', DataFrameColumnMapper(column_name='Name', mapping_func=extract_title(), new_column_name='Title')),
    ('truncate_titles', CategoricalTruncator('Title', n_values_to_keep=3)),
    ('impute_ages_by_title', ImputerByReference('Age', 'Title'))
])

df = pipeline.fit_transform(X_train)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 331 to 102
Data columns (total 7 columns):
Pclass    712 non-null int64
Sex       712 non-null object
Age       712 non-null float64
SibSp     712 non-null int64
Parch     712 non-null int64
Fare      712 non-null float64
Title     712 non-null object
dtypes: float64(2), int64(3), object(2)
memory usage: 44.5+ KB


Looks good! Now we one-hot encode all categorical attributes using again a generic transformer. Note that the one-hot encoding would get us into trouble if we were encoding columns with many different values (like column `Ticket`), but we do not worry about that here. The first step is convert all categorical attributes to numerical ones by factorizing to integers (`Mrs.` is 0, `Mr.` is 1, for example.):

In [26]:
class CategoricalToOneHotEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns
    def fit(self, X, y=None):
        # Pick all categorical attributes if no columns to transform were specified
        if self.columns is None:
            self.columns = X.select_dtypes(exclude='number')
        
        # Keep track of which categorical attributes are assigned to which integer. This is important 
        # when transforming the test set.
        mappings = {}
        
        for col in self.columns:
            labels, uniques = X.loc[:, col].factorize() # Assigns unique integers for all categories
            int_and_cat = list(enumerate(uniques))
            cat_and_int = [(x[1], x[0]) for x in int_and_cat]
            mappings[col] = {'int_to_cat': dict(int_and_cat), 'cat_to_int': dict(cat_and_int)}
    
        self.mappings = mappings
        return self

    def transform(self, X):
        Y = X.copy()
        for col in self.columns:
            transformed_col = Y.loc[:, col].transform(lambda x: self.mappings[col]['cat_to_int'][x])
            for key, val in self.mappings[col]['cat_to_int'].items():
                one_hot = (transformed_col == val) + 0 # Cast boolean to int by adding zero
                Y = Y.assign(**{'{}_{}'.format(col, key): one_hot})
            Y = Y.drop(col, axis=1)
        return Y
    
# Example output    
CategoricalToOneHotEncoder().fit_transform(df).head()   

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_male,Sex_female,Title_Mr,Title_Miss,Title_Other,Title_Mrs
331,1,45.5,0,0,28.5,1,0,1,0,0,0
733,2,23.0,0,0,13.0,1,0,1,0,0,0
382,3,32.0,0,0,7.925,1,0,1,0,0,0
704,3,26.0,1,0,7.8542,1,0,1,0,0,0
813,3,6.0,4,2,31.275,0,1,0,1,0,0


Note that we could drop either `Sex_male` or `Sex_female` without losing any data, but we'll leave that for now. Now that all values are imputed and all columns are numerical, we finally define a transformer for converting the DataFrame to a NumPy matrix and build the  full data preparation pipeline. We also include `MinMaxScaler` as last preprocessing step as some algorithms are sensitive to variations in scale. We also add a simple imputer, as test set has one missing `Fare` value.

In [28]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer

class DataFrameToValuesTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        # Remember the order of attributes before converting to NumPy
        self.attribute_order = list(X)
        return self
    def transform(self, X):
        return X.loc[:, self.attribute_order].values

def build_preprocessing_pipeline():
    return Pipeline([
        ('name_to_title', DataFrameColumnMapper(column_name='Name', mapping_func=extract_title(), new_column_name='Title')),
        ('truncate_titles', CategoricalTruncator(column_name='Title', n_values_to_keep=3)),
        ('impute_ages_by_title', ImputerByReference(column_to_impute='Age', column_ref='Title')),
        ('encode_categorical_onehot', CategoricalToOneHotEncoder()),
        ('encode_pclass_onehot', CategoricalToOneHotEncoder(columns=['Pclass'])),
        ('to_numpy', DataFrameToValuesTransformer()),
        ('imputer', SimpleImputer(strategy='median')), # Test set has one missing fare
        ('scaler', MinMaxScaler())
    ])

X_train_prepared = build_preprocessing_pipeline().fit_transform(X_train)
print('Prepared training data: {} samples, {} features'.format(*X_train_prepared.shape))

Prepared training data: 712 samples, 13 features


Before moving to trying different algorithms and optimizing hyperparameters, we define a few helper functions that are hopefully self-explanatory.

In [30]:
def build_pipeline(classifier=None):
    preprocessing_pipeline = build_preprocessing_pipeline()
    return Pipeline([
        ('preprocessing', preprocessing_pipeline),
        ('classifier', classifier) # Expected to be filled by grid search
    ])


def build_grid_search(pipeline, param_grid):
    return GridSearchCV(pipeline, param_grid, cv=5, return_train_score=True, refit='accuracy',
                        scoring={ 'accuracy': make_scorer(accuracy_score),
                                  'precision': make_scorer(precision_score)
                                },
                        verbose=1)

def pretty_cv_results(cv_results, 
                      sort_by='rank_test_accuracy',
                      sort_ascending=True,
                      n_rows=5):
    df = pd.DataFrame(cv_results)
    cols_of_interest = [key for key in df.keys() if key.startswith('param_') 
                        or key.startswith('mean_train') 
                        or key.startswith('mean_test_')
                        or key.startswith('rank')]
    return df.loc[:, cols_of_interest].sort_values(by=sort_by, ascending=sort_ascending).head(n_rows)

def run_grid_search(grid_search):
    grid_search.fit(X_train, y_train)
    print('Best test score accuracy is:', grid_search.best_score_)
    return pretty_cv_results(grid_search.cv_results_)

Trying different algorithms is now straightforward. Choose the parameters to vary and run the grid search with cross validation to find both the best preprocessing pipeline and classifier.

## [Logistic classifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier)

In [31]:
param_grid = [
    { 'preprocessing__truncate_titles__n_values_to_keep': [3, 4, 5],
      'classifier': [SGDClassifier(loss='log', tol=None, random_state=42)],
      'classifier__alpha': np.logspace(-5, -3, 3),
      'classifier__penalty': ['l2'],
      'classifier__max_iter': [20],
    }
]
log_grid_search = build_grid_search(pipeline=build_pipeline(), param_grid=param_grid)
linear_cv = run_grid_search(grid_search=log_grid_search)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Best test score accuracy is: 0.8216292134831461


[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:   18.5s finished


In [32]:
linear_cv

Unnamed: 0,param_classifier,param_classifier__alpha,param_classifier__max_iter,param_classifier__penalty,param_preprocessing__truncate_titles__n_values_to_keep,mean_test_accuracy,rank_test_accuracy,mean_train_accuracy,mean_test_precision,rank_test_precision,mean_train_precision
8,"SGDClassifier(alpha=0.001, average=False, clas...",0.001,20,l2,5,0.821629,1,0.830413,0.783425,7,0.796768
7,"SGDClassifier(alpha=0.001, average=False, clas...",0.001,20,l2,4,0.820225,2,0.828658,0.781275,8,0.796125
4,"SGDClassifier(alpha=0.001, average=False, clas...",0.0001,20,l2,4,0.813202,3,0.811111,0.888435,1,0.876203
5,"SGDClassifier(alpha=0.001, average=False, clas...",0.0001,20,l2,5,0.813202,3,0.812164,0.888435,1,0.876639
6,"SGDClassifier(alpha=0.001, average=False, clas...",0.001,20,l2,3,0.808989,5,0.81777,0.75257,9,0.773431


## [Random forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [33]:
param_grid = [
    { 'preprocessing__truncate_titles__n_values_to_keep': [5],
      'classifier': [RandomForestClassifier(random_state=42)],
      'classifier__n_estimators': [10, 30, 100],
      'classifier__max_features': range(4, 14, 3)
    }
]
rf_grid_search = build_grid_search(pipeline=build_pipeline(), param_grid=param_grid)
rf_cv_results = run_grid_search(grid_search=rf_grid_search)
rf_cv_results

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Best test score accuracy is: 0.8216292134831461


[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:   27.5s finished


Unnamed: 0,param_classifier,param_classifier__max_features,param_classifier__n_estimators,param_preprocessing__truncate_titles__n_values_to_keep,mean_test_accuracy,rank_test_accuracy,mean_train_accuracy,mean_test_precision,rank_test_precision,mean_train_precision
7,"(DecisionTreeClassifier(class_weight=None, cri...",10,30,5,0.821629,1,0.981391,0.77418,3,0.984783
4,"(DecisionTreeClassifier(class_weight=None, cri...",7,30,5,0.816011,2,0.981391,0.776304,1,0.984787
0,"(DecisionTreeClassifier(class_weight=None, cri...",4,10,5,0.813202,3,0.963833,0.774793,2,0.976417
8,"(DecisionTreeClassifier(class_weight=None, cri...",10,100,5,0.813202,3,0.982795,0.765945,6,0.985743
5,"(DecisionTreeClassifier(class_weight=None, cri...",7,100,5,0.811798,5,0.982795,0.763689,7,0.983893


## [SVM](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

In [34]:
param_grid = [
    { 
        'preprocessing__truncate_titles__n_values_to_keep': [5],
        'classifier': [ SVC(random_state=42, probability=True) ], # Probability to use in voting later
        'classifier__C': np.logspace(-1, 1, 3),
        'classifier__kernel': ['linear', 'poly', 'rbf']
    }
]


svm_grid_search = build_grid_search(pipeline=build_pipeline(), param_grid=param_grid)
svm_cv_results = run_grid_search(grid_search=svm_grid_search)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Best test score accuracy is: 0.8300561797752809


[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:   21.2s finished


Unnamed: 0,param_classifier,param_classifier__C,param_classifier__kernel,param_preprocessing__truncate_titles__n_values_to_keep,mean_test_accuracy,rank_test_accuracy,mean_train_accuracy,mean_test_precision,rank_test_precision,mean_train_precision
6,"SVC(C=10.0, cache_size=200, class_weight=None,...",10,linear,5,0.830056,1,0.830758,0.798307,3,0.796815
8,"SVC(C=10.0, cache_size=200, class_weight=None,...",10,rbf,5,0.81882,2,0.833918,0.842416,1,0.85953
3,"SVC(C=10.0, cache_size=200, class_weight=None,...",1,linear,5,0.816011,3,0.824437,0.768159,4,0.783621
7,"SVC(C=10.0, cache_size=200, class_weight=None,...",10,poly,5,0.797753,4,0.814956,0.831265,2,0.854193
5,"SVC(C=10.0, cache_size=200, class_weight=None,...",1,rbf,5,0.79073,5,0.799859,0.744253,5,0.754906


In [35]:
svm_cv_results

Unnamed: 0,param_classifier,param_classifier__C,param_classifier__kernel,param_preprocessing__truncate_titles__n_values_to_keep,mean_test_accuracy,rank_test_accuracy,mean_train_accuracy,mean_test_precision,rank_test_precision,mean_train_precision
6,"SVC(C=10.0, cache_size=200, class_weight=None,...",10,linear,5,0.830056,1,0.830758,0.798307,3,0.796815
8,"SVC(C=10.0, cache_size=200, class_weight=None,...",10,rbf,5,0.81882,2,0.833918,0.842416,1,0.85953
3,"SVC(C=10.0, cache_size=200, class_weight=None,...",1,linear,5,0.816011,3,0.824437,0.768159,4,0.783621
7,"SVC(C=10.0, cache_size=200, class_weight=None,...",10,poly,5,0.797753,4,0.814956,0.831265,2,0.854193
5,"SVC(C=10.0, cache_size=200, class_weight=None,...",1,rbf,5,0.79073,5,0.799859,0.744253,5,0.754906


## [Gaussian process classifier](http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessClassifier.html#sklearn.gaussian_process.GaussianProcessClassifier)

In [39]:
from sklearn.gaussian_process.kernels import RBF, Matern

param_grid = [
    { 
        'preprocessing__truncate_titles__n_values_to_keep': [5],
        'classifier': [ GaussianProcessClassifier() ], 
        'classifier__kernel': [1.0*RBF(1.0), 1.0*Matern(1.0)]
    }
]

gp_grid_search = build_grid_search(pipeline=build_pipeline(), param_grid=param_grid)
gp_cv_results = run_grid_search(grid_search=gp_grid_search)


Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   13.6s finished


Best test score accuracy is: 0.8328651685393258


Unnamed: 0,param_classifier,param_classifier__kernel,param_preprocessing__truncate_titles__n_values_to_keep,mean_test_accuracy,rank_test_accuracy,mean_train_accuracy,mean_test_precision,rank_test_precision,mean_train_precision
0,"GaussianProcessClassifier(copy_X_train=True,\n...",1**2 * RBF(length_scale=1),5,0.832865,1,0.844801,0.842778,1,0.848948
1,"GaussianProcessClassifier(copy_X_train=True,\n...","1**2 * Matern(length_scale=1, nu=1.5)",5,0.831461,2,0.844449,0.839115,2,0.848776


In [41]:
gp_cv_results

Unnamed: 0,param_classifier,param_classifier__kernel,param_preprocessing__truncate_titles__n_values_to_keep,mean_test_accuracy,rank_test_accuracy,mean_train_accuracy,mean_test_precision,rank_test_precision,mean_train_precision
0,"GaussianProcessClassifier(copy_X_train=True,\n...",1**2 * RBF(length_scale=1),5,0.832865,1,0.844801,0.842778,1,0.848948
1,"GaussianProcessClassifier(copy_X_train=True,\n...","1**2 * Matern(length_scale=1, nu=1.5)",5,0.831461,2,0.844449,0.839115,2,0.848776


## [AdaBoost](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier)

In [38]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

param_grid = [
    { 
        'preprocessing__truncate_titles__n_values_to_keep': [5],
        'classifier': [ AdaBoostClassifier(random_state=42) ],
        'classifier__n_estimators': [50, 100],
        'classifier__learning_rate': np.logspace(-1, 1, 3),
        'classifier__base_estimator': [
            DecisionTreeClassifier(max_depth=1),
            DecisionTreeClassifier(max_depth=2)
        ],
        # 'classifier__base_estimator__max_depth': [1, 2]
    }
]

ada_grid_search = build_grid_search(pipeline=build_pipeline(), param_grid=param_grid)
ada_cv_results = run_grid_search(grid_search=ada_grid_search)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:   31.2s finished


Best test score accuracy is: 0.8230337078651685


Unnamed: 0,param_classifier,param_classifier__base_estimator,param_classifier__learning_rate,param_classifier__n_estimators,param_preprocessing__truncate_titles__n_values_to_keep,mean_test_accuracy,rank_test_accuracy,mean_train_accuracy,mean_test_precision,rank_test_precision,mean_train_precision
7,"(DecisionTreeClassifier(class_weight=None, cri...","DecisionTreeClassifier(class_weight=None, crit...",0.1,100,5,0.823034,1,0.884829,0.794221,2,0.916694
1,"(DecisionTreeClassifier(class_weight=None, cri...","DecisionTreeClassifier(class_weight=None, crit...",0.1,100,5,0.820225,2,0.831459,0.777804,3,0.793459
6,"(DecisionTreeClassifier(class_weight=None, cri...","DecisionTreeClassifier(class_weight=None, crit...",0.1,50,5,0.820225,2,0.863059,0.79766,1,0.883414
2,"(DecisionTreeClassifier(class_weight=None, cri...","DecisionTreeClassifier(class_weight=None, crit...",1.0,50,5,0.81882,4,0.855689,0.776974,4,0.834163
3,"(DecisionTreeClassifier(class_weight=None, cri...","DecisionTreeClassifier(class_weight=None, crit...",1.0,100,5,0.81882,4,0.863414,0.775875,5,0.8493


In [40]:
ada_cv_results

Unnamed: 0,param_classifier,param_classifier__base_estimator,param_classifier__learning_rate,param_classifier__n_estimators,param_preprocessing__truncate_titles__n_values_to_keep,mean_test_accuracy,rank_test_accuracy,mean_train_accuracy,mean_test_precision,rank_test_precision,mean_train_precision
7,"(DecisionTreeClassifier(class_weight=None, cri...","DecisionTreeClassifier(class_weight=None, crit...",0.1,100,5,0.823034,1,0.884829,0.794221,2,0.916694
1,"(DecisionTreeClassifier(class_weight=None, cri...","DecisionTreeClassifier(class_weight=None, crit...",0.1,100,5,0.820225,2,0.831459,0.777804,3,0.793459
6,"(DecisionTreeClassifier(class_weight=None, cri...","DecisionTreeClassifier(class_weight=None, crit...",0.1,50,5,0.820225,2,0.863059,0.79766,1,0.883414
2,"(DecisionTreeClassifier(class_weight=None, cri...","DecisionTreeClassifier(class_weight=None, crit...",1.0,50,5,0.81882,4,0.855689,0.776974,4,0.834163
3,"(DecisionTreeClassifier(class_weight=None, cri...","DecisionTreeClassifier(class_weight=None, crit...",1.0,100,5,0.81882,4,0.863414,0.775875,5,0.8493


## [Gradient boosting](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)

In [42]:
param_grid = [
    { 
        'preprocessing__truncate_titles__n_values_to_keep': [5],
        'classifier': [ GradientBoostingClassifier(random_state=42) ],
        'classifier__loss': ['deviance'],
        'classifier__n_estimators': [50, 100],
        'classifier__max_features': [7, 13],
        'classifier__max_depth': [3, 5],
        'classifier__min_samples_leaf': [1],
        'classifier__min_samples_split': [2]
    }
]

gb_grid_search = build_grid_search(pipeline=build_pipeline(), param_grid=param_grid)
gb_cv_results = run_grid_search(grid_search=gb_grid_search)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed:   18.7s finished


Best test score accuracy is: 0.8370786516853933


In [43]:
gb_cv_results

Unnamed: 0,param_classifier,param_classifier__loss,param_classifier__max_depth,param_classifier__max_features,param_classifier__min_samples_leaf,param_classifier__min_samples_split,param_classifier__n_estimators,param_preprocessing__truncate_titles__n_values_to_keep,mean_test_accuracy,rank_test_accuracy,mean_train_accuracy,mean_test_precision,rank_test_precision,mean_train_precision
5,([DecisionTreeRegressor(criterion='friedman_ms...,deviance,5,7,1,2,100,5,0.837079,1,0.955759,0.813802,5,0.985693
1,([DecisionTreeRegressor(criterion='friedman_ms...,deviance,3,7,1,2,100,5,0.83427,2,0.899226,0.828948,2,0.931899
4,([DecisionTreeRegressor(criterion='friedman_ms...,deviance,5,7,1,2,50,5,0.830056,3,0.923809,0.812951,6,0.961477
0,([DecisionTreeRegressor(criterion='friedman_ms...,deviance,3,7,1,2,50,5,0.828652,4,0.873598,0.821827,3,0.893343
6,([DecisionTreeRegressor(criterion='friedman_ms...,deviance,5,13,1,2,50,5,0.828652,4,0.936095,0.809786,7,0.969542


In [23]:
gb_grid_search.best_estimator_.score(X_val, y_val)

0.8212290502793296

## Voting classifier
Create a voting classifier from the best estimators and check the generalization accuracy for heldout data `X_val`

In [45]:
voting_estimators = [
    # ('logistic', log_grid_search),
    # ('rf', rf_grid_search),
    ('svc', svm_grid_search),
    ('gp', gp_grid_search),
    # ('ada', ada_grid_search),
    ('gb', gb_grid_search),
]

estimators_with_names = [(name, grid_search.best_estimator_) for name, grid_search in voting_estimators]

voting_classifier = VotingClassifier(estimators=estimators_with_names,
                                     voting='soft')

voting_classifier.fit(X_train, y_train)
voting_classifier.score(X_val, y_val)
# cross_val_score(voting_classifier, X_train_val, y_train_val, cv=5)

0.8212290502793296

## Train voting classifier with all data available

In [46]:
voting_classifier.fit(X_train_val, y_train_val)

VotingClassifier(estimators=[('svc', Pipeline(memory=None,
     steps=[('preprocessing', Pipeline(memory=None,
     steps=[('name_to_title', DataFrameColumnMapper(column_name='Name',
           mapping_func=<function extract_title.<locals>.<lambda> at 0x11a72bae8>,
           new_column_name='Title')), ('truncate_ti... subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False))]))],
         flatten_transform=None, n_jobs=None, voting='soft',
         weights=[0.8300561797752809, 0.8328651685393258, 0.8370786516853933])

## Prepare the submission

In [47]:
def get_predictions(estimator):
    predictions = estimator.predict(X_test)
    indices = test.loc[:, 'PassengerId']
    as_dict = [{'PassengerId': index, 'Survived': prediction} for index, prediction in zip(indices, predictions)]
    return pd.DataFrame.from_dict(as_dict)

predictions = get_predictions(voting_classifier)

In [48]:
submission_folder = '.'
dest_file = os.path.join(submission_folder, 'submission.csv')
predictions.to_csv(dest_file, index=False)