# Classification and data mangling examples using the [Titanic dataset](https://www.kaggle.com/c/titanic) in Kaggle

This kernel illustrates the use of [scikit-learn Pipelines](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) for data preparation and easily reproducible transformation.

First import the required libraries:

In [4]:
import pandas as pd, sklearn, numpy as np
import seaborn as sns
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams["font.size"] = "16"

Load the datasets. Note that for simplicity, we'll keep the label field `Survived` in the `train` and `test` dataframes until actually starting transformations and predictions.

In [6]:
train = pd.read_csv('./kaggle_titanic_dataset/train.csv')
test = pd.read_csv('./kaggle_titanic_dataset/test.csv')
n_train, m_train = train.shape

Take a peek at the columns:

In [8]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


`train.info()` shows that many values in `Age` and `Cabin` columns are missing:

In [12]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


Check statistical summary for numerical attributes:

In [13]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Important things to note here are that 
- The dataset is slightly skewed as only 38% survived, which could be important to take into account in cross-validation
- Some values in the `Fare` column are zero
- As noted above, many `Age`s are missing.

As there are lots of data exploration kernels out there, I will skip any further data exploration here and focus on the actual data transformation pipelines. Data exploration shows that `Age` is an important indicator for survival and the missing values should be imputed. The data processing steps carried out below are:
1. Split the dataset into features and labels. To simplify the kernel, we drop `Ticket`, `Cabin`, and of course `PassengerId` from the feature set.
2. Extract the title from the `Name` field and use it as a feature to impute the missing values. For example, title `Mr.` says that the person in question was not a child.
3. One-hot encode all categorical attributes
4. Convert the dataframe to numpy matrix.

First we split the dataset into features and labels and give them common names `X_train` and `y_train`. Note that these are still dataframes, conversion to NumPy is done just before the prediction algorithms.

In [60]:
def to_features_and_labels(df):
    y = df['Survived'].values
    X = df.drop(['PassengerId', 'Survived', 'Cabin', 'Ticket'], axis=1) # Copy of the original dataframe
    return X, y

# Assume that PassengerId and Name do not matter
X_train, y_train = to_features_and_labels(train)
X_train.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S


First write a [scikit-learn transformer](http://scikit-learn.org/stable/data_transforms.html) for converting name to title. You may think that there's a lot of overhead in the classes (and that's true), but the transformer classes are highly reusable and therefore may save a lot of time in future.

In [87]:
from sklearn.base import BaseEstimator, TransformerMixin
class DataFrameColumnMapper(BaseEstimator, TransformerMixin):
    def __init__(self, column_name, mapping_func, new_column_name=None):
        self.column_name = column_name
        self.mapping_func = mapping_func
        self.new_column_name = new_column_name if new_column_name is not None else self.column_name
    def fit(self, X, y=None):
        # Nothing to do here
        return self
    def transform(self, X):
        transformed_column = X.transform({self.column_name: self.mapping_func})
        Y = X.copy()
        Y = Y.assign(**{self.new_column_name: transformed_column})
        if self.column_name != self.new_column_name:
            Y = Y.drop(self.column_name, axis=1)
        return Y

# Return a lambda function that extracts title from the full name, this allows instantiating the pattern only once
def extract_title():
    import re
    pattern = re.compile(', (\w*)')
    return lambda name: pattern.search(name).group(1)

# Example usage and output 
df = DataFrameColumnMapper(column_name='Name', mapping_func=extract_title(), new_column_name='Title').fit_transform(X_train)
df.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,3,male,22.0,1,0,7.25,S,Mr
1,1,female,38.0,1,0,71.2833,C,Mrs
2,3,female,26.0,0,0,7.925,S,Miss
3,1,female,35.0,1,0,53.1,S,Mrs
4,3,male,35.0,0,0,8.05,S,Mr


Let's take a look at the transformed names:

In [89]:
df['Title'].value_counts()[1:10]

Miss      182
Mrs       125
Master     40
Dr          7
Rev         6
Col         2
Mlle        2
Major       2
Sir         1
Name: Title, dtype: int64

It seems that only the, say, five most frequent would be useful for imputing ages, so let us write a transformer that transforms the less frequent fields of a categorical attribute as "Other".

In [4]:
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [5]:
print('--Survived--')
print(data["Survived"].value_counts())
print('---Embarked---')
print(X_train["Embarked"].value_counts())
print('---Tickets---')
print(X_train["Ticket"].value_counts()[0:10])

--Survived--
0    549
1    342
Name: Survived, dtype: int64
---Embarked---
S    644
C    168
Q     77
Name: Embarked, dtype: int64
---Tickets---
CA. 2343        7
347082          7
1601            7
347088          6
CA 2144         6
3101295         6
382652          5
S.O.C. 14879    5
113781          4
PC 17757        4
Name: Ticket, dtype: int64


In [6]:
# Pick numerical attributes
num_attribs = list(X_train.select_dtypes(include=['number']))
cat_attribs = list(X_train.select_dtypes(include=['object']))

print('Numerical attributes:', num_attribs)
print('Categorical attributes:', cat_attribs)

Numerical attributes: ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
Categorical attributes: ['Sex', 'Ticket', 'Cabin', 'Embarked']


### Build pipeline for processing numerical attributes

In [7]:
from sklearn.base import BaseEstimator, TransformerMixin
class DataFrameAttributesSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names=None):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        columns = list(X) if self.attribute_names is None else self.attribute_names
        return X[columns]
    
class DataFrameToValuesTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        self.attribute_names = list(X)
        return X.values


In [8]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer, StandardScaler

num_pipeline = Pipeline([
    ('selector', DataFrameAttributesSelector(attribute_names=num_attribs)),
    ('to_numpy', DataFrameToValuesTransformer()),
    ('imputer', Imputer(strategy='median')),
    ('scaler', StandardScaler())
])


### Pipeline for processing categorical attributes

In [9]:
class CategoricalToIntegerFactorizer(BaseEstimator, TransformerMixin):
    def __init__(self, max_categories):
        self.max_categories = max_categories
        self.categories = []
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        cols = list(X)
        Y = pd.DataFrame()
        for col in cols:
            unique_values = len(X[col].unique())
            # print('Column %s has %d unique values' % (col, unique_values))
            if (unique_values > self.max_categories):
                continue
            factorized, categories = X[col].factorize(na_sentinel=unique_values)
            self.categories.extend(['%s_%s' % (col, cat) for cat in categories])
            if (sum(factorized == unique_values) > 0.5):
                self.categories.append('%s_nan' % col)
            Y[col] = factorized
        return Y 

In [10]:
from sklearn.preprocessing import OneHotEncoder
cat_pipeline = Pipeline([
    ('selector', DataFrameAttributesSelector(attribute_names=cat_attribs)),
    ('cat_to_int_encoder', CategoricalToIntegerFactorizer(max_categories=5)),
    ('one_hot_encoder', OneHotEncoder())
])

# cat_pipeline.fit_transform(X_train).toarray()

In [12]:
from sklearn.pipeline import FeatureUnion
full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline)
])

X_train_prepared = full_pipeline.fit_transform(X_train)
print('Size of prepared X:', X_train_prepared.shape)

assert X_train_prepared.shape[0] == len(y_train)

Size of prepared X: (891, 11)


In [13]:
X_train_prepared

<891x11 sparse matrix of type '<class 'numpy.float64'>'
	with 6237 stored elements in Compressed Sparse Row format>

### Try RandomForestClassifier and GridSearchCV with the prepared data

In [15]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, make_scorer

param_grid = [
    { 'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8] },
    { 'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}
]

forest_clf = RandomForestClassifier()
grid_search = GridSearchCV(forest_clf, param_grid, cv=5, return_train_score=True, refit='accuracy',
                           scoring={ 'accuracy': make_scorer(accuracy_score),
                                     'precision': make_scorer(precision_score)
                                   })

cv = grid_search.fit(X_train_prepared, y_train)
cv_results = pd.DataFrame(grid_search.cv_results_)
print('Best test score accuracy is:', grid_search.best_score_)

Best test score accuracy is: 0.813692480359147


In [None]:
cols = list(grid_search.cv_results_.keys())
cols_of_interest = [key for key in cols if key.startswith('param_') 
                    or key.startswith('mean_train') 
                    or key.startswith('mean_test_')
                    or key.startswith('rank')]
cv_results[cols_of_interest]

In [None]:
num_attribs_prepared = num_pipeline.named_steps["to_numpy"].attribute_names
cat_attribs_prepared = cat_pipeline.named_steps["cat_to_int_encoder"].categories
attributes = num_attribs_prepared + cat_attribs_prepared

feature_importances = grid_search.best_estimator_.feature_importances_

sorted(zip(feature_importances, num_attribs_prepared + cat_attribs_prepared), reverse=True)

In [25]:
from sklearn.svm import SVC
param_grid = [
    { 'clf': [ RandomForestClassifier() ],
      'clf__n_estimators': [3, 10, 30] 
    },
    { 'clf': [ SVC() ] }
]

clf = Pipeline([
    ('clf', RandomForestClassifier())
])
grid_search = GridSearchCV(clf, param_grid, cv=5, return_train_score=True, refit='accuracy',
                           scoring={ 'accuracy': make_scorer(accuracy_score),
                                     'precision': make_scorer(precision_score)
                                   })

cv = grid_search.fit(X_train_prepared, y_train)
cv_results = pd.DataFrame(grid_search.cv_results_)
print('Best test score accuracy is:', grid_search.best_score_)

cv_results

Best test score accuracy is: 0.8271604938271605


Unnamed: 0,mean_fit_time,mean_score_time,mean_test_accuracy,mean_test_precision,mean_train_accuracy,mean_train_precision,param_clf,param_clf__n_estimators,params,rank_test_accuracy,...,split4_test_accuracy,split4_test_precision,split4_train_accuracy,split4_train_precision,std_fit_time,std_score_time,std_test_accuracy,std_test_precision,std_train_accuracy,std_train_precision
0,0.009039,0.002184,0.784512,0.714859,0.942764,0.937475,"(DecisionTreeClassifier(class_weight=None, cri...",3.0,{'clf': (DecisionTreeClassifier(class_weight=N...,4,...,0.79096,0.696203,0.936975,0.922509,0.000773,0.000339,0.030941,0.027747,0.005532,0.017835
1,0.024176,0.00302,0.808081,0.769735,0.971384,0.976022,"(DecisionTreeClassifier(class_weight=None, cri...",10.0,{'clf': (DecisionTreeClassifier(class_weight=N...,2,...,0.819209,0.764706,0.963585,0.962687,0.000385,8.9e-05,0.026221,0.03454,0.004549,0.00945
2,0.069361,0.006392,0.805836,0.760939,0.980642,0.980144,"(DecisionTreeClassifier(class_weight=None, cri...",30.0,{'clf': (DecisionTreeClassifier(class_weight=N...,3,...,0.836158,0.782609,0.97479,0.967153,0.001613,0.00016,0.029805,0.029927,0.003686,0.008697
3,0.012497,0.004123,0.82716,0.81299,0.832496,0.8158,"SVC(C=1.0, cache_size=200, class_weight=None, ...",,"{'clf': SVC(C=1.0, cache_size=200, class_weigh...",1,...,0.847458,0.847458,0.820728,0.801653,0.000654,0.00017,0.012262,0.026726,0.006058,0.008857
