# Machine Learning 101 Hands On Workshop - Session #2

### Quick Recap:

In the last workshop we went through:
* Initial Exploratory Data Analysis of the Titanic Dataset.
* Built some simple models (Random Forest, SVM).
* Used GridSearchCV to optimize hyperparameters (quick intro).

Takeaways:
* Overall, top performance ~ **83% accuracy** (on holdout set), **77.03% accuracy** on Kaggle public holdout using SVC and GridSearch.
* Most of the time and discussion revolved around data analysis and preprocessing.

### Agenda for this Session:  <a class="anchor" id="agenda"></a>
* [Utility functions](#utility-functions)
* [Preparation of data](#prep-of-data)
* [Using Scikit-Learn Pipelines to simplify and modularize the data pipeline](#sklearn-pipelines)
  - [Exercise 1](#exercise-1)
  - [Simple Pipeline](#simple-pipeline)
  - [More Complex Pipeline](#complex-pipeline)
* [Using Grid Search / Randomized Search to optimize hyperparameters](#search-for-hyperparameters)
  - [Exercise 2](#exercise-2)
* [Understanding which results are overfitting / underfitting](#overfitting-underfitting)
* [Write Kaggle Submission file based on predicted outputs](#kaggle-submission)
  - [Exercise 3 - Homework](#exercise-3)

In [1]:
# Get the data:
#  This will pull the required data (train.csv and test.csv files) from dropbox.
#  The cool thing is that this is leveraging wget on the docker container to pull data to your local disk,
#  consistently for all operating systems.
! rm -f train.csv test.csv
! wget https://www.dropbox.com/s/f7fb3gon8byyyz6/train.csv?dl=1 -O train.csv -q
! wget https://www.dropbox.com/s/zcd2751x6waex9f/test.csv?dl=1 -O test.csv -q
! ls -l train.csv test.csv
! wc -l train.csv test.csv

-rw-r--r-- 1 root root 28629 Mar  2 21:23 test.csv
-rw-r--r-- 1 root root 61194 Mar  2 21:23 train.csv
  892 train.csv
  419 test.csv
 1311 total


In [2]:
# Import the main Python libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import logging
import imp
import collections
logging.basicConfig(level=logging.DEBUG)

from sklearn.preprocessing import LabelEncoder, LabelBinarizer
from sklearn.preprocessing import Imputer, MinMaxScaler, Normalizer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import PCA
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import KFold, StratifiedKFold, LeaveOneOut
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression, PassiveAggressiveClassifier, ElasticNet
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.cluster import KMeans, DBSCAN
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score, matthews_corrcoef
from sklearn.base import TransformerMixin, BaseEstimator

### Utility functions <a class="anchor" id="utility-functions"></a>
(Back to [agenda/toc](#agenda))

* Functions to use in pipelines - from Session #1
  - model_eval
  - fare_imputer
  - embarked_imputer
  - transformer_dummies
  - select_numeric_columns
  - columns_subset
  - custom_transformer

In [3]:
def model_eval(y_actual, y_pred, label="Model"):
    """
    Provides classification evaluation metrics based on actual's and predicted values; namely:
    * Accuracy
    * F1 Score
    * Matthew's Correlation Coefficient
    * Confusion Matrix
    """
    logging.info("{} - Accuracy: {}".format(label, accuracy_score(y_actual, y_pred)))
    logging.info("{} - F1 Score: {}".format(label, f1_score(y_actual, y_pred)))
    logging.info("{} - Matthew's Corr Coef: {}".format(label, matthews_corrcoef(y_actual, y_pred)))
    return pd.DataFrame(confusion_matrix(y_actual, y_pred),
                        columns=['Pred_Died','Pred_Survived'],
                        index=['Actual_Died','Actual_Survived'])

def fare_imputer(X_):
    """
    Designed to be passed to a FunctionTransformer with a pd.DataFrame as input, with validate=False.
    * Based on Session #1 logic:
      - Impute Fare values for records where Fare is null or 0.
      - Impute values by calculating the median Fare by Pclass and then using this median value for imputation.
    """
    X = X_.copy()
    # Fare Imputer:
    pclass_to_fare_lookup = X[~((X.Fare == 0) | (X.Fare.isnull())) ].groupby(['Pclass']).Fare.median()
    X.loc[((X.Fare == 0) | (X.Fare.isnull())), 'Fare'] = X['Pclass'].map(dict(pclass_to_fare_lookup))
    return X

def embarked_imputer(X_):
    """
    Designed to be passed to a FunctionTransformer with a pd.DataFrame as input, with validate=False.
    * Based on Session #1 logic:
      - Based on Fare level of missing values, set all nulls to 'S'.
    """
    X = X_.copy()
    #Embarked Imputer:
    X.Embarked.fillna('S',inplace=True)
    return X

def transformer_dummies(X_, columns=None, transform_to_cat=True):
    """
    Designed to be passed to a FunctionTransformer with a pd.DataFrame as input, with validate=False.
    * Leverages Pandas get_dummies to one-hot encode categorical columns.
    * Key here is that columns to be encoded should ideally be "Category" columns so that get_dummies generates
      the same number of columns in the same order for different cuts of the data.
    """
    X = X_.copy()
    if transform_to_cat:
        for c in columns:
            X[c] = pd.Categorical(X[c])
    
    return pd.get_dummies(X, drop_first=True, dummy_na=True, columns=columns)

def select_numeric_cols(X):
    """
    Designed to be passed to a FunctionTransformer with a pd.DataFrame as input, with validate=False.
    This function will subset the selection to only numeric columns.
    ColumnExtractor class provides a better, more flexible implementation of this logic. 
    """
    return X.select_dtypes([np.int64,np.float64,np.uint8])

def columns_subset(X, columns):
    """
    Designed to be passed to a FunctionTransformer with a pd.DataFrame as input, with validate=False.
    Need to pass columns in as a kw_args parameter; e.g.
    FunctionTransformer( columns_subset, validate=False, kw_args = { 'columns' : ['A', 'B', ... ] } )
    """
    return X[columns]

In [4]:
def custom_transformer(X_):
    """
    Designed to be passed to a FunctionTransformer with a pd.DataFrame as input, with validate=False.
    Performs feature creation steps as discussed in Session #1.
    """
    X = X_.copy()
        
    X['fare_log'] = np.log(X.Fare)

    # Age:
    X['age_16_to_34'] = ((X.Age >= 16) & (X.Age <= 34)).apply(int)
    X['age_35_to_47'] = ((X.Age >= 35) & (X.Age <= 47)).apply(int)
    X['age_missing'] = (X.Age.isnull()).apply(int)
    X.drop(['Age'], axis=1, inplace=True)
    
    # People per ticket:
    X = X.merge(pd.DataFrame(X.groupby('Ticket').size(),
                             columns=['people_per_ticket']),
                how='left', left_on='Ticket', right_index=True)
    X.drop(['Ticket'], axis=1, inplace=True)
    
    # People per lastname:
    X['Lastname'] = X.Name.apply(lambda x: x.split(',')[0])
    X = X.merge(pd.DataFrame(X.groupby('Lastname').size(),
                             columns=['people_per_lastname']),
                how='left', left_on='Lastname', right_index=True)
    X.drop(['Name', 'Lastname'], axis=1, inplace=True)
    
    # People in family:
    X['people_in_family'] = X.apply(lambda x: x.Parch+x.SibSp+1,axis=1)
    X.drop(['Parch','SibSp'],axis=1,inplace=True)
    
    # Cabin information extraction:
    X.Cabin.fillna('O',inplace=True) # O = other
    #X['cabin_level'] = X.Cabin.apply(lambda x: x[0])
    X['cabin_level'] = pd.Categorical(X.Cabin.apply(lambda x: x[0]),
                                      categories=['O', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'])
    X.drop(['Cabin'], axis=1, inplace=True)

    return X

In [5]:
# Credit zacstewart.com (http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html)
class ModelTransformer(BaseEstimator, TransformerMixin):
    """
    Custom estimator which fits the passed model on train dataset; transform returns model predictions.
    Convenience function to add model predictions as features in a pipeline.
    Credit: zacstewart.com
    """
    def __init__(self, model):
        self.model = model

    def fit(self, *args, **kwargs):
        self.model.fit(*args, **kwargs)
        return self

    def transform(self, X, **transform_params):
        return pd.DataFrame(self.model.predict(X))


class ColumnExtractor(BaseEstimator, TransformerMixin):
    """
    Custom estimator which extracts subset of columns from a Panda's dataframe based on:
    * Columns list, or
    * Dtypes list
    Supports returning data as a DataFrame or a Numpy Array; based on 'np_return' parameter.
    """
    def __init__(self, columns=None, dtypes=None, np_return=True):
        if columns is None and dtypes is None:
            raise ValueError("Need to specify 'columns' or 'dtypes' to extract from DataFrame")
        else:
            self.columns = columns
            self.dtypes = dtypes
            self.np_return = np_return

    def fit(self, *args, **kwargs):
        return self

    def transform(self, X, **transform_params):
        if self.columns:
            cols = self.columns
        elif self.dtypes:
            cols = X.select_dtypes(self.dtypes).columns.values
        else:
            raise ValueError("Can't transform with no columns or dtypes specified")
        
        if self.np_return:
            return X[cols].values
        else:
            return X[cols]

class FixedLabelBinarizer(BaseEstimator, TransformerMixin):
    """
    LabelBinarizer used to support being called with a dummy 'y', but in Scikit Learn 0.19 this was changed.
    In 0.20 the class that will replace this is CategoricalEncoder.
    Simply wrapped the LabelBinarizer to allow it to support the standard pipeline function stub.
    Credit: Stackoverflow
    """
    def __init__(self):
        self.encoder = LabelBinarizer()
    
    def fit(self, X, y=None):
        self.encoder.fit(X)
        return self

    def transform(self, X, y=None):
        return self.encoder.transform(X)

class PassThroughTransformer(BaseEstimator, TransformerMixin):
    """
    Transform function returns exactly the columns / features that are passed to it.
    """
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X

#### Quick Discussion of Peek functionality:
* Allows looking into pipeline data.
* Using closure (factory pattern) to pass arguements to function within pipeline.

In [6]:
def peek_factory(label, head=0, output_dtypes=True):
    """
    Closure returning `peek` function that can be used to peek into the data in a pipeline
    without affecting the pipeline functionality.
    
    Parameters
    ----------
    label : str describing where in the pipeline we're peeking so that the log messages can be labelled
    head : number of rows of data from the pipeline to output into the log
    output_dtypes : for DataFrame only, whether or not to write dtypes into log
    
    Example usage:
      FunctionTransformer( peek_factory("Look at XYZ", head=1) )
      
    Note: supports looking at Panda's DataFrames or Numpy Arrays.
    """
    def peek(X, label=label, head=head, output_dtypes=output_dtypes):
        # Log information we want to know about the data passing through the pipeline:
        logging.debug('{} - Type: {}'.format(label, type(X)))
        logging.debug('{} - Shape: {}'.format(label, X.shape))
 
        if type(X) == type(pd.DataFrame()):
            if output_dtypes:
                logging.debug('{} - Dtypes: {}'.format(label, X.dtypes))
            if head > 0:
                logging.debug('{} - Head:'.format(label))
                tmp = X.head(head)
                for (i,r) in enumerate(tmp.itertuples(index=False)):
                    logging.debug('{} - Row {}: {}'.format(label, i, r))
        else:
            if head > 0:
                logging.debug('{} - Head:'.format(label))
                tmp = X[:head]
                for (i,r) in enumerate(tmp):
                    logging.debug('{} - Row {}: {}'.format(label, i, ",".join(map(str, r))))
        return X
    return peek

### Preparation of data: <a class="anchor" id="prep-of-data"></a>
(Back to [agenda/toc](#agenda))

In [7]:
# Take a quick look at the top of the files to check what data we're getting as input:
! head -3 train.csv test.csv

==> train.csv <==
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C

==> test.csv <==
PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S


In [8]:
# Read data in from input CSV files:
df = pd.read_csv('train.csv', index_col='PassengerId')
df_kaggle_holdout = pd.read_csv('test.csv', index_col='PassengerId')

In [9]:
df.head(3).T

PassengerId,1,2,3
Survived,0,1,1
Pclass,3,1,3
Name,"Braund, Mr. Owen Harris","Cumings, Mrs. John Bradley (Florence Briggs Th...","Heikkinen, Miss. Laina"
Sex,male,female,female
Age,22,38,26
SibSp,1,1,0
Parch,0,0,0
Ticket,A/5 21171,PC 17599,STON/O2. 3101282
Fare,7.25,71.2833,7.925
Cabin,,C85,


In [10]:
# Train / Test split:
X = df.drop('Survived',axis=1)
y = df.Survived.values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

## Using Scikit-Learn Pipelines to simplify and modularize the data pipeline <a class="anchor" id="sklearn-pipelines"></a>
(Back to [agenda/toc](#agenda))

Pipelines provide a convenient construct for automating ML processes. They can be used to chain multiple estimators into a single estimator.
<img src="images_2/pipeline.png" style="height: 80px;" align="center"/>

As described in Scikit Learn documentation:

> Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.

Advantages of pipelines:
* Modularize pipeline.
* Simple to switch functions in / out of pipeline.
* Very convenient for experimentation and searching for optimal ML approach.
* Overcome common problems like data leakage in your test harness.

#### Important Classes to know about:

1. [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
> Pipeline(\[ ('step1', Transformer1), ('step2', Transformer2), (...), ('model', Model) \])
<img src="images_2/pipeline.png" style="height: 80px;" align="center"/>
1. [FeatureUnion](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html)
> FeatureUnion(\[ ('transformer1', Transformer1), ('transformer2', Transformer2), (...) \])
<img src="images_2/featureunion.png" style="height: 210px;" align="center"/>
1. [FunctionTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html)
> FunctionTransformer(func, validate=False)


### Exercise #1: <a class="anchor" id="exercise-1"></a>
(Back to [agenda/toc](#agenda))<br><br>
We're going to spend some time looking at Pipelines built based on various Transformers and Estimators, let's start by getting a feeling for what these transformers do.<br><br>
<span style="color:blue">
Use FunctionTransformer to:<br>
    (1) Impute Fare values based on `fare_imputer` function, and<br>
    (2) Impute Embarked values based on `embarked_imputer` function, and<br>
    (3) Perform transformation using `custom_transformer` function.<br><br>
</span>
<div class="panel-group" id="accordion-2">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-2" href="#collapse1-2">Hints</a>
    </h4>
    </div>
    <div id="collapse1-2" class="panel-collapse collapse">
      <div class="panel-body">
For (1):<br>
Use the function: `fare_imputer` as an argument to `FunctionTransformer()` use `validate=False` argument since we'll be passing a DataFrame into this transformer.<br><br>
Call `fit_transform(X,y)` to perform transformation of the input feature matrix X.
<br><br>
Similar steps to solve (2) and (3), just with different functions: `embarked_imputer` and `custom_transformer`.
    </div>
    </div>
</div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-2" href="#collapse2-2">Solution</a>
    </h4>
    </div>
    <div id="collapse2-2" class="panel-collapse collapse">
      <div class="panel-body">
One possible solution is:
<br>
`t = FunctionTransformer(fare_imputer, validate=False)
print("Fare issues prior to transformation: {}".format((df.Fare == 0).sum()))
print("Fare issues after transformation: {}".format((t.fit_transform(X,y).Fare == 0).sum()))`
<br><br>
`t = FunctionTransformer(embarked_imputer, validate=False)
print("Embarked issues prior to transformation: {}".format((df.Embarked.isnull()).sum()))
print("Embarked issues after transformation: {}".format((t.fit_transform(X,y).Embarked.isnull()).sum()))`
<br><br>
`t = FunctionTransformer(custom_transformer, validate=False)
t.fit_transform(X,y).head().T`
<br>
      </div>
    </div>
</div>
</div> 

In [11]:
# Exercise #1 solution:

t = FunctionTransformer(fare_imputer, validate=False)
print("Fare issues prior to transformation: {}".format((df.Fare == 0).sum()))
print("Fare issues after transformation: {}".format((t.fit_transform(X,y).Fare == 0).sum())) 

t = FunctionTransformer(embarked_imputer, validate=False)
print("Embarked issues prior to transformation: {}".format((df.Embarked.isnull()).sum()))
print("Embarked issues after transformation: {}".format((t.fit_transform(X,y).Embarked.isnull()).sum())) 

t = FunctionTransformer(custom_transformer, validate=False)
t.fit_transform(X,y).head().T 

Fare issues prior to transformation: 15
Fare issues after transformation: 0
Embarked issues prior to transformation: 2
Embarked issues after transformation: 0


  


PassengerId,1,2,3,4,5
Pclass,3,1,3,1,3
Sex,male,female,female,female,male
Fare,7.25,71.2833,7.925,53.1,8.05
Embarked,S,C,S,S,S
fare_log,1.981,4.26666,2.07002,3.97218,2.08567
age_16_to_34,1,0,1,0,0
age_35_to_47,0,1,0,1,1
age_missing,0,0,0,0,0
people_per_ticket,1,1,1,2,1
people_per_lastname,2,1,1,2,2


### Simple Pipeline: <a class="anchor" id="simple-pipeline"></a>
(Back to [agenda/toc](#agenda))<br>
* For illustration of how a pipeline works, let's create a "simple" pipeline which:
  - Imputes missing values
  - Peek at the data to see it's format
  - Select just numeric columns -- otherwise Scikit Learn will error out
  - Peek at the data again to see what's going into our model
  - Apply a random forest model on the output
  - Look at some basic model evaluation metrics

In [12]:
basic_rf_pipe = Pipeline(
    [
        ('impute_fare', FunctionTransformer(fare_imputer, validate=False)),
        #('peek_before', FunctionTransformer(peek_factory('Prior to selection', head=1),validate=False)),
        ('numerics_selection', FunctionTransformer(select_numeric_cols, validate=False)),
        #('peek_after', FunctionTransformer(peek_factory('After selection', head=1),validate=False)),
        ('pass_through', PassThroughTransformer()),
        ('model', RandomForestClassifier(random_state=42))
    ]
)

Pipeline object created responds to standard Scikit Learn functions:
* fit
* fit_transform
* predict

In [13]:
# Note: dropping Age column as it includes NULLs causing Scikit-Learn to error out
basic_rf_pipe.fit(X_train.drop('Age',axis=1),y_train)

model_eval(y_test, basic_rf_pipe.predict(X_test.drop('Age',axis=1)))

INFO:root:Model - Accuracy: 0.6815642458100558
INFO:root:Model - F1 Score: 0.5648854961832062
INFO:root:Model - Matthew's Corr Coef: 0.32718036166210035


Unnamed: 0,Pred_Died,Pred_Survived
Actual_Died,85,20
Actual_Survived,37,37


**A couple of things to note:**
* After pipeline setup, pipeline usage is just like using a normal Scikit Learn estimator: called `fit` and `predict`.
* One can access steps in the pipeline using: `pipe.named_steps` this allows for detailed introspection of parameters / results from estimators within pipeline.

In [14]:
basic_rf_pipe.named_steps.model.feature_importances_

array([0.09885934, 0.08469302, 0.08979226, 0.72665538])

### More Complex Pipeline: <a class="anchor" id="complex-pipeline"></a>
(Back to [agenda/toc](#agenda))<br><br>
Planned pipeline:
* Treating categorical attributes separately.
* Using One-Hot encoding for catgorical attributes.
<img src="images_2/complex_pipeline.png" style="height: 450px;" align="center"/>

In [15]:
p_pclass = Pipeline(
    [
        ('selector_pclass', ColumnExtractor(columns = ['Pclass'])),
        ('one_hot_pclass', FixedLabelBinarizer()),
    ]
)
p_sex = Pipeline(
    [
        ('selector_sex', ColumnExtractor(columns = ['Sex'])),
        ('one_hot_sex', FixedLabelBinarizer()),
    ]
)
p_embarked = Pipeline(
    [
        ('selector_embarked', ColumnExtractor(columns = ['Embarked'])),
        ('one_hot_embarked', FixedLabelBinarizer()),
    ]
)
p_numeric = Pipeline(
    [
        ('selector_numeric', ColumnExtractor(dtypes = [np.int64, np.float64])),
        ('scale_numeric', Normalizer()),
    ]
)

p_full = Pipeline(
    [
        ('embarked_imputer', FunctionTransformer(embarked_imputer, validate=False)),
        ('fare_imputer', FunctionTransformer(fare_imputer, validate=False)),
        ('custom_transform', FunctionTransformer(custom_transformer, validate=False)),
        ('features', FeatureUnion(
            [
                ('pclass', p_pclass),
                ('sex', p_sex),
                ('embarked', p_embarked),
                ('numeric', p_numeric),
            ]
        )),
        ('estimators', FeatureUnion([
            ('pass_through', PassThroughTransformer()),
            ('m1', ModelTransformer(KNeighborsClassifier(n_neighbors=15))),
            ('m2', ModelTransformer(GradientBoostingClassifier(random_state=42))),
            ('m3', ModelTransformer(ElasticNet(random_state=42))),
            ('m4', ModelTransformer(KMeans(n_clusters=3, random_state=42)))
        ])),
        #('final_peek', FunctionTransformer(peek_factory('Input to SVC', head=1))),
        ('predictor', SVC())
    ]
)

In [16]:
# Simple fit, predict with pipeline:
#   Large difference between ~10% between accuracy on Training set VS Test set ==> Overfitting
p_model = p_full.fit(X_train,y_train)
model_eval(y_train, p_model.predict(X_train), 'Training set evaluation')
model_eval(y_test, p_model.predict(X_test), 'Test set evaluation')

INFO:root:Training set evaluation - Accuracy: 0.9129213483146067
INFO:root:Training set evaluation - F1 Score: 0.8755020080321284
INFO:root:Training set evaluation - Matthew's Corr Coef: 0.8147383472327271
INFO:root:Test set evaluation - Accuracy: 0.8156424581005587
INFO:root:Test set evaluation - F1 Score: 0.7659574468085106
INFO:root:Test set evaluation - Matthew's Corr Coef: 0.616565942969521


Unnamed: 0,Pred_Died,Pred_Survived
Actual_Died,92,13
Actual_Survived,20,54


In [17]:
# Cross validation for quick CV assessment of model:
cv = StratifiedKFold(n_splits=10, random_state=42)
cross_val_score(p_full, X_train, y_train, cv=cv).mean()

0.7935300693047171

#### Some take-aways from developing this section:
Getting this "more complex" pipeline to work had some gotcha's.

* The main challenge encountered was in trying to one-hot encode categorical variables as part of the Pipeline:
  - Issue 1: `LabelBinarizer()` class (version 0.18) supported taking a Numpy array of text-based categorical variables and converting them to one-hot encoded columns but in version 0.19 the interface changed to not include optional `y` in the `transform()` function, thus deviating from the Pipeline interface.
  - Issue 2: Using `pd.get_dummies` depends on the values present in the slice of data used during `fit` call, and also is order dependent.
  
* Solutions:
  - For this tutorial, I patched `LabelBinarizer()` class to match the Pipeline interface using `FixedLabelBinarizer`.
  - To use `pd.get_dummies`, one can combine setting a DataFrame column to a Categorical type using `pd.Categorical()` and then use pd.get_dummies as the Categorical definition ensures that the one-hot encoding of the column will always be consistent.
  - In Scikit Learn version 0.20, a new `CategoricalEncoder` class will (hopefully) solve this problem.

## Using Grid Search / Randomized Search to optimize hyperparameters <a class="anchor" id="search-for-hyperparameters"></a>
(Back to [agenda/toc](#agenda))

Classes that allow us to search a parameter space to find the model parameters which give the best model performance.

**High CPU usage ==> Higher model performance**

1. [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
> GridSearchCV(estimator, param_grid, cv=None)

1. [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
> RandomizedSearchCV(estimator, param_grid, cv=None)

We'll start by looking at GridSearch on the [Simple Pipeline](#simple-pipeline) we created above.

In [18]:
# First example; hyper-parameter search on "Simple Pipeline":
#   Specify parameters as: NAMED_STEP + "__" + PARAMETER_NAME
simple_param_grid_1 = {
    'model__n_estimators' : [100, 150, 200],
    'model__max_depth' : [2, 3, 5],
    'model__min_samples_split' : [3, 9, 15],
}

cv = StratifiedKFold(n_splits=3, random_state=42)

gs_simple1 = GridSearchCV(basic_rf_pipe, simple_param_grid_1, cv=cv)
gs_simple1.fit(X_train.drop('Age',axis=1), y_train)

print(gs_simple1.best_params_)
model_eval(y_train, gs_simple1.best_estimator_.predict(X_train.drop('Age',axis=1)), 'Training set evaluation')
model_eval(y_test, gs_simple1.best_estimator_.predict(X_test.drop('Age',axis=1)), 'Test set evaluation')

INFO:root:Training set evaluation - Accuracy: 0.7528089887640449
INFO:root:Training set evaluation - F1 Score: 0.6053811659192825
INFO:root:Training set evaluation - Matthew's Corr Coef: 0.4552494131215883
INFO:root:Test set evaluation - Accuracy: 0.7486033519553073
INFO:root:Test set evaluation - F1 Score: 0.6280991735537189
INFO:root:Test set evaluation - Matthew's Corr Coef: 0.47875642038386695


{'model__max_depth': 5, 'model__min_samples_split': 9, 'model__n_estimators': 100}


Unnamed: 0,Pred_Died,Pred_Survived
Actual_Died,96,9
Actual_Survived,36,38


### Exercise #2:  <a class="anchor" id="exercise-2"></a>
(Back to [agenda/toc](#agenda))<br><br>
It is possible to switch the entire model (or in fact any of the estimators) that's being used in the Pipeline. This can be done by providing the named step as a parameter directly; if we have different parameters related to the different models, we can have the parameter search across multiple SETs of parameters as follows:

`param_grid = [
    { grid 1 },
    { grid 2 }, 
    ...
]`<br><br>
<span style="color:blue">
Let's modify the example above to:<br>
* Search over a primary grid to switch to an `KNeighborsClassifier()` model and search over various values of `n_neighbors` parameter.
* Also, define a secondary grid to switch to an `SVC()` model and search over various values of `C` parameter
</span>
<div class="panel-group" id="accordion-3">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-2" href="#collapse1-3">Hints</a>
    </h4>
    </div>
    <div id="collapse1-3" class="panel-collapse collapse">
      <div class="panel-body">
The model can be switched with the following parameter:<br>
`'model' : [ MODEL_CLASS_INSTANTIATION ], `
    </div>
    </div>
</div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-3" href="#collapse2-3">Solution</a>
    </h4>
    </div>
    <div id="collapse2-3" class="panel-collapse collapse">
      <div class="panel-body">
One possible solution is:
<br>
`simple_param_grid_ex2 = [
    {
         'model' : [ KNeighborsClassifier() ],
         'model__n_neighbors' : [3, 7, 11, 15],
    },
    {
        'model' : [ SVC(random_state=42) ],
        'model__C' : [0.1, 0.3, 0.6, 0.9, 1.2],
    }
]`
      </div>
    </div>
</div>
</div>

In [19]:
# Exercise 2:

simple_param_grid_ex2 = [
  {
       'model' : [ KNeighborsClassifier() ],
       'model__n_neighbors' : [3, 7, 11, 15],
  },
  {
      'model' : [ SVC(random_state=42) ],
      'model__C' : [0.1, 0.3, 0.6, 0.9, 1.2],
  }
]

cv = StratifiedKFold(n_splits=3, random_state=42)

gs_simple_ex2 = GridSearchCV(basic_rf_pipe, simple_param_grid_ex2, cv=cv)
gs_simple_ex2.fit(X_train.drop('Age',axis=1), y_train)

print(gs_simple_ex2.best_params_)
model_eval(y_train, gs_simple_ex2.best_estimator_.predict(X_train.drop('Age',axis=1)), 'Training set evaluation')
model_eval(y_test, gs_simple_ex2.best_estimator_.predict(X_test.drop('Age',axis=1)), 'Test set evaluation')

INFO:root:Training set evaluation - Accuracy: 0.7752808988764045
INFO:root:Training set evaluation - F1 Score: 0.641255605381166
INFO:root:Training set evaluation - Matthew's Corr Coef: 0.5088081676064811
INFO:root:Test set evaluation - Accuracy: 0.7374301675977654
INFO:root:Test set evaluation - F1 Score: 0.6050420168067226
INFO:root:Test set evaluation - Matthew's Corr Coef: 0.4549350970065187


{'model': SVC(C=1.2, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=42, shrinking=True,
  tol=0.001, verbose=False), 'model__C': 1.2}


Unnamed: 0,Pred_Died,Pred_Survived
Actual_Died,96,9
Actual_Survived,38,36


##### What Parameters can be Searched Over?

We can use: `estimator.get_params().keys()` to understand the parameters that can be set on a specific estimator / pipeline.

In [20]:
# For example: 
gs_simple1.get_params().keys()

dict_keys(['cv', 'error_score', 'estimator__memory', 'estimator__steps', 'estimator__impute_fare', 'estimator__numerics_selection', 'estimator__pass_through', 'estimator__model', 'estimator__impute_fare__accept_sparse', 'estimator__impute_fare__func', 'estimator__impute_fare__inv_kw_args', 'estimator__impute_fare__inverse_func', 'estimator__impute_fare__kw_args', 'estimator__impute_fare__pass_y', 'estimator__impute_fare__validate', 'estimator__numerics_selection__accept_sparse', 'estimator__numerics_selection__func', 'estimator__numerics_selection__inv_kw_args', 'estimator__numerics_selection__inverse_func', 'estimator__numerics_selection__kw_args', 'estimator__numerics_selection__pass_y', 'estimator__numerics_selection__validate', 'estimator__model__bootstrap', 'estimator__model__class_weight', 'estimator__model__criterion', 'estimator__model__max_depth', 'estimator__model__max_features', 'estimator__model__max_leaf_nodes', 'estimator__model__min_impurity_decrease', 'estimator__model_

Performing GridSearchCV / RandomizedSearchCV on [Complex Pipeline](#complex-pipeline) we developed above.

In [21]:
# Find out what params can be tuned in pipeline:
[i for i in p_full.get_params().keys() if 'm4' in i]
#p_full.get_params().keys()

['estimators__m4',
 'estimators__m4__model__algorithm',
 'estimators__m4__model__copy_x',
 'estimators__m4__model__init',
 'estimators__m4__model__max_iter',
 'estimators__m4__model__n_clusters',
 'estimators__m4__model__n_init',
 'estimators__m4__model__n_jobs',
 'estimators__m4__model__precompute_distances',
 'estimators__m4__model__random_state',
 'estimators__m4__model__tol',
 'estimators__m4__model__verbose',
 'estimators__m4__model']

In [22]:
# Hyperparameter search:
cv = StratifiedKFold(n_splits=3, random_state=42)
p1_grid =[
    {
        'predictor__C' : [0.1, 0.3, 0.5, 1],
        'estimators__m1__model__n_neighbors' : [3,7,11,15],
        'estimators__m2__model__n_estimators' : [100, 150, 200],
    },
    {
        'predictor' : [ RandomForestClassifier(random_state=42) ],
        'predictor__n_estimators' : [10, 25],
        'predictor__max_depth' : [3, 5, 7],
        'estimators__m1' : [None, ModelTransformer(KNeighborsClassifier(n_neighbors=15))],
        'estimators__m2' : [None, ModelTransformer(RandomForestClassifier(random_state=42))],
        'estimators__m3' : [None, ModelTransformer(GradientBoostingClassifier(random_state=42))],
        'estimators__m4' : [None, ModelTransformer(KMeans(n_clusters=7, random_state=42))],
    },
]

gs1 = GridSearchCV(p_full, p1_grid, cv=cv, n_jobs=4)
gs1.fit(X_train,y_train)

GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=42, shuffle=False),
       error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('embarked_imputer', FunctionTransformer(accept_sparse=False,
          func=<function embarked_imputer at 0x7f17f7159bf8>,
          inv_kw_args=None, inverse_func=None, kw_args=None,
          pass_y='deprecated', validate=False)), ('fare_imputer', FunctionTransformer(accept_sparse=False,
 ...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params=None, iid=True, n_jobs=4,
       param_grid=[{'predictor__C': [0.1, 0.3, 0.5, 1], 'estimators__m1__model__n_neighbors': [3, 7, 11, 15], 'estimators__m2__model__n_estimators': [100, 150, 200]}, {'predictor': [RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=3, max_features='auto', max...7, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=42, tol=0.0001, verbo

In [23]:
print(gs1.best_params_)
model_eval(y_train,gs1.best_estimator_.predict(X_train))
model_eval(y_test,gs1.best_estimator_.predict(X_test))

INFO:root:Model - Accuracy: 0.9129213483146067
INFO:root:Model - F1 Score: 0.8755020080321284
INFO:root:Model - Matthew's Corr Coef: 0.8147383472327271
INFO:root:Model - Accuracy: 0.8156424581005587
INFO:root:Model - F1 Score: 0.7659574468085106
INFO:root:Model - Matthew's Corr Coef: 0.616565942969521


{'estimators__m1': None, 'estimators__m2': None, 'estimators__m3': ModelTransformer(model=GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=42, subsample=1.0, verbose=0,
              warm_start=False)), 'estimators__m4': ModelTransformer(model=KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=7, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=42, tol=0.0001, verbose=0)), 'predictor': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=3, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=

Unnamed: 0,Pred_Died,Pred_Survived
Actual_Died,92,13
Actual_Survived,20,54


## Understanding which models are overfitting / underfitting <a class="anchor" id="overfitting-underfitting"></a>
(Back to [agenda/toc](#agenda))<br><br>

<img src="images_2/overfit_underfit.png" style="height: 180px;" align="center"/>

Standard approach for understanding whether model is overfitting or underfitting is by looking at the prediction error on the training data and on the evaluation (test) data.

**Underfitting:** Poor performance on training data ==> model is too simple, increase model flexibility:
* Add new domain-specific features and more feature Cartesian products, and change the types of feature processing used
* Decrease the amount of regularization used

**Overfitting:** Good performance of training data, but not generalizable to evaluation (test) data ==> reduce model flexibility
* Feature selection: consider using fewer feature combinations, and decrease the number of numeric attribute bins
* Increase the amount of regularization used

Accuracy on training and test data could be poor because the learning algorithm did not have enough data to learn from. You could improve performance by doing the following:
* Increase the amount of training data examples
* Increase the number of passes on the existing training data

*Source: Amazon AWS Documentation* 

### Write Kaggle Submission file based on predicted outputs: <a class="anchor" id="kaggle-submission"></a>
(Back to [agenda/toc](#agenda))<br><br>

In [24]:
# If you're playing around with this notebook and want to submit a solution on Kaggle,
#  you can use the below to create the submission file.
pd.DataFrame(
    data=gs1.best_estimator_.predict(df_kaggle_holdout),
    index=df_kaggle_holdout.index,
    columns=['Survived']).reset_index().to_csv('nova_submission_session2.csv', index=False)

### Exercise #3 - Homework: <a class="anchor" id="exercise-3"></a>
(Back to [agenda/toc](#agenda))<br><br>
<span style="color:blue">
Use this notebook as a starting point, and see if you can get the accuracy of the model higher through:
* Adding new features
* Modifying the pipeline

We'd love to hear from you on [Slack](http://www.slack.com) ("NovaDataScience" group) with discussion of any progress you've made, issues you've encountered, or stories of success or failure digging into this problem further.
</span>
<div class="panel-group" id="accordion-4">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-4" href="#collapse1-4">Hints</a>
    </h4>
    </div>
    <div id="collapse1-4" class="panel-collapse collapse">
      <div class="panel-body">
Some ideas for features to add:<br>
<ul>
  <li>Extract of title from Name attribute.</li>
  <li>Combined features related to position in family.</li>
  <li>Other thoughts?</li>
</ul>
Some ideas for possible pipeline modifications:<br>
<ul>
  <li>Addition of `PCA()` components in a separate pipeline.</li>
  <li>Selection of features to use in model using `SelectKBest()`.</li>
  <li>Other thoughts?</li>
</ul>
    </div>
    </div>
</div>
</div>

<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>

### Misc Calculations:

In [None]:
t = FunctionTransformer(transformer_dummies, kw_args = {'columns' : ['Sex', 'Embarked', 'Pclass']}, validate=False)
t.transform(df[['Pclass','Sex','Embarked']]).head().T

In [None]:
p = Pipeline([
    ('impute_fare', FunctionTransformer(fare_imputer,validate=False)),
    ('impute_embarked', FunctionTransformer(embarked_imputer,validate=False)),
    ('custom', FunctionTransformer(custom_transformer,validate=False)),
    ('fUnion', FeatureUnion([
        ('categoricalPipeline', Pipeline([
            ('categoricalSelector', FunctionTransformer(columns_subset,
                                                        kw_args={ 'columns' : ['Sex','Pclass','Embarked', 'cabin_level']},
                                                        validate=False)),
            ('encoder', FunctionTransformer(transformer_dummies, kw_args = {'columns' : ['Sex', 'Pclass', 'Embarked', 'cabin_level']}, validate=False))
        ])),
        ('farePipeline', Pipeline([
            ('fareSelector', FunctionTransformer(columns_subset,
                                                        kw_args={ 'columns' : ['Fare']},
                                                        validate=False)),
            ('fareNorm', MinMaxScaler())
        ])),
        ('colSelector', FunctionTransformer(columns_subset,
                                                        kw_args={ 'columns' : ['fare_log', 'age_16_to_34',
                                                                              'age_35_to_47', 'age_missing',
                                                                              'people_per_ticket', 'people_per_lastname',
                                                                              'people_in_family']},
                                                        validate=False)),
    ])),
    ('model', SVC())
])

X = df.drop(['Survived'],axis=1)
y = df.Survived.values
m = p.fit(X, y)
m.predict(df_kaggle_holdout)