Pipeline Test
========

Somewhat following along with what Chris Albon did [here](https://chrisalbon.com/machine-learning/pipelines_with_parameter_optimization.html), combined with what I did in the other notebook. Since I've never used scikit-learn's `pipeline` object before, this is a little of a learning process for me.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.datasets import load_breast_cancer

In [2]:
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

In [3]:
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

# create necessary objects
scaler = StandardScaler()
pca = PCA()
svc = SVC()
logistic = LogisticRegression()

pipe_mm = Pipeline(steps=[('sc', scaler), ('svc', svc)])
pipe_ca = Pipeline(steps=[('sc', scaler), ('pca', pca), ('logistic', logistic)])

We have our two pipelines: my two-step scale-and-fit, and Chris' three-step scale-transform-fit. Our pipeline can do parameter optimization, but we need to set up what values to adjust and within what ranges. I'll set up the parameters and do a grid search with them to find the best model.

In [12]:
from sklearn.model_selection import GridSearchCV

gamma = np.logspace(-6, 1, 50)
n_components = np.arange(1, X.shape[1] + 1)
reg_c = np.logspace(-4, 4, 50)
penalty = ['l1', 'l2']

parameters_mm = dict(svc__gamma=gamma)
parameters_ca = dict(pca__n_components=n_components,
                     logistic__C=reg_c,
                     logistic__penalty=penalty)

grid_mm = GridSearchCV(pipe_mm, parameters_mm)
grid_ca = GridSearchCV(pipe_ca, parameters_ca)

grid_mm.fit(X, y)
grid_ca.fit(X, y)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(steps=[('sc', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('logistic', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_inte...y='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'pca__n_components': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]), 'logistic__C': array([  1.00000e-04,   1.45635e-04,   2.12095e-04,   3.08884e-04,
         4.49843e-04,   6.55129e-04,   9.54095e-0....23746e+03,   4.71487e+03,
         6.86649e+03,   1.00000e+04]), 'logistic__penalty': ['l1', 'l2']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, v

Now, let's view what worked the best for our two pipelines...

In [13]:
print('Best gamma:', grid_mm.best_estimator_.get_params()['svc__gamma'])

# View The Best Parameters
print('Best Number Of Components:', grid_ca.best_estimator_.get_params()['pca__n_components'])
print('Best C:', grid_ca.best_estimator_.get_params()['logistic__C'])
print('Best Penalty:', grid_ca.best_estimator_.get_params()['logistic__penalty'])

Best gamma: 0.0193069772888
Best Number Of Components: 9
Best C: 0.0409491506238
Best Penalty: l2


Our gamma roughly matches what I found in my exploratory analysis, so good news is that I probably didn't mess anything up there! Now, let's finally score our models using a 3-fold cross validation, using the models with the best parameters.

In [14]:
from sklearn.model_selection import cross_val_score

cross_val_score(pipe_mm, X, y)

array([ 0.96842105,  0.98421053,  0.97354497])

In [15]:
cross_val_score(pipe_ca, X, y)

array([ 0.97894737,  0.97368421,  0.97354497])

OK, so somewhat similar... It's hard to tell which model may be better, so let's increase the number of folds and save the results.

In [17]:
scores_mm = cross_val_score(pipe_mm, X, y, cv=10)
scores_ca = cross_val_score(pipe_ca, X, y, cv=10)

print(f'MM: {scores_mm.mean():.3f} +/- {scores_mm.std(ddof=0):.3f}')
print(f'CA: {scores_ca.mean():.3f} +/- {scores_ca.std(ddof=0):.3f}')

MM: 0.975 +/- 0.025
CA: 0.982 +/- 0.014


Looks like the scale-transform-fit pipeline for Chris Albon works slightly better than my simplified versio. The pipeline he set up seems like a better idea for this problem, and in general would be better, so I am fine with that result.