# Scikit-learn Pipeline Application

We will build 3 pipelines, each with a different estimator (classifciation algorithm), using default hyperparameters: 

Logistic Regression
Support VEctor Machine
Decision Tree

Pipeline *transforms*, will perform: 
Feature Scaling
Dimensionality refuction, using PCA to project data onto 2 dimensional space

We will then end with fitting to our final estimators. 

Afterward, we will:

Followup with scoring test data 
compare pipeline model accuracies
Identify the "best" model, meaning that which has the highest accuracy on our test data
Save the entire pipeline of the "best" model

Granted, given that we will use default hyperparameters, this will likely not  result in the most acccurate models, but it will provide a sense of how to use simple pipelines

## Import

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn import tree

### Load data

In [2]:
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

### Pipeline

In [3]:
pipe_lr = Pipeline([('scl', StandardScaler()),
                     ('pca', PCA(n_components = 2)),
                     ('clf', LogisticRegression(random_state=42))])
pipe_svm = Pipeline([('scl', StandardScaler()),
                     ('pca', PCA(n_components = 2)),
                     ('clf', svm.SVC(random_state=42))])
pipe_dt = Pipeline([('scl', StandardScaler()),
                     ('pca', PCA(n_components = 2)),
                     ('clf', tree.DecisionTreeClassifier(random_state=42))])


In [6]:
#list of pipeline so we can iterate through each
pipelines = [pipe_lr, pipe_svm, pipe_dt]

#Dictionary of pipelines and classifier types for ease of reference
pipe_dict = {0: 'Logistic Regression',
             1: 'Support Vector Machine',
             2: 'Decision Tree'}

Fitting the pipelines

In [8]:
for pipe in pipelines:
    pipe.fit(X_train, y_train)



Comparing accuracies

In [9]:
for idx, val in enumerate(pipelines):
    print('%s pipeline test accuracy: 3%f' %(pipe_dict[idx], val.score(X_test, y_test)))

Logistic Regression pipeline test accuracy: 30.933333
Support Vector Machine pipeline test accuracy: 30.900000
Decision Tree pipeline test accuracy: 30.866667


In [13]:
# Identify the most accurate model on test data
best_acc = 0.0
best_clf = 0
best_pipe = ''
for idx, val in enumerate(pipelines):
    if val.score(X_test, y_test) > best_acc:
        best_acc = val.score(X_test, y_test)
        best_pipe = val
        best_clf = idx
        
print('Classifier with best accuracy: %s' %pipe_dict[best_clf])

Classifier with best accuracy: Logistic Regression


Save pipeline to file

In [15]:
joblib.dump(best_pipe, 'best_pipeline.pkl', compress=1)
print('Saved %s pipeline to file' % pipe_dict[best_clf])

Saved Logistic Regression pipeline to file
