# Sklearn Pipelines

<p>Pipelines apply a list of transforms and final estimator  </p>

<p>Intermediate steps of pipeline must implement fit and transform methods and the final estimator only needs to implement it</p>

<p>pipeline class allows sticking multiple processes into a single scikit-learn estimator .pipeline class has fit , predict and score just like any other estimator</p>

<p>The step parameter for pipeline must be a tuple consisting of a name and an instance of the transformer or estimator</p>

In [13]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier,GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn import tree

In [14]:
iris=load_iris()

In [15]:
X_train,X_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.2,random_state=42)

In [19]:
pipe_lr=Pipeline([('scl',StandardScaler()),
                 ('pca',PCA(n_components=2)),
                 ('clf',LogisticRegression(random_state=42))])



pipe_svm=Pipeline([('scl',StandardScaler()),
                  ('pca',PCA(n_components=2)),
                  ('clf',svm.SVC(random_state=42))])




pipe_dt=Pipeline([('scl',StandardScaler()),
                 ('pca',PCA(n_components=2)),
                 ('clf',tree.DecisionTreeClassifier(random_state=42))])

pipe_adaboost=Pipeline([('scl',StandardScaler()),
                       ('pca',PCA(n_components=2)),
                       ('clf',AdaBoostClassifier())])

pipe_gradientboosting=Pipeline([('scl',StandardScaler()),
                       ('pca',PCA(n_components=2)),
                       ('clf',GradientBoostingClassifier())])

pipe_knn=Pipeline([('scl',StandardScaler()),
                  ('pca',PCA(n_components=2)),
                  ('clf',KNeighborsClassifier(n_neighbors=3))])

In [20]:
#List of pipelines for ease of iteration
pipelines=[pipe_lr,pipe_svm,pipe_dt,pipe_adaboost,pipe_gradientboosting,pipe_knn]

In [21]:
#Dictionery of pipelines and classifier types for ease of reference
pipe_dict={0:'LogisticRegression',1:'Support Vector Machine',2:'Decision tree',3:'AdaBoostClassifier',4:'GradientBoosting',5:'KNearestNeighbors'}

In [22]:
#fit the pipelines
for pipe in pipelines:
    pipe.fit(X_train,y_train)



In [23]:
#compare accuracies
for idx,val in enumerate(pipelines):
    print('%s pipeline test accuracy: %.3f' %(pipe_dict[idx],val.score(X_test,y_test)))

LogisticRegression pipeline test accuracy: 0.933
Support Vector Machine pipeline test accuracy: 0.900
Decision tree pipeline test accuracy: 0.867
AdaBoostClassifier pipeline test accuracy: 0.600
GradientBoosting pipeline test accuracy: 0.867
KNearestNeighbors pipeline test accuracy: 0.933


In [24]:
#identify the most accurate model on test data
best_acc=0.0
best_clf=0.0
best_pipe=''

for idx , val in enumerate(pipelines):
    if val.score(X_test,y_test)>best_acc:
        best_acc=val.score(X_test,y_test)
        best_pipe=val
        best_clf=idx
print('Classifier with best accuracy:%s'%pipe_dict[best_clf])

#save pipeline to file
joblib.dump(best_pipe,'best_pipeline.pkl',compress=1)
print('Saved %s pipeline to file'% pipe_dict[best_clf])

Classifier with best accuracy:LogisticRegression
Saved LogisticRegression pipeline to file
