# scikit-learn Pipelines



# Why Pipelines?

* Modularize code
    * More maintainable
    * Easier to put into production
* Prevent Information leakage
    * If you preprocess the entire dataset (including the test data) uniformly to start, then you will have some information leaked into your model. For example, if you use a standard scalar to create a standard normal distribution of a feature, then information regarding the mean and standard deviation from the entire dataset has now leaked into your modeling procedure.


# Adapting a Previous Example

Take a look at the modeling process shown below, which was taken from this [lesson on PCA](https://github.com/learn-co-curriculum/dsc-pca-and-digital-image-processing). Then, fold the various steps outlined in the code into a single pipeline object and then fit said pipeline to the dataset.

In [6]:
from sklearn.datasets import fetch_olivetti_faces
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA

data = fetch_olivetti_faces()



X = data.data
y = data.target

#Split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=22)

#Apply PCA
pca = PCA(n_components=100, whiten=True)
X_pca_train = pca.fit_transform(X_train)
X_pca_test = pca.transform(X_test)

#Fit
clf = svm.SVC(C=5, gamma=0.05)
clf.fit(X_pca_train, y_train)



train_acc = clf.score(X_pca_train, y_train)
test_acc = clf.score(X_pca_test, y_test)
print('Training Accuracy: {}\tTesting Accuracy: {}'.format(train_acc, test_acc))

Training Accuracy: 1.0	Testing Accuracy: 0.37


Now, adapt the code above to create a single pipeline object which can be applied succinctly. Print out the train and test accuracy for your pipeline model.

In [1]:
from sklearn.pipeline import Pipeline

Great! However, as you may note, the Testing Accuracy of this model is pretty abysmal. With that, update your pipeline to include a GridSearch for optimal parameters, as shown below.

> **Hint**: The pipeline actually is passed into Grid Search to find the optimal parameters.

In [14]:
import numpy as np
from sklearn.model_selection import GridSearchCV

clf = svm.SVC()
param_grid = {"C" : np.linspace(.1, 10, num=11),
             "gamma" : np.linspace(10**-3, 5, num=11)}
grid_search = GridSearchCV(clf, param_grid, cv=5)
grid_search.fit(X_pca_train, y_train)


train_acc = grid_search.best_estimator_.score(X_pca_train, y_train)
test_acc = grid_search.best_estimator_.score(X_pca_test, y_test)

print("Best parameters found: ", grid_search.best_params_)
print('Training Accuracy: {}\tTesting Accuracy: {}'.format(train_acc, test_acc))

Best parameters found:  {'C': 6.039999999999999, 'gamma': 0.001}
Training Accuracy: 1.0	Testing Accuracy: 0.96
