# General Pipeline Interface

The Pipeline class is not restricted to preprocessing and classification. It could be anything like preprocessing pipeline, scaling pipeline etc.

There are some requirement of the pipeline:

* The last step if it's an estimator needs to have __transform__ method.
* Internally during fit method each step perfoms fit and transform then the input of the next step is the transform output from previous step. Then at the last step it just call fit method

## Convenient Pipeline Creation using make_pipeline

In [3]:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC

# standard pipeline
pipe_long = Pipeline([('scaler', MinMaxScaler()), ('svc', SVC())])
# short pipeline
# this method create the name automatically
pipe_short = make_pipeline(MinMaxScaler(), SVC())
# if some steps uses the same class numbers added
pipe_short_1 = make_pipeline(MinMaxScaler(), MinMaxScaler(), SVC())

print("pipe long steps\n {}".format(pipe_long.steps))
print("pipe short pipeline\n {}".format(pipe_short.steps))
print("pipe short 1 pipeline\n {}".format(pipe_short_1.steps))

pipe long steps
 [('scaler', MinMaxScaler()), ('svc', SVC())]
pipe short pipeline
 [('minmaxscaler', MinMaxScaler()), ('svc', SVC())]
pipe short 1 pipeline
 [('minmaxscaler-1', MinMaxScaler()), ('minmaxscaler-2', MinMaxScaler()), ('svc', SVC())]


## Accessing Steps Attributes

say you want to extract the principal component from PCA. To access this you can use __named\_steps__ attribute

In [8]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
 
# create pipe with pca
pipe = Pipeline([('standard-scaler1', StandardScaler()),('pca', PCA(n_components=2))])

pipe.fit(cancer.data)
components = pipe.named_steps["pca"].components_
print("PCA components {}".format(components.shape))

PCA components (2, 30)


## Accessing Steps Attributs in GridSearchCV

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

pipe = make_pipeline(StandardScaler(), LogisticRegression())

param_grid = {'logisticregression__C': [0.001, 0.01, 0.1, 1, 10, 100]}

X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=0
)

grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)

print("print best estimator\n {}".format(grid.best_estimator_))
print("LogisticRegression step\n {}".format(grid.best_estimator_.named_steps['logisticregression']))
# get the coefficients
print("Logreg coefs\n {}".format(grid.best_estimator_.named_steps['logisticregression'].coef_))


print best estimator
 Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression(C=1))])
LogisticRegression step
 LogisticRegression(C=1)
Logreg coefs
 [[-0.29792942 -0.58056355 -0.3109406  -0.377129   -0.11984232  0.42855478
  -0.71131106 -0.85371164 -0.46688191  0.11762548 -1.38262136  0.0899184
  -0.94778563 -0.94686238  0.18575731  0.99305313  0.11090349 -0.3458275
   0.20290919  0.80470317 -0.91626377 -0.91726667 -0.8159834  -0.86539197
  -0.45539191  0.10347391 -0.83009341 -0.98445173 -0.5920036  -0.61086989]]
