### Pipeline
- Usually used to combine an Estimator (Imputer, Scaler, Regressor, Classifier) with preceding Transformers. Is then called a Composite Estimator
- Takes a list of ('name', 'estimator') tuples
- Exposes the same methods as the final Estimator
- All but the last Estimator need to be Transformers, the last Estimator can be a Transformer
- The Pipelines .fit() method calls .fit_transform() on all but the last Estimators and .fit() on the last Estimator
- The Pipelines .fit_transform() method calls .fit_transform() on all Estimators, so for this the last Estimator needs to be an Transformer as well
- Calling .predict() on a Pipeline should call .transform() for all but the last Estimators and .predict() on the last one. Same goes for .score()
- The Hyperparameters of all Estimators contained in a Pipeline can be optimized together by GridSerach

### ColumnTransformer
- Combines (only) Transformers and applies each one to the given columns
- Takes a list of ('name', 'transformer', 'list of column names') tuples
-  On .transform() or .fit_transform(), it applies the transformeres to the given columns and chains the results through

### Custom example

In [1]:
from sklearn.pipeline import Pipeline

In [2]:
class MyEstimator():
    def fit(self, features, labels):
        print("Fit method of MyEstimator")


class MyTransformer():
    def fit(self, features, labels):
        print("Fit method of MyTransformer")

    def transform(self, features):
        print("Transform method of MyTransformer")

    def fit_transform(self, features, labels):
        print("Fit_Transform method of MyTransformer")
        self.fit(features, labels)
        features = self.transform(features)


class MyPredictor():
    def fit(self, features, labels):
        print("Fit method of MyPredictor")
    
    def predict(self, features):
        print("Predict method of MyPredictor")

    def score(self, features, labels):
        print("Score method of MyPredictor")

In [3]:
te_pipe = Pipeline(
    steps=[
        ("transformer", MyTransformer()), 
        ("estimator", MyEstimator())
    ]
)

te_pipe.fit(X="Features")

Fit_Transform method of MyTransformer
Fit method of MyTransformer
Transform method of MyTransformer
Fit method of MyEstimator


In [4]:
tt_pipe = Pipeline(
    steps=[
        ("transformer_1", MyTransformer()), 
        ("transformer_2", MyTransformer())
    ]
)

tt_pipe.fit(X="Features")

Fit_Transform method of MyTransformer
Fit method of MyTransformer
Transform method of MyTransformer
Fit method of MyTransformer


In [5]:
tp_pipe = Pipeline(
    steps=[
        ("transformer", MyTransformer()), 
        ("predictor", MyPredictor())
    ]
)

print("=== FIT ===")
tp_pipe.fit(X="Features")
print("=== PREDICT ===")
tp_pipe.predict(X="Features")
print("=== SCORE ===")
tp_pipe.score("Fearues", "Labels")

=== FIT ===
Fit_Transform method of MyTransformer
Fit method of MyTransformer
Transform method of MyTransformer
Fit method of MyPredictor
=== PREDICT ===
Transform method of MyTransformer
Predict method of MyPredictor
=== SCORE ===
Transform method of MyTransformer
Score method of MyPredictor


### Tutorial example

In [6]:
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(0)

In [7]:
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

In [8]:
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")), 
        ("scaler", StandardScaler())
    ]
)

In [9]:
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

In [10]:
numeric_features = ["age", "fare"]
categorical_features = ["embarked", "sex", "pclass"]
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

In [11]:
clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [13]:
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

model score: 0.790


In [15]:
print("model score: %.3f" % clf.score(X_test, y_test))

model score: 0.790
