# 1. Pipeline: Seamlessly combine preprocessing steps

Pipelines can be composed of two different things:

- **Transformer:** any object with the `fit()` and `transform()` methods. You can think of a transformer as an object that’s used for processing your data, and you will commonly have multiple transformers in your data preparation workflow. E.g., you might use one transformer to impute missing values, and another one to scale features or one-hot encode your categorical variables. `MinMaxScaler()`, `SimpleImputer()` and `OneHotEncoder()` are all examples of transformers.
- **Estimator:** In scikit-learn lingo, an “estimator” usually means a machine learning model; i.e. an object with the `fit()` and `predict()` methods. `LinearRegression()` and `RandomForestClassifier()` are examples of estimators.

In [1]:
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

In [2]:
# Load diabetes dataset into pandas DataFrames
X, y = load_diabetes(scaled=False, return_X_y=True, as_frame=True)

In [3]:
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [4]:
X_train.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
322,55.0,2.0,32.1,112.67,207.0,92.4,25.0,8.28,6.1048,111.0
159,47.0,1.0,30.4,120.0,199.0,120.0,46.0,4.0,5.1059,87.0
318,73.0,1.0,27.0,102.0,211.0,121.0,67.0,3.0,4.7449,99.0
162,34.0,1.0,29.2,73.0,172.0,108.2,49.0,4.0,4.3041,91.0
115,40.0,2.0,26.5,93.0,236.0,147.0,37.0,7.0,5.5607,92.0


In [5]:
y_train.head()

322    242.0
159    195.0
318    109.0
162    172.0
115    229.0
Name: target, dtype: float64

Next, we define our `Pipeline`. For now, I’ll just define a simple preprocessing `Pipeline` that includes two steps — impute missing values with the mean, and rescale all features — and I won’t include an estimator/model. The principles, however, are the same regardless of whether or not you include an estimator.

In [6]:
from sklearn import set_config
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

In [7]:
# Return pandas DataFrames instead of numpy arrays
set_config(transform_output="pandas")

In [8]:
# Build pipeline
pipe = Pipeline(
    steps=[("impute_mean", SimpleImputer(strategy="mean")), ("rescale", MinMaxScaler())]
)

Once we’ve defined our `Pipeline`, we “fit” it to our training dataset, and use it to transform both the training and testing datasets:

In [9]:
# Fit the pipeline to the training data
pipe.fit(X_train)

In [10]:
# Transform data using the fitted pipeline
X_train_transformed = pipe.transform(X_train)
X_test_transformed = pipe.transform(X_test)

In [11]:
X_train_transformed.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
322,0.6,1.0,0.582645,0.713662,0.539216,0.252988,0.038961,0.885755,0.999228,0.775862
159,0.466667,0.0,0.512397,0.816901,0.5,0.390438,0.311688,0.282087,0.648601,0.362069
318,0.9,0.0,0.371901,0.56338,0.558824,0.395418,0.584416,0.141044,0.521886,0.568966
162,0.25,0.0,0.46281,0.15493,0.367647,0.331673,0.350649,0.282087,0.367159,0.431034
115,0.35,1.0,0.35124,0.43662,0.681373,0.5249,0.194805,0.705219,0.808242,0.448276


The advantage of using a Pipeline to handle these preprocessing steps is twofold:

- **Protect against leakage:** Because the preprocessor is fitted to the training dataset `X_train`, no information about the test set is “leaked” when imputing missing values or creating one-hot encoded features.
- **Avoid duplication:** If we didn’t use a `Pipeline` to handle these preprocessing steps, we’d end up transforming the `X_test` dataset multiple times (every time we wanted to apply a preprocessing step). At this small scale, the repetition might not seem too bad. But in complex ML workflows you can easily grow to 5, 10, or even 20 preprocessing steps. Using a `Pipeline` makes this easy because we can add in as many steps as we like and still only have to transform `X_train` and `X_test` once:

# 2. ColumnTransformer: Apply separate transformers to different feature subsets

A `ColumnTransformer` allows you to apply different transformers to different columns of an array or pandas DataFrame.

In [12]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import FunctionTransformer, MinMaxScaler, OneHotEncoder

In [13]:
# Categorical columns transformer - (a) impute NAs with the mode, and (b) one-hot encode
categorical_features = ["sex"]
categorical_transformer = Pipeline(
    steps=[
        ("impute_mode", SimpleImputer(strategy="most_frequent")),
        (
            "ohe",
            OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop="first"),
        ),  # handle_unknown='ignore' ensures that any values not encountered
    ]  # in the training dataset are ignored (i.e. all ohe columns will be set to zero)
)

In [14]:
# Numerical columns transformer - (a) impute NAs with the mean, and (b) rescale
numerical_features = [
    "bp",
    "bmi",
    "s1",
    "s2",
    "s3",
    "s4",
    "s5",
    "s6",
]  # All except 'age' and 'sex'
numerical_transformer = Pipeline(
    steps=[("impute_mean", SimpleImputer(strategy="mean")), ("rescale", MinMaxScaler())]
)

In [15]:
# Combine the individual transformers into a single ColumnTransformer
preprocessor = ColumnTransformer(
    # Chain together the individual transformers
    transformers=[
        ("categorical_transformer", categorical_transformer, categorical_features),
        ("numerical_transformer", numerical_transformer, numerical_features),
    ],
    # By default, columns which are not transformed by the ColumnTransformer
    # will be dropped. By setting remainder='passthrough', we ensure that
    # these columns are retained, in their original form.
    remainder="passthrough",
    # Prefix feature names with the name of the transformer that generated them (optional)
    verbose_feature_names_out=True,
)

In [16]:
# Fit the preprocessor to the training data
preprocessor.fit(X_train)

To apply the `ColumnTransformer` to our data, we use the same code as we did to apply our first `Pipeline`:

In [17]:
# Transform data using the fitted preprocessor
X_train_transformed = preprocessor.transform(X_train)
X_test_transformed = preprocessor.transform(X_test)

In [18]:
X_train_transformed.head()

Unnamed: 0,categorical_transformer__sex_2.0,numerical_transformer__bp,numerical_transformer__bmi,numerical_transformer__s1,numerical_transformer__s2,numerical_transformer__s3,numerical_transformer__s4,numerical_transformer__s5,numerical_transformer__s6,remainder__age
322,1.0,0.713662,0.582645,0.539216,0.252988,0.038961,0.885755,0.999228,0.775862,55.0
159,0.0,0.816901,0.512397,0.5,0.390438,0.311688,0.282087,0.648601,0.362069,47.0
318,0.0,0.56338,0.371901,0.558824,0.395418,0.584416,0.141044,0.521886,0.568966,73.0
162,0.0,0.15493,0.46281,0.367647,0.331673,0.350649,0.282087,0.367159,0.431034,34.0
115,1.0,0.43662,0.35124,0.681373,0.5249,0.194805,0.705219,0.808242,0.448276,40.0


# 3. FeatureUnion: Apply multiple transformers in parallel

`Pipeline` and `ColumnTransformer` are awesome tools, but they can only apply transformers *sequentially.*

If we want to apply multiple transformations to the same underlying features in parallel, we need to use another tool: `FeatureUnion`.

We can think of `FeatureUnion` as a tool that creates a “copy” of your underlying data, applies transformers to those copies in parallel, and then stitches the results together. Each transformer is passed the raw, underlying data, so we don’t experience the problem of sequential transformation.

In [19]:
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.pipeline import FeatureUnion

In [20]:
# Define a feature_union object which will create reduced-dimensionality features
union = FeatureUnion(
    transformer_list=[
        ("pca", PCA(n_components=1)),
        ("svd", TruncatedSVD(n_components=2)),
    ]
)

In [21]:
# Adapt the numerical transformer so that it includes the FeatureUnion
numerical_features = [
    "bp",
    "bmi",
    "s1",
    "s2",
    "s3",
    "s4",
    "s5",
    "s6",
]  # All except 'age' and 'sex'
numerical_transformer = Pipeline(
    steps=[
        ("impute_mean", SimpleImputer(strategy="mean")),
        ("rescale", MinMaxScaler()),
        ("reduce_dimensionality", union),
    ]
)

In [22]:
# Categorical columns transformer - same as above
categorical_features = ["sex"]
categorical_transformer = Pipeline(
    steps=[
        ("impute_mode", SimpleImputer(strategy="most_frequent")),
        (
            "ohe",
            OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop="first"),
        ),  # handle_unknown='ignore' ensures that any values not encountered in the training dataset are ignored (i.e. all ohe columns will be set to zero)
    ]
)

In [23]:
# Build the ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ("categorical_transformer", categorical_transformer, categorical_features),
        ("numerical_transformer", numerical_transformer, numerical_features),
    ],
    remainder="passthrough",
    verbose_feature_names_out=True,
)

In [24]:
# Fit the preprocessor to the training data
preprocessor.fit(X_train)

In this diagram, we can see that the `FeatureUnion` steps are applied in parallel, rather than sequentially. Just like before, we fit the `preprocessor` to our training data and then use it to transform any dataset we want to use for modelling/prediction.

In [25]:
# Transform data using the fitted preprocessor
X_train_transformed = preprocessor.transform(X_train)
X_test_transformed = preprocessor.transform(X_test)

In [26]:
X_train_transformed.head()

Unnamed: 0,categorical_transformer__sex_2.0,numerical_transformer__pca__pca0,numerical_transformer__svd__truncatedsvd0,numerical_transformer__svd__truncatedsvd1,remainder__age
322,1.0,0.855223,1.737902,-0.635456,55.0
159,0.0,0.222184,1.38699,-0.000252,47.0
318,0.0,0.007594,1.333332,0.300462,73.0
162,0.0,-0.144528,0.957259,0.008773,34.0
115,1.0,0.48677,1.487209,-0.337433,40.0


# 4. FunctionTransformer: Seamlessly integrate feature engineering

All of the transformers and tools discussed above use pre-built classes in scikit-learn to apply standard transformations to your data (e.g., scaling, one-hot encoding, imputing, etc.).

If you want to apply a custom function — for example during feature engineering — then you’ll want to use `FunctionTransformer`. Personally, I love this class - it makes it super easy to integrate custom functions into your `Pipeline` without having to write new transformer classes from scratch.

Creating a `FunctionTransformer` is really simple. You start by defining your functions in the standard Pythonic style, and then create a pipeline. Here, I define two simple functions: one that adds together two columns, and another that subtracts two columns.

In [27]:
from sklearn.preprocessing import FunctionTransformer

In [28]:
def add_features(X):
    X["feature_1_2"] = X["feature_1"] + X["feature_2"]
    return X


def subtract_features(X):
    X["feature_3_4"] = X["feature_3"] - X["feature_4"]
    return X

In [29]:
# Put into a pipeline
feature_engineering = Pipeline(
    steps=[
        ("add_features", FunctionTransformer(add_features)),
        ("subtract_features", FunctionTransformer(subtract_features)),
    ]
)

To simplify things even further, you could include multiple transformations within the same function:

In [30]:
def add_subtract_features(X):
    # Added by me
    X["feature_1_2"] = X[X.columns[0]] + X[X.columns[1]]  # Add features
    X["feature_3_4"] = X[X.columns[2]] - X[X.columns[3]]  # Subtract features

    # Original
    # X["feature_1_2"] = X["feature_1"] + X["feature_2"]  # Add features
    # X["feature_3_4"] = X["feature_3"] - X["feature_4"]  # Subtract features
    return X


# Put into a pipeline
feature_engineering = Pipeline(
    steps=[
        ("add_subtract_features", FunctionTransformer(add_subtract_features)),
    ]
)

Finally, add the `feature_engineering` pipeline to the `preprocessing` pipeline we defined earlier:

In [31]:
# Combine preprocessing and feature engineering in a single pipeline
pipe = Pipeline(
    [
        ("preprocessing", preprocessor),
        ("feature_engineering", feature_engineering),
    ]
)

And use this new pipeline to apply the same preprocessing/feature engineering steps to all your datasets:

In [32]:
# Fit the preprocessor to the training data
pipe.fit(X_train)

In [33]:
# Transform data using the fitted preprocessor
X_train_transformed = pipe.transform(X_train)
X_test_transformed = pipe.transform(X_test)

In [34]:
X_train_transformed.head()

Unnamed: 0,categorical_transformer__sex_2.0,numerical_transformer__pca__pca0,numerical_transformer__svd__truncatedsvd0,numerical_transformer__svd__truncatedsvd1,remainder__age,feature_1_2,feature_3_4
322,1.0,0.855223,1.737902,-0.635456,55.0,1.855223,2.373358
159,0.0,0.222184,1.38699,-0.000252,47.0,0.222184,1.387242
318,0.0,0.007594,1.333332,0.300462,73.0,0.007594,1.032869
162,0.0,-0.144528,0.957259,0.008773,34.0,-0.144528,0.948486
115,1.0,0.48677,1.487209,-0.337433,40.0,1.48677,1.824642


# Bonus: Save your pipelines for truly reproducible workflows

In enterprise applications of machine learning, it’s very rare to only use a model or preprocessing workflow once. More often, you’ll be required to regularly rerun your model each week/month and generate new predictions for new data.

In these situations, rather than writing a new preprocessing pipeline from scratch each time, you can use the same pipeline each time. To do this, once you’ve developed your pipeline use the `joblib` library, save the pipeline so that you can rerun the exact same transformations with future datasets:

In [35]:
import joblib

In [36]:
# Save pipeline
joblib.dump(pipe, "pipe.pkl")

['pipe.pkl']

In [37]:
# Assume that the below steps are applied in another notebook/script

# Load pipeline
pretrained_pipe = joblib.load("pipe.pkl")

In [38]:
X_test_new = X_test.copy()  # Added by me

In [39]:
# Apply pipeline to a new dataset, X_test_new
X_test_new_transformed = pretrained_pipe.transform(X_test_new)

In [40]:
X_test_new_transformed.head()

Unnamed: 0,categorical_transformer__sex_2.0,numerical_transformer__pca__pca0,numerical_transformer__svd__truncatedsvd0,numerical_transformer__svd__truncatedsvd1,remainder__age,feature_1_2,feature_3_4
353,0.0,0.252193,1.406919,-0.037561,34.0,0.252193,1.44448
48,1.0,-0.282313,0.989898,0.321357,67.0,0.717687,0.668541
77,0.0,-0.417296,0.748638,0.199915,22.0,-0.417296,0.548723
274,0.0,0.051338,1.193382,0.021853,53.0,0.051338,1.171528
365,0.0,0.155508,1.298551,-0.004799,58.0,0.155508,1.30335


# Conclusion

- `Pipeline` provides a quick way to sequentially apply different preprocessing transformers to your data
- Using a `ColumnTransformer` is a fantastic way to sequentially apply separate preprocessing steps to different feature subsets
- `FeatureUnion` enables you to apply different preprocessing transformations in parallel
- `FunctionTransformer` provides a super-simple way to write custom feature engineering functions and integrate them within your pipelines