## Machine Learning Pipelines
Building machine learning models is not only about choosing the right algorithm and tuning its hyperparameters. Significant amount of time is spent wrangling data and feature engineering before model experimentation begins. These preprocessing steps can easily overwhelm your worklflow and become hard to track. Focusing from ML model to ML pipeline and seeing the preprocessing steps as an integral part of building a model can help keep your workflow more organised. In this post, we will first look at wrong way to preprocess data for a model, then will learn a correct approach followed by two ways to build a machine learning (ML) pipeline.

ML Pipeline has many definitions depending on the context. In this post, ML Pipeline is defined as a collection of preprocessing steps and a model. This means when raw data is passed to the ML Pipeline, it preprocesses the data to the right format, scores the data using the model and pops out a prediction score.

### Setup
Let’s import libraries and a sample data: a subset of the titanic dataset (the data is available through Seaborn under the BSD-3 licence).

In [1]:
# Data manipulation
from seaborn import load_dataset
import numpy as np
import pandas as pd
pd.options.display.precision = 4
pd.options.mode.chained_assignment = None  

# Machine learning pipeline
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn import set_config
set_config(display="diagram")

# Load data
columns = ['alive', 'class', 'embarked', 'who', 'alone', 'adult_male']
df = load_dataset('titanic').drop(columns=columns)
df['deck'] = df['deck'].astype('object')
print(df.shape)
df.head()

(891, 9)


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,deck,embark_town
0,0,3,male,22.0,1,0,7.25,,Southampton
1,1,1,female,38.0,1,0,71.2833,C,Cherbourg
2,1,3,female,26.0,0,0,7.925,,Southampton
3,1,1,female,35.0,1,0,53.1,C,Southampton
4,0,3,male,35.0,0,0,8.05,,Southampton


We will now define commonly used variables to easily reference later on:

In [2]:
SEED = 42
TARGET = 'survived'
FEATURES = df.columns.drop(TARGET)

NUMERICAL = df[FEATURES].select_dtypes('number').columns
print(f"Numerical features: {', '.join(NUMERICAL)}")

CATEGORICAL = pd.Index(np.setdiff1d(FEATURES, NUMERICAL))
print(f"Categorical features: {', '.join(CATEGORICAL)}")

Numerical features: pclass, age, sibsp, parch, fare
Categorical features: deck, embark_town, sex


### Wrong Approach
It’s not uncommon to use pandas methods like this when preprocessing:

In [3]:
# Impute numerical variables with mean
df_num_imputed = df[NUMERICAL].fillna(df[NUMERICAL].mean())
# Normalise numerical variables
df_num_scaled = df_num_imputed.subtract(df_num_imputed.min(), axis=1)\
                              .divide(df_num_imputed.max()-df_num_imputed.min(), axis=1)

# Impute categorical variables with a constant
df_cat_imputed = df[CATEGORICAL].fillna('missing')
# One-hot-encode categorical variables
df_cat_encoded = pd.get_dummies(df_cat_imputed, drop_first=True)

# Merge data
df_preprocessed = df_num_scaled.join(df_cat_encoded)
df_preprocessed.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,deck_B,deck_C,deck_D,deck_E,deck_F,deck_G,deck_missing,embark_town_Queenstown,embark_town_Southampton,embark_town_missing,sex_male
0,1.0,0.2712,0.125,0.0,0.0142,0,0,0,0,0,0,1,0,1,0,1
1,0.0,0.4722,0.125,0.0,0.1391,0,1,0,0,0,0,0,0,0,0,0
2,1.0,0.3214,0.0,0.0,0.0155,0,0,0,0,0,0,1,0,1,0,0
3,0.0,0.4345,0.125,0.0,0.1036,0,1,0,0,0,0,0,0,1,0,0
4,1.0,0.4345,0.0,0.0,0.0157,0,0,0,0,0,0,1,0,1,0,1


We imputed missing values, scaled numerical variables between 0 to 1 and one-hot-encoded categorical variables. After preprocessing, the data is partitioned and a model is fitted:

In [4]:
# Partition data
X_train, X_test, y_train, y_test = train_test_split(df_preprocessed, df[TARGET], 
                                                    test_size=.2, random_state=SEED, 
                                                    stratify=df[TARGET])

# Train a model
model = LogisticRegression()
model.fit(X_train, y_train)

Okay, let’s analyse what was wrong with this approach:

◼️ Imputation: Numerical variables should be imputed with a mean from the training data instead of the entire data.

◼️ Scaling: Min and max should be calculated from the training data.

◼️ Encoding: Categories should be inferred from the training data. In addition, even if the data is partitioned prior to preprocessing, one-hot-encoding with pd.get_dummies(X_train) and pd.get_dummies(X_test) can result in inconsistent training and test data (i.e. the columns may vary depending on the categories in both datasets). Therefore, pd.get_dummies() should not be used for one-hot-encoding when preparing data for a model.

💡 Test data should be set aside prior to preprocessing. Any statistics such as mean, min and max used for preprocessing should be derived from the training data. Otherwise, there will be a data leakage problem.

Now, let’s asses the model. We will use ROC-AUC to evaluate the model. We will create a function that calculates ROC-AUC since it will be useful for evaluating the subsequent approaches:

In [5]:
def calculate_roc_auc(model_pipe, X, y):
    """Calculate roc auc score. 
    
    Parameters:
    ===========
    model_pipe: sklearn model or pipeline
    X: features
    y: true target
    """
    y_proba = model_pipe.predict_proba(X)[:,1]
    return roc_auc_score(y, y_proba)
  
print(f"Train ROC-AUC: {calculate_roc_auc(model, X_train, y_train):.4f}")
print(f"Test ROC-AUC: {calculate_roc_auc(model, X_test, y_test):.4f}")

Train ROC-AUC: 0.8669
Test ROC-AUC: 0.8329


### Correct approach but …
We will partition the data first and prepreprocess data using Scikit-learn’s transformers to prevent data leakage by preprocessing correctly:

In [6]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=TARGET), df[TARGET], 
                                                    test_size=.2, random_state=SEED, 
                                                    stratify=df[TARGET])
num_imputer = SimpleImputer(strategy='mean')
train_num_imputed = num_imputer.fit_transform(X_train[NUMERICAL])

scaler = MinMaxScaler()
train_num_scaled = scaler.fit_transform(train_num_imputed)

cat_imputer = SimpleImputer(strategy='constant', fill_value='missing')
train_cat_imputed = cat_imputer.fit_transform(X_train[CATEGORICAL])

encoder = OneHotEncoder(drop='first', handle_unknown='ignore', sparse=False)
train_cat_encoded = encoder.fit_transform(train_cat_imputed)

train_preprocessed = np.concatenate((train_num_scaled, train_cat_encoded), axis=1)

columns = np.append(NUMERICAL, encoder.get_feature_names_out(CATEGORICAL))
pd.DataFrame(train_preprocessed, columns=columns, index=X_train.index).head()

Unnamed: 0,pclass,age,sibsp,parch,fare,deck_B,deck_C,deck_D,deck_E,deck_F,deck_G,deck_missing,embark_town_Queenstown,embark_town_Southampton,embark_town_missing,sex_male
692,1.0,0.3693,0.0,0.0,0.1103,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
481,0.5,0.3693,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
527,0.0,0.3693,0.0,0.0,0.4329,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
855,1.0,0.2209,0.0,0.1667,0.0182,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
801,0.5,0.3843,0.125,0.1667,0.0512,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0


Lovely, we can fit the model now:

In [7]:
model = LogisticRegression()
model.fit(train_preprocessed, y_train)

We need to preprocess the test dataset in the same way before evaluating:

In [8]:
test_num_imputed = num_imputer.transform(X_test[NUMERICAL])
test_num_scaled = scaler.transform(test_num_imputed)
test_cat_imputed = cat_imputer.transform(X_test[CATEGORICAL])
test_cat_encoded = encoder.transform(test_cat_imputed)
test_preprocessed = np.concatenate((test_num_scaled, test_cat_encoded), axis=1)

print(f"Train ROC-AUC: {calculate_roc_auc(model, train_preprocessed, y_train):.4f}")
print(f"Test ROC-AUC: {calculate_roc_auc(model, test_preprocessed, y_test):.4f}")

Train ROC-AUC: 0.8670
Test ROC-AUC: 0.8332


Awesome, this time the approach was correct. But writing good code doesn’t stop at being correct. For each preprocessing step, we stored interim outputs for both training and test datasets. When the number of preprocessing steps increase, this will soon become very tedious to keep up and therefore prone to error like missing a step in preprocessing the test data. This code can be made more organised, streamlined and readable. That’s what we will do in the next sections.

### Elegant Approach #1
Let’s streamline the previous code using Scikit-learn’s Pipeline and ColumnTransformer. If you aren’t familiar with them, this post explains them concisely.

In [9]:
numerical_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', MinMaxScaler())
])

categorical_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore', sparse=False))
])

preprocessors = ColumnTransformer(transformers=[
    ('num', numerical_pipe, NUMERICAL),
    ('cat', categorical_pipe, CATEGORICAL)
])

pipe = Pipeline([
    ('preprocessors', preprocessors),
    ('model', LogisticRegression())
])

pipe.fit(X_train, y_train)

The pipeline:

  - Splits input data into numerical and categorical groups
  - Preprocesses both groups in parallel
  - Concatenates the preprocessed data from both groups
  - Passes the preprocessed data into the model

When raw data is passed to the trained pipeline, it will preprocess and make a prediction. This means we no longer have to store interim results for both training and test dataset. Scoring unseen data is as simple as pipe.predict(). That’s very elegant, isn’t it? Now, let’s evaluate the performance of the model:

In [10]:
print(f"Train ROC-AUC: {calculate_roc_auc(pipe, X_train, y_train):.4f}")
print(f"Test ROC-AUC: {calculate_roc_auc(pipe, X_test, y_test):.4f}")

Train ROC-AUC: 0.8670
Test ROC-AUC: 0.8332


Great to see that it matches the performance of previous approach since the transformation was exactly the same but just written in a more elegant way. For our small example, this is the best approach among the four approaches shown in this post.

Scikit-learn’s out-of-the-box transformers such as OneHotEncoder and SimpleImputer are fast and efficient. However, these prebuilt transformers may not always fulfill our unique preprocessing needs. In those cases, being familiar with the next approach gives us more control over bespoke ways of preprocessing.