## Pipelining
### Introduction
In applied machine learning, there are well-established workflows designed to tackle common issues—one of the most critical being data leakage. Data leakage happens when information from the test set influences the training process, leading to overly optimistic and misleading evaluation results.

One of the most common ways this occurs is during data preparation. For instance, if you apply scaling or normalisation to the entire dataset before splitting into train and test sets, the model unintentionally gains information about the test data. To prevent this, it’s essential to build a robust evaluation setup where training and test data remain strictly separated, including during preprocessing.

Scikit-learn’s Pipeline utility helps automate and streamline this process by chaining together preprocessing steps (like imputation, encoding, and scaling) with the modelling stage. Each step is applied in the correct order and only within the appropriate training folds, ensuring a clean and fair evaluation.

### Install Python libaries

In [None]:
!pip install pandas matplotlib seaborn scikit-learn

### Adult Income dataset

The Adult Income dataset (also known as the *Census Income* dataset) was collected from the *1994 U.S. Census*. It is commonly used in machine learning for *binary classification* tasks, where the goal is to predict whether a person earns more than $50,000 per year based on demographic and employment-related attributes. This dataset is widely used for:
- Benchmarking classification algorithms
- Learning preprocessing pipelines (handling categorical & numerical data)
- Studying model bias and fairness (e.g., income by gender or race)
- Feature engineering practice

The original purpose of this dataset was to analyse demographic and employment-related factors and predict whether an individual's income exceeds $50,000 per year—a threshold relevant for economic and social research.  This kind of analysis is useful for:

- Economic planning and policy
- Employment and wage studies
- Social inequality research
- Targeted surveys

Each row represents one adult individual, with 14 input features and one target label (`income`):

- <=50K: Income less than or equal to $50,000 a year

- \>50K: Income greater than $50,000 a year

The training and test datasets are provided separately. The test set contains new individuals, which makes it suitable for model evaluation on new and unseen data. Here is an overciew of the features:

| Feature           | Description                                                     |
|-------------------|-----------------------------------------------------------------|
| `age`             | Age of the individual                                           |
| `workclass`       | Type of employment (e.g., Private, Self-emp, Government)        |
| `fnlwgt`          | Final weight – a sampling weight used by the Census             |
| `education`       | Highest level of education completed (e.g., Bachelors, HS-grad) |
| `education_num`   | Numeric encoding of education level                             |
| `marital_status`  | Marital status (e.g., Married, Single)                          |
| `occupation`      | Job type (e.g., Tech-support, Sales, Craft-repair)              |
| `relationship`    | Family role (e.g., Husband, Not-in-family)                      |
| `race`            | Race of the individual                                          |
| `sex`             | Gender of the individual                                        |
| `capital_gain`    | Income from capital gains                                       |
| `capital_loss`    | Losses from capital                                             |
| `hours_per_week`  | Number of hours worked per week                                 |
| `native_country`  | Country of origin                                               |
| `income`          | *Target variable* – `<=50K` or `>50K`                         |

### Loading the data

You can load the dataset using `pandas` from the UCI repository and split it into training and test sets for modelling and evaluation.


In [None]:
import pandas as pd

# Column names from the dataset documentation
column_names = [
    'age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status',
    'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
    'hours_per_week', 'native_country', 'income'
]

# Load the train and test data
url_train = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'

df_train = pd.read_csv(url_train, header=None, names=column_names, na_values=' ?', skipinitialspace=True)

url_test = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'

df_test = pd.read_csv(url_test, skiprows=1, header=None, names=column_names, na_values=' ?', skipinitialspace=True)

print(df_train.shape)
print(df_test.shape)

# Remove trailing period in target
df_test['income'] = df_test['income'].str.replace('.', '', regex=False)

# Separate features and target for training data
X_train = df_train.drop('income', axis=1)
Y_train = df_train['income']

X_test = df_test.drop('income', axis=1)
Y_test = df_test['income']


Since we’re loading the training and test sets separately (from the UCI Adult dataset), there's no need to use `train_test_split` — we already have separate datasets (`df_train` and `df_test`).

For some datasets you will often find that this step has been prepared for you with ample volumes of training and test data. Given the amount of data (32K+ rows), you can of course download and perform test-train split as usual using just the training data. More data is usually a good thing when it comes to more sophisticated models though.

We will take a smaller sample of the data for training (this is a big dataset), you can increase this if you wish based on your hardware and the time you have available:

In [None]:
# Define sample size
train_sample_size = 500
test_sample_size = 200

seed = 7

# Sample from training and test sets (set random_state for reproducibility)
X_train = X_train.sample(n=train_sample_size, random_state=seed)
Y_train = Y_train.loc[X_train.index]  # Ensure labels match sampled features

X_test = X_test.sample(n=test_sample_size, random_state=seed)
Y_test = Y_test.loc[X_test.index]     # Match test labels to test features

In the above, we randomly select a subset of rows from both training and test sets. We also ensure the corresponding labels (`Y_train`, `Y_test`) are aligned correctly using `.loc[]` with the sampled indices. Let's review our sample:

In [None]:
X_train.head()

Let's also group our columns of data based on whether they are numeric (and will require scaling), or text (requiring categorisation):

In [None]:
# Separate feature types
numerical_features = [
    'age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week'
]

categorical_features = [
    'workclass', 'education', 'marital_status', 'occupation',
    'relationship', 'race', 'sex', 'native_country'
]

### Working with pipelines - data preparation
Pipelines help you prevent data leakage in your test harness by ensuring that data preparation like standardisation is constrained to each fold of your cross-validation procedure.  We will create a pipeline that standardises the data then creates a model. The pipeline is defined with two steps:

1. Standardise the data.
2. Create the model - a Linear Discriminant Analysis model.

The pipeline is then evaluated using 10-fold cross-validation:

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# Select only numerical columns from the training and test sets
X_train_num = X_train[numerical_features]
X_test_num = X_test[numerical_features]

# Create pipeline with standardisation and LDA model
estimators = [
    ('standardise', StandardScaler()),
    ('lda', LinearDiscriminantAnalysis())
]

# The estimators are passed to the pipeline
model = Pipeline(estimators)

seed = 7

# Set up 10-fold cross-validation
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)

# Evaluate using cross-validation
results = cross_val_score(model, X_train_num, Y_train, cv=kfold)

# Evaluate final model on test set
model.fit(X_train_num, Y_train)

test_accuracy = model.score(X_test_num, Y_test)

# Show results
print("Cross-validated Training Accuracy: {:.4f}".format(results.mean()))
print("Test Accuracy: {:.4f}".format(test_accuracy))

This illustrates the basic pipeline framework used in machine learning workflows. However, for our dataset, we need to extend the pipeline to handle real-world challenges in the data more effectively. The dataset contains a mix of *numerical* and *categorical* features, and it may also include *missing values*.

These issues are common in applied machine learning tasks, and they need to be addressed before training a model. If left untreated, they can lead to poor performance or even errors during training.  To prepare the dataset properly, we need to perform several key preprocessing steps:

- *Imputing missing values*: Filling in missing or null entries so the model can be trained without errors. This might involve replacing missing values with the mean (for numbers) or the most frequent value (for categories).
- *Encoding categorical variables*: Converting non-numeric columns (like strings or labels) into numeric form, so they can be understood by algorithms. Common techniques include one-hot encoding or ordinal encoding.
- *Scaling numerical features*: Adjusting the scale of numeric values so that features with large ranges (like income or age) don’t dominate features with smaller ranges. This is especially important for algorithms sensitive to feature magnitude, such as KNN or logistic regression.

To organise all of these steps clearly and efficiently, we can use a `ColumnTransformer` within a `Pipeline`. The `ColumnTransformer` allows us to apply different preprocessing strategies to different subsets of columns — for example, scaling only the numeric ones and encoding only the categorical ones. This ensures that each type of data is treated appropriately in a single, unified pipeline:

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# Define preprocessing pipeline for numerical features:
# 1) Impute missing values using the median
# 2) Scale features to have mean 0 and standard deviation 1
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Define preprocessing pipeline for categorical features:
# 1) Impute missing values using the most frequent category
# 2) Convert categories into one-hot encoded binary columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine the two transformers using a ColumnTransformer:
# This applies the appropriate pipeline to each column type
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_transformer, numerical_features),     # Apply to numeric columns
    ('cat', categorical_transformer, categorical_features)  # Apply to categorical columns
])

# Create a full pipeline:
# 1) Preprocess the data
# 2) Fit a logistic regression model on the transformed data
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))  # Increased max_iter to ensure convergence
])

# Train the pipeline on the training set
model.fit(X_train, Y_train)

# Set up 10-fold cross-validation to evaluate training performance
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(model, X_train, Y_train, cv=kfold)

# Evaluate model accuracy on training and test sets
print("Cross-validated Training Accuracy: {:.4f}".format(results.mean()))
print("Test Accuracy: {:.4f}".format(test_accuracy))  # test_accuracy must be defined elsewhere


### Working with pipelines – feature extraction

*Feature extraction* is the process of transforming raw input data into new features that can improve model performance. However, like other steps in data preparation, feature extraction must be done *carefully* to avoid *data leakage*. This means that all feature extraction should be performed only on the training data, never on the test set or across folds during cross-validation — otherwise, information from the test data can "leak" into training, leading to overly optimistic results.

To help with this, scikit-learn offers a powerful tool called `FeatureUnion`. This allows us to combine multiple feature extraction steps into a single, unified set of features, which are then passed to the model. Importantly, when used within a pipeline, all transformations and the union of features occur within each fold of cross-validation, ensuring no leakage.

In the example below, we build a pipeline that includes two parallel feature extraction techniques and a classifier, combined in four main steps:

1. *Feature Extraction with Principal Component Analysis (PCA)*: Reduces the dimensionality of the data to three components that capture the most variance.
2. *Feature Extraction with Statistical Selection*: Selects the six most relevant features using a univariate statistical test (e.g. ANOVA F-test).
3. *Feature Union*: Merges the outputs from the two feature extraction techniques into a single feature set.
4. *Logistic Regression Model*: Trains a logistic regression classifier on the combined features.

This entire process is wrapped inside a pipeline and evaluated using 10-fold cross-validation. A key takeaway is that `FeatureUnion` itself acts like a *mini-pipeline*, and becomes a *single step* in the final pipeline that leads to model training.

This shows how you can nest pipelines inside pipelines — a flexible and clean way to manage complex preprocessing and modelling workflows:

In [None]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score

# Define preprocessing for numerical features:
# 1) Impute missing values using the median
# 2) Scale features to standard normal distribution (mean=0, std=1)
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Define preprocessing for categorical features:
# 1) Impute missing values with the most frequent value
# 2) One-hot encode categories (convert to binary columns)
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine both transformers into a single preprocessor:
# This applies the right transformation to each feature type
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_transformer, numerical_features),
    ('cat', categorical_transformer, categorical_features)
])

# Define feature extraction steps:
# 1) PCA reduces the feature space to 3 dimensions (captures most variance)
# 2) SelectKBest selects the 6 most statistically relevant features
# Both are applied in parallel and their results combined
feature_union = FeatureUnion([
    ('pca', PCA(n_components=3)),
    ('select_best', SelectKBest(k=6))
])

# Define full pipeline:
# 1) Preprocess the data
# 2) Extract features using both PCA and SelectKBest
# 3) Train a logistic regression model
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('features', feature_union),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Set up k-fold cross-validation (10 folds, shuffled, with fixed seed)
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)

# Evaluate model performance using cross-validation on training data
results = cross_val_score(model, X_train, Y_train, cv=kfold)

# Print training and test accuracy
print("Cross-validated Training Accuracy: {:.4f}".format(results.mean()))

Using a pipeline helps prevent *data leakage*, which is a common pitfall in machine learning. By wrapping both the *preprocessing steps* and the *model* inside a single pipeline, we ensure that any transformations (like scaling or encoding) are applied **only** to the training data within each fold of cross-validation. This means the model doesn’t accidentally learn patterns from the test set, leading to more honest and reliable results.

Pipelines also result in cleaner and more reusable code. They make your workflow modular — each step is clearly defined and easy to swap, tune, or extend. This is especially helpful in larger projects or when collaborating with others.

Another benefit is *fair* model evaluation. Because pipelines ensure that every model is trained and validated using the same cross-validation splits — with consistent preprocessing inside each fold — the results are more directly comparable. This helps you make more accurate decisions when choosing between models.

### What have we learnt?
Pipelines allow you to chain together data preprocessing and modelling steps into a single, unified workflow. This ensures consistency, helps avoid common mistakes like data leakage, and simplifies code management—especially when working with complex datasets like the Adult Income dataset, which contains both numerical and categorical features.

We also learned how to extend pipelines with feature engineering, enabling us to reduce dimensionality and select the most informative features. When we integrate steps like imputation, encoding, scaling, and classification into one pipeline, we ensure that cross-validation and testing are fair and reproducible. Ultimately, pipelines provide a powerful, modular way to build, evaluate, and tune models in a clean and scalable manner—making them essential for any real-world machine learning project.