 # ColumnTransformer Pipelines

    Feature-Type–Aware, Leakage-Safe ML Architectures
    
## Objective

This notebook focuses on building robust, maintainable pipelines using ColumnTransformer, covering:

- Explicit numeric vs categorical preprocessing

- Mixed data types in real-world datasets

- Schema-aware transformations

- Safe extensibility for feature engineering

It answers:

    How do we correctly preprocess heterogeneous tabular data without manual feature handling or leakage?

## Why ColumnTransformer Is Essential

Real datasets contain:

- Numeric variables (scale-sensitive)

- Categorical variables (encoding required)

- Ordinal variables (order matters)

- Mixed missingness patterns

Without ColumnTransformer:

- Feature leakage occurs

- Column order breaks models

- Pipelines become brittle

- Production inference fails

    ColumnTransformer is not optional for tabular ML.

## Imports and dataset

In [3]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import (
    StandardScaler,
    OneHotEncoder,
    OrdinalEncoder
)

from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score


In [5]:

DATA_PATH =  """D:/GitHub/Data-Science-Techniques/datasets/synthetic_customer_churn_classification_complete.csv"""
df = pd.read_csv(DATA_PATH)

X = df.drop(columns=["churn", "customer_id"])
y = df["churn"]


In [42]:
df.satisfaction_level.unique()

array([nan, 'Very High', 'Medium', 'High', 'Very Low', 'Low'],
      dtype=object)

# Train/Test Split (Always First)

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    stratify=y,
    random_state=2010
)


# Explicit Feature Grouping
## Numeric Features

In [11]:
numeric_features = [
    "age",
    "income",
    "tenure_years",
    "avg_monthly_usage",
    "support_tickets_last_year"
]


## Ordinal Features

In [44]:
ordinal_features = ["satisfaction_level"]

ordinal_categories = [
    ["Very Low", "Low", "Medium", "High", "Very High"]
]


## Nominal Categorical Features

In [47]:
categorical_features = [
    "customer_segment",
    "region"
]


Explicit grouping improves:

- Readability

- Auditability

- Long-term maintenance

# Pipeline
## Numeric Pipeline

In [50]:
numeric_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])


✔ Handles missing values

✔ Makes coefficients comparable

## Ordinal Pipeline

In [53]:
ordinal_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OrdinalEncoder(categories=ordinal_categories))
])


✔ Preserves order

✔ Avoids one-hot explosion

## Categorical Pipeline

In [56]:
categorical_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(
        handle_unknown="ignore",
        drop="first"
    ))
])


✔ Handles unseen categories

✔ Avoids dummy-variable trap

## ColumnTransformer Assembly

In [59]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_features),
        ("ord", ordinal_pipeline, ordinal_features),
        ("cat", categorical_pipeline, categorical_features)
    ],
    remainder="drop"
)


Guarantees:

- Correct routing

- Feature-type isolation

- No manual joins

# Full Modeling Pipeline

In [62]:
pipeline = Pipeline(steps=[
    ("preprocessing", preprocessor),
    ("model", LogisticRegression(
        max_iter=1000,
        class_weight="balanced"
    ))
])


✔ Single deployable object

✔ Fully leakage-safe

## Train Pipeline

In [65]:
pipeline.fit(X_train, y_train)


All encoders and scalers are fit only on training data.

## Evaluate Pipeline

In [68]:
y_test_prob = pipeline.predict_proba(X_test)[:, 1]

roc_auc_score(y_test, y_test_prob)


np.float64(0.8382716716977812)

## Inspect Transformed Feature Space

In [71]:
pipeline.named_steps["preprocessing"]


Critical for:

- Debugging

- Feature audits

- Model documentation

# Why This Design Scales


| Requirement        | Satisfied |
| ------------------ | --------- |
| Mixed data types   | ✔         |
| Leakage prevention | ✔         |
| Schema evolution   | ✔         |
| New features       | ✔         |
| Deployment safety  | ✔         |


# Common Mistakes (Avoided)

- ❌ Using OneHotEncoder for ordinals
- ❌ Encoding before split
- ❌ Mixing preprocessing logic
- ❌ Hard-coding column indices
- ❌ Manual feature concatenation

# When ColumnTransformer Is Mandatory

- ✔ Tabular ML
- ✔ Regulatory models
- ✔ Production pipelines
- ✔ Feature-rich datasets
- ✔ Team-based projects

# Key Takeaways

- ColumnTransformer is the backbone of tabular ML

- Feature-type awareness improves correctness

- Explicit schemas improve governance

- Pipelines should be readable, not clever

- This pattern is production-default

# Related Notebooks

[09_Pipelines_and_Workflows/]()

├── 	[	01_basic_pipeline.ipynb	](	01_basic_pipeline.ipynb	)

├── 	[	02_column_transformer_pipeline.ipynb	](	02_column_transformer_pipeline.ipynb	) **← YOU ARE HERE**

├── 	[	03_pipeline_with_feature_engineering.ipynb	](	03_pipeline_with_feature_engineering.ipynb	)

├── 	[	04_leakage_safe_cross_validation.ipynb	](	04_leakage_safe_cross_validation.ipynb	)

├── 	[	05_pipeline_with_model_tuning.ipynb	](	05_pipeline_with_model_tuning.ipynb	)

├── 	[	06_pipeline_serialization_and_inference.ipynb	](  06_pipeline_serialization_and_inference.ipynb )  

├── 	[	07_pipeline_monitoring_and_reusability.ipynb	](	07_pipeline_monitoring_and_reusability.ipynb	) 