# Pipeline with Feature Engineering
    Embedding Business and Statistical Features Safely into ML Pipelines
## Objective

This notebook demonstrates how to:

- Add domain-driven and interaction features inside pipelines

- Use FunctionTransformer and custom transformers

- Keep feature engineering leakage-safe

- Maintain clean separation between raw data and learned features

It answers:

    How do we engineer features without breaking reproducibility or leaking information?

## Why Feature Engineering Must Live Inside Pipelines

If feature engineering happens:

- Outside pipelines ‚Üí ‚ùå leakage risk

- In notebooks only ‚Üí ‚ùå not deployable

- Differently in training vs inference ‚Üí ‚ùå silent bugs

üìå Any feature not inside the pipeline does not exist in production.

# Imports and dataset

In [2]:
import numpy as np
import pandas as pd

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import (
    StandardScaler,
    OneHotEncoder,
    OrdinalEncoder,
    FunctionTransformer
)

from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score


In [5]:

DATA_PATH =  """D:/GitHub/Data-Science-Techniques/datasets/synthetic_customer_churn_classification_complete.csv"""
df = pd.read_csv(DATA_PATH)

X = df.drop(columns=["churn", "customer_id"])
y = df["churn"]


# Train/Test Split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    stratify=y,
    random_state=2010
)


# Custom Feature Engineering Transformer
Example: Usage Intensity & Support Load

In [13]:
class UsageSupportFeatures(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        X["usage_per_tenure"] = X["avg_monthly_usage"] / (X["tenure_years"] + 1)
        X["tickets_per_year"] = X["support_tickets_last_year"] / (X["tenure_years"] + 1)
        return X


‚úî Deterministic

‚úî No target leakage

‚úî Reusable

## Apply Feature Engineering Early in Pipeline

In [16]:
feature_engineering = Pipeline(steps=[
    ("fe", UsageSupportFeatures())
])


Applied before preprocessing so new features are typed correctly.

## Feature Grouping (After Engineering)

In [45]:
numeric_features = [
    "age",
    "income",
    "tenure_years",
    "avg_monthly_usage",
    "support_tickets_last_year",
    "usage_per_tenure",
    "tickets_per_year"
]

ordinal_features = ["satisfaction_level"]
ordinal_categories = [["Very Low", "Low", "Medium", "High", "Very High"]]


categorical_features = ["customer_segment", "region"]


# Preprocessing Pipelines

In [48]:
numeric_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])


### Ordinal

In [51]:
ordinal_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OrdinalEncoder(categories=ordinal_categories))
])


### Categorical

In [54]:
categorical_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore", drop="first"))
])


## ColumnTransformer

In [57]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_features),
        ("ord", ordinal_pipeline, ordinal_features),
        ("cat", categorical_pipeline, categorical_features)
    ]
)


# Full Pipeline Assembly

In [60]:
pipeline = Pipeline(steps=[
    ("feature_engineering", feature_engineering),
    ("preprocessing", preprocessor),
    ("model", LogisticRegression(
        max_iter=1000,
        class_weight="balanced"
    ))
])


Entire logic = one object

## Train Pipeline

In [63]:
pipeline.fit(X_train, y_train)


# Evaluate Pipeline

In [67]:
y_test_prob = pipeline.predict_proba(X_test)[:, 1]

roc_auc_score(y_test, y_test_prob)




np.float64(0.8381306520515178)

## Inspect Feature Flow

In [70]:
pipeline


Helps trace:

- Raw ‚Üí engineered ‚Üí encoded ‚Üí modeled

## Why This Pattern Is Correct

| Concern         | Addressed |
| --------------- | --------- |
| Leakage         | ‚úî         |
| Reproducibility | ‚úî         |
| Deployment      | ‚úî         |
| Governance      | ‚úî         |
| Feature drift   | ‚úî         |


## Common Mistakes (Avoided)

- ‚ùå Feature engineering outside pipeline
- ‚ùå Target-dependent features
- ‚ùå Hard-coded column indices
- ‚ùå Post-split feature mutation
- ‚ùå Inference-time mismatches

## Key Takeaways

- Feature engineering belongs inside pipelines

- Custom transformers are simple and powerful

- Always engineer features before encoding

- Pipelines must reflect production reality

- This pattern scales to complex systems

# Related Notebooks

[09_Pipelines_and_Workflows/]()

‚îú‚îÄ‚îÄ 	[	01_basic_pipeline.ipynb	](	01_basic_pipeline.ipynb	)

‚îú‚îÄ‚îÄ 	[	02_column_transformer_pipeline.ipynb	](	02_column_transformer_pipeline.ipynb	)

‚îú‚îÄ‚îÄ 	[	03_pipeline_with_feature_engineering.ipynb	](	03_pipeline_with_feature_engineering.ipynb	)

‚îú‚îÄ‚îÄ 	[	02_leakage_safe_cross_validation.ipynb	](	02_leakage_safe_cross_validation.ipynb	)

‚îú‚îÄ‚îÄ 	[	04_pipeline_with_model_tuning.ipynb	](	04_pipeline_with_model_tuning.ipynb	)

‚îú‚îÄ‚îÄ 	[	05_pipeline_serialization_and_inference.ipynb	](	05_pipeline_serialization_and_inference.ipynb	)

‚îú‚îÄ‚îÄ 	[	03_pipeline_monitoring_and_reusability.ipynb	](	03_pipeline_monitoring_and_reusability.ipynb	)