# Pipeline with Model Tuning
    Leakage-Safe Hyperparameter Optimization
## Objective

This notebook demonstrates how to:

Combine preprocessing, feature engineering, and - modeling into a single pipeline

- Perform GridSearchCV / RandomizedSearchCV safely

- Avoid tuning-induced data leakage

- Compare tuned vs untuned models fairly

It answers:

    How do we tune models without contaminating validation performance?

## Why Model Tuning Is a Leakage Risk

Common (wrong) pattern:

- Preprocess data globally

- Split or CV

- Tune model

‚ùå This leaks information from validation folds.

üìå All preprocessing and tuning must occur inside CV folds.

## Imports and Dataset

In [1]:
import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import (
    StandardScaler,
    OneHotEncoder
)

from sklearn.impute import SimpleImputer

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import (
    StratifiedKFold,
    GridSearchCV,
    RandomizedSearchCV
)


In [2]:

DATA_PATH =  """D:/GitHub/Data-Science-Techniques/datasets/synthetic_customer_churn_classification_complete.csv"""
df = pd.read_csv(DATA_PATH)

X = df.drop(columns=["churn", "customer_id"])
y = df["churn"]


# Define Preprocessing Pipeline

In [6]:
numeric_features = X.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X.select_dtypes(include=["object", "category"]).columns

numeric_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(
        handle_unknown="ignore",
        drop="first"
    ))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_features),
        ("cat", categorical_pipeline, categorical_features)
    ]
)


# Full Modeling Pipeline

In [9]:
pipeline = Pipeline(steps=[
    ("preprocessing", preprocessor),
    ("model", LogisticRegression(
        max_iter=1000,
        class_weight="balanced"
    ))
])


No preprocessing happens outside the pipeline.

# Hyperparameter Grid (Logistic Regression)

In [14]:
param_grid = {
    "model__C": [0.01, 0.1, 1.0, 10.0],
    "model__penalty": ["l1", "l2"],
    "model__solver": ["liblinear"]
}


# Leakage-Safe GridSearchCV

In [31]:
cv = StratifiedKFold(
    n_splits=5,
    shuffle=True,
    random_state=2010
)

grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    scoring="roc_auc", #"recall",
    cv=cv,
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X, y)


Fitting 5 folds for each of 8 candidates, totalling 40 fits


## Results Interpretation

In [33]:
grid_search.best_params_


{'model__C': 0.01, 'model__penalty': 'l1', 'model__solver': 'liblinear'}

In [35]:
grid_search.best_score_


np.float64(1.0)

This ROC-AUC is honest ‚Äî evaluated on unseen folds.

## Tuned vs Untuned Comparison

In [39]:
from sklearn.model_selection import cross_val_score

baseline_scores = cross_val_score(
    pipeline,
    X,
    y,
    cv=cv,
    scoring="roc_auc"
)

baseline_scores.mean()


np.float64(1.0)

In [41]:
baseline_scores.mean()

np.float64(1.0)

Always compare under same CV strategy.

# RandomizedSearchCV (Tree Model)

In [44]:
rf_pipeline = Pipeline(steps=[
    ("preprocessing", preprocessor),
    ("model", RandomForestClassifier(
        random_state=2010,
        class_weight="balanced"
    ))
])


In [46]:
param_dist = {
    "model__n_estimators": [100, 200, 400],
    "model__max_depth": [None, 4, 6, 8],
    "model__min_samples_split": [2, 5, 10],
    "model__max_features": ["sqrt", "log2"]
}


In [48]:
random_search = RandomizedSearchCV(
    estimator=rf_pipeline,
    param_distributions=param_dist,
    n_iter=20,
    scoring="roc_auc",
    cv=cv,
    random_state=42,
    n_jobs=-1,
    verbose=1
)

random_search.fit(X, y)


Fitting 5 folds for each of 20 candidates, totalling 100 fits


## Best Tuned Random Forest

In [51]:
random_search.best_params_


{'model__n_estimators': 200,
 'model__min_samples_split': 5,
 'model__max_features': 'sqrt',
 'model__max_depth': None}

In [53]:
random_search.best_score_

np.float64(1.0)

# Nested CV (Conceptual Overview)

üìå When tuning influences model choice, use nested CV:

- Outer CV ‚Üí performance estimation

- Inner CV ‚Üí hyperparameter tuning

(Not executed here due to runtime, but recommended for research-grade evaluation.)

# Common Tuning Mistakes

| Mistake                     | Impact           |
| --------------------------- | ---------------- |
| Preprocessing before tuning | Leakage          |
| Different CV for comparison | Bias             |
| Over-searching small data   | Overfitting      |
| Ignoring variance           | False confidence |


# Best Practices

- ‚úî Always tune pipeline, not model
- ‚úî Use same CV for all candidates
- ‚úî Log best params + CV config
- ‚úî Prefer RandomizedSearch for large spaces
- ‚úî Nested CV for final model claims

# Key Takeaways

- Tuning is part of the modeling pipeline

- GridSearchCV is leakage-safe only with pipelines

- Performance claims are only as valid as CV design

- Honest tuning protects production reliability

# Related Notebooks

[09_Pipelines_and_Workflows/]()

‚îú‚îÄ‚îÄ 	[	01_basic_pipeline.ipynb	](	01_basic_pipeline.ipynb	)

‚îú‚îÄ‚îÄ 	[	02_column_transformer_pipeline.ipynb	](	02_column_transformer_pipeline.ipynb	)

‚îú‚îÄ‚îÄ 	[	03_pipeline_with_feature_engineering.ipynb	](	03_pipeline_with_feature_engineering.ipynb	)

‚îú‚îÄ‚îÄ 	[	04_leakage_safe_cross_validation.ipynb	](	04_leakage_safe_cross_validation.ipynb	) 

‚îú‚îÄ‚îÄ 	[	05_pipeline_with_model_tuning.ipynb	](	05_pipeline_with_model_tuning.ipynb	) **‚Üê YOU ARE HERE**

‚îú‚îÄ‚îÄ 	[	06_pipeline_serialization_and_inference.ipynb	](  06_pipeline_serialization_and_inference.ipynb )  

‚îú‚îÄ‚îÄ 	[	07_pipeline_monitoring_and_reusability.ipynb	](	07_pipeline_monitoring_and_reusability.ipynb	) 