# Basic ML Pipeline
    Building Leakage-Safe, Reproducible End-to-End Pipelines
##  Objective

This notebook introduces basic machine learning pipelines using scikit-learn, focusing on:

- Why pipelines are mandatory (not optional)

- Safe preprocessing + modeling composition

- Preventing data leakage

- Creating reproducible, deployable ML workflows

It answers:

    How do we structure preprocessing and modeling so the system is correct, auditable, and production-ready?

##  Why Pipelines Matter

Without pipelines:

- Preprocessing leaks information

- Train/test inconsistency occurs

- Models break in production

- Results are irreproducible

With pipelines:
- ‚úî Leakage-free training
- ‚úî One-object deployment
- ‚úî Reproducibility
- ‚úî Governance compliance

üìå Any model without a pipeline is incomplete.
## Imports and dataset

In [3]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import (
    StandardScaler,
    OneHotEncoder
)

from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score


In [7]:

DATA_PATH =  """D:/GitHub/Data-Science-Techniques/datasets/synthetic_customer_churn_classification_complete.csv"""
df = pd.read_csv(DATA_PATH)

X = df.drop(columns=["churn", "customer_id"])
y = df["churn"]


# Train/Test Split (First and Mandatory)

In [41]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    stratify=y,
    random_state=2010
)


## Identify Feature Types

In [12]:
numeric_features = X.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X.select_dtypes(include=["object", "category"]).columns


# Define Preprocessing Pipelines
## Numeric Pipeline

In [15]:
numeric_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])


‚úî Handles missing values

‚úî Scales features for linear models

## Categorical Pipeline

In [20]:
categorical_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(
        handle_unknown="ignore",
        drop="first"
    ))
])


‚úî Handles missing categories

‚úî Prevents inference-time crashes

## ColumnTransformer (Feature Union)

In [23]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_features),
        ("cat", categorical_pipeline, categorical_features)
    ]
)


This guarantees:

- Correct feature routing

- No manual joins

- Consistent transformations

# Full Modeling Pipeline

In [26]:
pipeline = Pipeline(steps=[
    ("preprocessing", preprocessor),
    ("model", LogisticRegression(
        max_iter=1000,
        class_weight="balanced"
    ))
])


- ‚úî Single object
- ‚úî Serializable
- ‚úî Deployable

## Train Pipeline

In [30]:
pipeline.fit(X_train, y_train)


All transformations are learned only from training data

# Evaluate Pipeline

In [33]:
y_test_prob = pipeline.predict_proba(X_test)[:, 1]

roc_auc_score(y_test, y_test_prob)


np.float64(1.0)

‚úî Correct evaluation

‚úî No leakage

## Inspect Pipeline Structure

In [36]:
pipeline

Useful for:

- Debugging

- Governance reviews

- Model documentation

## What This Pipeline Guarantees

| Risk                        | Mitigated |
| --------------------------- | --------- |
| Data leakage                | ‚úî         |
| Train/test mismatch         | ‚úî         |
| Manual preprocessing errors | ‚úî         |
| Production drift bugs       | ‚úî         |




## Common Mistakes (Avoided)

- ‚ùå Scaling before split
- ‚ùå Encoding full dataset
- ‚ùå Manual feature engineering outside pipeline
- ‚ùå Separate training and inference logic
- ‚ùå Hard-coding feature order

## When This Pipeline Is Enough


- ‚úî Baseline models

- ‚úî Linear / tree models

- ‚úî Small-to-medium datasets

- ‚úî Clear feature schema


Later notebooks will extend this.

## Key Takeaways

- Pipelines are non-negotiable

- Preprocessing belongs inside the pipeline

- ColumnTransformer is the backbone

- This structure is deployment-safe

- Every future notebook builds on this

# Related Notebooks

[09_Pipelines_and_Workflows/]()

‚îú‚îÄ‚îÄ 	[	01_basic_pipeline.ipynb	](	01_basic_pipeline.ipynb	)

‚îú‚îÄ‚îÄ 	[	02_column_transformer_pipeline.ipynb	](	02_column_transformer_pipeline.ipynb	)

‚îú‚îÄ‚îÄ 	[	03_pipeline_with_feature_engineering.ipynb	](	03_pipeline_with_feature_engineering.ipynb	)

‚îú‚îÄ‚îÄ 	[	02_leakage_safe_cross_validation.ipynb	](	02_leakage_safe_cross_validation.ipynb	)

‚îú‚îÄ‚îÄ 	[	04_pipeline_with_model_tuning.ipynb	](	04_pipeline_with_model_tuning.ipynb	)

‚îú‚îÄ‚îÄ 	[	05_pipeline_serialization_and_inference.ipynb	](	05_pipeline_serialization_and_inference.ipynb	)

‚îú‚îÄ‚îÄ 	[	03_pipeline_monitoring_and_reusability.ipynb	](	03_pipeline_monitoring_and_reusability.ipynb	)