# Pipeline Serialization and Inference

    Persisting, Loading, and Using ML Pipelines in Production
## Objective

This notebook demonstrates how to:

- Serialize full ML pipelines safely

- Reload pipelines for inference

- Perform single and batch predictions

- Ensure schema consistency

- Avoid common production failures

It answers:

    How do we move from trained model ‚Üí usable production artifact?

## Why Serialization Matters

Without serialization:

- Models cannot be reused

- Training must be repeated

- Deployment is impossible

- Results are not reproducible

With serialization:

- ‚úî Reproducibility
- ‚úî Deployment readiness
- ‚úî One-object model artifact
- ‚úî Stable inference

üìå The pipeline ‚Äî not just the model ‚Äî must be saved.

## Imports

In [4]:
import numpy as np
import pandas as pd
import joblib

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import (
    StandardScaler,
    OneHotEncoder
)

from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression


In [6]:

DATA_PATH =  """D:/GitHub/Data-Science-Techniques/datasets/synthetic_customer_churn_classification_complete.csv"""
df = pd.read_csv(DATA_PATH)

X = df.drop(columns=["churn", "customer_id"])
y = df["churn"]


# Build Final Training Pipeline

In [9]:
numeric_features = X.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X.select_dtypes(include=["object", "category"]).columns

numeric_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore", drop="first"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_features),
        ("cat", categorical_pipeline, categorical_features)
    ]
)

pipeline = Pipeline(steps=[
    ("preprocessing", preprocessor),
    ("model", LogisticRegression(
        max_iter=1000,
        class_weight="balanced"
    ))
])


# Train Final Model

In [12]:
pipeline.fit(X, y)

Train on full dataset once model design is finalized.

## Serialize Pipeline

In [15]:
joblib.dump(pipeline, "churn_pipeline.joblib")

['churn_pipeline.joblib']

‚úî Saves:

- Imputers

- Encoders

- Scalers

- Model weights

- Feature schema

üìå This is the complete production artifact.

# Load Serialized Pipeline

In [18]:
loaded_pipeline = joblib.load("churn_pipeline.joblib")

‚úî Ready for inference immediately

‚úî No retraining required

## Single Prediction (Production Scenario)

In [22]:
sample = X.iloc[[0]]

loaded_pipeline.predict(sample)


array([0])

In [24]:
loaded_pipeline.predict_proba(sample)

array([[9.99931664e-01, 6.83360438e-05]])

## Batch Inference

In [27]:
batch_predictions = loaded_pipeline.predict(X.head(10))
batch_probabilities = loaded_pipeline.predict_proba(X.head(10))[:, 1]

batch_predictions, batch_probabilities


(array([0, 0, 0, 0, 1, 0, 1, 0, 0, 1]),
 array([6.83360438e-05, 2.93515811e-06, 7.32874514e-05, 4.21949547e-04,
        9.99968587e-01, 2.83586413e-04, 9.99972882e-01, 1.06017406e-04,
        4.55202577e-04, 9.99973529e-01]))

## Schema Safety Check

In [30]:
list(loaded_pipeline.feature_names_in_)

['age',
 'income',
 'tenure_years',
 'avg_monthly_usage',
 'support_tickets_last_year',
 'satisfaction_level',
 'customer_segment',
 'region',
 'future_retention_offer']

üìå Ensure incoming production data has:

- Same column names

- Same data types

- Same feature order

## Handling Missing or Extra Columns

In [33]:
missing_cols = set(loaded_pipeline.feature_names_in_) - set(X.columns)
extra_cols = set(X.columns) - set(loaded_pipeline.feature_names_in_)

missing_cols, extra_cols


(set(), set())

Must be validated before inference in production systems.

## Production Inference Function

In [36]:
def predict_churn(model, input_df):
    """
    Safe production prediction wrapper.
    """
    input_df = input_df.copy()

    # Schema validation
    required_cols = model.feature_names_in_
    input_df = input_df[required_cols]

    preds = model.predict(input_df)
    probs = model.predict_proba(input_df)[:, 1]

    return pd.DataFrame({
        "prediction": preds,
        "probability": probs
    })


In [38]:
predict_churn(loaded_pipeline, X.head(5))

Unnamed: 0,prediction,probability
0,0,6.8e-05
1,0,3e-06
2,0,7.3e-05
3,0,0.000422
4,1,0.999969


## Common Serialization Pitfalls

| Problem                   | Impact                |
| ------------------------- | --------------------- |
| Saving model only         | Missing preprocessing |
| Different sklearn version | Load failure          |
| Feature mismatch          | Wrong predictions     |
| Encoding drift            | Silent bugs           |
| No schema validation      | Production crashes    |


## Best Practices

- ‚úî Always save **full pipeline**
- ‚úî Version control model artifacts
- ‚úî Log training schema
- ‚úî Validate schema before inference
- ‚úî Use deterministic preprocessing

## Key Takeaways

- Serialization completes the ML lifecycle

- The pipeline is the deployable unit

- Schema validation is mandatory

- Inference must mirror training exactly

- This notebook bridges **ML ‚Üí Production**

# Related Notebooks

[09_Pipelines_and_Workflows/]()

‚îú‚îÄ‚îÄ 	[	01_basic_pipeline.ipynb	](	01_basic_pipeline.ipynb	)

‚îú‚îÄ‚îÄ 	[	02_column_transformer_pipeline.ipynb	](	02_column_transformer_pipeline.ipynb	)

‚îú‚îÄ‚îÄ 	[	03_pipeline_with_feature_engineering.ipynb	](	03_pipeline_with_feature_engineering.ipynb	)

‚îú‚îÄ‚îÄ 	[	04_leakage_safe_cross_validation.ipynb	](	04_leakage_safe_cross_validation.ipynb	)

‚îú‚îÄ‚îÄ 	[	05_pipeline_with_model_tuning.ipynb	](	05_pipeline_with_model_tuning.ipynb	)

‚îú‚îÄ‚îÄ 	[	06_pipeline_serialization_and_inference.ipynb	](	06_pipeline_serialization_and_inference.ipynb	)  **‚Üê YOU ARE HERE**

‚îú‚îÄ‚îÄ 	[	07_pipeline_monitoring_and_reusability.ipynb	](	07_pipeline_monitoring_and_reusability.ipynb	)