# Pickle and Joblib Model Serialization
##  Objective

Demonstrate reliable model serialization and deserialization using `joblib` and `pickle`, emphasizing production risks, reproducibility, and environment consistency.

This notebook explicitly frames serialization as a deployment contract, not a convenience function.

##  Why Model Serialization Matters in Production
- Concepts Covered

- Difference between training-time objects and production artifacts

Why “saving a model” is insufficient without:

- Versioned dependencies
- Feature consistency
- Metadata tracking
- 
## Key Risks

- Environment mismatch (Python / library versions)
- Broken pipelines after refactoring
- Silent inference failures
- 
##  Serialization Strategies Overview
### Comparison Table
| Method      | Use Case        | Pros              | Cons              |
| ----------- | --------------- | ----------------- | ----------------- |
| pickle      | Python-only     | Native, flexible  | Unsafe, fragile   |
| joblib      | Large ML models | Faster, efficient | Python-bound      |
| ONNX / PMML | Cross-platform  | Portable          | Limited operators |


> This notebook focuses on pickle and joblib only.
ONNX and PMML are covered in a separate notebook.

##  Baseline Example: Training a Simple Model
### Steps

- Load dataset (e.g., sklearn.datasets)
- Train a model with preprocessing
- Wrap preprocessing + model in a Pipeline
- 
## Best Practice Emphasized

 **Always serialize the full pipeline**, not just the estimator.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression


# Serialization with pickle
## Saving the Model

In [None]:
import pickle

with open("model_pickle.pkl", "wb") as f:
    pickle.dump(pipeline, f)

## Loading the Model

In [None]:
with open("model_pickle.pkl", "rb") as f:
    loaded_model = pickle.load(f)

## Critical Warnings (Explicit Section)

- ❌ Not secure against untrusted sources
- ❌ Sensitive to Python version changes
- ❌ Breaks easily with code refactoring
- 
> Rule: Never load pickle files from unknown origins.

##  Serialization with joblib (Recommended)
Why joblib Is Preferred

- Optimized for NumPy arrays
- Faster disk I/O
- Lower memory footprint

## Saving the Model

In [None]:
import joblib

joblib.dump(pipeline, "model_joblib.joblib")

### Loading the Model

In [None]:
loaded_model = joblib.load("model_joblib.joblib")

## Validation After Loading (Mandatory Step)
### Why This Matters

Serialization success ≠ inference correctness.

### Example

In [None]:
import numpy as np

sample = X_test.iloc[:5]
original_preds = pipeline.predict(sample)
loaded_preds = loaded_model.predict(sample)

np.testing.assert_array_equal(original_preds, loaded_preds)

Ensures byte-level functional equivalence

# Dependency and Environment Management
## Concepts

- Serialization does not include dependencies

- Model artifacts must ship with:

    - requirements.txt or environment.yml

    - Python version

    - OS (optional but recommended)

## Example: requirements.txt

    python==3.11
    scikit-learn==1.4.0
    numpy==1.26.3
    joblib==1.3.2


# Metadata and Artifact Structure (Best Practice)
## Recommended Structure

    `artifacts/
    │
    ├── model.joblib
    ├── metadata.json
    ├── requirements.txt
    └── training_config.yaml
    `

### Example Metadata Fields

In [None]:
{
  "model_name": "logistic_regression_v1",
  "training_date": "2026-02-01",
  "sklearn_version": "1.4.0",
  "features": ["age", "income", "region"]
}

## What NOT to Do (Anti-Patterns)

- ❌ Serialize only model.coef_
- ❌ Serialize after manual feature engineering
- ❌ Rely on implicit global variables
- ❌ Overwrite artifacts without versioning

##  When Pickle / Joblib Is NOT Enough

Transition to:

- ONNX → multi-language inference

- PMML → enterprise integration

- Model servers (FastAPI, BentoML, MLflow)

- This naturally leads to 02_onnx_and_pmml_export.ipynb

##  Key Takeaways

- Serialization is a production boundary

- joblib is preferred over pickle

- Pipelines must be serialized end-to-end

- Validation and metadata are non-negotiable

- Environment reproducibility is part of deployment

## Suggested Exercises (Optional)

- Break a loaded model by changing sklearn versions

- Serialize a model without preprocessing and compare predictions

- Add metadata tracking and version bump