# Data and Feature Versioning
## Objective

Demonstrate how to version datasets and feature definitions so that any trained or deployed model can be:

- Reproduced exactly

- Audited reliably

- Rolled back safely

> A model is reproducible only if its data and features are reproducible.

## Why Data and Feature Versioning Matters
#### Common Failure Modes

- Same code + different data = different model

- Feature logic changes without retraining

- Training data overwritten or lost

- Inconsistent offline vs online features

#### Core Principle

> Models do not fail alone — pipelines fail.

## What Needs to Be Versioned

| Asset                    | Why                      |
| ------------------------ | ------------------------ |
| Raw data snapshot        | Ground truth reference   |
| Cleaned / processed data | Reproducible training    |
| Feature definitions      | Input consistency        |
| Feature values           | Offline vs online parity |
| Data schema              | Prevent silent breaks    |


## Data Versioning Strategies
3.1 Snapshot-Based Versioning (Foundational)

- Immutable snapshots

- Timestamped or hash-based

- Stored in object storage or data lake


        data/
        ├── raw/
        │   └── customers_2026_02_01.parquet
        ├── processed/
        │   └── customers_clean_v1.0.parquet


## Hash-Based Identification

In [None]:
import hashlib

def file_hash(path):
    with open(path, "rb") as f:
        return hashlib.md5(f.read()).hexdigest()

Hashes uniquely identify data used in training.

## DVC (Data Version Control)

- Git-like versioning for large datasets
- Tracks data, pipelines, and models
- Integrates with S3, GCS, Azure

In [None]:
dvc add data/processed/customers_clean.parquet
git commit -m "Add processed customer dataset v1.0"

# Feature Versioning
- Why Feature Versioning Is Separate
- Feature logic changes more frequently than models
- Same model with new features ≠ same behavior

## Feature Definitions as Code

In [None]:
def compute_age_feature(df):
    return (df["reference_date"] - df["birth_date"]).dt.days // 365

- Version feature code in Git

- Tag feature releases

## Feature Schema Versioning

In [None]:
{
  "feature_set": "customer_features",
  "version": "2.1.0",
  "features": {
    "age": "int",
    "avg_purchase_value": "float",
    "region": "category"
  }
}


## Feature Store Concepts (Introductory)

- Centralized feature definitions
- Offline and online consistency
- Time-travel support
- 
> Full feature store implementation is beyond this notebook’s scope.

# Linking Model ↔ Data ↔ Features
### Metadata Example

In [None]:
{
  "model_version": "1.2.0",
  "data_snapshot_id": "customers_clean_v1.0",
  "data_hash": "a94a8fe5ccb19ba61c4c0873d391e987",
  "feature_set_version": "2.1.0"
}


> This linkage is mandatory for auditability.

# Reproducible Training Pipeline
### Key Principles

- Deterministic preprocessing
- Explicit random seeds
- Immutable inputs

In [None]:
RANDOM_STATE = 2010

# Offline vs Online Feature Consistency
## The Training–Serving Skew Problem

| Risk                    | Impact            |
| ----------------------- | ----------------- |
| Different feature logic | Wrong predictions |
| Time leakage            | Inflated metrics  |
| Missing features        | Inference failure |


## Best Practices

- Share feature code
- Validate schemas
- Monitor feature distributions

# Schema Validation (Production Guardrail)

In [None]:
from pydantic import BaseModel

class FeatureSchema(BaseModel):
    age: int
    avg_purchase_value: float
    region: str

> Reject inputs that violate schema contracts.

## Anti-Patterns to Avoid

- ❌ Using mutable datasets for training
- ❌ Recomputing features differently in production
- ❌ Not tracking data lineage
- ❌ Mixing feature logic inside notebooks only

##  Key Takeaways

- Data and feature versioning are non-negotiable for production ML

- Snapshots + hashes enable reproducibility

- Feature logic must be versioned independently

- Models should reference exact data and feature versions

### Transition Forward

➡ 04_production_workflows/

- Batch inference pipelines

- Real-time inference systems