# Data Drift Detection
## Objective

Demonstrate how to detect, quantify, and monitor data drift between training (reference) data and production (current) data in deployed machine learning systems.

The notebook emphasizes:

- Statistical rigor

- Feature-level visibility

- Operational decision-making (alerts, retraining triggers)

## Why Data Drift Detection Is Critical
What Happens Without Monitoring

- Model accuracy degrades silently

- Business decisions drift from reality

- Retraining happens too late or blindly

### Types of Drift (Contextualized)

| Type               | Description                        | Detectable Here  |
| ------------------ | ---------------------------------- | ---------------- |
| Covariate Drift    | Input feature distribution changes | ‚úÖ                |
| Concept Drift      | Relationship X ‚Üí y changes         | ‚ùå (needs labels) |
| Data Quality Drift | Missing, invalid, corrupted values | ‚úÖ                |


This notebook focuses on covariate and data quality drift.

## Reference vs Current Data
### Definitions

- Reference data: training or validation dataset

- Current data: recent production batch / time window

### Operational Best Practice

- Use time-windowed snapshots (e.g., last 7 days vs training)

- Never compare against a single point in time

## Dataset Setup
### Steps

- Load a structured dataset

- Split into:

    - Reference dataset

    - Simulated production dataset (with injected drift)

In [3]:
import pandas as pd
import numpy as np

# Feature Classification
## Why This Matters

Different feature types require different statistical tests.

| Feature Type | Examples         | Tests            |
| ------------ | ---------------- | ---------------- |
| Numerical    | age, income      | PSI, KS-test     |
| Categorical  | country, product | PSI, Chi-squared |
| Binary       | yes/no           | PSI              |

# Population Stability Index (PSI)
## What PSI Measures

- Distribution shift between two populations

- Widely used in regulated industries

### PSI Interpretation

| PSI Value  | Interpretation       |
| ---------- | -------------------- |
| < 0.1      | No significant drift |
| 0.1 ‚Äì 0.25 | Moderate drift       |
| > 0.25     | Severe drift         |


## PSI Implementation (Numerical)

In [6]:
def calculate_psi(expected, actual, bins=10):
    expected_percents, bin_edges = np.histogram(
        expected, bins=bins, density=True
    )
    actual_percents, _ = np.histogram(
        actual, bins=bin_edges, density=True
    )

    psi = np.sum(
        (expected_percents - actual_percents) *
        np.log((expected_percents + 1e-6) / (actual_percents + 1e-6))
    )
    return psi

# Statistical Tests
## Kolmogorov-Smirnov Test (Numerical)

In [None]:
from scipy.stats import ks_2samp

stat, p_value = ks_2samp(reference, current)

- Sensitive to shape changes

- Requires continuous variables

##  Chi-Squared Test (Categorical)

In [None]:
from scipy.stats import chi2_contingency

-  Compares frequency distributions

- Requires aligned categories

# Feature-Level Drift Report
### Example Output Structure


| Feature | Type        | PSI  | p-value | Drift Flag |
| ------- | ----------- | ---- | ------- | ---------- |
| age     | Numerical   | 0.32 | 0.001   | üö®         |
| income  | Numerical   | 0.05 | 0.61    | ‚úÖ          |
| region  | Categorical | 0.27 | 0.003   | üö®         |


> Feature-level visibility is mandatory for actionability.

# Thresholds and Alerting Logic
### Example Rules

In [None]:
if psi > 0.25:
    alert_level = "CRITICAL"
elif psi > 0.1:
    alert_level = "WARNING"
else:
    alert_level = "OK"

### Operational Actions

Log drift metrics

Trigger alerts

Schedule retraining

Block inference (optional)

# Drift Is Not a Retraining Trigger (By Default)
### Why

- Some drift is seasonal or expected

- Blind retraining increases risk

### Recommended Policy

- Drift ‚Üí investigation

- Drift + performance drop ‚Üí retraining

# Visualization (Optional but Recommended)

- Distribution plots (reference vs current)

- PSI heatmaps across features

- Drift trends over time

Visualization supports human-in-the-loop decisions.

# Integration into Production Systems
### Where This Runs

- Batch monitoring jobs

- Streaming pipelines

- Model monitoring services

### Output Destinations

- Logs

- Dashboards

- Alerting systems

# Anti-Patterns to Avoid

- ‚ùå Monitoring only aggregate drift
- ‚ùå Using one test for all feature types
- ‚ùå Ignoring feature importance
- ‚ùå Comparing against outdated reference data

# Key Takeaways

- Data drift monitoring is mandatory in production ML

- Feature-level detection enables targeted responses

- PSI + statistical tests complement each other

- Drift ‚â† failure, but unmonitored drift is

### Transition to Next Notebook

‚û° 02_performance_monitoring.ipynb

- Detect output degradation

- Link drift signals to model performance

### Optional Exercises

- Inject synthetic drift and observe PSI behavior

- Compare KS-test sensitivity vs PSI

- Build a rolling drift monitor across time windows