# Model Monitoring: Detecting Data Drift

## Objectives
- Understand the concept of **Data Drift** (often called Feature Drift or Covariate Shift), which occurs when the statistical properties of the input data change over time.
- Implement standard statistical tests (e.g., Kolmogorov-Smirnov test) to identify if an infrastructure metric's distribution has fundamentally changed.

## Dataset
- Two synthetic datasets of server request latencies. One represents "historical" data (what the model was trained on), and the other represents "recent" data (production data over the last 24 hours).

## Expected Outcome
- Visual and statistical confirmation that the recent data has drifted from the historical distribution, meaning the ML model's predictions can no longer be trusted until retrained.

## Challenge
- Build a simple Python function that takes two arrays, runs a KS test, and throws a Python `Warning` if drift is detected.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ks_2samp

np.random.seed(42)
sns.set_theme(style="darkgrid")

### 1. Simulating Data Drift
Let's imagine you trained an anomaly detection model on server latencies last month. This month, a new microservice update was pushed, slightly shifting the baseline latency.

In [None]:
# Historical Training Data (Last Month)
# e.g., mean latency is 150ms with a standard deviation of 20
historical_latency = np.random.normal(loc=150, scale=20, size=1000)

# Production Data (Past 24 Hours) - The update slowed things down slightly
# Mean latency is now 165ms
recent_latency = np.random.normal(loc=165, scale=22, size=500)

### 2. Visualizing Drift
The easiest way to detect a problem is visually overlaying the density plots.

In [None]:
plt.figure(figsize=(10, 5))
sns.kdeplot(historical_latency, fill=True, label='Historical (Training Data)')
sns.kdeplot(recent_latency, fill=True, label='Recent (Production Data)')
plt.title("Detecting Data Drift in Server Latency")
plt.xlabel("Latency (ms)")
plt.ylabel("Density")
plt.legend()
plt.show()

### 3. Statistical Testing: The Kolmogorov-Smirnov Test
Visualizations are great for humans, but for automated monitoring pipelines, we need math. The **Kolmogorov-Smirnov (KS) Test** compares the cumulative distributions of two datasets.
- **Null Hypothesis (H0):** The two samples come from the exact same distribution.
- **p-value:** If $p < 0.05$, we reject the null hypothesis and confirm Drift has occurred.

In [None]:
stat, p_value = ks_2samp(historical_latency, recent_latency)

print(f"KS Statistic: {stat:.4f}")
print(f"p-value: {p_value:.10f}")

if p_value < 0.05:
    print("\n[ALERT] DATA DRIFT DETECTED: Distributions belong to different datasets. Retraining may be required.")
else:
    print("\n[OK] No significant drift detected.")