### Detect Data Drift in ML Models
**Objective**: Monitor and detect changes in data distributions that impact ML model performance.

**Task**: Categorical Feature Drift

**Steps**:
1. Load the baseline distribution for a categorical feature (e.g., gender ) from your training dataset.
2. Load the same feature from your current production data.
3. Use chi-squared tests to compare the distributions of the categorical feature.
4. Step 4: If significant drift is detected, investigate the cause and update the model as needed.

In [1]:
# write your code from here
import pandas as pd
from scipy.stats import chi2_contingency

# Step 1: Load baseline training data
baseline_data = pd.DataFrame({
    'gender': ['Male'] * 500 + ['Female'] * 500
})

# Step 2: Load current production data
production_data = pd.DataFrame({
    'gender': ['Male'] * 300 + ['Female'] * 650 + ['Other'] * 50
})

# Step 3: Prepare frequency counts
baseline_counts = baseline_data['gender'].value_counts().sort_index()
production_counts = production_data['gender'].value_counts().sort_index()

# Align indexes
all_categories = sorted(set(baseline_counts.index).union(set(production_counts.index)))
baseline_counts = baseline_counts.reindex(all_categories, fill_value=0)
production_counts = production_counts.reindex(all_categories, fill_value=0)

# Combine into a contingency table
contingency_table = pd.DataFrame([baseline_counts, production_counts])

# Chi-squared test
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)

# Step 4: Interpret results
print("Chi-squared Statistic:", chi2_stat)
print("P-value:", p_value)

if p_value < 0.05:
    print("⚠️ Significant data drift detected in 'gender' distribution.")
else:
    print("✅ No significant data drift detected in 'gender' distribution.")


Chi-squared Statistic: 119.56521739130434
P-value: 1.0882857177594273e-26
⚠️ Significant data drift detected in 'gender' distribution.
