### Detect Data Drift in ML Models
**Objective**: Monitor and detect changes in data distributions that impact ML model performance.

**Task**: Categorical Feature Drift

**Steps**:
1. Load the baseline distribution for a categorical feature (e.g., gender ) from your training dataset.
2. Load the same feature from your current production data.
3. Use chi-squared tests to compare the distributions of the categorical feature.
4. Step 4: If significant drift is detected, investigate the cause and update the model as needed.

In [1]:
import pandas as pd
import pytest
from drift_detector import detect_categorical_drift  # Your script filename

def test_detect_drift_with_clear_difference():
    baseline = pd.DataFrame({'gender': ['Male'] * 90 + ['Female'] * 10})
    production = pd.DataFrame({'gender': ['Male'] * 50 + ['Female'] * 50})
    result = detect_categorical_drift(baseline, production, 'gender')
    assert result["drift_detected"] is True
    assert 0 <= result["p_value"] <= 1

def test_detect_drift_no_difference():
    baseline = pd.DataFrame({'gender': ['Male'] * 50 + ['Female'] * 50})
    production = pd.DataFrame({'gender': ['Male'] * 50 + ['Female'] * 50})
    result = detect_categorical_drift(baseline, production, 'gender')
    assert result["drift_detected"] is False

def test_missing_feature_raises_error():
    df = pd.DataFrame({'age': [25, 30]})
    with pytest.raises(ValueError):
        detect_categorical_drift(df, df, 'gender')


ModuleNotFoundError: No module named 'pytest'