### Detect Data Drift in ML Models
**Objective**: Monitor and detect changes in data distributions that impact ML model performance.

**Task**: Categorical Feature Drift

**Steps**:
1. Load the baseline distribution for a categorical feature (e.g., gender ) from your training dataset.
2. Load the same feature from your current production data.
3. Use chi-squared tests to compare the distributions of the categorical feature.
4. Step 4: If significant drift is detected, investigate the cause and update the model as needed.

In [None]:
# write your code from here
import pandas as pd
from scipy.stats import chi2_contingency

try:
    # Baseline training data categorical feature distribution
    train_data = pd.DataFrame({'gender': ['Male', 'Female', 'Female', 'Male', 'Male', 'Female']})

    # Production data categorical feature distribution
    prod_data = pd.DataFrame({'gender': ['Male', 'Female', 'Female', 'Female', 'Male', 'Female', 'Female']})

    # Calculate frequency counts
    train_counts = train_data['gender'].value_counts().sort_index()
    prod_counts = prod_data['gender'].value_counts().sort_index()

    # Combine counts into contingency table
    contingency_table = pd.DataFrame({
        'train': train_counts,
        'prod': prod_counts
    }).fillna(0)

    chi2, p_value, _, _ = chi2_contingency(contingency_table.T)

    print(f"Chi-squared test p-value: {p_value:.4f}")
    if p_value < 0.05:
        print("Significant data drift detected in categorical feature 'gender'. Investigate and update the model.")
    else:
        print("No significant drift detected in categorical feature 'gender'.")

except Exception as e:
    print(f"Error: {e}")
