## Model Monitoring and Data Drift Analysis

In this section, we perform data drift analysis using the alibi-detect library. We compare the training dataset with the production dataset—focusing on numeric features—to determine if significant drift exists. The output includes key metrics such as whether drift was detected, the p-value, and a distance measure.


In [3]:
import pandas as pd
from alibi_detect.cd import TabularDrift
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

def detect_data_drift(train_file: str, prod_file: str, numeric_features: list, p_value_threshold: float = 0.05) -> pd.DataFrame:
    """
    Detects data drift between training and production datasets using alibi_detect's TabularDrift.

    Parameters:
        train_file (str): File path to the training dataset (Parquet format).
        prod_file (str): File path to the production dataset (Parquet format).
        numeric_features (list): List of numeric feature column names to be analyzed.
        p_value_threshold (float): The significance threshold for drift detection (default is 0.05).

    Returns:
        pd.DataFrame: A DataFrame summarizing drift detection metrics including:
                      - 'Drift Detected': Flag indicating if drift is detected (0 or 1)
                      - 'P-Value': Statistical significance of the drift test for each feature
                      - 'Distance': A measure of the distance between distributions for each feature
    """
    # Load the training and production datasets from Parquet files.
    train_df = pd.read_parquet(train_file)
    prod_df = pd.read_parquet(prod_file)
    
    # Drop the target column 'y' to focus only on the features.
    X_train = train_df.drop(columns=['y'])
    X_prod = prod_df.drop(columns=['y'])
    
    # Extract the numeric features from the datasets as numpy arrays.
    X_train_numeric = X_train[numeric_features].values
    X_prod_numeric = X_prod[numeric_features].values
    
    # Initialize the TabularDrift detector with the training data as the reference distribution.
    cd = TabularDrift(X_train_numeric, p_val=p_value_threshold)
    
    # Run the drift detector on the production data.
    preds = cd.predict(X_prod_numeric)
    
    # Extract drift detection results.
    drift_detected = preds['data']['is_drift']
    p_value = preds['data']['p_val']
    distance = preds['data']['distance']
    
    # Organize the metrics into a DataFrame for tabular display.
    results = pd.DataFrame({
        'Metric': ['Drift Detected', 'P-Value', 'Distance'],
        'Value': [drift_detected, p_value, distance]
    })
    
    return results

# Define the list of numeric features as per the dataset schema.
numeric_features = ['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']

# Disable column width truncation so that full lists are displayed.
pd.set_option('display.max_colwidth', None)

# Call the function to perform drift detection on the training and production datasets.
results = detect_data_drift(
    train_file='Datasets/Processed/banking_data_train.parquet',
    prod_file='Datasets/Processed/banking_data_prod.parquet',
    numeric_features=numeric_features
)

# Print the drift detection results with full details.
print(results.to_string())


           Metric                                                                                          Value
0  Drift Detected                                                                                              0
1         P-Value                     [0.3136755, 0.52932245, 0.80658126, 0.3400289, 0.9677855, 1.0, 0.99993527]
2        Distance  [0.011649788, 0.009800986, 0.007752979, 0.011388713, 0.005971547, 0.0024533956, 0.0039208494]


***The numeric features in the production dataset are consistent with those in the training dataset. The high p-values and low distance metrics indicate that there is no statistically significant drift in these features. This stability implies that the model's performance is unlikely to be impacted by changes in the numeric data distribution over time.***

## Categorical Data Drift Analysis

In this section, we perform drift detection for categorical features. For each categorical column, we compare the frequency distributions in the training and production datasets using a chi-square test. The output includes the chi-square statistic, p-value, and a flag indicating whether drift is detected (using a significance threshold of 0.05).


In [4]:
import pandas as pd
from scipy.stats import chi2_contingency
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

def detect_categorical_drift(train_file: str, prod_file: str, categorical_features: list, alpha: float = 0.05) -> pd.DataFrame:
    """
    Detects drift in categorical features using the chi-square test.
    
    For each categorical feature, this function computes the frequency distributions in both the training 
    and production datasets, aligns them by their categories, and then applies the chi-square test to determine 
    if the distributions differ significantly.
    
    Parameters:
        train_file (str): File path to the training dataset (Parquet format).
        prod_file (str): File path to the production dataset (Parquet format).
        categorical_features (list): List of categorical feature column names to be analyzed.
        alpha (float): Significance level for the chi-square test (default is 0.05).
        
    Returns:
        pd.DataFrame: A DataFrame summarizing the drift detection results for each feature, including:
                      - Feature: Name of the categorical feature.
                      - Chi2 Statistic: The chi-square test statistic.
                      - p-value: The p-value from the chi-square test.
                      - Drift Detected: 1 if drift is detected (p < alpha), otherwise 0.
    """
    # Load training and production datasets from Parquet files.
    train_df = pd.read_parquet(train_file)
    prod_df = pd.read_parquet(prod_file)
    
    # Drop the target column 'y' from both datasets.
    X_train = train_df.drop(columns=['y'])
    X_prod = prod_df.drop(columns=['y'])
    
    # List to store results for each categorical feature.
    results = []
    
    # Iterate over each categorical feature to perform drift detection.
    for col in categorical_features:
        # Get frequency counts for each category in training and production datasets.
        train_counts = X_train[col].value_counts().sort_index()
        prod_counts = X_prod[col].value_counts().sort_index()
        
        # Determine the union of categories present in either dataset.
        all_categories = sorted(set(train_counts.index) | set(prod_counts.index))
        
        # Reindex counts to include all categories; fill missing values with 0.
        train_counts = train_counts.reindex(all_categories, fill_value=0)
        prod_counts = prod_counts.reindex(all_categories, fill_value=0)
        
        # Create a contingency table where rows represent datasets and columns represent categories.
        contingency_table = pd.DataFrame({
            'train': train_counts,
            'prod': prod_counts
        })
        
        # Perform the chi-square test on the transposed contingency table.
        # Transposing so that each row represents one dataset.
        chi2, p, dof, expected = chi2_contingency(contingency_table.T)
        
        # Determine if drift is detected based on the p-value.
        drift_flag = 1 if p < alpha else 0
        
        # Append the results for this feature.
        results.append({
            'Feature': col,
            'Chi2 Statistic': chi2,
            'p-value': p,
            'Drift Detected': drift_flag
        })
    
    # Convert the list of results into a DataFrame and return it.
    return pd.DataFrame(results)

# Define the list of categorical features as per the dataset schema.
categorical_features = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']

# Call the function to perform drift detection on the training and production datasets.
categorical_results = detect_categorical_drift(
    train_file='Datasets/Processed/banking_data_train.parquet',
    prod_file='Datasets/Processed/banking_data_prod.parquet',
    categorical_features=categorical_features
)

# Ensure full output is displayed without truncation.
pd.set_option('display.max_colwidth', None)
print(categorical_results.to_string())


     Feature  Chi2 Statistic   p-value  Drift Detected
0        job        7.223676  0.780691               0
1    marital        3.574436  0.167425               0
2  education       10.889839  0.012337               1
3    default        1.053602  0.304679               0
4    housing        0.146600  0.701806               0
5       loan        0.320922  0.571054               0
6    contact        2.442812  0.294815               0
7      month       10.720298  0.466985               0
8   poutcome        5.239725  0.155062               0


***Although the chi-square test detected drift in the 'education' feature (with a p-value below 0.05), this is most likely due to normal sampling variability from the random split of the same dataset. The majority of categorical features show no significant drift, indicating that the training and production splits are largely consistent. The apparent drift in 'education' is a statistical artifact rather than a true change in the underlying data distribution.***