# Customer Churn Prediction - Model Evaluation and Monitoring

This notebook demonstrates how to evaluate model performance, monitor for degradation, and manage model versions.

## Overview

We'll cover:
1. Loading and evaluating models
2. Computing performance metrics
3. Analyzing confusion matrices
4. Checking model calibration
5. Storing predictions for validation
6. Detecting performance degradation
7. Managing model versions and rollback

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
from sklearn.calibration import calibration_curve

# Import our services
from services.prediction import ChurnPredictor
from services.monitoring import ModelEvaluator, AlertService, PredictionStore
from services.model_repository import ModelRepository

# Set display options
pd.set_option('display.max_columns', None)
sns.set_style('whitegrid')

print("✓ All imports successful")

## Step 1: Load Model and Test Data

In [None]:
# Initialize predictor
predictor = ChurnPredictor()

print(f"✓ Loaded model version: {predictor.model_version}")

# Load test data
test_data = pd.read_csv('../data/raw/test_data.csv')
X_test = test_data.drop('churn', axis=1)
y_test = test_data['churn']

print(f"Test data shape: {test_data.shape}")
print(f"Churn rate: {y_test.mean():.2%}")

## Step 2: Evaluate Model Performance

Let's compute comprehensive performance metrics.

In [None]:
# Initialize evaluator
evaluator = ModelEvaluator()

# Evaluate model
metrics = evaluator.evaluate(predictor.model, predictor.transformer, X_test, y_test)

print("Model Performance Metrics:")
print(f"  Precision: {metrics['precision']:.4f}")
print(f"  Recall: {metrics['recall']:.4f}")
print(f"  F1-Score: {metrics['f1_score']:.4f}")
print(f"  Accuracy: {metrics.get('accuracy', 'N/A')}")

# Check if meets threshold
MIN_RECALL = 0.85
if metrics['recall'] >= MIN_RECALL:
    print(f"\n✓ Model meets recall threshold (>= {MIN_RECALL:.0%})")
else:
    print(f"\n⚠ Model recall is below {MIN_RECALL:.0%} threshold")

## Step 3: Confusion Matrix Analysis

In [None]:
# Generate predictions
y_pred = predictor.model.predict(predictor.transformer.transform(X_test))
y_pred_proba = predictor.model.predict_proba(predictor.transformer.transform(X_test))[:, 1]

# Compute confusion matrix
cm = evaluator.compute_confusion_matrix(y_test, y_pred)

print("Confusion Matrix:")
print(cm)
print(f"\nBreakdown:")
print(f"  True Negatives (correctly predicted no churn): {cm[0, 0]}")
print(f"  False Positives (predicted churn, but didn't): {cm[0, 1]}")
print(f"  False Negatives (predicted no churn, but did): {cm[1, 0]}")
print(f"  True Positives (correctly predicted churn): {cm[1, 1]}")

In [None]:
# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['No Churn', 'Churn'],
            yticklabels=['No Churn', 'Churn'],
            cbar_kws={'label': 'Count'})
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.show()

# Calculate rates
tn, fp, fn, tp = cm.ravel()
total = tn + fp + fn + tp

print(f"\nRates:")
print(f"  True Negative Rate (Specificity): {tn/(tn+fp):.2%}")
print(f"  False Positive Rate: {fp/(tn+fp):.2%}")
print(f"  True Positive Rate (Recall/Sensitivity): {tp/(tp+fn):.2%}")
print(f"  False Negative Rate: {fn/(tp+fn):.2%}")

## Step 4: ROC Curve and AUC

In [None]:
# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print(f"AUC Score: {roc_auc:.4f}")

## Step 5: Model Calibration

Check if predicted probabilities match actual outcomes.

In [None]:
# Compute calibration curve
prob_true, prob_pred = evaluator.compute_calibration_curve(y_test, y_pred_proba)

# Plot calibration curve
plt.figure(figsize=(8, 6))
plt.plot(prob_pred, prob_true, marker='o', linewidth=2, label='Model')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Perfect Calibration')
plt.xlabel('Predicted Probability')
plt.ylabel('True Probability')
plt.title('Calibration Curve')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("Calibration Analysis:")
print("If the curve is close to the diagonal, the model is well-calibrated.")
print("Above diagonal = model underestimates probability")
print("Below diagonal = model overestimates probability")

## Step 6: Probability Distribution Analysis

In [None]:
# Analyze probability distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Histogram by actual class
axes[0, 0].hist(y_pred_proba[y_test == 0], bins=30, alpha=0.6, label='No Churn', color='green')
axes[0, 0].hist(y_pred_proba[y_test == 1], bins=30, alpha=0.6, label='Churn', color='red')
axes[0, 0].axvline(x=0.5, color='black', linestyle='--', label='Threshold')
axes[0, 0].set_xlabel('Predicted Probability')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Probability Distribution by Actual Class')
axes[0, 0].legend()

# 2. Box plot by actual class
prob_df = pd.DataFrame({
    'probability': y_pred_proba,
    'actual': ['Churn' if y == 1 else 'No Churn' for y in y_test]
})
prob_df.boxplot(column='probability', by='actual', ax=axes[0, 1])
axes[0, 1].set_title('Probability Distribution by Actual Class')
axes[0, 1].set_xlabel('Actual Class')
axes[0, 1].set_ylabel('Predicted Probability')

# 3. Cumulative distribution
sorted_probs = np.sort(y_pred_proba)
cumulative = np.arange(1, len(sorted_probs) + 1) / len(sorted_probs)
axes[1, 0].plot(sorted_probs, cumulative, linewidth=2)
axes[1, 0].axvline(x=0.5, color='red', linestyle='--', label='Threshold')
axes[1, 0].set_xlabel('Predicted Probability')
axes[1, 0].set_ylabel('Cumulative Proportion')
axes[1, 0].set_title('Cumulative Distribution of Probabilities')
axes[1, 0].legend()
axes[1, 0].grid(alpha=0.3)

# 4. Probability bins
bins = [0, 0.2, 0.4, 0.6, 0.8, 1.0]
prob_df['bin'] = pd.cut(prob_df['probability'], bins=bins)
bin_counts = prob_df.groupby('bin')['actual'].value_counts().unstack(fill_value=0)
bin_counts.plot(kind='bar', ax=axes[1, 1], color=['green', 'red'])
axes[1, 1].set_title('Actual Outcomes by Probability Bin')
axes[1, 1].set_xlabel('Probability Bin')
axes[1, 1].set_ylabel('Count')
axes[1, 1].legend(['No Churn', 'Churn'])
axes[1, 1].set_xticklabels(axes[1, 1].get_xticklabels(), rotation=45)

plt.tight_layout()
plt.show()

## Step 7: Store Predictions for Future Validation

In [None]:
# Initialize prediction store
store = PredictionStore()

# Store predictions with actual outcomes
stored_count = 0
for idx in range(len(test_data)):
    customer_id = f"TEST_{idx:04d}"
    prediction = float(y_pred_proba[idx])
    actual = int(y_test.iloc[idx])
    
    store.store(
        customer_id=customer_id,
        prediction=prediction,
        actual_outcome=actual,
        model_version=predictor.model_version
    )
    stored_count += 1

print(f"✓ Stored {stored_count} predictions")

In [None]:
# Retrieve and analyze stored predictions
stored_predictions = store.retrieve(model_version=predictor.model_version)

print(f"Retrieved {len(stored_predictions)} predictions")
print(f"\nSample stored predictions:")
print(stored_predictions.head())

## Step 8: Performance Monitoring and Alerts

In [None]:
# Initialize alert service
alert_service = AlertService()

# Check for performance degradation
alert = alert_service.check_performance(metrics['recall'], threshold=0.85)

if alert:
    print(f"⚠ ALERT: {alert}")
else:
    print("✓ No performance alerts - model is performing within acceptable range")

In [None]:
# Simulate monitoring over time
# In production, you would track these metrics over days/weeks

monitoring_data = {
    'date': pd.date_range(start='2025-01-01', periods=10, freq='D'),
    'recall': [0.87, 0.86, 0.85, 0.84, 0.83, 0.82, 0.81, 0.80, 0.79, 0.78],
    'precision': [0.52, 0.51, 0.50, 0.49, 0.48, 0.47, 0.46, 0.45, 0.44, 0.43],
    'f1_score': [0.65, 0.64, 0.63, 0.62, 0.61, 0.60, 0.59, 0.58, 0.57, 0.56]
}

monitoring_df = pd.DataFrame(monitoring_data)

# Plot performance over time
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].plot(monitoring_df['date'], monitoring_df['recall'], marker='o', linewidth=2)
axes[0].axhline(y=0.85, color='red', linestyle='--', label='Threshold')
axes[0].set_xlabel('Date')
axes[0].set_ylabel('Recall')
axes[0].set_title('Recall Over Time')
axes[0].legend()
axes[0].grid(alpha=0.3)
axes[0].tick_params(axis='x', rotation=45)

axes[1].plot(monitoring_df['date'], monitoring_df['precision'], marker='o', linewidth=2, color='orange')
axes[1].set_xlabel('Date')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision Over Time')
axes[1].grid(alpha=0.3)
axes[1].tick_params(axis='x', rotation=45)

axes[2].plot(monitoring_df['date'], monitoring_df['f1_score'], marker='o', linewidth=2, color='green')
axes[2].set_xlabel('Date')
axes[2].set_ylabel('F1-Score')
axes[2].set_title('F1-Score Over Time')
axes[2].grid(alpha=0.3)
axes[2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("Performance Trend Analysis:")
print(f"  Recall degradation: {monitoring_df['recall'].iloc[0] - monitoring_df['recall'].iloc[-1]:.3f}")
print(f"  Alert triggered on: {monitoring_df[monitoring_df['recall'] < 0.85]['date'].min()}")

## Step 9: Model Version Management

In [None]:
# Initialize repository
repo = ModelRepository()

# List all versions
versions = repo.list_versions()

print(f"Total model versions: {len(versions)}\n")

# Create comparison DataFrame
version_data = []
for v in versions:
    version_data.append({
        'version': v.version,
        'recall': v.metadata.get('recall', None),
        'precision': v.metadata.get('precision', None),
        'f1_score': v.metadata.get('f1_score', None),
        'timestamp': v.metadata.get('timestamp', None)
    })

version_df = pd.DataFrame(version_data)
print("Model Version Comparison:")
print(version_df)

In [None]:
# Visualize version performance
if len(version_df) > 1:
    fig, ax = plt.subplots(figsize=(12, 6))
    
    x = range(len(version_df))
    width = 0.25
    
    ax.bar([i - width for i in x], version_df['recall'], width, label='Recall', alpha=0.8)
    ax.bar(x, version_df['precision'], width, label='Precision', alpha=0.8)
    ax.bar([i + width for i in x], version_df['f1_score'], width, label='F1-Score', alpha=0.8)
    
    ax.axhline(y=0.85, color='red', linestyle='--', label='Recall Threshold')
    ax.set_xlabel('Model Version')
    ax.set_ylabel('Score')
    ax.set_title('Model Performance Across Versions')
    ax.set_xticks(x)
    ax.set_xticklabels(version_df['version'], rotation=45, ha='right')
    ax.legend()
    ax.grid(alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()

## Step 10: Model Rollback Simulation

Demonstrate how to rollback to a previous version if needed.

In [None]:
# Get current version
current_version = repo.get_latest_version()
print(f"Current deployed version: {current_version}")

# Simulate rollback (uncomment to actually perform rollback)
# if len(versions) > 1:
#     previous_version = versions[-2].version
#     print(f"\nRolling back to: {previous_version}")
#     repo.rollback(previous_version)
#     print("✓ Rollback complete")
#     
#     # Verify
#     new_current = repo.get_latest_version()
#     print(f"New deployed version: {new_current}")

print("\nNote: Rollback code is commented out to prevent accidental execution.")
print("Uncomment the code above to perform an actual rollback.")

## Step 11: Detailed Classification Report

In [None]:
# Generate detailed classification report
print("Detailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=['No Churn', 'Churn']))

# Additional metrics
from sklearn.metrics import matthews_corrcoef, cohen_kappa_score

mcc = matthews_corrcoef(y_test, y_pred)
kappa = cohen_kappa_score(y_test, y_pred)

print(f"\nAdditional Metrics:")
print(f"  Matthews Correlation Coefficient: {mcc:.4f}")
print(f"  Cohen's Kappa: {kappa:.4f}")

## Summary

In this notebook, we:

1. ✓ Loaded and evaluated a trained model
2. ✓ Computed comprehensive performance metrics
3. ✓ Analyzed confusion matrices
4. ✓ Examined ROC curves and AUC
5. ✓ Checked model calibration
6. ✓ Analyzed probability distributions
7. ✓ Stored predictions for validation
8. ✓ Set up performance monitoring and alerts
9. ✓ Compared model versions
10. ✓ Demonstrated rollback capabilities

## Key Takeaways

- **Continuous Monitoring**: Track metrics over time to detect degradation early
- **Multiple Metrics**: Don't rely on a single metric - use precision, recall, F1, AUC together
- **Calibration Matters**: Well-calibrated probabilities are crucial for decision-making
- **Version Control**: Maintain multiple model versions for quick rollback
- **Alert Systems**: Automated alerts help catch issues before they impact business

## Best Practices for Production Monitoring

1. **Set Up Automated Monitoring**
   - Schedule daily/weekly evaluation jobs
   - Track metrics in a time-series database
   - Set up email/Slack alerts for degradation

2. **Monitor Data Drift**
   - Track feature distributions over time
   - Compare production data to training data
   - Alert on significant distribution shifts

3. **A/B Testing**
   - Test new models on a subset of traffic
   - Compare performance before full deployment
   - Gradually roll out improvements

4. **Feedback Loop**
   - Collect actual outcomes for predictions
   - Retrain models with new data
   - Continuously improve performance

5. **Documentation**
   - Document model changes and reasons
   - Track business impact of model updates
   - Maintain audit trail for compliance