# Customer Churn Prediction - Prediction and Explanation Workflow

This notebook demonstrates how to use the trained model to make predictions and generate SHAP explanations.

## Overview

We'll cover:
1. Loading a trained model from the repository
2. Making single customer predictions
3. Batch predictions for multiple customers
4. Computing SHAP explanations
5. Interpreting feature contributions
6. Visualizing SHAP values

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import shap

# Import our services
from services.prediction import ChurnPredictor, CustomerRecord
from services.explainability import SHAPExplainer
from services.model_repository import ModelRepository

# Set display options
pd.set_option('display.max_columns', None)
sns.set_style('whitegrid')

# Initialize SHAP
shap.initjs()

print("✓ All imports successful")

## Step 1: Load Trained Model

First, let's see what models are available and load one.

In [None]:
# Initialize repository
repo = ModelRepository()

# List available versions
versions = repo.list_versions()

print(f"Available model versions: {len(versions)}\n")
for v in versions[-5:]:  # Show last 5
    print(f"Version: {v.version}")
    print(f"  Recall: {v.metadata.get('recall', 'N/A'):.4f}")
    print(f"  Precision: {v.metadata.get('precision', 'N/A'):.4f}")
    print()

In [None]:
# Initialize predictor (loads latest model by default)
predictor = ChurnPredictor()

print(f"✓ Loaded model version: {predictor.model_version}")
print(f"  Number of features: {len(predictor.feature_names)}")

## Step 2: Single Customer Prediction

Let's make a prediction for a single customer.

In [None]:
# Create a customer record
customer = CustomerRecord(
    customer_id="CUST_001",
    features={
        'tenure_months': 12,
        'monthly_charges': 75.50,
        'total_charges': 906.00,
        'contract_type': 'Month-to-month',
        'payment_method': 'Electronic check',
        'internet_service': 'Fiber optic',
        'online_security': 'No',
        'online_backup': 'No',
        'device_protection': 'No',
        'tech_support': 'No',
        'streaming_tv': 'Yes',
        'streaming_movies': 'Yes'
    }
)

# Make prediction with explanations
result = predictor.predict_single(customer, include_explanations=True)

print(f"Customer ID: {result.customer_id}")
print(f"Churn Probability: {result.churn_probability:.3f}")
print(f"High Risk: {'YES' if result.is_high_risk else 'NO'}")
print(f"Model Version: {result.model_version}")
print(f"\nTop Contributing Features:")
for feature, value in result.top_features:
    direction = "increases" if value > 0 else "decreases"
    print(f"  {feature}: {value:.4f} ({direction} churn risk)")

## Step 3: Batch Predictions

Now let's make predictions for multiple customers at once.

In [None]:
# Load test data
test_data = pd.read_csv('../data/raw/test_data.csv')

print(f"Test data shape: {test_data.shape}")
test_data.head()

In [None]:
# Make batch predictions
results = predictor.predict_batch(test_data.head(50))  # Predict for first 50 customers

# Convert to DataFrame
results_df = pd.DataFrame([
    {
        'customer_id': r.customer_id,
        'churn_probability': r.churn_probability,
        'is_high_risk': r.is_high_risk,
        'actual_churn': test_data.loc[test_data.index[i], 'churn'] if i < len(test_data) else None
    }
    for i, r in enumerate(results)
])

print(f"\nPrediction Summary:")
print(f"  Total customers: {len(results_df)}")
print(f"  High-risk accounts: {results_df['is_high_risk'].sum()} ({results_df['is_high_risk'].mean():.1%})")
print(f"  Average churn probability: {results_df['churn_probability'].mean():.3f}")
print(f"  Min probability: {results_df['churn_probability'].min():.3f}")
print(f"  Max probability: {results_df['churn_probability'].max():.3f}")

results_df.head(10)

In [None]:
# Visualize prediction distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram of probabilities
axes[0].hist(results_df['churn_probability'], bins=20, edgecolor='black', alpha=0.7)
axes[0].axvline(x=0.5, color='red', linestyle='--', label='High-Risk Threshold')
axes[0].set_xlabel('Churn Probability')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Churn Probabilities')
axes[0].legend()

# High-risk vs low-risk
risk_counts = results_df['is_high_risk'].value_counts()
axes[1].bar(['Low Risk', 'High Risk'], [risk_counts.get(False, 0), risk_counts.get(True, 0)], 
            color=['green', 'red'], alpha=0.7)
axes[1].set_ylabel('Count')
axes[1].set_title('Risk Classification')

plt.tight_layout()
plt.show()

## Step 4: SHAP Explanations

Let's compute detailed SHAP explanations to understand feature contributions.

In [None]:
# Initialize SHAP explainer
explainer = SHAPExplainer(predictor.model, predictor.transformer)

print("✓ SHAP explainer initialized")
print(f"  Background samples: {explainer.background_size}")

In [None]:
# Select a few customers for detailed explanation
sample_customers = test_data.head(10)

# Compute SHAP values
shap_values = explainer.explain(sample_customers)

print(f"SHAP values shape: {shap_values.shape}")
print(f"Number of features: {len(predictor.feature_names)}")

## Step 5: Visualize SHAP Explanations

Let's create various visualizations to understand feature importance.

In [None]:
# Waterfall plot for a single customer
customer_idx = 0

print(f"Explaining customer at index {customer_idx}")
print(f"Predicted probability: {predictor.model.predict_proba(predictor.transformer.transform(sample_customers.iloc[customer_idx:customer_idx+1]))[0, 1]:.3f}")

# Create SHAP explanation object
shap_explanation = shap.Explanation(
    values=shap_values[customer_idx],
    base_values=explainer.explainer.expected_value,
    data=predictor.transformer.transform(sample_customers.iloc[customer_idx:customer_idx+1])[0],
    feature_names=predictor.feature_names
)

# Waterfall plot
shap.plots.waterfall(shap_explanation, max_display=10)

In [None]:
# Force plot for a single customer
shap.plots.force(
    explainer.explainer.expected_value,
    shap_values[customer_idx],
    predictor.transformer.transform(sample_customers.iloc[customer_idx:customer_idx+1])[0],
    feature_names=predictor.feature_names
)

In [None]:
# Summary plot showing feature importance across all samples
shap.summary_plot(
    shap_values,
    predictor.transformer.transform(sample_customers),
    feature_names=predictor.feature_names,
    max_display=15
)

In [None]:
# Bar plot of mean absolute SHAP values
shap.summary_plot(
    shap_values,
    predictor.transformer.transform(sample_customers),
    feature_names=predictor.feature_names,
    plot_type="bar",
    max_display=15
)

## Step 6: Analyze Top Features

Let's identify the most important features across all customers.

In [None]:
# Calculate mean absolute SHAP values
mean_abs_shap = np.abs(shap_values).mean(axis=0)

# Create DataFrame
feature_importance_df = pd.DataFrame({
    'feature': predictor.feature_names,
    'mean_abs_shap': mean_abs_shap
}).sort_values('mean_abs_shap', ascending=False)

print("Top 10 Most Important Features:")
print(feature_importance_df.head(10))

In [None]:
# Visualize top features
plt.figure(figsize=(10, 6))
top_features = feature_importance_df.head(10)
plt.barh(top_features['feature'], top_features['mean_abs_shap'])
plt.xlabel('Mean Absolute SHAP Value')
plt.title('Top 10 Feature Importances (SHAP)')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## Step 7: Verify SHAP Additivity

SHAP values should satisfy the additivity property: sum of SHAP values = prediction - base value.

In [None]:
# Verify additivity for each customer
predictions = predictor.model.predict_proba(
    predictor.transformer.transform(sample_customers)
)[:, 1]

base_value = explainer.explainer.expected_value

print(f"Base value (expected value): {base_value:.4f}\n")

for i in range(min(5, len(sample_customers))):
    shap_sum = shap_values[i].sum()
    prediction = predictions[i]
    expected_sum = prediction - base_value
    is_valid = explainer.validate_shap_sum(shap_values[i], prediction, base_value)
    
    print(f"Customer {i}:")
    print(f"  Prediction: {prediction:.4f}")
    print(f"  SHAP sum: {shap_sum:.4f}")
    print(f"  Expected (pred - base): {expected_sum:.4f}")
    print(f"  Difference: {abs(shap_sum - expected_sum):.6f}")
    print(f"  Valid: {'✓' if is_valid else '✗'}")
    print()

## Step 8: Compare High-Risk vs Low-Risk Customers

Let's analyze the differences between high-risk and low-risk customers.

In [None]:
# Classify customers
high_risk_mask = predictions >= 0.5
low_risk_mask = predictions < 0.5

print(f"High-risk customers: {high_risk_mask.sum()}")
print(f"Low-risk customers: {low_risk_mask.sum()}")

# Compare average SHAP values
if high_risk_mask.sum() > 0 and low_risk_mask.sum() > 0:
    high_risk_shap = shap_values[high_risk_mask].mean(axis=0)
    low_risk_shap = shap_values[low_risk_mask].mean(axis=0)
    
    comparison_df = pd.DataFrame({
        'feature': predictor.feature_names,
        'high_risk_shap': high_risk_shap,
        'low_risk_shap': low_risk_shap,
        'difference': high_risk_shap - low_risk_shap
    }).sort_values('difference', ascending=False)
    
    print("\nTop features differentiating high-risk from low-risk:")
    print(comparison_df.head(10))

## Summary

In this notebook, we:

1. ✓ Loaded a trained model from the repository
2. ✓ Made single customer predictions
3. ✓ Performed batch predictions
4. ✓ Computed SHAP explanations
5. ✓ Visualized feature contributions
6. ✓ Verified SHAP additivity property
7. ✓ Analyzed differences between risk groups

## Key Insights

- SHAP values provide transparent explanations for each prediction
- Feature importance varies by customer
- The additivity property ensures mathematical consistency
- Visualizations help communicate model behavior to stakeholders

## Next Steps

- **Deploy Predictions**: Integrate predictions into business workflows
- **Monitor Performance**: Track model accuracy over time (see next notebook)
- **Refine Threshold**: Adjust high-risk threshold based on business costs
- **Feature Engineering**: Use SHAP insights to create better features
- **A/B Testing**: Compare different model versions in production