# üîç Exercise 7: Local Validation Simulation

**Week 7 | AI in Healthcare Curriculum**

---

## Learning Objectives

By completing this exercise, you will:

- üéØ Understand why local validation is essential before AI deployment
- üéØ Validate a "vendor" model on local Australian data
- üéØ Compare vendor-claimed vs locally-validated performance
- üéØ Analyse performance gaps and investigate causes
- üéØ Apply a structured go/no-go decision framework

---

## ‚è±Ô∏è Estimated Time: 90 minutes

---

## Context

**Scenario:** A vendor approaches your Australian health service with a deterioration prediction model. The model was developed and validated using data from large US academic medical centres. The vendor provides impressive performance statistics.

**Your task:** Before recommending deployment, you must validate the model on your local patient population to determine whether the vendor's claimed performance translates to your setting.

**Why this matters:**
- Models trained on one population often perform worse on others
- Different clinical practices, patient demographics, and data systems affect performance
- Published or vendor-reported performance may not reflect your reality
- Local validation is now considered essential before any AI deployment

## Part 1: Setup and Load the "Vendor" Model

In [None]:
# Setup - run this first!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, 
    roc_auc_score, confusion_matrix, roc_curve,
    precision_recall_curve, average_precision_score
)
from sklearn.calibration import calibration_curve
import pickle
import io
import base64

# Set display options
pd.set_option('display.max_columns', None)
np.random.seed(42)

print("‚úÖ Setup complete!")

In [None]:
# Simulate a "vendor" model trained on US data
# In reality, you would receive this from the vendor

def create_vendor_model():
    """
    Create a simulated vendor model trained on US hospital data.
    This represents what you might receive from a commercial vendor.
    """
    np.random.seed(123)  # Different seed for "US" data
    
    # Generate "US" training data with slightly different characteristics
    n_us = 5000
    
    us_data = pd.DataFrame({
        'age': np.random.normal(58, 18, n_us).clip(18, 95).astype(int),  # Older population
        'heart_rate': np.random.normal(82, 15, n_us).clip(40, 180).astype(int),
        'respiratory_rate': np.random.normal(17, 4, n_us).clip(8, 40).astype(int),
        'systolic_bp': np.random.normal(128, 20, n_us).clip(70, 200).astype(int),
        'temperature_f': np.random.normal(98.6, 1.2, n_us).clip(95, 105),  # Fahrenheit!
        'oxygen_saturation': np.random.normal(95, 4, n_us).clip(80, 100).astype(int),
        'bmi': np.random.normal(29, 6, n_us).clip(15, 50),  # Higher BMI
        'los_hours': np.random.exponential(24, n_us).clip(1, 200),
    })
    
    # US outcome model (different risk factors)
    us_risk = (
        0.015 * (us_data['age'] - 50) / 10 +
        0.008 * (us_data['heart_rate'] - 80) / 20 +
        0.005 * (us_data['bmi'] - 25) / 5 +
        -0.003 * (us_data['oxygen_saturation'] - 95) +
        np.random.normal(0, 0.04, n_us)
    )
    us_outcomes = (us_risk > 0.12).astype(int)
    
    # Train the "vendor" model
    features = ['age', 'heart_rate', 'respiratory_rate', 'systolic_bp', 
                'oxygen_saturation', 'bmi']
    
    model = GradientBoostingClassifier(
        n_estimators=100,
        max_depth=5,
        random_state=42
    )
    model.fit(us_data[features], us_outcomes)
    
    # Calculate "vendor reported" performance on US test data
    X_train, X_test, y_train, y_test = train_test_split(
        us_data[features], us_outcomes, test_size=0.2, random_state=42
    )
    y_prob = model.predict_proba(X_test)[:, 1]
    
    vendor_performance = {
        'AUC': roc_auc_score(y_test, y_prob),
        'Sensitivity': recall_score(y_test, model.predict(X_test)),
        'Specificity': 1 - (model.predict(X_test)[y_test == 0].mean()),
        'PPV': precision_score(y_test, model.predict(X_test)),
        'N_development': len(X_train),
        'N_validation': len(X_test)
    }
    
    return model, features, vendor_performance

# Create the vendor model
vendor_model, expected_features, vendor_claimed = create_vendor_model()

print("="*60)
print("VENDOR MODEL SPECIFICATIONS")
print("="*60)
print(f"\nModel Name: DeterioratePredict Pro v2.1")
print(f"Vendor: HealthAI Solutions Inc.")
print(f"Development Data: US Academic Medical Centres (2019-2023)")
print(f"\nModel Type: {type(vendor_model).__name__}")
print(f"\nRequired Input Features:")
for feat in expected_features:
    print(f"  ‚Ä¢ {feat}")

print(f"\n" + "="*60)
print("VENDOR-CLAIMED PERFORMANCE (US Validation Cohort)")
print("="*60)
print(f"\nDevelopment cohort: n = {vendor_claimed['N_development']:,}")
print(f"Validation cohort: n = {vendor_claimed['N_validation']:,}")
print(f"\nPerformance Metrics:")
print(f"  AUC-ROC: {vendor_claimed['AUC']:.3f}")
print(f"  Sensitivity: {vendor_claimed['Sensitivity']:.1%}")
print(f"  Specificity: {vendor_claimed['Specificity']:.1%}")
print(f"  PPV: {vendor_claimed['PPV']:.1%}")

## Part 2: Prepare Local Australian Validation Data

Now let's load your local hospital data and check compatibility with the vendor model.

In [None]:
# Generate synthetic Australian hospital data
def generate_australian_data(n_patients=1000):
    """
    Generate synthetic Australian hospital data with:
    - Different demographic patterns
    - Metric units (Celsius, not Fahrenheit)
    - Australian-specific characteristics
    """
    np.random.seed(42)
    
    # Demographics - Australian patterns
    ages = np.random.normal(52, 22, n_patients).clip(18, 95).astype(int)  # Younger than US
    
    # Indigenous status (for later subgroup analysis)
    indigenous = np.random.choice(
        ['Non-Indigenous', 'Indigenous'],
        n_patients,
        p=[0.95, 0.05]
    )
    
    # Remoteness
    remoteness = np.random.choice(
        ['Metropolitan', 'Regional', 'Remote'],
        n_patients,
        p=[0.70, 0.25, 0.05]
    )
    
    # Clinical data - metric units, different baselines
    data = pd.DataFrame({
        'patient_id': [f'AU{i:05d}' for i in range(n_patients)],
        'age': ages,
        'indigenous_status': indigenous,
        'remoteness': remoteness,
        'heart_rate': np.random.normal(78, 16, n_patients).clip(40, 180).astype(int),
        'respiratory_rate': np.random.normal(16, 5, n_patients).clip(8, 40).astype(int),
        'systolic_bp': np.random.normal(122, 18, n_patients).clip(70, 200).astype(int),
        'temperature_c': np.random.normal(37.0, 0.6, n_patients).clip(35, 41),  # Celsius!
        'oxygen_saturation': np.random.normal(96, 3, n_patients).clip(80, 100).astype(int),
        'bmi': np.random.normal(27, 5, n_patients).clip(15, 50),  # Lower than US
        'comorbidity_count': np.random.poisson(1.2, n_patients).clip(0, 8),
    })
    
    # Australian outcomes (different risk profile)
    au_risk = (
        0.012 * (data['age'] - 50) / 10 +
        0.006 * (data['heart_rate'] - 78) / 20 +
        0.008 * data['comorbidity_count'] +
        -0.004 * (data['oxygen_saturation'] - 96) +
        0.02 * (data['remoteness'] == 'Remote').astype(int) +
        np.random.normal(0, 0.03, n_patients)
    )
    data['adverse_outcome'] = (au_risk > 0.10).astype(int)
    
    return data

# Generate local data
au_data = generate_australian_data(1000)

print("Local Australian Hospital Data Loaded")
print("="*60)
print(f"Total patients: {len(au_data):,}")
print(f"Adverse outcome rate: {au_data['adverse_outcome'].mean()*100:.1f}%")
print("\nFirst 5 rows:")
au_data.head()

In [None]:
# Check data compatibility with vendor model
print("="*60)
print("DATA COMPATIBILITY CHECK")
print("="*60)

print("\nVendor model expects these features:")
for feat in expected_features:
    print(f"  ‚Ä¢ {feat}")

print("\nLocal data columns:")
print(au_data.columns.tolist())

# Check for missing/different features
local_cols = set(au_data.columns)
required = set(expected_features)

missing = required - local_cols
extra = local_cols - required

print(f"\n‚ö†Ô∏è COMPATIBILITY ISSUES:")
if missing:
    print(f"  Missing required features: {missing}")
else:
    print(f"  No missing features ‚úì")

# Check for unit differences
print("\n‚ö†Ô∏è POTENTIAL UNIT MISMATCH:")
print("  Local data has 'temperature_c' (Celsius)")
print("  Vendor model likely expects Fahrenheit")
print("  ‚ûú Conversion may be required!")

In [None]:
# Prepare local data for the vendor model
print("Preparing local data for vendor model...")
print("-" * 50)

# Create a copy for model input
au_model_data = au_data.copy()

# Handle missing/different features
# Note: We'll use the features the vendor expects, not temperature
# This simulates real-world data compatibility challenges

# Check we have the required features
model_features = au_model_data[['age', 'heart_rate', 'respiratory_rate', 
                                 'systolic_bp', 'oxygen_saturation', 'bmi']]

print(f"\nPrepared {len(model_features)} patients for validation")
print(f"Using features: {model_features.columns.tolist()}")
print("\n‚úÖ Data prepared for vendor model")

## Part 3: Compare Populations

Before validating, let's understand how our local population differs from the vendor's development population.

In [None]:
# Compare population characteristics
print("="*60)
print("POPULATION COMPARISON")
print("="*60)

# Recreate US summary stats for comparison
us_summary = {
    'Age (mean)': '58 years',
    'Age (SD)': '18 years',
    'Heart Rate (mean)': '82 bpm',
    'BMI (mean)': '29 kg/m¬≤',
    'SpO2 (mean)': '95%',
    'Outcome Rate': '~18%'
}

au_summary = {
    'Age (mean)': f"{au_data['age'].mean():.0f} years",
    'Age (SD)': f"{au_data['age'].std():.0f} years",
    'Heart Rate (mean)': f"{au_data['heart_rate'].mean():.0f} bpm",
    'BMI (mean)': f"{au_data['bmi'].mean():.1f} kg/m¬≤",
    'SpO2 (mean)': f"{au_data['oxygen_saturation'].mean():.0f}%",
    'Outcome Rate': f"{au_data['adverse_outcome'].mean()*100:.1f}%"
}

comparison = pd.DataFrame({
    'US (Vendor Data)': us_summary,
    'Australian (Local)': au_summary
})

print("\n" + comparison.to_string())

print("\nüí° Key Differences:")
print("  ‚Ä¢ Australian patients are younger on average")
print("  ‚Ä¢ Australian BMI is lower than US")
print("  ‚Ä¢ Baseline outcome rates may differ")
print("  ‚Ä¢ These differences may affect model performance!")

In [None]:
# Visualise population differences
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Simulate US distributions for comparison
np.random.seed(123)
us_ages = np.random.normal(58, 18, 1000).clip(18, 95)
us_bmi = np.random.normal(29, 6, 1000).clip(15, 50)
us_hr = np.random.normal(82, 15, 1000).clip(40, 180)
us_spo2 = np.random.normal(95, 4, 1000).clip(80, 100)

# Age comparison
axes[0, 0].hist(us_ages, bins=20, alpha=0.6, label='US (Vendor)', color='steelblue')
axes[0, 0].hist(au_data['age'], bins=20, alpha=0.6, label='Australian (Local)', color='coral')
axes[0, 0].set_xlabel('Age (years)')
axes[0, 0].set_ylabel('Count')
axes[0, 0].set_title('Age Distribution')
axes[0, 0].legend()

# BMI comparison
axes[0, 1].hist(us_bmi, bins=20, alpha=0.6, label='US (Vendor)', color='steelblue')
axes[0, 1].hist(au_data['bmi'], bins=20, alpha=0.6, label='Australian (Local)', color='coral')
axes[0, 1].set_xlabel('BMI (kg/m¬≤)')
axes[0, 1].set_ylabel('Count')
axes[0, 1].set_title('BMI Distribution')
axes[0, 1].legend()

# Heart rate comparison
axes[1, 0].hist(us_hr, bins=20, alpha=0.6, label='US (Vendor)', color='steelblue')
axes[1, 0].hist(au_data['heart_rate'], bins=20, alpha=0.6, label='Australian (Local)', color='coral')
axes[1, 0].set_xlabel('Heart Rate (bpm)')
axes[1, 0].set_ylabel('Count')
axes[1, 0].set_title('Heart Rate Distribution')
axes[1, 0].legend()

# SpO2 comparison
axes[1, 1].hist(us_spo2, bins=20, alpha=0.6, label='US (Vendor)', color='steelblue')
axes[1, 1].hist(au_data['oxygen_saturation'], bins=20, alpha=0.6, label='Australian (Local)', color='coral')
axes[1, 1].set_xlabel('SpO2 (%)')
axes[1, 1].set_ylabel('Count')
axes[1, 1].set_title('Oxygen Saturation Distribution')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

print("\n‚ö†Ô∏è Population shift detected! This is a form of 'dataset shift'")
print("   that commonly causes AI models to underperform when deployed.")

## Part 4: External Validation

Now let's run the vendor model on our local data and compare performance.

In [None]:
# Run vendor model on local data
print("Running vendor model on local Australian data...")
print("="*60)

# Get predictions
X_local = au_data[expected_features]
y_local = au_data['adverse_outcome']

# Predictions and probabilities
local_predictions = vendor_model.predict(X_local)
local_probabilities = vendor_model.predict_proba(X_local)[:, 1]

# Calculate local performance metrics
local_performance = {
    'AUC': roc_auc_score(y_local, local_probabilities),
    'Sensitivity': recall_score(y_local, local_predictions),
    'Specificity': 1 - local_predictions[y_local == 0].mean(),
    'PPV': precision_score(y_local, local_predictions),
    'N': len(y_local)
}

print("\n‚úÖ Validation complete!")

In [None]:
# Compare vendor claimed vs local validated performance
print("="*70)
print("PERFORMANCE COMPARISON: VENDOR CLAIMED vs LOCAL VALIDATED")
print("="*70)

comparison_metrics = ['AUC', 'Sensitivity', 'Specificity', 'PPV']

print(f"\n{'Metric':<15} {'Vendor (US)':<15} {'Local (AU)':<15} {'Difference':<15}")
print("-" * 60)

performance_gaps = {}
for metric in comparison_metrics:
    vendor_val = vendor_claimed[metric]
    local_val = local_performance[metric]
    diff = local_val - vendor_val
    performance_gaps[metric] = diff
    
    # Color code the difference
    diff_str = f"{diff:+.3f}"
    if diff < -0.05:
        diff_str += " ‚ö†Ô∏è"
    elif diff < -0.10:
        diff_str += " üî¥"
    
    print(f"{metric:<15} {vendor_val:<15.3f} {local_val:<15.3f} {diff_str:<15}")

print("\n" + "-" * 60)
print(f"Vendor validation N: {vendor_claimed['N_validation']:,}")
print(f"Local validation N: {local_performance['N']:,}")

In [None]:
# Visualise the performance gap
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart comparison
x = np.arange(len(comparison_metrics))
width = 0.35

vendor_vals = [vendor_claimed[m] for m in comparison_metrics]
local_vals = [local_performance[m] for m in comparison_metrics]

bars1 = axes[0].bar(x - width/2, vendor_vals, width, label='Vendor (US)', color='steelblue')
bars2 = axes[0].bar(x + width/2, local_vals, width, label='Local (AU)', color='coral')

axes[0].set_xlabel('Metric')
axes[0].set_ylabel('Score')
axes[0].set_title('Performance: Vendor Claimed vs Local Validated')
axes[0].set_xticks(x)
axes[0].set_xticklabels(comparison_metrics)
axes[0].legend()
axes[0].set_ylim(0, 1)
axes[0].axhline(0.8, color='green', linestyle='--', alpha=0.5, label='Acceptable threshold')

# ROC curve comparison (local only - we don't have US raw data)
fpr, tpr, _ = roc_curve(y_local, local_probabilities)
axes[1].plot(fpr, tpr, 'coral', linewidth=2, 
             label=f'Local AUC = {local_performance["AUC"]:.3f}')
axes[1].plot([0, 1], [0, 1], 'k--', label='Random')
axes[1].fill_between(fpr, tpr, alpha=0.3, color='coral')
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title(f'Local Validation ROC Curve\n(Vendor claimed AUC: {vendor_claimed["AUC"]:.3f})')
axes[1].legend()

plt.tight_layout()
plt.show()

# Summary interpretation
auc_gap = performance_gaps['AUC']
if auc_gap < -0.10:
    print("\nüî¥ SIGNIFICANT PERFORMANCE DEGRADATION DETECTED")
    print(f"   AUC dropped by {abs(auc_gap):.3f} from vendor claims")
elif auc_gap < -0.05:
    print("\n‚ö†Ô∏è MODERATE PERFORMANCE DEGRADATION DETECTED")
    print(f"   AUC dropped by {abs(auc_gap):.3f} from vendor claims")
else:
    print("\n‚úÖ Performance reasonably consistent with vendor claims")

## Part 5: Subgroup Analysis

Does the model perform consistently across different patient subgroups in our local population?

In [None]:
# Add predictions to dataframe for subgroup analysis
au_data['predicted'] = local_predictions
au_data['probability'] = local_probabilities

def subgroup_performance(df, group_column):
    """Calculate performance metrics by subgroup."""
    results = []
    for group in df[group_column].unique():
        subset = df[df[group_column] == group]
        if len(subset) < 20:
            continue
        
        y_true = subset['adverse_outcome']
        y_pred = subset['predicted']
        y_prob = subset['probability']
        
        try:
            auc = roc_auc_score(y_true, y_prob)
        except:
            auc = np.nan
            
        results.append({
            'Group': group,
            'N': len(subset),
            'Outcome Rate': y_true.mean(),
            'AUC': auc,
            'Sensitivity': recall_score(y_true, y_pred, zero_division=0),
            'PPV': precision_score(y_true, y_pred, zero_division=0)
        })
    
    return pd.DataFrame(results)

# Analyse by remoteness
print("="*60)
print("SUBGROUP ANALYSIS: BY REMOTENESS")
print("="*60)
remoteness_perf = subgroup_performance(au_data, 'remoteness')
print(remoteness_perf.round(3).to_string(index=False))

In [None]:
# Analyse by Indigenous status
print("="*60)
print("SUBGROUP ANALYSIS: BY INDIGENOUS STATUS")
print("="*60)
indigenous_perf = subgroup_performance(au_data, 'indigenous_status')
print(indigenous_perf.round(3).to_string(index=False))

# Calculate disparity
if len(indigenous_perf) > 1:
    sens_values = indigenous_perf.set_index('Group')['Sensitivity']
    if 'Indigenous' in sens_values.index and 'Non-Indigenous' in sens_values.index:
        gap = sens_values['Non-Indigenous'] - sens_values['Indigenous']
        print(f"\n‚ö†Ô∏è Sensitivity gap: {gap*100:+.1f} percentage points")

In [None]:
# Create age groups for analysis
au_data['age_group'] = pd.cut(au_data['age'], 
                               bins=[0, 40, 60, 80, 100],
                               labels=['<40', '40-60', '60-80', '80+'])

print("="*60)
print("SUBGROUP ANALYSIS: BY AGE GROUP")
print("="*60)
age_perf = subgroup_performance(au_data, 'age_group')
print(age_perf.round(3).to_string(index=False))

In [None]:
# Visualise subgroup performance
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# By remoteness
rem_sorted = remoteness_perf.sort_values('AUC', ascending=True)
axes[0].barh(rem_sorted['Group'], rem_sorted['AUC'], color='steelblue')
axes[0].axvline(local_performance['AUC'], color='red', linestyle='--', label='Overall')
axes[0].axvline(vendor_claimed['AUC'], color='green', linestyle='--', label='Vendor claim')
axes[0].set_xlabel('AUC')
axes[0].set_title('AUC by Remoteness')
axes[0].legend()
axes[0].set_xlim(0.5, 1.0)

# By Indigenous status
ind_sorted = indigenous_perf.sort_values('AUC', ascending=True)
axes[1].barh(ind_sorted['Group'], ind_sorted['AUC'], color='steelblue')
axes[1].axvline(local_performance['AUC'], color='red', linestyle='--', label='Overall')
axes[1].axvline(vendor_claimed['AUC'], color='green', linestyle='--', label='Vendor claim')
axes[1].set_xlabel('AUC')
axes[1].set_title('AUC by Indigenous Status')
axes[1].legend()
axes[1].set_xlim(0.5, 1.0)

# By age group
age_sorted = age_perf.sort_values('AUC', ascending=True)
axes[2].barh(age_sorted['Group'].astype(str), age_sorted['AUC'], color='steelblue')
axes[2].axvline(local_performance['AUC'], color='red', linestyle='--', label='Overall')
axes[2].axvline(vendor_claimed['AUC'], color='green', linestyle='--', label='Vendor claim')
axes[2].set_xlabel('AUC')
axes[2].set_title('AUC by Age Group')
axes[2].legend()
axes[2].set_xlim(0.5, 1.0)

plt.tight_layout()
plt.show()

print("\nüí° Subgroup analysis reveals whether the model works for ALL your patients,")
print("   not just the 'average' patient.")

## Part 6: Calibration Analysis

Even if the model discriminates well (high AUC), are its probability estimates accurate?

In [None]:
# Calibration analysis
print("="*60)
print("CALIBRATION ANALYSIS")
print("="*60)

# Calculate calibration curve
prob_true, prob_pred = calibration_curve(y_local, local_probabilities, n_bins=10)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Calibration plot
axes[0].plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')
axes[0].plot(prob_pred, prob_true, 'o-', color='coral', label='Model')
axes[0].set_xlabel('Mean Predicted Probability')
axes[0].set_ylabel('Fraction of Positives')
axes[0].set_title('Calibration Curve')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Prediction distribution
axes[1].hist(local_probabilities[y_local == 0], bins=20, alpha=0.6, 
             label='No deterioration', color='steelblue')
axes[1].hist(local_probabilities[y_local == 1], bins=20, alpha=0.6,
             label='Deterioration', color='coral')
axes[1].set_xlabel('Predicted Probability')
axes[1].set_ylabel('Count')
axes[1].set_title('Distribution of Predicted Probabilities')
axes[1].legend()

plt.tight_layout()
plt.show()

print("\nüìä Calibration Interpretation:")
print("  ‚Ä¢ If the model predicts 30% probability, ~30% of those patients")
print("    should actually deteriorate if well-calibrated")
print("  ‚Ä¢ Poor calibration means probability estimates are unreliable")
print("  ‚Ä¢ May need recalibration before deployment")

## Part 7: Go/No-Go Decision Framework

Based on your validation, complete the decision framework for governance.

In [None]:
# Generate validation report
print("="*70)
print("LOCAL VALIDATION REPORT")
print("="*70)

validation_report = {
    'Model Name': 'DeterioratePredict Pro v2.1',
    'Vendor': 'HealthAI Solutions Inc.',
    'Validation Date': '2024-XX-XX',
    'Local Sample Size': len(au_data),
    'Validation Setting': 'Australian Regional Hospital',
    '---': '---',
    'Vendor Claimed AUC': f"{vendor_claimed['AUC']:.3f}",
    'Local Validated AUC': f"{local_performance['AUC']:.3f}",
    'Performance Gap': f"{performance_gaps['AUC']:+.3f}",
    '----': '----',
    'Vendor Claimed Sensitivity': f"{vendor_claimed['Sensitivity']:.1%}",
    'Local Validated Sensitivity': f"{local_performance['Sensitivity']:.1%}",
    'Sensitivity Gap': f"{performance_gaps['Sensitivity']*100:+.1f}%",
}

for key, value in validation_report.items():
    if key.startswith('-'):
        print("-" * 50)
    else:
        print(f"{key}: {value}")

# Decision criteria
print("\n" + "="*70)
print("DECISION CRITERIA ASSESSMENT")
print("="*70)

criteria = [
    ('AUC ‚â• 0.75', local_performance['AUC'] >= 0.75),
    ('Sensitivity ‚â• 70%', local_performance['Sensitivity'] >= 0.70),
    ('AUC gap ‚â§ 0.10 from vendor', abs(performance_gaps['AUC']) <= 0.10),
    ('No subgroup AUC < 0.70', all(remoteness_perf['AUC'].dropna() >= 0.70)),
]

print(f"\n{'Criterion':<40} {'Met?':<10}")
print("-" * 50)
all_met = True
for criterion, met in criteria:
    status = "‚úÖ Yes" if met else "‚ùå No"
    if not met:
        all_met = False
    print(f"{criterion:<40} {status:<10}")

In [None]:
# Decision framework
print("\n" + "="*70)
print("GO / NO-GO DECISION FRAMEWORK")
print("="*70)

if all_met:
    recommendation = "CONDITIONAL GO"
    color = "üü°"
elif local_performance['AUC'] < 0.70:
    recommendation = "NO-GO"
    color = "üî¥"
else:
    recommendation = "DEFER - Further evaluation required"
    color = "üü†"

print(f"\n{color} RECOMMENDATION: {recommendation}")

print("\nConditions for deployment (if approved):")
conditions = [
    "1. 3-month silent running period before clinical use",
    "2. Mandatory clinical override capability",
    "3. Weekly performance monitoring by subgroup",
    "4. Clear escalation pathway for performance degradation",
    "5. Staff training on model limitations",
    "6. Patient notification that AI is being used"
]

for condition in conditions:
    print(f"  {condition}")

print("\nQuestions for vendor:")
questions = [
    "1. Why does performance differ in Australian population?",
    "2. Can model be recalibrated on local data?",
    "3. What ongoing support is provided?",
    "4. How often is the model updated?",
    "5. What liability does vendor accept for failures?"
]

for question in questions:
    print(f"  {question}")

## Part 8: Your Validation Report

Complete the validation report template below for governance submission.

In [None]:
# ===== YOUR VALIDATION REPORT =====

your_report = """
============================================================
LOCAL VALIDATION REPORT FOR CLINICAL GOVERNANCE
============================================================

AI SYSTEM: DeterioratePredict Pro v2.1
VENDOR: HealthAI Solutions Inc.
EVALUATOR: [Your name]
DATE: [Date]
VALIDATION SETTING: [Your health service]

------------------------------------------------------------
1. EXECUTIVE SUMMARY
------------------------------------------------------------
[2-3 sentence summary of validation findings and recommendation]



------------------------------------------------------------
2. PERFORMANCE COMPARISON
------------------------------------------------------------

Metric          Vendor (US)    Local (AU)    Gap
-------         -----------    ----------    ---
AUC             [value]        [value]       [value]
Sensitivity     [value]        [value]       [value]
Specificity     [value]        [value]       [value]

------------------------------------------------------------
3. SUBGROUP ANALYSIS
------------------------------------------------------------
[Describe any concerning disparities by subgroup]



------------------------------------------------------------
4. POPULATION DIFFERENCES
------------------------------------------------------------
[Key differences between vendor development population and local]



------------------------------------------------------------
5. RECOMMENDATION
------------------------------------------------------------
[ ] APPROVE for deployment
[ ] APPROVE with conditions
[ ] DEFER pending [specify]
[ ] DO NOT APPROVE

Rationale:



------------------------------------------------------------
6. CONDITIONS FOR DEPLOYMENT (if applicable)
------------------------------------------------------------



------------------------------------------------------------
7. MONITORING PLAN
------------------------------------------------------------



============================================================
"""

print(your_report)

## Part 9: Reflection Questions

In [None]:
# ===== YOUR REFLECTIONS =====

reflections = """
1. How much performance degradation from vendor claims is acceptable?
   What if AUC drops from 0.85 to 0.78? To 0.72? Where's the line?
   Your answer:
   

2. If the model works well overall but poorly for Indigenous patients
   or remote populations, should you still deploy it?
   Your answer:
   

3. What additional information would you want from the vendor before
   making a final decision?
   Your answer:
   

4. What would 'silent running' look like in your clinical setting?
   How would you monitor performance without affecting care?
   Your answer:
   

5. If you had to explain this validation to a patient, what would
   you tell them?
   Your answer:
   

"""

print(reflections)

## üìù Deliverable

**For your portfolio:**

Complete the validation report (Part 8) as if you were submitting it to your clinical governance committee. Include:

1. Performance comparison (vendor vs local)
2. Subgroup analysis findings
3. Clear go/no-go recommendation with rationale
4. Conditions for deployment (if recommending approval)
5. Monitoring plan

Submit via LMS by the Week 7 deadline.

## üèÅ Summary

In this exercise, you learned:

‚úÖ **Local validation is essential** - vendor performance rarely translates directly

‚úÖ **Population differences matter** - age, demographics, clinical practice all affect performance

‚úÖ **Subgroup analysis reveals hidden failures** - overall performance can mask disparities

‚úÖ **Calibration affects clinical utility** - probability estimates may need adjustment

‚úÖ **Structured frameworks support decisions** - governance needs clear criteria

**Key takeaway:** Never deploy a vendor AI without local validation. "Validated" elsewhere doesn't mean validated here.

---

**Next exercise (Week 10):** We'll experiment with large language models (LLMs) for clinical applications.