# ‚öñÔ∏è Exercise 5: Measuring Algorithmic Fairness

**Week 5 | AI in Healthcare Curriculum**

---

## Learning Objectives

By completing this exercise, you will:

- üéØ Train a clinical prediction model and evaluate its overall performance
- üéØ Calculate and interpret performance metrics stratified by demographic groups
- üéØ Apply formal fairness metrics (demographic parity, equalised odds)
- üéØ Understand the trade-offs between different fairness definitions
- üéØ Develop a governance recommendation based on fairness analysis

---

## ‚è±Ô∏è Estimated Time: 2 hours

---

## Context

In Week 3, you explored a healthcare dataset and identified representation gaps and outcome disparities. Now we'll take the next step: **training a model and measuring whether it performs fairly.**

An AI model might achieve excellent *overall* performance while systematically underperforming for certain patient groups. This exercise will help you:
- Identify such disparities
- Quantify them using formal metrics
- Make informed governance decisions

**Clinical Scenario:** You're evaluating a deterioration prediction model for possible deployment. Before recommending approval, you need to assess whether it performs equitably across different patient populations.

## Part 1: Setup and Data Preparation

In [None]:
# Setup - run this first!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, 
    roc_auc_score, confusion_matrix, classification_report,
    roc_curve
)

# Set display options
pd.set_option('display.max_columns', None)
np.random.seed(42)

print("‚úÖ Setup complete!")

In [None]:
# Generate the same ED dataset from Exercise 3

def generate_ed_dataset(n_patients=2000):
    """
    Generate a synthetic ED dataset with realistic patterns,
    including intentional biases for educational purposes.
    """
    np.random.seed(42)
    
    # Demographics
    ages = np.random.normal(55, 20, n_patients).clip(18, 95).astype(int)
    genders = np.random.choice(['Male', 'Female'], n_patients, p=[0.52, 0.48])
    
    # Indigenous status - intentionally underrepresented
    indigenous_status = np.random.choice(
        ['Non-Indigenous', 'Aboriginal', 'Torres Strait Islander', 'Both', 'Not Stated'],
        n_patients,
        p=[0.85, 0.03, 0.01, 0.005, 0.105]
    )
    
    # Remoteness - urban overrepresentation
    remoteness = np.random.choice(
        ['Major City', 'Inner Regional', 'Outer Regional', 'Remote', 'Very Remote'],
        n_patients,
        p=[0.75, 0.15, 0.07, 0.02, 0.01]
    )
    
    # Socioeconomic status (SEIFA-like decile)
    seifa_decile = np.random.choice(range(1, 11), n_patients, 
                                     p=[0.05, 0.06, 0.07, 0.08, 0.09, 0.11, 0.12, 0.14, 0.14, 0.14])
    
    # Clinical data
    triage_category = np.random.choice([1, 2, 3, 4, 5], n_patients,
                                        p=[0.03, 0.12, 0.35, 0.40, 0.10])
    
    # Vital signs
    heart_rate = np.random.normal(85, 18, n_patients).clip(40, 180)
    respiratory_rate = np.random.normal(18, 5, n_patients).clip(8, 40)
    systolic_bp = np.random.normal(125, 22, n_patients).clip(70, 200)
    temperature = np.random.normal(37.0, 0.7, n_patients).clip(35, 41)
    oxygen_saturation = np.random.normal(96, 3, n_patients).clip(80, 100)
    
    # Comorbidities
    comorbidity_count = np.random.poisson(1.5, n_patients).clip(0, 8)
    
    # Outcomes - with bias related to socioeconomic status and remoteness
    base_risk = (
        0.01 * (ages - 50) / 10 +
        0.02 * (5 - triage_category) +
        0.01 * comorbidity_count +
        0.005 * (heart_rate - 80) / 20 +
        -0.005 * (seifa_decile - 5) +
        0.03 * np.isin(remoteness, ['Remote', 'Very Remote']).astype(int) +
        np.random.normal(0, 0.03, n_patients)
    )
    adverse_outcome = (base_risk > 0.15).astype(int)
    
    # Create DataFrame
    df = pd.DataFrame({
        'patient_id': [f'ED{i:05d}' for i in range(n_patients)],
        'age': ages,
        'gender': genders,
        'indigenous_status': indigenous_status,
        'remoteness': remoteness,
        'seifa_decile': seifa_decile,
        'triage_category': triage_category,
        'heart_rate': heart_rate.round(0).astype(int),
        'respiratory_rate': respiratory_rate.round(0).astype(int),
        'systolic_bp': systolic_bp.round(0).astype(int),
        'temperature': temperature.round(1),
        'oxygen_saturation': oxygen_saturation.round(0).astype(int),
        'comorbidity_count': comorbidity_count,
        'adverse_outcome': adverse_outcome
    })
    
    return df

# Generate dataset
ed_data = generate_ed_dataset(2000)

# Create grouped variables for fairness analysis
ed_data['indigenous_grouped'] = ed_data['indigenous_status'].apply(
    lambda x: 'Indigenous' if x in ['Aboriginal', 'Torres Strait Islander', 'Both'] 
    else ('Not Stated' if x == 'Not Stated' else 'Non-Indigenous')
)

ed_data['remoteness_grouped'] = ed_data['remoteness'].apply(
    lambda x: 'Remote' if x in ['Remote', 'Very Remote'] else 'Non-Remote'
)

ed_data['seifa_grouped'] = pd.cut(ed_data['seifa_decile'], 
                                   bins=[0, 3, 7, 10], 
                                   labels=['Low (1-3)', 'Medium (4-7)', 'High (8-10)'])

print("Emergency Department Dataset Generated")
print("="*60)
print(f"Total presentations: {len(ed_data):,}")
print(f"Adverse outcome rate: {ed_data['adverse_outcome'].mean()*100:.1f}%")
print("\nFirst 5 rows:")
ed_data.head()

## Part 2: Train a Clinical Prediction Model

We'll train a Random Forest classifier to predict adverse outcomes (deterioration). This simulates the type of model a vendor might offer.

In [None]:
# Prepare features for the model
# Note: We're using only clinical features, not demographics
# This is a common approach to avoid "encoding" demographics directly

feature_columns = [
    'age', 'triage_category', 'heart_rate', 'respiratory_rate',
    'systolic_bp', 'temperature', 'oxygen_saturation', 'comorbidity_count'
]

X = ed_data[feature_columns]
y = ed_data['adverse_outcome']

# Store demographic columns for fairness analysis
demographics = ed_data[['patient_id', 'gender', 'indigenous_grouped', 
                        'remoteness_grouped', 'seifa_grouped']].copy()

print("Features used for prediction:")
print("-" * 40)
for col in feature_columns:
    print(f"  ‚Ä¢ {col}")

print(f"\n‚ö†Ô∏è Note: Demographic variables are NOT used as model inputs")
print(f"   However, this doesn't guarantee fair outcomes!")

In [None]:
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Split demographics to match
demographics_test = demographics.iloc[X_test.index].copy()

print(f"Training set: {len(X_train)} patients")
print(f"Test set: {len(X_test)} patients")
print(f"\nOutcome distribution in test set: {y_test.mean()*100:.1f}% adverse")

In [None]:
# Train the Random Forest model
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    class_weight='balanced'  # Helps with imbalanced outcomes
)

model.fit(X_train, y_train)

print("‚úÖ Model trained successfully!")
print(f"\nModel type: {type(model).__name__}")
print(f"Number of trees: {model.n_estimators}")
print(f"Max depth: {model.max_depth}")

In [None]:
# Generate predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]  # Probability of adverse outcome

# Add predictions to demographics dataframe for analysis
demographics_test['true_outcome'] = y_test.values
demographics_test['predicted_outcome'] = y_pred
demographics_test['predicted_probability'] = y_prob

print("Predictions generated!")
print(f"Predicted positive rate: {y_pred.mean()*100:.1f}%")
print(f"Actual positive rate: {y_test.mean()*100:.1f}%")

## Part 3: Overall Model Performance

Before examining fairness, let's understand how well the model performs overall.

In [None]:
# Calculate overall performance metrics
print("="*60)
print("OVERALL MODEL PERFORMANCE")
print("="*60)

overall_metrics = {
    'Accuracy': accuracy_score(y_test, y_pred),
    'Precision (PPV)': precision_score(y_test, y_pred),
    'Recall (Sensitivity)': recall_score(y_test, y_pred),
    'AUC-ROC': roc_auc_score(y_test, y_prob)
}

for metric, value in overall_metrics.items():
    print(f"  {metric}: {value:.3f}")

print("\nüí° Interpretation:")
print(f"  ‚Ä¢ The model correctly identifies {overall_metrics['Recall (Sensitivity)']*100:.0f}% of patients who will deteriorate")
print(f"  ‚Ä¢ When it predicts deterioration, it's correct {overall_metrics['Precision (PPV)']*100:.0f}% of the time")

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion matrix heatmap
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=['No Deterioration', 'Deterioration'],
            yticklabels=['No Deterioration', 'Deterioration'])
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title('Confusion Matrix')

# ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
axes[1].plot(fpr, tpr, 'b-', linewidth=2, label=f'AUC = {overall_metrics["AUC-ROC"]:.3f}')
axes[1].plot([0, 1], [0, 1], 'r--', label='Random classifier')
axes[1].set_xlabel('False Positive Rate (1 - Specificity)')
axes[1].set_ylabel('True Positive Rate (Sensitivity)')
axes[1].set_title('ROC Curve')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n‚ö†Ô∏è Key Question: Does this 'good' overall performance hold for all patient groups?")

## Part 4: Stratified Performance Analysis

Now let's examine whether performance varies across demographic groups. This is the heart of algorithmic fairness analysis.

In [None]:
def calculate_group_metrics(df, group_column):
    """
    Calculate performance metrics for each subgroup.
    """
    results = []
    
    for group_value in df[group_column].unique():
        mask = df[group_column] == group_value
        subset = df[mask]
        
        n = len(subset)
        if n < 10:  # Skip very small groups
            continue
            
        y_true = subset['true_outcome']
        y_pred = subset['predicted_outcome']
        y_prob = subset['predicted_probability']
        
        # Calculate metrics (handle edge cases)
        try:
            auc = roc_auc_score(y_true, y_prob)
        except:
            auc = np.nan
            
        metrics = {
            'Group': group_value,
            'N': n,
            'Base Rate': y_true.mean(),
            'Accuracy': accuracy_score(y_true, y_pred),
            'Precision': precision_score(y_true, y_pred, zero_division=0),
            'Recall': recall_score(y_true, y_pred, zero_division=0),
            'AUC': auc,
            'Positive Rate': y_pred.mean()  # For demographic parity
        }
        results.append(metrics)
    
    return pd.DataFrame(results)

# Calculate metrics for each demographic grouping
print("Calculating stratified performance metrics...\n")

In [None]:
# Performance by Indigenous Status
print("="*70)
print("PERFORMANCE BY INDIGENOUS STATUS")
print("="*70)

indigenous_metrics = calculate_group_metrics(demographics_test, 'indigenous_grouped')

# Format for display
display_cols = ['Group', 'N', 'Base Rate', 'Recall', 'Precision', 'AUC']
display_df = indigenous_metrics[display_cols].copy()
display_df['Base Rate'] = (display_df['Base Rate'] * 100).round(1).astype(str) + '%'
display_df['Recall'] = (display_df['Recall'] * 100).round(1).astype(str) + '%'
display_df['Precision'] = (display_df['Precision'] * 100).round(1).astype(str) + '%'
display_df['AUC'] = display_df['AUC'].round(3)

print(display_df.to_string(index=False))

# Calculate disparity
if len(indigenous_metrics) > 1:
    max_recall = indigenous_metrics['Recall'].max()
    min_recall = indigenous_metrics['Recall'].min()
    print(f"\n‚ö†Ô∏è Recall disparity: {max_recall*100:.1f}% vs {min_recall*100:.1f}%")
    print(f"   Ratio: {max_recall/min_recall:.2f}x")

In [None]:
# Performance by Remoteness
print("="*70)
print("PERFORMANCE BY REMOTENESS")
print("="*70)

remoteness_metrics = calculate_group_metrics(demographics_test, 'remoteness_grouped')

display_df = remoteness_metrics[display_cols].copy()
display_df['Base Rate'] = (display_df['Base Rate'] * 100).round(1).astype(str) + '%'
display_df['Recall'] = (display_df['Recall'] * 100).round(1).astype(str) + '%'
display_df['Precision'] = (display_df['Precision'] * 100).round(1).astype(str) + '%'
display_df['AUC'] = display_df['AUC'].round(3)

print(display_df.to_string(index=False))

if len(remoteness_metrics) > 1:
    max_recall = remoteness_metrics['Recall'].max()
    min_recall = remoteness_metrics['Recall'].min()
    print(f"\n‚ö†Ô∏è Recall disparity: {max_recall*100:.1f}% vs {min_recall*100:.1f}%")

In [None]:
# Performance by Socioeconomic Status
print("="*70)
print("PERFORMANCE BY SOCIOECONOMIC STATUS (SEIFA)")
print("="*70)

seifa_metrics = calculate_group_metrics(demographics_test, 'seifa_grouped')

display_df = seifa_metrics[display_cols].copy()
display_df['Base Rate'] = (display_df['Base Rate'] * 100).round(1).astype(str) + '%'
display_df['Recall'] = (display_df['Recall'] * 100).round(1).astype(str) + '%'
display_df['Precision'] = (display_df['Precision'] * 100).round(1).astype(str) + '%'
display_df['AUC'] = display_df['AUC'].round(3)

print(display_df.to_string(index=False))

if len(seifa_metrics) > 1:
    max_recall = seifa_metrics['Recall'].max()
    min_recall = seifa_metrics['Recall'].min()
    print(f"\n‚ö†Ô∏è Recall disparity: {max_recall*100:.1f}% vs {min_recall*100:.1f}%")

In [None]:
# Visualise performance disparities
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Indigenous status
indigenous_metrics_sorted = indigenous_metrics.sort_values('Recall', ascending=True)
axes[0].barh(indigenous_metrics_sorted['Group'], indigenous_metrics_sorted['Recall'] * 100, color='steelblue')
axes[0].axvline(overall_metrics['Recall (Sensitivity)'] * 100, color='red', linestyle='--', label='Overall')
axes[0].set_xlabel('Recall (%)')
axes[0].set_title('Recall by Indigenous Status')
axes[0].legend()

# Remoteness
remoteness_metrics_sorted = remoteness_metrics.sort_values('Recall', ascending=True)
axes[1].barh(remoteness_metrics_sorted['Group'], remoteness_metrics_sorted['Recall'] * 100, color='steelblue')
axes[1].axvline(overall_metrics['Recall (Sensitivity)'] * 100, color='red', linestyle='--', label='Overall')
axes[1].set_xlabel('Recall (%)')
axes[1].set_title('Recall by Remoteness')
axes[1].legend()

# SEIFA
seifa_metrics_sorted = seifa_metrics.sort_values('Recall', ascending=True)
axes[2].barh(seifa_metrics_sorted['Group'].astype(str), seifa_metrics_sorted['Recall'] * 100, color='steelblue')
axes[2].axvline(overall_metrics['Recall (Sensitivity)'] * 100, color='red', linestyle='--', label='Overall')
axes[2].set_xlabel('Recall (%)')
axes[2].set_title('Recall by SEIFA Group')
axes[2].legend()

plt.tight_layout()
plt.show()

print("\nüí° Key Insight: The model may have good OVERALL performance but")
print("   systematically miss deteriorating patients in certain groups.")

## Part 5: Formal Fairness Metrics

Now let's apply formal fairness definitions. There are several ways to define "fair", and they can conflict with each other.

In [None]:
# Define fairness metric calculations

def calculate_fairness_metrics(df, group_column, reference_group=None):
    """
    Calculate common fairness metrics.
    """
    groups = df[group_column].unique()
    
    if reference_group is None:
        # Use largest group as reference
        reference_group = df[group_column].value_counts().idxmax()
    
    results = {}
    
    for group in groups:
        mask = df[group_column] == group
        subset = df[mask]
        
        # Positive prediction rate (for demographic parity)
        positive_rate = subset['predicted_outcome'].mean()
        
        # True positive rate (recall) and false positive rate (for equalised odds)
        positives = subset[subset['true_outcome'] == 1]
        negatives = subset[subset['true_outcome'] == 0]
        
        tpr = positives['predicted_outcome'].mean() if len(positives) > 0 else np.nan
        fpr = negatives['predicted_outcome'].mean() if len(negatives) > 0 else np.nan
        
        results[group] = {
            'N': len(subset),
            'Positive_Rate': positive_rate,
            'TPR': tpr,
            'FPR': fpr
        }
    
    return pd.DataFrame(results).T, reference_group

In [None]:
# Calculate fairness metrics by Indigenous status
print("="*70)
print("FAIRNESS METRICS BY INDIGENOUS STATUS")
print("="*70)

fairness_indigenous, ref_group = calculate_fairness_metrics(
    demographics_test, 'indigenous_grouped', 'Non-Indigenous'
)

print(f"\nReference group: {ref_group}")
print("\nMetrics by group:")
print(fairness_indigenous.round(3))

# Calculate disparities
print("\n" + "-"*50)
print("FAIRNESS ANALYSIS:")
print("-"*50)

ref_metrics = fairness_indigenous.loc[ref_group]

for group in fairness_indigenous.index:
    if group != ref_group:
        group_metrics = fairness_indigenous.loc[group]
        
        # Demographic Parity Ratio
        dp_ratio = group_metrics['Positive_Rate'] / ref_metrics['Positive_Rate']
        
        # Equalised Odds - TPR and FPR differences
        tpr_diff = group_metrics['TPR'] - ref_metrics['TPR']
        fpr_diff = group_metrics['FPR'] - ref_metrics['FPR']
        
        print(f"\n{group} vs {ref_group}:")
        print(f"  Demographic Parity Ratio: {dp_ratio:.2f}")
        print(f"    (1.0 = equal positive prediction rates)")
        print(f"  TPR Difference: {tpr_diff:+.3f}")
        print(f"  FPR Difference: {fpr_diff:+.3f}")
        print(f"    (0.0 = equal error rates = Equalised Odds)")

In [None]:
# Visualise fairness metrics
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Demographic Parity - Positive prediction rates
axes[0].bar(fairness_indigenous.index, fairness_indigenous['Positive_Rate'] * 100, 
            color='steelblue', edgecolor='white')
axes[0].axhline(y=demographics_test['predicted_outcome'].mean() * 100, 
                color='red', linestyle='--', label='Overall rate')
axes[0].set_ylabel('Positive Prediction Rate (%)')
axes[0].set_title('Demographic Parity\n(Equal Prediction Rates)')
axes[0].legend()
axes[0].tick_params(axis='x', rotation=45)

# True Positive Rate (Recall)
axes[1].bar(fairness_indigenous.index, fairness_indigenous['TPR'] * 100,
            color='forestgreen', edgecolor='white')
axes[1].axhline(y=overall_metrics['Recall (Sensitivity)'] * 100,
                color='red', linestyle='--', label='Overall rate')
axes[1].set_ylabel('True Positive Rate / Recall (%)')
axes[1].set_title('Equal Opportunity\n(Equal TPR for True Positives)')
axes[1].legend()
axes[1].tick_params(axis='x', rotation=45)

# False Positive Rate
axes[2].bar(fairness_indigenous.index, fairness_indigenous['FPR'] * 100,
            color='coral', edgecolor='white')
axes[2].set_ylabel('False Positive Rate (%)')
axes[2].set_title('Predictive Equality\n(Equal FPR for True Negatives)')
axes[2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("\nüìö Fairness Definitions:")
print("-" * 50)
print("‚Ä¢ Demographic Parity: Equal positive prediction rates across groups")
print("‚Ä¢ Equal Opportunity: Equal TPR (detecting true positives) across groups")
print("‚Ä¢ Equalised Odds: Equal TPR AND FPR across groups")
print("\n‚ö†Ô∏è These definitions can conflict - you may need to choose which matters most!")

### üîß Your Turn: Analyse Fairness by SEIFA Group

In [None]:
# YOUR CODE: Calculate and interpret fairness metrics by SEIFA group
# Use the calculate_fairness_metrics function

fairness_seifa, ref_group = calculate_fairness_metrics(
    demographics_test, 'seifa_grouped', 'High (8-10)'
)

print("Fairness Metrics by SEIFA Group:")
print(fairness_seifa.round(3))

# Add your interpretation here:
# What disparities do you observe?
# Which groups might be disadvantaged by this model?

## Part 6: The Fairness Trade-off Challenge

Here's a crucial insight: **different fairness criteria often conflict**. Let's explore why.

In [None]:
# Demonstrate the impossibility theorem
print("="*70)
print("THE FAIRNESS TRADE-OFF DILEMMA")
print("="*70)

print("""
Consider two groups with DIFFERENT base rates of adverse outcomes:

  Group A: 30% adverse outcome rate (higher risk population)
  Group B: 15% adverse outcome rate (lower risk population)

A well-calibrated model should predict higher risk for Group A.
But this VIOLATES demographic parity (equal positive prediction rates).

If we FORCE demographic parity:
  - We might under-predict risk for Group A (missing deteriorating patients)
  - Or over-predict risk for Group B (unnecessary interventions)

This is known as the "impossibility theorem" of algorithmic fairness.
When base rates differ, you CANNOT satisfy all fairness criteria simultaneously.
""")

# Show base rates in our data
print("\nBase rates in our dataset:")
print("-" * 40)
for group in demographics_test['seifa_grouped'].unique():
    rate = demographics_test[demographics_test['seifa_grouped'] == group]['true_outcome'].mean()
    print(f"  {group}: {rate*100:.1f}%")

In [None]:
# Explore threshold adjustment as mitigation
print("\n" + "="*70)
print("MITIGATION EXPLORATION: Threshold Adjustment")
print("="*70)

# Currently using 0.5 threshold for all groups
# What if we use different thresholds per group?

def apply_threshold(prob, threshold):
    return (prob >= threshold).astype(int)

# Standard threshold
standard_threshold = 0.5

print(f"\nStandard threshold ({standard_threshold}) for all groups:")
for group in ['Indigenous', 'Non-Indigenous']:
    if group == 'Not Stated':
        continue
    mask = demographics_test['indigenous_grouped'] == group
    subset = demographics_test[mask]
    pred = apply_threshold(subset['predicted_probability'], standard_threshold)
    recall = recall_score(subset['true_outcome'], pred, zero_division=0)
    print(f"  {group}: Recall = {recall*100:.1f}%")

print("\nüí° Group-specific thresholds could equalise recall,")
print("   but this raises ethical questions about treating groups differently.")

## Part 7: Governance Decision Framework

Based on your analysis, you need to make a recommendation to the clinical governance committee.

In [None]:
# Generate governance summary report
print("="*70)
print("FAIRNESS ASSESSMENT SUMMARY FOR GOVERNANCE")
print("="*70)

print("\nüìä OVERALL MODEL PERFORMANCE:")
print("-" * 50)
for metric, value in overall_metrics.items():
    print(f"  {metric}: {value:.3f}")

print("\n‚ö†Ô∏è IDENTIFIED DISPARITIES:")
print("-" * 50)

# Indigenous
ind_recall = indigenous_metrics.set_index('Group')['Recall']
if 'Indigenous' in ind_recall.index and 'Non-Indigenous' in ind_recall.index:
    gap = ind_recall['Non-Indigenous'] - ind_recall['Indigenous']
    print(f"  Indigenous vs Non-Indigenous recall gap: {gap*100:+.1f} percentage points")

# Remoteness
rem_recall = remoteness_metrics.set_index('Group')['Recall']
if 'Remote' in rem_recall.index and 'Non-Remote' in rem_recall.index:
    gap = rem_recall['Non-Remote'] - rem_recall['Remote']
    print(f"  Remote vs Non-Remote recall gap: {gap*100:+.1f} percentage points")

# SEIFA
seifa_recall = seifa_metrics.set_index('Group')['Recall']
if 'Low (1-3)' in seifa_recall.index and 'High (8-10)' in seifa_recall.index:
    gap = seifa_recall['High (8-10)'] - seifa_recall['Low (1-3)']
    print(f"  Low vs High SEIFA recall gap: {gap*100:+.1f} percentage points")

print("\nüìã CLINICAL IMPLICATIONS:")
print("-" * 50)
print("  ‚Ä¢ If deployed, the model may miss more deteriorating patients in:")
print("    - Indigenous populations")
print("    - Remote/rural areas")
print("    - Lower socioeconomic areas")
print("  ‚Ä¢ These are often the populations with least access to alternatives")

print("\n‚ùì GOVERNANCE QUESTIONS:")
print("-" * 50)
print("  1. What level of disparity is acceptable?")
print("  2. Who decides what 'fair enough' means?")
print("  3. Should we delay deployment until disparities are addressed?")
print("  4. Are there mitigation strategies we can implement?")
print("  5. How will we monitor for disparities post-deployment?")

## Part 8: Your Governance Recommendation

Based on your analysis, complete the governance recommendation below.

In [None]:
# ===== YOUR GOVERNANCE RECOMMENDATION =====

governance_recommendation = """
ALGORITHMIC FAIRNESS ASSESSMENT
AI System: Deterioration Prediction Model
Date: [Your date]
Assessor: [Your name]

============================================================

1. EXECUTIVE SUMMARY
------------------------------------------------------------
[Summarise overall performance and key fairness findings in 2-3 sentences]



2. KEY FINDINGS
------------------------------------------------------------
Overall Performance:
  - AUC: [value]
  - Recall: [value]

Identified Disparities:
  - [Disparity 1]
  - [Disparity 2]
  - [Disparity 3]

3. CLINICAL IMPACT ASSESSMENT
------------------------------------------------------------
[What are the real-world implications of these disparities?]
[Which patient groups are most affected?]
[What are the consequences of missed deterioration?]



4. RECOMMENDATION
------------------------------------------------------------
[ ] APPROVE for deployment without conditions
[ ] APPROVE with conditions (specify below)
[ ] DEFER pending further evaluation
[ ] DO NOT APPROVE

Conditions/Rationale:



5. PROPOSED MITIGATIONS
------------------------------------------------------------
[If recommending approval, what mitigations should be implemented?]



6. MONITORING REQUIREMENTS
------------------------------------------------------------
[How should fairness be monitored post-deployment?]



============================================================
"""

print(governance_recommendation)

## Part 9: Reflection Questions

In [None]:
# ===== YOUR REFLECTIONS =====

reflections = """
1. If the model has lower recall for Indigenous patients, what are the
   clinical implications? Who bears the cost of this disparity?
   Your answer:
   

2. Is a 5% difference in AUC between groups acceptable? What about 10%?
   Who should decide this threshold?
   Your answer:
   

3. If disparities exist, is the problem:
   a) The model itself?
   b) The training data?
   c) The underlying healthcare system?
   d) All of the above?
   Your answer:
   

4. Should we deploy an imperfect model if it still improves on current
   practice overall, even if it worsens disparities?
   Your answer:
   

5. What role should affected communities play in these decisions?
   Your answer:
   

"""

print(reflections)

## üìù Deliverable

**For your portfolio:**

Complete the governance recommendation (Part 8) with:
1. Your fairness analysis findings
2. A clear recommendation (approve/defer/reject)
3. Proposed mitigations or conditions
4. Monitoring requirements

**Word count:** Approximately 500 words for the governance recommendation.

Submit via LMS by the Week 5 deadline.

## üèÅ Summary

In this exercise, you learned:

‚úÖ **Overall performance can hide disparities** - always stratify by subgroup

‚úÖ **Multiple fairness definitions exist** - demographic parity, equal opportunity, equalised odds

‚úÖ **Fairness criteria can conflict** - the impossibility theorem means trade-offs are unavoidable

‚úÖ **Context matters** - what's "fair enough" depends on clinical impact and alternatives

‚úÖ **Governance decisions are ethical decisions** - they should involve diverse perspectives

**Key takeaway:** Algorithmic fairness is not just a technical problem‚Äîit requires human judgment about acceptable trade-offs and who should make those decisions.

---

**Next exercise (Week 7):** We'll examine how to validate a vendor model on your local population before deployment.