# üìä Exercise 3: Data Exploration and Bias Detection

**Week 3 | AI in Healthcare Curriculum**

---

## Learning Objectives

By completing this exercise, you will:

- üéØ Explore a healthcare dataset to understand its characteristics
- üéØ Identify potential sources of bias in training data
- üéØ Analyse outcomes by demographic subgroups
- üéØ Examine missing data patterns and their implications
- üéØ Understand how dataset composition affects AI behaviour

---

## ‚è±Ô∏è Estimated Time: 90 minutes

---

## Context

**"Garbage in, garbage out"** - An AI system can only learn patterns present in its training data. If the data is biased, incomplete, or unrepresentative, the AI will inherit those limitations.

This week, we'll examine a healthcare dataset through a critical lens, asking:
- Who is represented in this data?
- Who is missing or underrepresented?
- What biases might an AI learn from this data?

**Important:** We use a synthetic dataset for teaching. The principles apply to real clinical data.

## Part 1: Setup and Data Loading

In [None]:
# Setup - run this first!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
np.random.seed(42)

print("‚úÖ Setup complete!")

In [None]:
# Load the healthcare dataset
# This is a synthetic dataset representing Emergency Department presentations

def generate_ed_dataset(n_patients=2000):
    """
    Generate a synthetic ED dataset with realistic patterns,
    including intentional biases for educational purposes.
    """
    np.random.seed(42)
    
    # Demographics
    ages = np.random.normal(55, 20, n_patients).clip(18, 95).astype(int)
    
    # Gender with slight imbalance (realistic ED pattern)
    genders = np.random.choice(['Male', 'Female'], n_patients, p=[0.52, 0.48])
    
    # Indigenous status - intentionally underrepresented
    indigenous_status = np.random.choice(
        ['Non-Indigenous', 'Aboriginal', 'Torres Strait Islander', 'Both', 'Not Stated'],
        n_patients,
        p=[0.85, 0.03, 0.01, 0.005, 0.105]  # Underrepresentation of Indigenous
    )
    
    # Remoteness - urban overrepresentation
    remoteness = np.random.choice(
        ['Major City', 'Inner Regional', 'Outer Regional', 'Remote', 'Very Remote'],
        n_patients,
        p=[0.75, 0.15, 0.07, 0.02, 0.01]  # Urban overrepresentation
    )
    
    # Socioeconomic status (SEIFA-like decile)
    seifa_decile = np.random.choice(range(1, 11), n_patients, 
                                     p=[0.05, 0.06, 0.07, 0.08, 0.09, 0.11, 0.12, 0.14, 0.14, 0.14])
    
    # Clinical data
    triage_category = np.random.choice([1, 2, 3, 4, 5], n_patients,
                                        p=[0.03, 0.12, 0.35, 0.40, 0.10])
    
    # Vital signs (with some missing data patterns)
    heart_rate = np.random.normal(85, 18, n_patients).clip(40, 180)
    respiratory_rate = np.random.normal(18, 5, n_patients).clip(8, 40)
    systolic_bp = np.random.normal(125, 22, n_patients).clip(70, 200)
    temperature = np.random.normal(37.0, 0.7, n_patients).clip(35, 41)
    oxygen_saturation = np.random.normal(96, 3, n_patients).clip(80, 100)
    
    # Introduce missing data patterns - NOT random
    # More missing data for remote areas and Indigenous patients
    missing_mask = np.zeros(n_patients, dtype=bool)
    for i in range(n_patients):
        if remoteness[i] in ['Remote', 'Very Remote']:
            missing_mask[i] = np.random.random() < 0.25
        elif indigenous_status[i] in ['Aboriginal', 'Torres Strait Islander', 'Both']:
            missing_mask[i] = np.random.random() < 0.15
        else:
            missing_mask[i] = np.random.random() < 0.05
    
    # Apply missing data
    temperature = np.where(missing_mask, np.nan, temperature)
    
    # Create some missing pathology results
    pathology_available = np.where(
        np.isin(remoteness, ['Remote', 'Very Remote']),
        np.random.random(n_patients) < 0.6,
        np.random.random(n_patients) < 0.9
    )
    
    # Comorbidities
    comorbidity_count = np.random.poisson(1.5, n_patients).clip(0, 8)
    
    # Outcomes - with bias related to socioeconomic status
    base_risk = (
        0.01 * (ages - 50) / 10 +
        0.02 * (5 - triage_category) +
        0.01 * comorbidity_count +
        -0.005 * (seifa_decile - 5) +  # Lower SES = higher risk
        0.03 * np.isin(remoteness, ['Remote', 'Very Remote']).astype(int) +
        np.random.normal(0, 0.03, n_patients)
    )
    adverse_outcome = (base_risk > 0.15).astype(int)
    
    # Create DataFrame
    df = pd.DataFrame({
        'patient_id': [f'ED{i:05d}' for i in range(n_patients)],
        'age': ages,
        'gender': genders,
        'indigenous_status': indigenous_status,
        'remoteness': remoteness,
        'seifa_decile': seifa_decile,
        'triage_category': triage_category,
        'heart_rate': heart_rate.round(0).astype(int),
        'respiratory_rate': respiratory_rate.round(0).astype(int),
        'systolic_bp': systolic_bp.round(0).astype(int),
        'temperature': temperature.round(1),
        'oxygen_saturation': oxygen_saturation.round(0).astype(int),
        'pathology_available': pathology_available,
        'comorbidity_count': comorbidity_count,
        'adverse_outcome': adverse_outcome
    })
    
    return df

# Generate the dataset
ed_data = generate_ed_dataset(2000)

print("Emergency Department Dataset Generated")
print("="*60)
print(f"Total presentations: {len(ed_data):,}")
print(f"\nColumns available: {len(ed_data.columns)}")
print(ed_data.columns.tolist())
print("\nFirst 10 rows:")
ed_data.head(10)

## Part 2: Basic Dataset Exploration

Before diving into bias analysis, let's understand the basic characteristics of our data.

In [None]:
# Basic statistics
print("Dataset Overview")
print("="*60)
print(f"\nShape: {ed_data.shape[0]} rows √ó {ed_data.shape[1]} columns")
print(f"\nData types:")
print(ed_data.dtypes)

In [None]:
# Summary statistics for numerical variables
print("Summary Statistics for Clinical Variables")
print("="*60)
ed_data[['age', 'heart_rate', 'respiratory_rate', 'systolic_bp', 
         'temperature', 'oxygen_saturation', 'comorbidity_count']].describe()

In [None]:
# Missing data overview
print("Missing Data Analysis")
print("="*60)
missing_counts = ed_data.isnull().sum()
missing_pct = (ed_data.isnull().sum() / len(ed_data) * 100).round(1)

missing_df = pd.DataFrame({
    'Missing Count': missing_counts,
    'Missing %': missing_pct
})

print(missing_df[missing_df['Missing Count'] > 0])
print(f"\n‚ö†Ô∏è Note: Temperature has {missing_pct['temperature']:.1f}% missing values")
print("   We'll investigate whether this missingness is random later.")

## Part 3: Demographic Analysis - Who Is Represented?

A critical question for AI fairness: **Does this dataset reflect the population we want to serve?**

In [None]:
# Age distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(ed_data['age'], bins=20, edgecolor='white', color='steelblue')
axes[0].axvline(ed_data['age'].mean(), color='red', linestyle='--', label=f'Mean: {ed_data["age"].mean():.1f}')
axes[0].axvline(ed_data['age'].median(), color='orange', linestyle='--', label=f'Median: {ed_data["age"].median():.1f}')
axes[0].set_xlabel('Age (years)')
axes[0].set_ylabel('Count')
axes[0].set_title('Age Distribution in Dataset')
axes[0].legend()

# By age group
age_groups = pd.cut(ed_data['age'], bins=[0, 25, 45, 65, 85, 100], 
                    labels=['18-25', '26-45', '46-65', '66-85', '85+'])
age_counts = age_groups.value_counts().sort_index()
axes[1].bar(age_counts.index, age_counts.values, color='steelblue', edgecolor='white')
axes[1].set_xlabel('Age Group')
axes[1].set_ylabel('Count')
axes[1].set_title('Presentations by Age Group')

plt.tight_layout()
plt.show()

print("\nüí° Question: Is this age distribution representative of your ED population?")
print("   Consider: Who presents to ED? Who might be underrepresented?")

In [None]:
# Gender distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

gender_counts = ed_data['gender'].value_counts()

# Pie chart
axes[0].pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%',
            colors=['steelblue', 'coral'], startangle=90)
axes[0].set_title('Gender Distribution')

# Comparison to population
comparison_data = pd.DataFrame({
    'Dataset': [gender_counts['Male']/len(ed_data)*100, gender_counts['Female']/len(ed_data)*100],
    'Australian Population': [49.3, 50.7]  # ABS 2021 Census
}, index=['Male', 'Female'])

comparison_data.plot(kind='bar', ax=axes[1], rot=0, color=['steelblue', 'coral'])
axes[1].set_ylabel('Percentage')
axes[1].set_title('Dataset vs Australian Population')
axes[1].legend(loc='upper right')

plt.tight_layout()
plt.show()

print("\nüí° Note: Males are slightly overrepresented in ED presentations.")
print("   This reflects real ED patterns, but an AI trained on this data")
print("   may have learned more about male presentations.")

In [None]:
# Indigenous status analysis
print("Indigenous Status in Dataset")
print("="*60)

indigenous_counts = ed_data['indigenous_status'].value_counts()
indigenous_pct = (indigenous_counts / len(ed_data) * 100).round(1)

for status in indigenous_counts.index:
    print(f"  {status}: {indigenous_counts[status]:,} ({indigenous_pct[status]}%)")

# Combine Aboriginal and Torres Strait Islander
indigenous_combined = ed_data['indigenous_status'].isin(
    ['Aboriginal', 'Torres Strait Islander', 'Both']
).sum()
indigenous_combined_pct = indigenous_combined / len(ed_data) * 100

print(f"\n  Combined Indigenous: {indigenous_combined:,} ({indigenous_combined_pct:.1f}%)")
print(f"\n‚ö†Ô∏è Critical Finding:")
print(f"   Aboriginal and Torres Strait Islander peoples make up ~3.8% of")
print(f"   the Australian population, but only {indigenous_combined_pct:.1f}% of this dataset.")
print(f"\n   An AI trained on this data will have learned less about")
print(f"   patterns specific to Indigenous patient populations.")

In [None]:
# Remoteness analysis
print("Geographic Remoteness in Dataset")
print("="*60)

remoteness_counts = ed_data['remoteness'].value_counts()
remoteness_pct = (remoteness_counts / len(ed_data) * 100).round(1)

# Compare to population distribution
population_remoteness = {
    'Major City': 72.0,
    'Inner Regional': 18.0,
    'Outer Regional': 7.0,
    'Remote': 2.0,
    'Very Remote': 1.0
}

comparison = pd.DataFrame({
    'Dataset %': [remoteness_pct.get(k, 0) for k in population_remoteness.keys()],
    'Population %': list(population_remoteness.values())
}, index=population_remoteness.keys())

fig, ax = plt.subplots(figsize=(10, 5))
comparison.plot(kind='bar', ax=ax, rot=45, color=['steelblue', 'coral'])
ax.set_ylabel('Percentage')
ax.set_title('Geographic Representation: Dataset vs Population')
ax.legend(loc='upper right')
plt.tight_layout()
plt.show()

print("\nüí° Observation: Remote and Very Remote populations are underrepresented.")
print("   This could lead to AI that performs poorly for rural/remote patients.")

### üîß Your Turn: Socioeconomic Analysis

SEIFA decile (1 = most disadvantaged, 10 = least disadvantaged) indicates socioeconomic status.

Analyse the SEIFA distribution in the dataset:

In [None]:
# YOUR CODE: Analyse SEIFA distribution
# Hint: Use value_counts() and create a bar chart

# Get the distribution
seifa_counts = ed_data['seifa_decile'].value_counts().sort_index()

# Plot
plt.figure(figsize=(10, 5))
plt.bar(seifa_counts.index, seifa_counts.values, color='steelblue', edgecolor='white')
plt.axhline(y=len(ed_data)/10, color='red', linestyle='--', 
            label='Expected if equal (10% each)')
plt.xlabel('SEIFA Decile (1=Most Disadvantaged, 10=Least Disadvantaged)')
plt.ylabel('Count')
plt.title('Socioeconomic Distribution in Dataset')
plt.xticks(range(1, 11))
plt.legend()
plt.show()

print("\nü§î Question: Are disadvantaged groups (lower SEIFA) adequately represented?")
print("   What might happen if an AI is trained primarily on higher SEIFA patients?")

## Part 4: Outcome Analysis by Subgroup

Now let's examine whether **outcomes differ between groups**. This is crucial for understanding potential algorithmic bias.

In [None]:
# Overall outcome rate
outcome_rate = ed_data['adverse_outcome'].mean() * 100
print(f"Overall adverse outcome rate: {outcome_rate:.1f}%")
print("\nNow let's see if this varies by demographic group...\n")

In [None]:
# Outcomes by demographic groups
def analyse_outcomes_by_group(data, group_column):
    """Analyse adverse outcome rates by a grouping variable."""
    grouped = data.groupby(group_column).agg({
        'adverse_outcome': ['count', 'sum', 'mean']
    })
    grouped.columns = ['n_patients', 'n_adverse', 'adverse_rate']
    grouped['adverse_rate_pct'] = (grouped['adverse_rate'] * 100).round(1)
    return grouped

# Analyse by multiple groups
groups_to_analyse = ['gender', 'indigenous_status', 'remoteness', 'triage_category']

for group in groups_to_analyse:
    print(f"\n{'='*60}")
    print(f"Adverse Outcomes by {group.replace('_', ' ').title()}")
    print('='*60)
    result = analyse_outcomes_by_group(ed_data, group)
    print(result[['n_patients', 'n_adverse', 'adverse_rate_pct']])
    
    # Calculate disparity
    max_rate = result['adverse_rate_pct'].max()
    min_rate = result['adverse_rate_pct'].min()
    print(f"\n  Disparity: {max_rate:.1f}% vs {min_rate:.1f}% (ratio: {max_rate/min_rate:.2f}x)")

In [None]:
# Visualise outcome disparities
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# By remoteness
remoteness_outcomes = ed_data.groupby('remoteness')['adverse_outcome'].mean() * 100
order = ['Major City', 'Inner Regional', 'Outer Regional', 'Remote', 'Very Remote']
remoteness_outcomes = remoteness_outcomes.reindex(order)
axes[0, 0].bar(range(len(remoteness_outcomes)), remoteness_outcomes.values, color='steelblue')
axes[0, 0].set_xticks(range(len(remoteness_outcomes)))
axes[0, 0].set_xticklabels(remoteness_outcomes.index, rotation=45, ha='right')
axes[0, 0].set_ylabel('Adverse Outcome Rate (%)')
axes[0, 0].set_title('Outcomes by Remoteness')
axes[0, 0].axhline(outcome_rate, color='red', linestyle='--', label='Overall rate')
axes[0, 0].legend()

# By SEIFA decile
seifa_outcomes = ed_data.groupby('seifa_decile')['adverse_outcome'].mean() * 100
axes[0, 1].bar(seifa_outcomes.index, seifa_outcomes.values, color='steelblue')
axes[0, 1].set_xlabel('SEIFA Decile')
axes[0, 1].set_ylabel('Adverse Outcome Rate (%)')
axes[0, 1].set_title('Outcomes by Socioeconomic Status')
axes[0, 1].axhline(outcome_rate, color='red', linestyle='--', label='Overall rate')
axes[0, 1].legend()

# By triage category
triage_outcomes = ed_data.groupby('triage_category')['adverse_outcome'].mean() * 100
axes[1, 0].bar(triage_outcomes.index, triage_outcomes.values, color='steelblue')
axes[1, 0].set_xlabel('Triage Category')
axes[1, 0].set_ylabel('Adverse Outcome Rate (%)')
axes[1, 0].set_title('Outcomes by Triage Category')

# By age group
ed_data['age_group'] = pd.cut(ed_data['age'], bins=[0, 25, 45, 65, 85, 100], 
                              labels=['18-25', '26-45', '46-65', '66-85', '85+'])
age_outcomes = ed_data.groupby('age_group')['adverse_outcome'].mean() * 100
axes[1, 1].bar(range(len(age_outcomes)), age_outcomes.values, color='steelblue')
axes[1, 1].set_xticks(range(len(age_outcomes)))
axes[1, 1].set_xticklabels(age_outcomes.index)
axes[1, 1].set_xlabel('Age Group')
axes[1, 1].set_ylabel('Adverse Outcome Rate (%)')
axes[1, 1].set_title('Outcomes by Age Group')

plt.tight_layout()
plt.show()

print("\nüí° Key Finding: Outcomes vary significantly by demographic group.")
print("   An AI that ignores these patterns may perpetuate or amplify disparities.")

## Part 5: Missing Data Analysis - Not Random!

Missing data is rarely random in healthcare. Let's investigate whether missingness correlates with patient characteristics.

In [None]:
# Missing temperature by group
print("Missing Temperature Data Analysis")
print("="*60)

# Create missing indicator
ed_data['temp_missing'] = ed_data['temperature'].isnull()

# Analyse by different groups
for group in ['remoteness', 'indigenous_status']:
    print(f"\nMissing Rate by {group.replace('_', ' ').title()}:")
    missing_by_group = ed_data.groupby(group)['temp_missing'].mean() * 100
    for idx, val in missing_by_group.items():
        print(f"  {idx}: {val:.1f}%")

In [None]:
# Visualise missing data patterns
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# By remoteness
missing_by_remoteness = ed_data.groupby('remoteness')['temp_missing'].mean() * 100
order = ['Major City', 'Inner Regional', 'Outer Regional', 'Remote', 'Very Remote']
missing_by_remoteness = missing_by_remoteness.reindex(order)
axes[0].bar(range(len(missing_by_remoteness)), missing_by_remoteness.values, color='coral')
axes[0].set_xticks(range(len(missing_by_remoteness)))
axes[0].set_xticklabels(missing_by_remoteness.index, rotation=45, ha='right')
axes[0].set_ylabel('% Missing Temperature')
axes[0].set_title('Missing Temperature by Remoteness')

# By Indigenous status
ed_data['indigenous_grouped'] = ed_data['indigenous_status'].apply(
    lambda x: 'Indigenous' if x in ['Aboriginal', 'Torres Strait Islander', 'Both'] 
    else ('Not Stated' if x == 'Not Stated' else 'Non-Indigenous')
)
missing_by_indigenous = ed_data.groupby('indigenous_grouped')['temp_missing'].mean() * 100
axes[1].bar(missing_by_indigenous.index, missing_by_indigenous.values, color='coral')
axes[1].set_ylabel('% Missing Temperature')
axes[1].set_title('Missing Temperature by Indigenous Status')

plt.tight_layout()
plt.show()

print("\n‚ö†Ô∏è Critical Finding: Data is more likely to be missing for:")
print("   - Patients from Remote/Very Remote areas")
print("   - Indigenous patients")
print("\n   This 'missing not at random' pattern means:")
print("   - AI trained on complete cases will underlearn from these groups")
print("   - AI may perform worse for patients with missing data")
print("   - The missing data itself might be informative (limited access to care)")

In [None]:
# Pathology availability analysis
print("Pathology Results Availability")
print("="*60)

path_by_remoteness = ed_data.groupby('remoteness')['pathology_available'].mean() * 100
order = ['Major City', 'Inner Regional', 'Outer Regional', 'Remote', 'Very Remote']
path_by_remoteness = path_by_remoteness.reindex(order)

plt.figure(figsize=(10, 5))
plt.bar(range(len(path_by_remoteness)), path_by_remoteness.values, color='steelblue')
plt.xticks(range(len(path_by_remoteness)), path_by_remoteness.index, rotation=45, ha='right')
plt.ylabel('% with Pathology Available')
plt.title('Pathology Results Availability by Remoteness')
plt.tight_layout()
plt.show()

print("\nüí° Implication: If AI relies on pathology results, it may be less useful")
print("   for patients in remote areas where results aren't available.")

## Part 6: Simulating Dataset Shift

What happens when an AI model is deployed on a population different from its training data?

Let's simulate temporal shift by comparing patients from different "time periods."

In [None]:
# Create simulated time periods
# Imagine first 1500 patients are from "2020-2022" (training period)
# Last 500 patients are from "2023" (deployment period) with slight shift

np.random.seed(123)  # Different seed for shift

# Add a "period" column
ed_data['period'] = np.where(
    ed_data.index < 1500,
    'Training (2020-2022)',
    'Deployment (2023)'
)

# Simulate demographic shift (aging population, more remote presentations)
# Modify deployment period data slightly
deployment_mask = ed_data['period'] == 'Deployment (2023)'
ed_data.loc[deployment_mask, 'age'] = (ed_data.loc[deployment_mask, 'age'] + 
                                        np.random.normal(3, 2, deployment_mask.sum())).clip(18, 95).astype(int)

# Compare periods
print("Comparing Training vs Deployment Periods")
print("="*60)

comparison = ed_data.groupby('period').agg({
    'age': 'mean',
    'heart_rate': 'mean',
    'triage_category': 'mean',
    'adverse_outcome': 'mean'
}).round(2)

comparison.columns = ['Mean Age', 'Mean HR', 'Mean Triage', 'Outcome Rate']
comparison['Outcome Rate'] = (comparison['Outcome Rate'] * 100).round(1)
print(comparison)

In [None]:
# Visualise the shift
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Age distribution shift
for period in ed_data['period'].unique():
    subset = ed_data[ed_data['period'] == period]['age']
    axes[0].hist(subset, bins=20, alpha=0.6, label=period, edgecolor='white')
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Count')
axes[0].set_title('Age Distribution Shift Over Time')
axes[0].legend()

# Outcome rate comparison
outcome_by_period = ed_data.groupby('period')['adverse_outcome'].mean() * 100
axes[1].bar(outcome_by_period.index, outcome_by_period.values, color=['steelblue', 'coral'])
axes[1].set_ylabel('Adverse Outcome Rate (%)')
axes[1].set_title('Outcome Rates by Period')

plt.tight_layout()
plt.show()

print("\n‚ö†Ô∏è Dataset Shift Implications:")
print("   - If the deployment population differs from training, performance may degrade")
print("   - This is why continuous monitoring is essential after AI deployment")
print("   - Models may need regular retraining or recalibration")

## Part 7: Summary of Bias Sources

Let's consolidate our findings about potential biases in this dataset.

In [None]:
# Summary report
print("="*70)
print("BIAS AND DATA QUALITY ASSESSMENT SUMMARY")
print("="*70)

print("\nüìä REPRESENTATION ISSUES:")
print("-" * 50)

indigenous_pct = ed_data['indigenous_status'].isin(
    ['Aboriginal', 'Torres Strait Islander', 'Both']).mean() * 100
remote_pct = ed_data['remoteness'].isin(['Remote', 'Very Remote']).mean() * 100

print(f"  ‚Ä¢ Indigenous representation: {indigenous_pct:.1f}% (vs ~3.8% population)")
print(f"  ‚Ä¢ Remote/Very Remote: {remote_pct:.1f}% (vs ~3% population)")
print(f"  ‚Ä¢ Low SEIFA (1-3) representation: {(ed_data['seifa_decile'] <= 3).mean()*100:.1f}%")

print("\nüìä MISSING DATA PATTERNS:")
print("-" * 50)
print(f"  ‚Ä¢ Temperature missing overall: {ed_data['temperature'].isnull().mean()*100:.1f}%")
print(f"  ‚Ä¢ Temperature missing (Remote/VR): {ed_data[ed_data['remoteness'].isin(['Remote', 'Very Remote'])]['temperature'].isnull().mean()*100:.1f}%")
print(f"  ‚Ä¢ Pathology unavailable (Remote/VR): {(1-ed_data[ed_data['remoteness'].isin(['Remote', 'Very Remote'])]['pathology_available'].mean())*100:.1f}%")

print("\nüìä OUTCOME DISPARITIES:")
print("-" * 50)
remote_outcome = ed_data[ed_data['remoteness'].isin(['Remote', 'Very Remote'])]['adverse_outcome'].mean() * 100
urban_outcome = ed_data[ed_data['remoteness'] == 'Major City']['adverse_outcome'].mean() * 100
low_seifa = ed_data[ed_data['seifa_decile'] <= 3]['adverse_outcome'].mean() * 100
high_seifa = ed_data[ed_data['seifa_decile'] >= 8]['adverse_outcome'].mean() * 100

print(f"  ‚Ä¢ Remote vs Urban outcome rate: {remote_outcome:.1f}% vs {urban_outcome:.1f}%")
print(f"  ‚Ä¢ Low vs High SEIFA outcome rate: {low_seifa:.1f}% vs {high_seifa:.1f}%")

print("\n‚ö†Ô∏è IMPLICATIONS FOR AI TRAINING:")
print("-" * 50)
print("  1. Underrepresented groups may have worse AI performance")
print("  2. Missing data patterns correlate with disadvantage")
print("  3. Outcome disparities may be learned and perpetuated by AI")
print("  4. Dataset shift requires ongoing monitoring post-deployment")

print("\n" + "="*70)

## Part 8: Reflection Questions

Consider these questions and write your responses:

In [None]:
# ===== YOUR REFLECTIONS =====

reflections = """
1. Does this dataset reflect the population YOUR health service serves?
   Your answer:
   

2. What groups are most underrepresented? What are the implications?
   Your answer:
   

3. If missing data correlates with disadvantage, what happens when AI
   is trained only on "complete" cases?
   Your answer:
   

4. Outcomes vary by demographic group. Is this:
   a) Bias the AI should NOT learn?
   b) Real clinical differences the AI SHOULD learn?
   c) A mix of both?
   Your answer:
   

5. What questions would you ask a vendor about their AI's training data?
   Your answer:
   

"""

print(reflections)
print("\n‚úÖ Reflection saved!")

## üìù Deliverable

**For your portfolio:**

Complete this analysis notebook with your observations on:
1. Key representation gaps you identified
2. Missing data patterns and their implications
3. Potential sources of bias an AI might learn
4. What you would want to know about any AI's training data

Submit the completed notebook via LMS by the Week 3 deadline.

## üèÅ Summary

In this exercise, you learned:

‚úÖ **Representation matters** - AI can only learn from who's in the data

‚úÖ **Missing data isn't random** - it often correlates with disadvantage

‚úÖ **Outcomes vary by group** - this creates complex ethical questions

‚úÖ **Dataset shift happens** - populations change over time

‚úÖ **Critical data analysis** is essential before trusting any AI

**Key takeaway:** Understanding data characteristics is the first step to understanding AI limitations. Always ask: "Who is in this data, and who is missing?"

---

**Next exercise (Week 5):** We'll train a model on this data and measure whether it performs fairly across demographic groups.