<a href="https://colab.research.google.com/github/nmansour67/skills-introduction-to-github/blob/main/No_Show_Predictor_Data_Generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# ============================================================================
# SCRIPT 1: NO-SHOW PREDICTOR - DATA GENERATOR
# Purpose: Generate realistic hospital baseline data + AI predictions
# Output: 2 CSV files for download
# ============================================================================

print("="*80)
print("üìä DATA GENERATOR: NO-SHOW PREDICTOR AI VALIDATION")
print("="*80)
print("""
This script generates TWO realistic datasets:
  1. Hospital baseline data (ground truth patient appointments)
  2. AI model predictions (vendor AI outputs)

You will download these CSV files, then use them in Script 2 for analysis.
This simulates real-world workflow: Hospital data ‚Üí AI validation
""")

# ============================================================================
# INSTALL LIBRARIES
# ============================================================================

print("\nüì¶ Installing required libraries...")
import subprocess
import sys

packages = ['pandas', 'numpy']
for package in packages:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package])

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)  # Reproducibility

print("‚úÖ Libraries loaded\n")

# ============================================================================
# CONFIGURATION
# ============================================================================

print("‚öôÔ∏è CONFIGURATION")
print("="*80)

NUM_APPOINTMENTS = 500
PILOT_DURATION_DAYS = 14

# Zip code characteristics (simulating real socioeconomic geography)
ZIP_CODES = {
    '90210': {
        'name': 'Beverly Hills (High Income)',
        'percentage': 0.40,
        'no_show_rate': 0.15,  # LOW - good transit
        'avg_distance': 8,
        'transit_quality': 'Excellent'
    },
    '90005': {
        'name': 'Mid-Wilshire (Middle Income)',
        'percentage': 0.35,
        'no_show_rate': 0.25,  # MODERATE
        'avg_distance': 12,
        'transit_quality': 'Moderate'
    },
    '90011': {
        'name': 'South LA (Low Income)',
        'percentage': 0.25,
        'no_show_rate': 0.45,  # HIGH - transportation barriers!
        'avg_distance': 18,
        'transit_quality': 'Poor'
    }
}

print(f"Generating {NUM_APPOINTMENTS} MRI appointments over {PILOT_DURATION_DAYS} days")
print(f"\nüìç SOCIOECONOMIC GEOGRAPHY:")
for zip_code, info in ZIP_CODES.items():
    print(f"  {zip_code}: {info['name']}")
    print(f"    ‚Ä¢ {info['percentage']*100:.0f}% of patients")
    print(f"    ‚Ä¢ {info['no_show_rate']*100:.0f}% no-show rate (transportation: {info['transit_quality']})")

# ============================================================================
# GENERATE DATASET 1: HOSPITAL BASELINE DATA (GROUND TRUTH)
# ============================================================================

print("\n\nüìä DATASET 1: GENERATING HOSPITAL BASELINE DATA")
print("="*80)

patients = []

for i in range(NUM_APPOINTMENTS):
    patient_id = f"PT-{i+1:04d}"

    # Assign zip code based on realistic distribution
    zip_rand = np.random.random()
    if zip_rand < ZIP_CODES['90210']['percentage']:
        zip_code = '90210'
    elif zip_rand < ZIP_CODES['90210']['percentage'] + ZIP_CODES['90005']['percentage']:
        zip_code = '90005'
    else:
        zip_code = '90011'

    zip_info = ZIP_CODES[zip_code]

    # Demographics
    age = int(np.clip(np.random.normal(55, 15), 25, 75))
    gender = np.random.choice(['M', 'F'])

    # Insurance type (correlated with zip code - this is reality)
    if zip_code == '90210':
        insurance = np.random.choice(['Commercial', 'Medicare', 'Medicaid'],
                                    p=[0.70, 0.25, 0.05])
    elif zip_code == '90005':
        insurance = np.random.choice(['Commercial', 'Medicare', 'Medicaid'],
                                    p=[0.50, 0.35, 0.15])
    else:  # 90011
        insurance = np.random.choice(['Commercial', 'Medicare', 'Medicaid'],
                                    p=[0.25, 0.30, 0.45])

    # Distance to clinic (correlated with zip code)
    distance = max(2, np.random.normal(zip_info['avg_distance'], 3))

    # Prior no-show history (0-3 previous no-shows)
    if zip_code == '90210':
        prior_noshow = np.random.choice([0, 1, 2, 3], p=[0.70, 0.20, 0.08, 0.02])
    elif zip_code == '90005':
        prior_noshow = np.random.choice([0, 1, 2, 3], p=[0.55, 0.25, 0.15, 0.05])
    else:  # 90011
        prior_noshow = np.random.choice([0, 1, 2, 3], p=[0.40, 0.30, 0.20, 0.10])

    # Appointment scheduling
    day_offset = np.random.randint(0, PILOT_DURATION_DAYS)
    appointment_date = datetime(2025, 1, 1) + timedelta(days=day_offset)

    # Skip Sundays
    while appointment_date.weekday() == 6:
        day_offset = np.random.randint(0, PILOT_DURATION_DAYS)
        appointment_date = datetime(2025, 1, 1) + timedelta(days=day_offset)

    hour = np.random.randint(8, 18)
    minute = np.random.choice([0, 15, 30, 45])
    appointment_time = appointment_date.replace(hour=hour, minute=minute)

    day_of_week = appointment_date.strftime('%A')
    time_slot = 'Morning' if hour < 12 else 'Afternoon' if hour < 17 else 'Evening'
    lead_time = np.random.randint(1, 30)

    # GROUND TRUTH: Did patient show up?
    # Base probability from zip code (reflects transportation barriers)
    noshow_prob = zip_info['no_show_rate']

    # Adjust for other factors
    noshow_prob += prior_noshow * 0.05
    if insurance == 'Medicaid':
        noshow_prob += 0.10
    if distance > 20:
        noshow_prob += 0.10
    if lead_time < 3:
        noshow_prob -= 0.05

    noshow_prob = np.clip(noshow_prob, 0.05, 0.70)
    showed_up = np.random.random() > noshow_prob

    patients.append({
        'patient_id': patient_id,
        'age': age,
        'gender': gender,
        'zip_code': zip_code,
        'zip_name': zip_info['name'],
        'insurance_type': insurance,
        'distance_miles': round(distance, 1),
        'prior_noshow_count': prior_noshow,
        'appointment_datetime': appointment_time.strftime('%Y-%m-%d %H:%M'),
        'appointment_date': appointment_date.strftime('%Y-%m-%d'),
        'day_of_week': day_of_week,
        'time_slot': time_slot,
        'lead_time_days': lead_time,
        'showed_up': showed_up,
        'noshow': not showed_up
    })

    if (i + 1) % 100 == 0:
        print(f"  Generated {i+1}/{NUM_APPOINTMENTS} appointments...")

baseline_df = pd.DataFrame(patients)

print(f"\n‚úÖ Dataset 1 complete: {len(baseline_df)} appointments")

# Summary statistics
overall_noshow = baseline_df['noshow'].sum() / len(baseline_df) * 100
print(f"\nüìä GROUND TRUTH SUMMARY:")
print(f"  Overall no-show rate: {overall_noshow:.1f}%")

for zip_code in ['90210', '90005', '90011']:
    zip_data = baseline_df[baseline_df['zip_code'] == zip_code]
    noshow_rate = zip_data['noshow'].sum() / len(zip_data) * 100
    print(f"  {zip_code}: {noshow_rate:.1f}% no-show ({zip_data['noshow'].sum()}/{len(zip_data)})")

# ============================================================================
# GENERATE DATASET 2: AI MODEL PREDICTIONS
# ============================================================================

print("\n\nü§ñ DATASET 2: GENERATING AI MODEL PREDICTIONS")
print("="*80)

print("""
‚ö†Ô∏è THE AI'S PROBLEMATIC LOGIC:
The vendor's AI learned from historical data where ZIP CODE became the
strongest predictor (demographic proxy = BIAS).

AI essentially learned:
  ‚Ä¢ "Zip 90011 = High Risk" ‚Üí Flags 80% of low-income patients
  ‚Ä¢ "Zip 90210 = Low Risk"  ‚Üí Flags only 10% of high-income patients

This achieves 90% accuracy BUT through biased pattern recognition!
""")

ai_predictions = []

for idx, patient in baseline_df.iterrows():
    # AI RISK SCORING (Biased Black Box)
    risk_score = 0

    # PRIMARY FACTOR: Zip Code (PROBLEMATIC - demographic proxy!)
    if patient['zip_code'] == '90011':
        risk_score += 50  # Heavily weights low-income area
    elif patient['zip_code'] == '90005':
        risk_score += 25
    else:  # 90210
        risk_score += 5

    # SECONDARY: Insurance (also problematic proxy)
    if patient['insurance_type'] == 'Medicaid':
        risk_score += 20
    elif patient['insurance_type'] == 'Medicare':
        risk_score += 10

    # TERTIARY: Prior no-shows (legitimate but confounded)
    risk_score += patient['prior_noshow_count'] * 8

    # Distance
    if patient['distance_miles'] > 15:
        risk_score += 10

    # Add algorithmic "noise"
    risk_score += np.random.normal(0, 5)
    risk_score = np.clip(risk_score, 0, 100)

    # Risk categorization
    if risk_score < 30:
        risk_category = 'Low Risk'
        recommended_action = 'Normal booking'
    elif risk_score < 60:
        risk_category = 'Medium Risk'
        recommended_action = 'Monitor'
    else:
        risk_category = 'High Risk'
        recommended_action = 'DOUBLE-BOOK'

    # Binary prediction
    ai_predicts_noshow = risk_score > 55

    ai_predictions.append({
        'patient_id': patient['patient_id'],
        'ai_risk_score': round(risk_score, 1),
        'ai_risk_category': risk_category,
        'ai_recommended_action': recommended_action,
        'ai_predicts_noshow': ai_predicts_noshow
    })

    if (idx + 1) % 100 == 0:
        print(f"  Generated predictions for {idx+1}/{NUM_APPOINTMENTS} patients...")

ai_df = pd.DataFrame(ai_predictions)

print(f"\n‚úÖ Dataset 2 complete: {len(ai_df)} predictions")

# Quick bias check
total_high_risk = (ai_df['ai_risk_category'] == 'High Risk').sum()
print(f"\nüö® AI FLAGGING PREVIEW:")
print(f"  Total flagged as HIGH RISK: {total_high_risk} ({total_high_risk/len(ai_df)*100:.1f}%)")

# Merge to show bias by zip
merged_preview = baseline_df.merge(ai_df, on='patient_id')
for zip_code in ['90210', '90005', '90011']:
    zip_data = merged_preview[merged_preview['zip_code'] == zip_code]
    high_risk = (zip_data['ai_risk_category'] == 'High Risk').sum()
    pct = high_risk / len(zip_data) * 100
    print(f"  {zip_code}: {pct:.1f}% flagged ({high_risk}/{len(zip_data)})")

print(f"\n‚ö†Ô∏è BIAS VISIBLE: Zip 90011 flagged at ~4-8x rate of Zip 90210")

# ============================================================================
# SAVE DATASETS TO CSV
# ============================================================================

print("\n\nüíæ SAVING DATASETS TO CSV FILES")
print("="*80)

# Dataset 1: Hospital baseline
baseline_filename = 'hospital_appointments_baseline.csv'
baseline_df.to_csv(f'/tmp/{baseline_filename}', index=False)
print(f"‚úÖ Saved: {baseline_filename}")
print(f"   Rows: {len(baseline_df)} | Columns: {len(baseline_df.columns)}")

# Dataset 2: AI predictions
ai_filename = 'ai_model_predictions.csv'
ai_df.to_csv(f'/tmp/{ai_filename}', index=False)
print(f"‚úÖ Saved: {ai_filename}")
print(f"   Rows: {len(ai_df)} | Columns: {len(ai_df.columns)}")

# ============================================================================
# DOWNLOAD FILES
# ============================================================================

print("\n\nüì• DOWNLOADING FILES TO YOUR COMPUTER")
print("="*80)

from google.colab import files

print("\nüîΩ Starting download...")

files.download(f'/tmp/{baseline_filename}')
print(f"‚úÖ Downloaded: {baseline_filename}")

files.download(f'/tmp/{ai_filename}')
print(f"‚úÖ Downloaded: {ai_filename}")

print("\n" + "="*80)
print("‚úÖ DATA GENERATION COMPLETE")
print("="*80)

print(f"""
üì¶ YOU NOW HAVE 2 CSV FILES:

1. {baseline_filename}
   ‚Üí Hospital ground truth data ({len(baseline_df)} appointments)
   ‚Üí Contains: Demographics, appointment details, actual outcomes

2. {ai_filename}
   ‚Üí AI vendor predictions ({len(ai_df)} predictions)
   ‚Üí Contains: Risk scores, risk categories, recommendations

NEXT STEPS:
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

1. ‚úÖ Save both CSV files to your computer
2. üìÇ Open a NEW Google Colab notebook
3. üìã Copy and paste SCRIPT 2 (Data Analyzer)
4. üì§ Upload these 2 CSV files when prompted
5. ‚ñ∂Ô∏è Run Script 2 to perform comparative analysis

This simulates real-world workflow:
  Hospital exports data ‚Üí Uploads to analysis tool ‚Üí Validates AI

Ready to proceed to Script 2!
""")

üìä DATA GENERATOR: NO-SHOW PREDICTOR AI VALIDATION

This script generates TWO realistic datasets:
  1. Hospital baseline data (ground truth patient appointments)
  2. AI model predictions (vendor AI outputs)

You will download these CSV files, then use them in Script 2 for analysis.
This simulates real-world workflow: Hospital data ‚Üí AI validation


üì¶ Installing required libraries...
‚úÖ Libraries loaded

‚öôÔ∏è CONFIGURATION
Generating 500 MRI appointments over 14 days

üìç SOCIOECONOMIC GEOGRAPHY:
  90210: Beverly Hills (High Income)
    ‚Ä¢ 40% of patients
    ‚Ä¢ 15% no-show rate (transportation: Excellent)
  90005: Mid-Wilshire (Middle Income)
    ‚Ä¢ 35% of patients
    ‚Ä¢ 25% no-show rate (transportation: Moderate)
  90011: South LA (Low Income)
    ‚Ä¢ 25% of patients
    ‚Ä¢ 45% no-show rate (transportation: Poor)


üìä DATASET 1: GENERATING HOSPITAL BASELINE DATA
  Generated 100/500 appointments...
  Generated 200/500 appointments...
  Generated 300/500 appointments

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

‚úÖ Downloaded: hospital_appointments_baseline.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

‚úÖ Downloaded: ai_model_predictions.csv

‚úÖ DATA GENERATION COMPLETE

üì¶ YOU NOW HAVE 2 CSV FILES:

1. hospital_appointments_baseline.csv
   ‚Üí Hospital ground truth data (500 appointments)
   ‚Üí Contains: Demographics, appointment details, actual outcomes

2. ai_model_predictions.csv
   ‚Üí AI vendor predictions (500 predictions)
   ‚Üí Contains: Risk scores, risk categories, recommendations

NEXT STEPS:
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

1. ‚úÖ Save both CSV files to your computer
2. üìÇ Open a NEW Google Colab notebook
3. üìã Copy and paste SCRIPT 2 (Data Analyzer)
4. üì§ Upload these 2 CSV files when prompted
5. ‚ñ∂Ô∏è Run Script 2 to perform comparative analysis

This simulates real-world workflow:
  Hospital exports data ‚Üí Uploads to analysis tool ‚Üí Validates AI

Ready to proceed to Script 2!

