<a href="https://colab.research.google.com/github/nmansour67/skills-introduction-to-github/blob/main/In_Silico_Protocol_Generating_CSV_Files.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# ============================================================================
# ICU EARLY MOBILITY PROTOCOL EVALUATION
# Synthetic Dataset Generation with Realistic Selection Bias
# ============================================================================
#
# Clinical Context:
# High-acuity ICU evaluating Early Mobility protocol vs. Standard Bedrest
#
# Selection Bias (Real-World):
# Sicker patients (higher SOFA scores, more comorbidities) are kept on bedrest
# due to perceived instability, creating confounding in the observational study
#
# This code generates two realistic patient cohorts with embedded bias
# ============================================================================

print("="*80)
print("üè• ICU EARLY MOBILITY PROTOCOL EVALUATION")
print("="*80)
print("""
CLINICAL SCENARIO:
Your ICU is evaluating whether early mobilization reduces delirium compared
to traditional bedrest protocols. However, clinical decision-making introduces
selection bias: sicker patients are kept on bedrest.

This code generates realistic data that mirrors this real-world bias.
""")

# ============================================================================
# SECTION 1: INSTALL LIBRARIES & SETUP
# ============================================================================

print("\nüì¶ Installing required libraries...")

import subprocess
import sys
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "pandas"])
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "numpy"])

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print("‚úÖ Libraries loaded\n")

# ============================================================================
# SECTION 2: CLINICAL PARAMETERS & BIAS MODELING
# ============================================================================

print("‚öôÔ∏è SECTION 2: Configuring Clinical Parameters")
print("="*80)

# Number of patients per group
N_MOBILITY = 10  # Early Mobility group
N_BEDREST = 10   # Bedrest group

print(f"Generating {N_MOBILITY} patients in Early Mobility group")
print(f"Generating {N_BEDREST} patients in Bedrest group")

# Clinical parameter distributions
print("\nüìä CLINICAL PARAMETER MODELING:")

# SELECTION BIAS PARAMETERS
print("\n‚ö†Ô∏è SELECTION BIAS (Real-World Clinical Practice):")
print("  ‚Ä¢ Bedrest group: Higher SOFA scores (sicker patients)")
print("  ‚Ä¢ Bedrest group: More comorbidities (medical complexity)")
print("  ‚Ä¢ This mimics real ICU decision-making where unstable patients")
print("    are deemed 'too sick to mobilize'")

# Age distribution (similar across groups - not a selection criterion)
AGE_MEAN = 65
AGE_STD = 12

# SOFA Score distribution (Sequential Organ Failure Assessment, 0-24)
# Higher scores = more organ dysfunction = sicker patients
SOFA_MOBILITY_MEAN = 6.5   # Moderate severity (patients deemed safe to mobilize)
SOFA_MOBILITY_STD = 2.0

SOFA_BEDREST_MEAN = 11.5   # Higher severity (BIAS: sicker patients kept on bedrest)
SOFA_BEDREST_STD = 2.5

print(f"\n  SOFA Score Distributions:")
print(f"    Mobility group: Mean {SOFA_MOBILITY_MEAN} ¬± {SOFA_MOBILITY_STD}")
print(f"    Bedrest group:  Mean {SOFA_BEDREST_MEAN} ¬± {SOFA_BEDREST_STD} (‚ö†Ô∏è SICKER)")

# Comorbidity Index (Charlson Comorbidity Index, 0-37)
# Higher scores = more chronic diseases = medical complexity
COMORBIDITY_MOBILITY_MEAN = 3.0    # Fewer comorbidities
COMORBIDITY_MOBILITY_STD = 1.5

COMORBIDITY_BEDREST_MEAN = 5.8     # More comorbidities (BIAS)
COMORBIDITY_BEDREST_STD = 1.8

print(f"\n  Comorbidity Index Distributions:")
print(f"    Mobility group: Mean {COMORBIDITY_MOBILITY_MEAN} ¬± {COMORBIDITY_MOBILITY_STD}")
print(f"    Bedrest group:  Mean {COMORBIDITY_BEDREST_MEAN} ¬± {COMORBIDITY_BEDREST_STD} (‚ö†Ô∏è MORE COMPLEX)")

# ============================================================================
# SECTION 3: OUTCOME MODELING (DELIRIUM)
# ============================================================================

print("\n\nüéØ SECTION 3: Outcome Modeling (Delirium Incidence)")
print("="*80)

print("""
DELIRIUM RISK FACTORS (Evidence-Based):
  1. Higher SOFA scores ‚Üí Increased delirium risk
  2. More comorbidities ‚Üí Increased delirium risk
  3. Advanced age ‚Üí Increased delirium risk
  4. Immobility (bedrest) ‚Üí Increased delirium risk

CONFOUNDING CHALLENGE:
Even if early mobility reduces delirium, the bedrest group is SICKER,
which independently increases their delirium risk. This makes it hard
to isolate the true effect of mobility.

This is why randomized controlled trials are needed‚Äîbut this dataset
simulates the observational study you'd see in real practice.
""")

def calculate_delirium_probability(age, sofa_score, comorbidity_index, is_bedrest):
    """
    Calculate probability of delirium based on clinical risk factors

    Risk model (logistic):
    - Base risk: 20%
    - Age: +0.5% per year above 65
    - SOFA: +3% per point
    - Comorbidity: +2% per point
    - Bedrest: +15% (immobility effect)
    """

    # Base risk
    risk = 0.20

    # Age contribution
    if age > 65:
        risk += (age - 65) * 0.005

    # SOFA contribution (major risk factor)
    risk += sofa_score * 0.03

    # Comorbidity contribution
    risk += comorbidity_index * 0.02

    # Bedrest contribution (the intervention effect we're studying!)
    # Early mobility is PROTECTIVE, bedrest is HARMFUL
    if is_bedrest:
        risk += 0.15  # Bedrest increases delirium risk
    else:
        risk -= 0.10  # Early mobility is protective

    # Cap probability between 0 and 1
    risk = np.clip(risk, 0.05, 0.95)

    return risk

print("‚úÖ Delirium risk model configured")

# ============================================================================
# SECTION 4: GENERATE GROUP A - EARLY MOBILITY
# ============================================================================

print("\n\nüë• SECTION 4: Generating Group A - Early Mobility Protocol")
print("="*80)

mobility_patients = []

for i in range(N_MOBILITY):
    patient_id = f"MOB-{i+1:03d}"

    # Demographics
    age = int(np.clip(np.random.normal(AGE_MEAN, AGE_STD), 25, 95))

    # Clinical severity (LOWER severity - selection bias)
    sofa_score = int(np.clip(np.random.normal(SOFA_MOBILITY_MEAN, SOFA_MOBILITY_STD), 2, 15))

    # Comorbidities (FEWER comorbidities - selection bias)
    comorbidity_index = int(np.clip(np.random.normal(COMORBIDITY_MOBILITY_MEAN, COMORBIDITY_MOBILITY_STD), 0, 10))

    # Calculate delirium probability
    delirium_prob = calculate_delirium_probability(age, sofa_score, comorbidity_index, is_bedrest=False)

    # Determine outcome (probabilistic)
    outcome_delirium = "Yes" if np.random.random() < delirium_prob else "No"

    mobility_patients.append({
        'Patient_ID': patient_id,
        'Age': age,
        'SOFA_Score': sofa_score,
        'Comorbidity_Index': comorbidity_index,
        'Outcome_Delirium': outcome_delirium
    })

mobility_df = pd.DataFrame(mobility_patients)

print(f"‚úÖ Generated {len(mobility_df)} patients in Early Mobility group")
print(f"\nüìä GROUP A SUMMARY STATISTICS:")
print(f"  Age: {mobility_df['Age'].mean():.1f} ¬± {mobility_df['Age'].std():.1f} years")
print(f"  SOFA Score: {mobility_df['SOFA_Score'].mean():.1f} ¬± {mobility_df['SOFA_Score'].std():.1f}")
print(f"  Comorbidity Index: {mobility_df['Comorbidity_Index'].mean():.1f} ¬± {mobility_df['Comorbidity_Index'].std():.1f}")
print(f"  Delirium Rate: {(mobility_df['Outcome_Delirium']=='Yes').sum()}/{len(mobility_df)} ({(mobility_df['Outcome_Delirium']=='Yes').sum()/len(mobility_df)*100:.0f}%)")

# ============================================================================
# SECTION 5: GENERATE GROUP B - BEDREST
# ============================================================================

print("\n\nüõèÔ∏è SECTION 5: Generating Group B - Standard Bedrest Protocol")
print("="*80)

bedrest_patients = []

for i in range(N_BEDREST):
    patient_id = f"BED-{i+1:03d}"

    # Demographics (similar age distribution)
    age = int(np.clip(np.random.normal(AGE_MEAN, AGE_STD), 25, 95))

    # Clinical severity (HIGHER severity - selection bias!)
    sofa_score = int(np.clip(np.random.normal(SOFA_BEDREST_MEAN, SOFA_BEDREST_STD), 6, 20))

    # Comorbidities (MORE comorbidities - selection bias!)
    comorbidity_index = int(np.clip(np.random.normal(COMORBIDITY_BEDREST_MEAN, COMORBIDITY_BEDREST_STD), 2, 12))

    # Calculate delirium probability
    delirium_prob = calculate_delirium_probability(age, sofa_score, comorbidity_index, is_bedrest=True)

    # Determine outcome (probabilistic)
    outcome_delirium = "Yes" if np.random.random() < delirium_prob else "No"

    bedrest_patients.append({
        'Patient_ID': patient_id,
        'Age': age,
        'SOFA_Score': sofa_score,
        'Comorbidity_Index': comorbidity_index,
        'Outcome_Delirium': outcome_delirium
    })

bedrest_df = pd.DataFrame(bedrest_patients)

print(f"‚úÖ Generated {len(bedrest_df)} patients in Bedrest group")
print(f"\nüìä GROUP B SUMMARY STATISTICS:")
print(f"  Age: {bedrest_df['Age'].mean():.1f} ¬± {bedrest_df['Age'].std():.1f} years")
print(f"  SOFA Score: {bedrest_df['SOFA_Score'].mean():.1f} ¬± {bedrest_df['SOFA_Score'].std():.1f} (‚ö†Ô∏è HIGHER)")
print(f"  Comorbidity Index: {bedrest_df['Comorbidity_Index'].mean():.1f} ¬± {bedrest_df['Comorbidity_Index'].std():.1f} (‚ö†Ô∏è HIGHER)")
print(f"  Delirium Rate: {(bedrest_df['Outcome_Delirium']=='Yes').sum()}/{len(bedrest_df)} ({(bedrest_df['Outcome_Delirium']=='Yes').sum()/len(bedrest_df)*100:.0f}%)")

# ============================================================================
# SECTION 6: COMPARATIVE ANALYSIS (DEMONSTRATING BIAS)
# ============================================================================

print("\n\n‚öñÔ∏è SECTION 6: Comparative Analysis - Demonstrating Selection Bias")
print("="*80)

print("\nüìä BETWEEN-GROUP COMPARISONS:")

# Age comparison
age_diff = bedrest_df['Age'].mean() - mobility_df['Age'].mean()
print(f"\nAge:")
print(f"  Mobility: {mobility_df['Age'].mean():.1f} years")
print(f"  Bedrest:  {bedrest_df['Age'].mean():.1f} years")
print(f"  Difference: {age_diff:+.1f} years ({'Similar' if abs(age_diff) < 3 else 'Different'})")

# SOFA comparison (KEY BIAS!)
sofa_diff = bedrest_df['SOFA_Score'].mean() - mobility_df['SOFA_Score'].mean()
print(f"\nSOFA Score (Severity):")
print(f"  Mobility: {mobility_df['SOFA_Score'].mean():.1f}")
print(f"  Bedrest:  {bedrest_df['SOFA_Score'].mean():.1f}")
print(f"  Difference: {sofa_diff:+.1f} points")
print(f"  ‚ö†Ô∏è SELECTION BIAS: Bedrest patients are {sofa_diff:.1f} points SICKER")

# Comorbidity comparison (KEY BIAS!)
comorbidity_diff = bedrest_df['Comorbidity_Index'].mean() - mobility_df['Comorbidity_Index'].mean()
print(f"\nComorbidity Index:")
print(f"  Mobility: {mobility_df['Comorbidity_Index'].mean():.1f}")
print(f"  Bedrest:  {bedrest_df['Comorbidity_Index'].mean():.1f}")
print(f"  Difference: {comorbidity_diff:+.1f} points")
print(f"  ‚ö†Ô∏è SELECTION BIAS: Bedrest patients have {comorbidity_diff:.1f} more comorbidities")

# Outcome comparison
mobility_delirium_rate = (mobility_df['Outcome_Delirium']=='Yes').sum() / len(mobility_df) * 100
bedrest_delirium_rate = (bedrest_df['Outcome_Delirium']=='Yes').sum() / len(bedrest_df) * 100
delirium_diff = bedrest_delirium_rate - mobility_delirium_rate

print(f"\nDelirium Incidence:")
print(f"  Mobility: {mobility_delirium_rate:.0f}%")
print(f"  Bedrest:  {bedrest_delirium_rate:.0f}%")
print(f"  Difference: {delirium_diff:+.0f} percentage points")

if bedrest_delirium_rate > mobility_delirium_rate:
    print(f"  üìà Bedrest group has HIGHER delirium rate")
    print(f"  ‚ùì Is this because:")
    print(f"     (a) Bedrest CAUSES more delirium? OR")
    print(f"     (b) Bedrest patients were SICKER to begin with?")
    print(f"  ‚Üí This is the CONFOUNDING problem in observational studies!")
else:
    print(f"  üìâ Mobility group has HIGHER delirium rate")
    print(f"  ‚ùì This might seem counterintuitive, but remember:")
    print(f"     Selection bias can mask true treatment effects")

# ============================================================================
# SECTION 7: SAVE CSV FILES
# ============================================================================

print("\n\nüíæ SECTION 7: Saving CSV Files")
print("="*80)

# Save Group A (Early Mobility)
mobility_filename = 'mobility_group.csv'
mobility_df.to_csv(f'/tmp/{mobility_filename}', index=False)
print(f"‚úÖ Saved: {mobility_filename}")
print(f"   Location: /tmp/{mobility_filename}")
print(f"   Rows: {len(mobility_df)}")
print(f"   Columns: {', '.join(mobility_df.columns)}")

# Save Group B (Bedrest)
bedrest_filename = 'bedrest_group.csv'
bedrest_df.to_csv(f'/tmp/{bedrest_filename}', index=False)
print(f"\n‚úÖ Saved: {bedrest_filename}")
print(f"   Location: /tmp/{bedrest_filename}")
print(f"   Rows: {len(bedrest_df)}")
print(f"   Columns: {', '.join(bedrest_df.columns)}")

# ============================================================================
# SECTION 8: DISPLAY SAMPLE DATA
# ============================================================================

print("\n\nüìã SECTION 8: Sample Data Preview")
print("="*80)

print("\nüìä GROUP A - EARLY MOBILITY (First 5 patients):")
print(mobility_df.head())

print("\nüìä GROUP B - BEDREST (First 5 patients):")
print(bedrest_df.head())

# ============================================================================
# SECTION 9: DOWNLOAD FILES
# ============================================================================

print("\n\nüì• SECTION 9: Downloading Files")
print("="*80)

from google.colab import files

print("\nüîΩ Downloading CSV files to your computer...\n")

files.download(f'/tmp/{mobility_filename}')
print(f"‚úÖ Downloaded: {mobility_filename}")

files.download(f'/tmp/{bedrest_filename}')
print(f"‚úÖ Downloaded: {bedrest_filename}")

# ============================================================================
# SECTION 10: EDUCATIONAL SUMMARY
# ============================================================================

print("\n\n" + "="*80)
print("üìö EDUCATIONAL SUMMARY: UNDERSTANDING SELECTION BIAS")
print("="*80)

print(f"""
üéì WHAT YOU JUST CREATED:

Two ICU patient cohorts with REALISTIC SELECTION BIAS:

GROUP A - EARLY MOBILITY:
  ‚Ä¢ {N_MOBILITY} patients
  ‚Ä¢ Average SOFA: {mobility_df['SOFA_Score'].mean():.1f} (moderate severity)
  ‚Ä¢ Average Comorbidities: {mobility_df['Comorbidity_Index'].mean():.1f}
  ‚Ä¢ Delirium Rate: {mobility_delirium_rate:.0f}%
  ‚Ä¢ Clinical Profile: Stable enough for mobilization

GROUP B - BEDREST:
  ‚Ä¢ {N_BEDREST} patients
  ‚Ä¢ Average SOFA: {bedrest_df['SOFA_Score'].mean():.1f} (HIGHER severity - BIAS!)
  ‚Ä¢ Average Comorbidities: {bedrest_df['Comorbidity_Index'].mean():.1f} (MORE complex - BIAS!)
  ‚Ä¢ Delirium Rate: {bedrest_delirium_rate:.0f}%
  ‚Ä¢ Clinical Profile: Deemed "too unstable" for mobilization

‚ö†Ô∏è THE SELECTION BIAS PROBLEM:

In real ICU practice, clinicians use clinical judgment to decide who gets
mobilized. Sicker patients are kept on bedrest due to safety concerns.

This creates a FUNDAMENTAL CONFOUNDING problem:
  ‚Ä¢ If bedrest patients have worse outcomes, is it because:
    (a) Bedrest is harmful? OR
    (b) They were sicker to begin with?

You CANNOT determine causation from this observational data alone!

üí° IMPLICATIONS FOR RESEARCH:

This dataset demonstrates why you need:
  1. Propensity score matching (adjust for baseline differences)
  2. Multivariate regression (control for confounders)
  3. Randomized controlled trials (eliminate selection bias)

üìä NEXT STEPS FOR ANALYSIS:

1. Load both CSVs into a statistical package (R, SPSS, Python)
2. Perform propensity score matching on SOFA + Comorbidity
3. Compare matched cohorts
4. Use logistic regression controlling for confounders
5. Interpret results cautiously (observational data limitations!)

üéØ THE CORE LESSON:

"Association ‚â† Causation"

Just because bedrest patients have different outcomes doesn't mean
bedrest CAUSED those outcomes. Selection bias matters!

This is why rigorous study design (RCTs) and advanced statistical
methods (propensity matching, instrumental variables) are essential
in healthcare research.

You now have realistic data that teaches this fundamental principle.
""")

print("="*80)
print("‚úÖ DATA GENERATION COMPLETE")
print("="*80)
print("""
üì¶ YOU NOW HAVE:
  ‚Ä¢ mobility_group.csv (Early Mobility patients)
  ‚Ä¢ bedrest_group.csv (Standard Bedrest patients)
  ‚Ä¢ Realistic selection bias built into the data
  ‚Ä¢ A teaching tool for understanding confounding

USE THIS DATA TO:
  ‚Ä¢ Teach research methodology
  ‚Ä¢ Practice statistical adjustment techniques
  ‚Ä¢ Understand limitations of observational studies
  ‚Ä¢ Demonstrate need for RCTs in clinical research

Ready for your analysis! üìä
""")

üè• ICU EARLY MOBILITY PROTOCOL EVALUATION

CLINICAL SCENARIO:
Your ICU is evaluating whether early mobilization reduces delirium compared
to traditional bedrest protocols. However, clinical decision-making introduces
selection bias: sicker patients are kept on bedrest.

This code generates realistic data that mirrors this real-world bias.


üì¶ Installing required libraries...
‚úÖ Libraries loaded

‚öôÔ∏è SECTION 2: Configuring Clinical Parameters
Generating 10 patients in Early Mobility group
Generating 10 patients in Bedrest group

üìä CLINICAL PARAMETER MODELING:

‚ö†Ô∏è SELECTION BIAS (Real-World Clinical Practice):
  ‚Ä¢ Bedrest group: Higher SOFA scores (sicker patients)
  ‚Ä¢ Bedrest group: More comorbidities (medical complexity)
  ‚Ä¢ This mimics real ICU decision-making where unstable patients
    are deemed 'too sick to mobilize'

  SOFA Score Distributions:
    Mobility group: Mean 6.5 ¬± 2.0
    Bedrest group:  Mean 11.5 ¬± 2.5 (‚ö†Ô∏è SICKER)

  Comorbidity Index Distr

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

‚úÖ Downloaded: mobility_group.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

‚úÖ Downloaded: bedrest_group.csv


üìö EDUCATIONAL SUMMARY: UNDERSTANDING SELECTION BIAS

üéì WHAT YOU JUST CREATED:

Two ICU patient cohorts with REALISTIC SELECTION BIAS:

GROUP A - EARLY MOBILITY:
  ‚Ä¢ 10 patients
  ‚Ä¢ Average SOFA: 6.1 (moderate severity)
  ‚Ä¢ Average Comorbidities: 2.7
  ‚Ä¢ Delirium Rate: 50%
  ‚Ä¢ Clinical Profile: Stable enough for mobilization

GROUP B - BEDREST:
  ‚Ä¢ 10 patients
  ‚Ä¢ Average SOFA: 10.6 (HIGHER severity - BIAS!)
  ‚Ä¢ Average Comorbidities: 5.1 (MORE complex - BIAS!)
  ‚Ä¢ Delirium Rate: 70%
  ‚Ä¢ Clinical Profile: Deemed "too unstable" for mobilization

‚ö†Ô∏è THE SELECTION BIAS PROBLEM:

In real ICU practice, clinicians use clinical judgment to decide who gets
mobilized. Sicker patients are kept on bedrest due to safety concerns.

This creates a FUNDAMENTAL CONFOUNDING problem:
  ‚Ä¢ If bedrest patients have worse outcomes, is it because:
    (a) Bedrest is harmful? OR
    (b) They were sicker to begin with?

You CANNOT determine cau