# ü§ñ Exercise 10: Clinical LLM Experimentation

**Week 10 | AI in Healthcare Curriculum**

---

## Learning Objectives

By completing this exercise, you will:

- üéØ Set up and interact with a clinical LLM via API
- üéØ Systematically evaluate clinical knowledge accuracy
- üéØ Test clinical reasoning capabilities and limitations
- üéØ Probe for hallucination, temporal limits, and Australian-specific gaps
- üéØ Experiment with prompt engineering techniques
- üéØ Assess LLM utility for clinical documentation support

---

## ‚è±Ô∏è Estimated Time: 2 hours

---

## Context

Large Language Models (LLMs) like GPT-4 and Claude are increasingly being explored for clinical applications. Unlike the ML models we've examined previously, LLMs are **general-purpose** systems trained on vast text corpora, including medical literature.

This creates both opportunities and risks:
- **Opportunities:** Flexible, conversational interfaces; broad knowledge; documentation support
- **Risks:** Hallucination; confident but wrong answers; outdated information; gaps in local context

**Your task:** Systematically evaluate an LLM's capabilities and limitations for clinical use cases relevant to your practice.

**Important:** This exercise uses a simulated LLM API for teaching purposes. The principles apply to any clinical LLM evaluation.

## Part 1: Setup and LLM Access

In [None]:
# Setup - run this first!

import numpy as np
import pandas as pd
import time
import json
import warnings
warnings.filterwarnings('ignore')

# For real API access, you would use:
# !pip install anthropic
# import anthropic

print("‚úÖ Setup complete!")

In [None]:
# LLM Query Function
# For this exercise, we'll create a simulated LLM that demonstrates
# realistic behaviours including occasional errors and limitations

# In production, you would use the actual API:
# 
# import anthropic
# client = anthropic.Anthropic(api_key="your-api-key")
#
# def query_llm(prompt, system_prompt="You are a helpful clinical assistant."):
#     response = client.messages.create(
#         model="claude-sonnet-4-20250514",
#         max_tokens=1024,
#         system=system_prompt,
#         messages=[{"role": "user", "content": prompt}]
#     )
#     return response.content[0].text

class SimulatedClinicalLLM:
    """
    A simulated LLM for educational purposes that demonstrates
    realistic clinical AI behaviours including strengths and limitations.
    """
    
    def __init__(self):
        self.responses = self._load_responses()
        
    def _load_responses(self):
        """Pre-defined responses demonstrating various LLM behaviours."""
        return {
            # Accurate clinical knowledge
            "sepsis-3": """The Sepsis-3 definition (2016) defines sepsis as life-threatening organ dysfunction caused by a dysregulated host response to infection. 

Key criteria:
‚Ä¢ Suspected or documented infection, PLUS
‚Ä¢ Acute increase of ‚â•2 points in SOFA score

The qSOFA (quick SOFA) screening tool for patients outside ICU includes:
‚Ä¢ Respiratory rate ‚â•22/min
‚Ä¢ Altered mentation (GCS <15)
‚Ä¢ Systolic blood pressure ‚â§100 mmHg

Two or more qSOFA criteria suggests higher risk of poor outcome.

Septic shock is defined as sepsis with:
‚Ä¢ Persisting hypotension requiring vasopressors to maintain MAP ‚â•65 mmHg
‚Ä¢ Serum lactate >2 mmol/L despite adequate fluid resuscitation""",

            "tension-pneumothorax": """Management of tension pneumothorax in the pre-hospital setting:

IMMEDIATE DECOMPRESSION is required - this is a clinical diagnosis, do not delay for imaging.

Needle thoracostomy:
‚Ä¢ Site: 2nd intercostal space, midclavicular line (traditional) OR
‚Ä¢ 4th-5th intercostal space, anterior axillary line (increasingly preferred)
‚Ä¢ Use large-bore cannula (14G minimum, ideally longer needle in larger patients)
‚Ä¢ Insert perpendicular to chest wall, just above the rib
‚Ä¢ Rush of air confirms diagnosis

Finger thoracostomy (if trained):
‚Ä¢ More definitive than needle decompression
‚Ä¢ 4th-5th intercostal space, anterior axillary line
‚Ä¢ Blunt dissection through chest wall
‚Ä¢ Finger sweep to confirm pleural space

Post-procedure:
‚Ä¢ Monitor for re-accumulation
‚Ä¢ Definitive chest drain at hospital
‚Ä¢ Reassess vital signs frequently""",

            "thrombolysis-contraindications": """Contraindications to thrombolysis in acute ischaemic stroke:

ABSOLUTE CONTRAINDICATIONS:
‚Ä¢ Intracranial haemorrhage on imaging
‚Ä¢ Ischaemic stroke or severe head trauma within 3 months
‚Ä¢ Intracranial neoplasm, AVM, or aneurysm
‚Ä¢ Recent intracranial or spinal surgery
‚Ä¢ Active internal bleeding
‚Ä¢ Aortic dissection
‚Ä¢ Bleeding diathesis (platelets <100,000, INR >1.7, aPTT elevated)

RELATIVE CONTRAINDICATIONS:
‚Ä¢ Uncontrolled hypertension (SBP >185 or DBP >110 despite treatment)
‚Ä¢ Major surgery within 14 days
‚Ä¢ GI/urinary haemorrhage within 21 days
‚Ä¢ Recent arterial puncture at non-compressible site
‚Ä¢ Seizure at stroke onset
‚Ä¢ Pregnancy
‚Ä¢ Recent MI (relative - within 3 months)

TIME WINDOW:
‚Ä¢ IV alteplase: within 4.5 hours of symptom onset
‚Ä¢ Tenecteplase: emerging evidence, similar window
‚Ä¢ Extended window may apply with perfusion imaging""",

            # Hallucination example - fictional trial
            "retrieval-2-trial": """The RETRIEVAL-2 trial (Richardson et al., 2023) was a multicentre randomised controlled trial examining adrenaline dosing in out-of-hospital cardiac arrest.

Key findings:
‚Ä¢ n = 4,892 patients across 45 EMS services
‚Ä¢ Compared standard 1mg adrenaline vs. weight-based dosing (10mcg/kg)
‚Ä¢ Primary outcome: Survival to hospital discharge with favourable neurological outcome

Results showed weight-based dosing improved survival (12.3% vs 9.8%, p=0.02) with better neurological outcomes (mRS 0-2: 8.1% vs 5.9%).

The trial recommended updating resuscitation guidelines to adopt weight-based adrenaline dosing.""",

            # Australian-specific gaps
            "pbs-apixaban": """I can provide general information about apixaban PBS listing in Australia, though I recommend verifying current details with the PBS website as listings can change.

Apixaban (Eliquis) is PBS-listed for:
‚Ä¢ Prevention of stroke and systemic embolism in non-valvular atrial fibrillation
‚Ä¢ Treatment and prevention of deep vein thrombosis (DVT) and pulmonary embolism (PE)
‚Ä¢ VTE prophylaxis following hip or knee replacement surgery

Restrictions typically apply based on:
‚Ä¢ CHA2DS2-VASc score for AF indication
‚Ä¢ Prior treatment history
‚Ä¢ Specialist initiation for some indications

NOTE: I may not have the most current PBS schedule information. Please check www.pbs.gov.au for current restrictions and authority requirements.""",

            # Temporal limitation
            "anzcor-2024": """I don't have reliable information about specific 2024 ANZCOR guideline updates. My knowledge has a training cutoff, and clinical guidelines are updated regularly.

For the most current ANZCOR (Australian and New Zealand Committee on Resuscitation) guidelines, I recommend:

‚Ä¢ Official ANZCOR website: www.resus.org.au
‚Ä¢ Check the specific guideline number you need (e.g., ANZCOR Guideline 11.2 for cardiac arrest)
‚Ä¢ Note the publication/revision date on each guideline

Key points about ANZCOR guidelines:
‚Ä¢ Aligned with ILCOR (International Liaison Committee on Resuscitation) evidence reviews
‚Ä¢ Updated following major ILCOR consensus conferences
‚Ä¢ May differ slightly from AHA or ERC guidelines

Would you like me to discuss the general principles of a specific resuscitation topic instead?""",

            # Clinical reasoning
            "chest-pain-reasoning": """Clinical reasoning for chest pain assessment:

IMMEDIATE CONSIDERATIONS (life-threatening):
1. Acute Coronary Syndrome (STEMI/NSTEMI/Unstable Angina)
2. Aortic dissection
3. Pulmonary embolism
4. Tension pneumothorax
5. Cardiac tamponade
6. Oesophageal rupture (Boerhaave syndrome)

KEY DISCRIMINATING FEATURES:

ACS features:
‚Ä¢ Pressure/squeezing quality
‚Ä¢ Radiation to arm, jaw, back
‚Ä¢ Associated diaphoresis, nausea
‚Ä¢ Risk factors: age, smoking, diabetes, hypertension, family history

Aortic dissection red flags:
‚Ä¢ Sudden onset, "tearing" quality
‚Ä¢ Maximal at onset
‚Ä¢ Pulse/BP differential between arms
‚Ä¢ Neurological symptoms
‚Ä¢ History of hypertension, connective tissue disorder

PE features:
‚Ä¢ Pleuritic component
‚Ä¢ Associated dyspnoea, tachycardia
‚Ä¢ Risk factors: immobility, recent surgery, malignancy, OCP

INITIAL WORKUP:
‚Ä¢ ECG (immediate)
‚Ä¢ Troponin (serial)
‚Ä¢ CXR
‚Ä¢ D-dimer (if PE suspected, use clinical probability score)
‚Ä¢ Consider CT aortogram/CTPA based on clinical suspicion""",

            # Documentation support
            "discharge-summary": """DRAFT DISCHARGE SUMMARY

PATIENT: [Name]
DOB: [Date]
MRN: [Number]
ADMISSION DATE: [Date]
DISCHARGE DATE: [Date]
ATTENDING: [Physician]

PRINCIPAL DIAGNOSIS:
Community-acquired pneumonia (right lower lobe)

SECONDARY DIAGNOSES:
‚Ä¢ Type 2 diabetes mellitus
‚Ä¢ Hypertension
‚Ä¢ Chronic kidney disease stage 3

PRESENTING COMPLAINT:
3-day history of productive cough, fevers, and increasing dyspnoea.

HOSPITAL COURSE:
72-year-old male admitted with community-acquired pneumonia confirmed on chest X-ray showing right lower lobe consolidation. Initial observations: T 38.9¬∞C, HR 102, BP 128/76, RR 24, SpO2 91% on room air. CRP 187, WCC 14.2. CURB-65 score of 2.

Commenced on IV amoxicillin/clavulanate per hospital CAP guidelines. Required supplemental oxygen (2L NP) for first 48 hours. Transitioned to oral antibiotics day 3 with clinical improvement. Blood cultures negative.

DISCHARGE MEDICATIONS:
[To be completed - verify against medication reconciliation]

FOLLOW-UP:
‚Ä¢ GP review in 1 week
‚Ä¢ Repeat CXR in 6 weeks to confirm resolution
‚Ä¢ Smoking cessation support referral

---
NOTE: This is a draft requiring clinical review and verification of all details."""
        }
    
    def query(self, prompt, system_prompt=None):
        """Simulate an LLM query with appropriate response."""
        prompt_lower = prompt.lower()
        
        # Add slight delay to simulate API call
        time.sleep(0.5)
        
        # Match to pre-defined responses based on keywords
        if 'sepsis' in prompt_lower and ('criteria' in prompt_lower or 'definition' in prompt_lower or 'sepsis-3' in prompt_lower):
            return self.responses['sepsis-3']
        elif 'tension pneumothorax' in prompt_lower or ('pneumothorax' in prompt_lower and 'pre-hospital' in prompt_lower):
            return self.responses['tension-pneumothorax']
        elif 'thrombolysis' in prompt_lower and ('contraindication' in prompt_lower or 'stroke' in prompt_lower):
            return self.responses['thrombolysis-contraindications']
        elif 'retrieval-2' in prompt_lower or ('retrieval' in prompt_lower and 'trial' in prompt_lower and 'adrenaline' in prompt_lower):
            return self.responses['retrieval-2-trial']
        elif 'pbs' in prompt_lower and 'apixaban' in prompt_lower:
            return self.responses['pbs-apixaban']
        elif 'anzcor' in prompt_lower and ('2024' in prompt_lower or 'latest' in prompt_lower or 'current' in prompt_lower):
            return self.responses['anzcor-2024']
        elif 'chest pain' in prompt_lower and ('differential' in prompt_lower or 'reasoning' in prompt_lower or 'assessment' in prompt_lower or 'approach' in prompt_lower):
            return self.responses['chest-pain-reasoning']
        elif 'discharge' in prompt_lower and 'summary' in prompt_lower:
            return self.responses['discharge-summary']
        else:
            return self._generate_generic_response(prompt)
    
    def _generate_generic_response(self, prompt):
        """Generate a generic response for unmatched queries."""
        return f"""I can help with clinical questions, though I should note some important limitations:

1. My training data has a knowledge cutoff, so recent guidelines or evidence may not be reflected
2. I may not have complete information about Australian-specific contexts (PBS, local guidelines)
3. All clinical information should be verified against authoritative sources

Regarding your question about: "{prompt[:100]}..."

I'd be happy to provide general clinical information, but please verify any specific recommendations with current guidelines and local protocols.

Could you clarify what specific aspect you'd like me to address?"""

# Create the simulated LLM
llm = SimulatedClinicalLLM()

def query_llm(prompt, system_prompt="You are a helpful clinical assistant."):
    """Query the LLM with a prompt."""
    return llm.query(prompt, system_prompt)

print("‚úÖ LLM interface ready!")
print("\nüìå Note: This exercise uses a simulated LLM for teaching.")
print("   Real API code is provided in comments for production use.")

In [None]:
# Test the LLM connection
print("Testing LLM connection...")
print("="*60)

test_response = query_llm("What are the diagnostic criteria for sepsis according to Sepsis-3?")

print("\n‚úÖ Connection successful!")
print("\nTest response preview (first 200 chars):")
print(test_response[:200] + "...")

## Part 2: Clinical Knowledge Assessment

Let's systematically test the LLM's clinical knowledge across different domains.

In [None]:
# Define clinical knowledge test questions
clinical_questions = [
    {
        'domain': 'Critical Care',
        'question': "What are the diagnostic criteria for sepsis according to Sepsis-3?",
        'expected_elements': ['SOFA score', 'organ dysfunction', 'infection', 'qSOFA'],
        'source': 'Sepsis-3 Consensus Definitions (JAMA 2016)'
    },
    {
        'domain': 'Emergency Medicine',
        'question': "Describe the management of a tension pneumothorax in the pre-hospital setting.",
        'expected_elements': ['needle decompression', 'intercostal space', 'immediate', 'clinical diagnosis'],
        'source': 'ANZCOR/ERC Guidelines'
    },
    {
        'domain': 'Neurology',
        'question': "What are the contraindications to thrombolysis in acute ischaemic stroke?",
        'expected_elements': ['haemorrhage', 'time window', 'blood pressure', 'recent surgery'],
        'source': 'AHA/ASA Stroke Guidelines'
    }
]

print("Clinical Knowledge Assessment")
print("="*70)

In [None]:
# Run clinical knowledge tests
knowledge_results = []

for i, q in enumerate(clinical_questions, 1):
    print(f"\n{'='*70}")
    print(f"Question {i}: {q['domain']}")
    print(f"{'='*70}")
    print(f"\nüìù Question: {q['question']}")
    print(f"\nüîç Expected elements: {', '.join(q['expected_elements'])}")
    print(f"\nüí¨ LLM Response:")
    print("-"*50)
    
    response = query_llm(q['question'])
    print(response)
    
    # Check for expected elements
    response_lower = response.lower()
    elements_found = [elem for elem in q['expected_elements'] if elem.lower() in response_lower]
    elements_missing = [elem for elem in q['expected_elements'] if elem.lower() not in response_lower]
    
    print(f"\nüìä Assessment:")
    print(f"   Elements found: {len(elements_found)}/{len(q['expected_elements'])}")
    if elements_missing:
        print(f"   Missing: {', '.join(elements_missing)}")
    
    knowledge_results.append({
        'domain': q['domain'],
        'elements_found': len(elements_found),
        'elements_total': len(q['expected_elements']),
        'score': len(elements_found) / len(q['expected_elements'])
    })

In [None]:
# Summarise knowledge assessment
print("\n" + "="*70)
print("CLINICAL KNOWLEDGE SUMMARY")
print("="*70)

results_df = pd.DataFrame(knowledge_results)
print(f"\n{'Domain':<25} {'Score':<15} {'Rating':<15}")
print("-"*55)

for _, row in results_df.iterrows():
    score = row['score']
    if score >= 0.75:
        rating = "‚úÖ Good"
    elif score >= 0.5:
        rating = "‚ö†Ô∏è Partial"
    else:
        rating = "‚ùå Poor"
    print(f"{row['domain']:<25} {score:.0%}{'':>8} {rating:<15}")

print(f"\nOverall average: {results_df['score'].mean():.0%}")

## Part 3: Probing for Limitations

Now let's probe the LLM's limitations: hallucination, temporal knowledge gaps, and Australian-specific knowledge.

In [None]:
# Test 1: Hallucination - Ask about a fictional trial
print("="*70)
print("LIMITATION TEST 1: Hallucination Detection")
print("="*70)

print("\nüìù Testing with a FICTIONAL trial name...")
print("\nQuestion: 'What did the RETRIEVAL-2 trial show about adrenaline dosing?'")
print("\n‚ö†Ô∏è NOTE: RETRIEVAL-2 is a fictional trial - it does not exist!")
print("-"*50)

hallucination_response = query_llm("What did the RETRIEVAL-2 trial show about adrenaline dosing?")
print(f"\nüí¨ LLM Response:\n{hallucination_response}")

print("\n" + "-"*50)
print("üìä Assessment:")
if "don't have" in hallucination_response.lower() or "cannot find" in hallucination_response.lower() or "not aware" in hallucination_response.lower():
    print("   ‚úÖ LLM appropriately indicated uncertainty")
else:
    print("   ‚ö†Ô∏è WARNING: LLM may have fabricated information about a non-existent trial!")
    print("   This is a HALLUCINATION - the trial does not exist.")
    print("\n   üî¥ This demonstrates why LLM outputs must ALWAYS be verified.")

In [None]:
# Test 2: Temporal limitations
print("="*70)
print("LIMITATION TEST 2: Temporal Knowledge Limits")
print("="*70)

print("\nüìù Testing knowledge of recent guidelines...")
print("\nQuestion: 'What are the latest 2024 ANZCOR guidelines?'")
print("-"*50)

temporal_response = query_llm("What are the latest 2024 ANZCOR guidelines?")
print(f"\nüí¨ LLM Response:\n{temporal_response}")

print("\n" + "-"*50)
print("üìä Assessment:")
if "knowledge" in temporal_response.lower() and ("cutoff" in temporal_response.lower() or "training" in temporal_response.lower() or "don't have" in temporal_response.lower()):
    print("   ‚úÖ LLM appropriately acknowledged temporal limitations")
else:
    print("   ‚ö†Ô∏è LLM may not have clearly indicated its knowledge cutoff")

print("\nüí° Key Learning: LLMs have training cutoff dates. Recent guidelines,")
print("   evidence, or events may not be reflected in their responses.")

In [None]:
# Test 3: Australian-specific knowledge
print("="*70)
print("LIMITATION TEST 3: Australian-Specific Knowledge")
print("="*70)

print("\nüìù Testing Australian PBS knowledge...")
print("\nQuestion: 'What PBS restrictions apply to apixaban in Australia?'")
print("-"*50)

australian_response = query_llm("What PBS restrictions apply to apixaban in Australia?")
print(f"\nüí¨ LLM Response:\n{australian_response}")

print("\n" + "-"*50)
print("üìä Assessment:")
if "verify" in australian_response.lower() or "pbs.gov.au" in australian_response.lower() or "current" in australian_response.lower():
    print("   ‚úÖ LLM appropriately recommended verification")
if "authority" in australian_response.lower() or "restriction" in australian_response.lower():
    print("   ‚úÖ LLM demonstrated some PBS knowledge")
else:
    print("   ‚ö†Ô∏è Response may lack Australian-specific detail")

print("\nüí° Key Learning: LLMs trained primarily on US data may have gaps")
print("   in Australian-specific knowledge (PBS, TGA, AHPRA, Medicare).")

### üîß Your Turn: Design a Limitation Test

Create your own test to probe an LLM limitation relevant to your clinical context.

In [None]:
# YOUR CODE: Design and run your own limitation test

# Example structure:
my_test = {
    'name': 'Your Test Name',
    'category': 'Hallucination / Temporal / Local Context / Other',
    'question': 'Your question here',
    'why_this_tests_limitation': 'Explain what limitation this tests',
    'expected_good_response': 'What should a good response include?'
}

# Run your test
print(f"YOUR LIMITATION TEST: {my_test['name']}")
print("="*60)
print(f"\nCategory: {my_test['category']}")
print(f"\nQuestion: {my_test['question']}")
print(f"\nWhy this tests a limitation: {my_test['why_this_tests_limitation']}")

# Uncomment to run:
# response = query_llm(my_test['question'])
# print(f"\nResponse: {response}")

## Part 4: Clinical Reasoning Evaluation

Can the LLM demonstrate clinical reasoning, not just knowledge recall?

In [None]:
# Test clinical reasoning with a case
print("="*70)
print("CLINICAL REASONING EVALUATION")
print("="*70)

reasoning_prompt = """A 58-year-old male presents with sudden onset chest pain that is severe, 
tearing in quality, radiating to his back. He has a history of hypertension. 
On examination, his BP is 180/100 in the right arm and 150/90 in the left arm.

What is your approach to the differential diagnosis and initial assessment?"""

print(f"\nüìã Clinical Case:\n{reasoning_prompt}")
print("\n" + "-"*50)

reasoning_response = query_llm(reasoning_prompt)
print(f"\nüí¨ LLM Response:\n{reasoning_response}")

In [None]:
# Evaluate clinical reasoning quality
print("\n" + "="*70)
print("REASONING QUALITY ASSESSMENT")
print("="*70)

reasoning_criteria = {
    'Identifies key diagnosis': ['aortic dissection', 'dissection'],
    'Notes red flags': ['tearing', 'sudden', 'bp differential', 'pulse'],
    'Structured approach': ['differential', 'workup', 'investigation'],
    'Appropriate urgency': ['urgent', 'immediate', 'emergency', 'life-threatening'],
    'Mentions key test': ['ct', 'cta', 'aortogram', 'imaging']
}

response_lower = reasoning_response.lower()

print(f"\n{'Criterion':<35} {'Met?':<10}")
print("-"*45)

criteria_met = 0
for criterion, keywords in reasoning_criteria.items():
    met = any(kw in response_lower for kw in keywords)
    status = "‚úÖ Yes" if met else "‚ùå No"
    if met:
        criteria_met += 1
    print(f"{criterion:<35} {status:<10}")

print(f"\nOverall: {criteria_met}/{len(reasoning_criteria)} criteria met")

if criteria_met >= 4:
    print("\n‚úÖ LLM demonstrated reasonable clinical reasoning")
elif criteria_met >= 2:
    print("\n‚ö†Ô∏è LLM showed partial clinical reasoning")
else:
    print("\n‚ùå LLM failed to demonstrate adequate clinical reasoning")

## Part 5: Prompt Engineering Experiments

How does prompt design affect LLM output quality?

In [None]:
# Compare different prompt styles
print("="*70)
print("PROMPT ENGINEERING EXPERIMENTS")
print("="*70)

base_question = "chest pain differential diagnosis"

prompt_styles = {
    'Basic': "Tell me about chest pain differential diagnosis.",
    
    'Specific': "What is the approach to chest pain differential diagnosis in the emergency department, focusing on life-threatening causes?",
    
    'Role-based': """You are an experienced emergency physician. 
A junior doctor asks you to explain your systematic approach to chest pain assessment. 
Focus on the key discriminating features that help differentiate life-threatening causes.""",
    
    'Structured': """Provide a systematic approach to chest pain differential diagnosis.

Format your response as:
1. IMMEDIATE LIFE THREATS (list with brief descriptions)
2. KEY DISCRIMINATING FEATURES (for each major diagnosis)
3. INITIAL WORKUP (ordered by priority)

Be concise and clinically focused."""
}

print("\nComparing 4 different prompt styles for the same clinical question...")

In [None]:
# Run prompt comparison (just show the structured one for brevity)
print("\n" + "="*70)
print("PROMPT STYLE: Structured")
print("="*70)

print(f"\nüìù Prompt:\n{prompt_styles['Structured']}")
print("\n" + "-"*50)

structured_response = query_llm(prompt_styles['Structured'])
print(f"\nüí¨ Response:\n{structured_response}")

In [None]:
# Prompt engineering principles
print("\n" + "="*70)
print("PROMPT ENGINEERING PRINCIPLES FOR CLINICAL USE")
print("="*70)

principles = """
1. BE SPECIFIC
   ‚ùå "Tell me about sepsis"
   ‚úÖ "Explain the Sepsis-3 diagnostic criteria and qSOFA score"

2. PROVIDE CONTEXT
   ‚ùå "What antibiotics should I use?"
   ‚úÖ "For community-acquired pneumonia in a 70yo with penicillin allergy, 
      CURB-65 score 2, what antibiotics per Australian guidelines?"

3. SPECIFY FORMAT
   ‚ùå "Explain the differential"
   ‚úÖ "List the top 5 differentials in order of likelihood, with one key 
      discriminating feature for each"

4. SET APPROPRIATE ROLE
   ‚ùå Generic query
   ‚úÖ "As a clinical decision support tool, provide evidence-based guidance..."

5. REQUEST UNCERTAINTY ACKNOWLEDGMENT
   ‚ùå "What is the answer?"
   ‚úÖ "Provide your assessment and indicate areas of uncertainty or where 
      guidelines may have changed since your training"

6. ASK FOR SOURCES
   ‚úÖ "Cite the guideline or evidence source for each recommendation"
"""

print(principles)

## Part 6: Documentation Support Evaluation

Can LLMs assist with clinical documentation tasks?

In [None]:
# Test documentation support
print("="*70)
print("DOCUMENTATION SUPPORT EVALUATION")
print("="*70)

documentation_prompt = """Generate a draft discharge summary for:

Patient: 72-year-old male
Admission: Community-acquired pneumonia (RLL)
Comorbidities: Type 2 diabetes, hypertension, CKD stage 3
Hospital course: IV antibiotics x 3 days, required O2 48hrs, improving
Discharge: Day 4, oral antibiotics to complete course

Include standard sections and note any items requiring clinical verification."""

print(f"\nüìù Documentation Request:\n{documentation_prompt}")
print("\n" + "-"*50)

documentation_response = query_llm(documentation_prompt)
print(f"\nüí¨ Generated Draft:\n{documentation_response}")

In [None]:
# Assess documentation quality
print("\n" + "="*70)
print("DOCUMENTATION QUALITY ASSESSMENT")
print("="*70)

doc_criteria = {
    'Standard sections present': ['diagnosis', 'hospital course', 'discharge', 'follow'],
    'Clinical accuracy': ['pneumonia', 'antibiotics', 'oxygen'],
    'Safety markers': ['verify', 'review', 'draft', 'clinical'],
    'Medication reconciliation note': ['medication', 'reconcil', 'verify']
}

response_lower = documentation_response.lower()

print(f"\n{'Criterion':<35} {'Met?':<10}")
print("-"*45)

for criterion, keywords in doc_criteria.items():
    met = any(kw in response_lower for kw in keywords)
    status = "‚úÖ Yes" if met else "‚ùå No"
    print(f"{criterion:<35} {status:<10}")

print("\n‚ö†Ô∏è CRITICAL REMINDER:")
print("   LLM-generated documentation must ALWAYS be reviewed and verified")
print("   by the responsible clinician before use. LLMs can:")
print("   ‚Ä¢ Fabricate plausible-sounding details")
print("   ‚Ä¢ Miss important information")
print("   ‚Ä¢ Use incorrect medication doses or names")
print("   ‚Ä¢ Create documentation that looks correct but contains errors")

## Part 7: Structured Capability Assessment

Complete a structured assessment of LLM capabilities for three clinical use cases.

In [None]:
# Structured assessment template
print("="*70)
print("STRUCTURED CAPABILITY ASSESSMENT")
print("="*70)

assessment_template = """
============================================================
LLM CAPABILITY ASSESSMENT FOR CLINICAL USE
============================================================

Evaluator: [Your name]
Date: [Date]
LLM System: Clinical LLM (simulated for teaching)

------------------------------------------------------------
USE CASE 1: Clinical Knowledge Queries
------------------------------------------------------------
Description: Using LLM to answer clinical knowledge questions

CAPABILITY ASSESSMENT:
[ ] Accurate for established guidelines (Sepsis-3, etc.)
[ ] Appropriate uncertainty expression
[ ] Structured, usable responses

LIMITATIONS IDENTIFIED:
[ ] Knowledge cutoff affects recent guidelines
[ ] Australian-specific gaps (PBS, TGA, local protocols)
[ ] Risk of hallucination for unfamiliar queries

RISK LEVEL: [ ] Low [ ] Medium [ ] High

RECOMMENDATION:
[ ] Suitable for use with verification
[ ] Suitable with significant caveats
[ ] Not recommended

Required safeguards:
‚Ä¢
‚Ä¢

------------------------------------------------------------
USE CASE 2: Clinical Reasoning Support
------------------------------------------------------------
Description: Using LLM to assist with differential diagnosis

CAPABILITY ASSESSMENT:
[ ] Identifies major differentials
[ ] Recognises red flags
[ ] Appropriate clinical reasoning structure

LIMITATIONS IDENTIFIED:
[ ]
[ ]

RISK LEVEL: [ ] Low [ ] Medium [ ] High

RECOMMENDATION:
[ ] Suitable for use with verification
[ ] Suitable with significant caveats
[ ] Not recommended

Required safeguards:
‚Ä¢
‚Ä¢

------------------------------------------------------------
USE CASE 3: Documentation Support
------------------------------------------------------------
Description: Using LLM to draft clinical documentation

CAPABILITY ASSESSMENT:
[ ] Generates appropriate structure
[ ] Incorporates provided information
[ ] Flags items needing verification

LIMITATIONS IDENTIFIED:
[ ]
[ ]

RISK LEVEL: [ ] Low [ ] Medium [ ] High

RECOMMENDATION:
[ ] Suitable for use with verification
[ ] Suitable with significant caveats
[ ] Not recommended

Required safeguards:
‚Ä¢
‚Ä¢

------------------------------------------------------------
OVERALL SUMMARY
------------------------------------------------------------

Key capabilities:
1.
2.
3.

Key limitations:
1.
2.
3.

Essential safeguards for ANY clinical LLM use:
1. All outputs must be verified by qualified clinician
2. Never rely on LLM for time-critical decisions without verification
3. Be aware of knowledge cutoff and local context gaps
4. Document when AI assistance was used
5. Report any errors or concerning outputs

============================================================
"""

print(assessment_template)

## Part 8: Your Evaluation Report

In [None]:
# ===== YOUR EVALUATION REPORT =====

your_evaluation = """
============================================================
CLINICAL LLM EVALUATION REPORT
============================================================

Evaluator: [Your name]
Date: [Date]
Clinical Context: [Your specialty/setting]

------------------------------------------------------------
1. EXECUTIVE SUMMARY
------------------------------------------------------------
[2-3 sentence summary of your findings about LLM capabilities
and suitability for clinical use]



------------------------------------------------------------
2. CAPABILITY ASSESSMENT BY USE CASE
------------------------------------------------------------

Use Case 1: Clinical Knowledge Queries
  Capability Rating: [High/Medium/Low]
  Key Strength:
  Key Limitation:
  Recommendation:

Use Case 2: Clinical Reasoning Support
  Capability Rating: [High/Medium/Low]
  Key Strength:
  Key Limitation:
  Recommendation:

Use Case 3: Documentation Support
  Capability Rating: [High/Medium/Low]
  Key Strength:
  Key Limitation:
  Recommendation:

------------------------------------------------------------
3. LIMITATIONS OBSERVED
------------------------------------------------------------

Hallucination Risk:
[Describe what you observed]

Temporal Knowledge Gaps:
[Describe what you observed]

Australian Context Gaps:
[Describe what you observed]

Other Limitations:
[Any additional limitations observed]

------------------------------------------------------------
4. RECOMMENDED SAFEGUARDS
------------------------------------------------------------

Essential safeguards for clinical LLM use:
1.
2.
3.
4.
5.

------------------------------------------------------------
5. CONCLUSION
------------------------------------------------------------

[Your overall assessment of LLM readiness for clinical use]



============================================================
"""

print(your_evaluation)

## Part 9: Reflection Questions

In [None]:
# ===== YOUR REFLECTIONS =====

reflections = """
1. What surprised you most about the LLM's capabilities or limitations?
   Your answer:
   

2. If an LLM confidently provides incorrect information (hallucination),
   how might this affect clinical decision-making?
   Your answer:
   

3. What safeguards would you require before allowing LLM use in your
   clinical environment?
   Your answer:
   

4. How might LLMs change clinical practice over the next 5 years?
   What opportunities and risks do you foresee?
   Your answer:
   

5. Would you personally use an LLM to assist with clinical tasks?
   Which tasks, and with what caveats?
   Your answer:
   

"""

print(reflections)

## üìù Deliverable

**For your portfolio:**

Complete the LLM Evaluation Report (Part 8) with:

1. Assessment of three clinical use cases
2. Documented limitations you observed (hallucination, temporal, local context)
3. Recommended safeguards for clinical use
4. Overall conclusion on clinical readiness

This assessment directly supports your Capstone project's emerging technology analysis.

Submit via LMS by the Week 10 deadline.

## üèÅ Summary

In this exercise, you learned:

‚úÖ **LLMs have impressive but inconsistent clinical knowledge** - verification is essential

‚úÖ **Hallucination is a real risk** - LLMs can fabricate plausible-sounding information

‚úÖ **Temporal and local context gaps exist** - recent guidelines and Australian-specific knowledge may be limited

‚úÖ **Prompt engineering affects output quality** - structured prompts yield better results

‚úÖ **Documentation support is promising but requires verification** - never use LLM output without review

**Key takeaway:** LLMs are powerful tools with significant limitations. Clinical use requires robust safeguards, verification workflows, and ongoing monitoring. The clinician remains responsible for all clinical decisions.

---

**Congratulations!** You've completed the practical computing stream. These skills will serve you well as you navigate the evolving landscape of healthcare AI.