# <a id='toc1_'></a>[Diagnosis coding from clinical summary using prompt engineering](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Diagnosis coding from clinical summary using prompt engineering](#toc1_)    
- [Project Background](#toc2_)    
    - [The need for automated clinical coding](#toc2_1_1_)    
    - [Challenges for automation](#toc2_1_2_)    
    - [What is ICD-10-AM diagnostic coding, and what is ICD-10-AM?](#toc2_1_3_)    
    - [What is the principal diagnosis code (in the context of ICD-10-AM)?](#toc2_1_4_)    
    - [What is your research question for this project?](#toc2_1_5_)    
- [Dataset](#toc3_)    
    - [Dev and test set creation](#toc3_1_1_)    
- [Imports](#toc4_)    
  - [Dataset creation instruction](#toc4_1_)    
  - [Evaluation Data Frame](#toc4_2_)    
- [Question 3 ‚Äì LLM prompt engineering](#toc5_)    
    - [Zero-shot prompt](#toc5_1_1_)    
    - [One-shot prompt](#toc5_1_2_)    
    - [Three-shot prompt](#toc5_1_3_)    
    - [Self-check prompt](#toc5_1_4_)    
    - [Zero-shot chain-of-thought prompt](#toc5_1_5_)    
- [Prompt evaluation and conclusion](#toc6_)    
- [Appendix A.](#toc7_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

Date: 02.12.2025

This study project is a part of HDAT9510-Health Data Analytics: Machine Learning II (T3/25) course at UNSW. However, I modified it to look more like a coherent story and added some facts from research papers about the problem. 

**Original task**:  
You are required to explore different prompt-engineering techniques and run inference using an LLM hosted on Google Colab. Your task is to provide a hospital-course summary as model input and generate the corresponding principal diagnosis code using Hugging Face framework. A synthetic clinical dataset from the Hugging Face repository will be used for this activity.  

Authors of original project task:  
- Larry Bi: bokang.bi@unsw.edu.au   
- Oscar Perez-Concha: o.perezconcha@unsw.edu.au  

Recommended reading:  
- Chapter 1. "Understanding large language models of the textbook Build a Large Language Model (From Scratch)" (Manning Publications, 2024) by Sebastian Raschka   
- Chapter 6. "Prompt Engineering" of the textboox Hands-On Large Language Models (O'Reilly Media, 2024) by Jay Alammar

Setup:  
- [Google Colab extension for VS Code](https://github.com/googlecolab/colab-vscode)  
    -  It allows to use Colab GPUs and local extensions for you VSC setup.  
        - I like it because I often experienced errors using Github via Colab.

Libraries:  
- Transformers  
- [Datasets](https://pypi.org/project/datasets/) 
    - [Quickstart page](https://huggingface.co/docs/datasets/quickstart)  
- Pandas  

# <a id='toc2_'></a>[Project Background](#toc0_)

Clinical coding is a non-trivial task for humans. The process of coding usually includes data abstraction or summarisation. More specifically, an expert clinical coder is expected to decipher a large number of documents about a patient‚Äôs episode of care, and to select the most accurate codes from a large classification system (or an ontology), according to the contexts in the various documents and the regularly updated coding guidelines.

### <a id='toc2_1_1_'></a>[The need for automated clinical coding](#toc0_)

There is a big room for improvement:  
- A clinical coder in NHS Scotland usually codes about 60 cases a day (equivalent to 7‚Äì8‚Äâmin for each case) and an NHS coding department of around 25‚Äì30 coders usually codes over 20,000 cases per month.  
- The average accuracy of coding in the UK was around 83%.

### <a id='toc2_1_2_'></a>[Challenges for automation](#toc0_)

- Clinical documents are variously structured, notational, lengthy, and incomplete.  
- Clinical coding systems are dynalically updated. 
    - The ICD-11 system contains around 17,000 unique codes for injuries, diseases and causes of death, underpinned by more than 120,000 codable terms and can code more than 1.6 million clinical situations using code combinations.  

Source: [Automated clinical coding: what, why, and where we are? Dong, Hang. et al. npj Digital Medicine](https://www.nature.com/articles/s41746-022-00705-7)

### <a id='toc2_1_3_'></a>[What is ICD-10-AM diagnostic coding, and what is ICD-10-AM?](#toc0_)

ICD-10-AM (International Statistical Classification of Diseases and Related Health Problems, Tenth Revision, Australian Modification) is the Australian national standard for classifying diagnoses and health conditions recorded during episodes of patient care. It is an adaptation of the World Health Organisation‚Äôs ICD-10 system, expanded and modified to meet the administrative needs of the Australian healthcare system. ICD-10-AM incorporates Australian Coding Standards, providing a framework for consistent clinical documentation, morbidity reporting, hospital funding, and health service evaluation.  

Diagnostic coding using ICD-10-AM involves translating clinicians‚Äô notes describing medical history, symptoms, and investigations into standardised alphanumeric codes. These codes represent diseases, disorders, and injuries. Clinical coders assign the appropriate codes by reviewing the entire medical record, applying the Australian Coding Standards, and selecting the code that best reflects the patient‚Äôs diagnoses and conditions treated or investigated.  
Accurate ICD-10-AM diagnostic codes support hospital activity-based funding, health statistics for research, and ensure comparability of clinical data across time and settings. Uniform coding also helps monitor disease trends, evaluate healthcare outcomes, and guide resource allocation.  

*I used GenAI to help brainstorm ideas about the broader purposes of clinical coding beyond hospital funding and statistics*

### <a id='toc2_1_4_'></a>[What is the principal diagnosis code (in the context of ICD-10-AM)?](#toc0_)
In ICD-10-AM, the principal diagnosis is the condition that, after the whole record has been studied, is considered responsible for the patient‚Äôs admission. The condition established after study may or may not confirm the admitting diagnosis. It is chosen based on the circumstances of care, not just the first condition listed. As an example from the Australian Coding Standards illustrates, when a patient with diabetes and coronary artery disease is admitted with severe chest pain and found to have a myocardial infarction, the myocardial infarction is coded as the principal diagnosis because it led to the hospitalisation.

> EXAMPLE:  
> Diagnoses as listed on the front sheet:  
> - Diabetes mellitus  
> - Coronary artery disease  
> - Myocardial infarction  
>
> History of present illness:  
> Patient experienced severe chest pain on the morning of admission and was transported by ambulance to hospital 
and admitted to the coronary care unit.  
 In this example, the information from the clinical record indicates that myocardial infarction is the principal 
diagnosis.  

Source: [Australian Coding Standards 2019](https://ar-drg.laneprint.com.au/wp-content/uploads/2020/10/ACS-Sample.pdf)

### <a id='toc2_1_5_'></a>[What is your research question for this project?](#toc0_)
 ‚ÄúHow does the choice between different prompt strategies (one-shot prompting, few-shot prompting, and chain-of-thought prompts) influence the accuracy of tested LLM (Qwen2.5-0.5B-Instruct) in predicting ICD-10-AM principal diagnosis codes from hospital-course summaries in the Asclepius Synthetic Clinical Notes dataset?‚Äù

# <a id='toc3_'></a>[Dataset](#toc0_)

The Asclepius dataset was created from **publicly available case reports** in the PMC-Patients collection (PubMed). Authors of the dataset explained that one of their motivations was to avoid the restrictions of real clinical notes due to privacy risks. While datasets like MIMIC-IV exist, access to them and even products derived from them is ‚Äú**only limited to credentialed individuals, such as those who have completed CITI training**".
To overcome these limitations, the authors turned to case reports as an open and privacy-safe alternative. However, case reports differ from hospital notes: they are written as polished academic narratives, whereas clinical notes are semi-structured, use abbreviations, and often contain non-standard language and grammatical errors. To make the data usable for training a clinical LLM, the **authors used GPT-3.5 to convert case reports into synthetic discharge summaries** that mimic the style and structure of real EHR documents. To avoid the model's hallucination, they added safeguards to prevent the introduction of new clinical entities. Throughout the process, clinicians reviewed samples to ensure the rewritten notes remained accurate and didn‚Äôt introduce any new medical details.  

To finetune an LLM capable of performing various clinical NLP tasks, an **instruction‚Äìanswer pair dataset** is necessary. To build this dataset, the defined eight key clinical NLP tasks were used, and five clinician-verified example questions were created for each. These examples served as seeds for GPT-3.5-turbo, which was given a synthetic clinical note and asked to generate new task-specific instructions. After that, the model was prompted again ‚Äì this time with both the instruction and the note - to produce the corresponding answer. This pipeline resulted in more than **158,000 instruction‚Äìanswer pairs** grounded entirely in the synthetic notes.  

By doing so, authors bypass regulatory constraints and make the dataset available to the broader research community.

*I used GenAI to create a part of the answer. I asked it to create a short version of the author's thought process from the paper to include in the answer. Still, I modified the generated answer to my style and liking; however, it saved me some time because it explained the whole process in a couple of sentences. I provided AI with text from section 2.1 of the paper and asked it to summarise and shorten it.*

Sources:  

- [HuggingFace dataset card](https://huggingface.co/datasets/starmpcc/Asclepius-Synthetic-Clinical-Notes)   
- [Original paper](https://arxiv.org/abs/2309.00237) Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes. Kweon, Sunjun, et al. Findings of the Association for Computational Linguistics: ACL 2024.

### <a id='toc3_1_1_'></a>[Dev and test set creation](#toc0_)

**Development set `patient_id`: 825, 1411, 4399, 4644, 5353**

**Test set ``patient_id``: 418, 608, 2678, 3824, 3972, 4046, 4175, 4679, 4758, 5545**

# <a id='toc4_'></a>[Imports](#toc0_)

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import pandas as pd
import re
from datasets import load_dataset
from pprint import pprint

# https://huggingface.co/docs/transformers/main_classes/logging#transformers.utils.logging.get_verbosity
from transformers.utils import logging
logging.set_verbosity_error()


## <a id='toc4_1_'></a>[Dataset creation instruction](#toc0_)

**Development set `patient_id`: 825, 1411, 4399, 4644, 5353**

**Test set ``patient_id``: 418, 608, 2678, 3824, 3972, 4046, 4175, 4679, 4758, 5545**

**Ground truth for the development set:**



*   Patient_id: 825. ICD1-10-AM: G05
*   Patient_id: 1411. ICD1-10-AM: A52
*   Patient_id: 4399. ICD1-10-AM: K75
*   Patient_id: 4644. ICD1-10-AM: S83
*   Patient_id: 5353. ICD1-10-AM: L03

## <a id='toc4_2_'></a>[Evaluation Data Frame](#toc0_)

The code in the following cell creates the pandas DataFrame containing the ground-truth labels (ICD-10-AM principal diagnosis codes) for evaluation in Question 4.

DO NOT modify this cell; Run it to create the `Eval` DataFrame for Question 4.

In [2]:

data = [
    (418,  "C79"),
    (608,  "M85"),
    (2678, "C49"),
    (3824, "I97"),
    (3972, "A18"),
    (4046, "N49"),
    (4175, "K25"),
    (4679, "D16"),
    (4758, "J85"),
    (5545, "K42"),
]


Eval = pd.DataFrame(data, columns=[
    "patient_id",
    "ICD-10-AM principal code"
])


Eval["Model generated ICD-10-AM Code"] = ""

In [3]:

ds = load_dataset("starmpcc/Asclepius-Synthetic-Clinical-Notes")
print(ds)


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


README.md: 0.00B [00:00, ?B/s]

synthetic.csv:   0%|          | 0.00/402M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/158114 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'note', 'question', 'answer', 'task'],
        num_rows: 158114
    })
})


In [4]:
# demonstraton of dataset content
example = ds["train"][0]
pprint(example)

{'answer': 'The healthcare team used a gradual approach to changing the '
           "patient's position to avoid worsening of the respiratory status "
           'and prevent respiratory failure.',
 'note': 'Discharge Summary:\n'
         '\n'
         'Patient: 60-year-old male with moderate ARDS from COVID-19\n'
         '\n'
         'Hospital Course:\n'
         '\n'
         'The patient was admitted to the hospital with symptoms of fever, dry '
         'cough, and dyspnea. During physical therapy on the acute ward, the '
         'patient experienced coughing attacks that induced oxygen '
         'desaturation and dyspnea with any change of position or deep '
         'breathing. To avoid rapid deterioration and respiratory failure, a '
         'step-by-step approach was used for position changes. The breathing '
         'exercises were adapted to avoid prolonged coughing and oxygen '
         'desaturation, and with close monitoring, the patient managed to '
         'perfo

In [5]:
# Define the patient IDs for development and test sets
dev_patient_ids = [825, 1411, 4399, 4644, 5353]
test_patient_ids = [418, 608, 2678, 3824, 3972, 4046, 4175, 4679, 4758, 5545]

# I used GenAI to help with this code snippet.
# Filter the dataset to create development set
# https://huggingface.co/docs/datasets/v1.4.0/package_reference/main_classes.html#datasets.Dataset.filter
# Provides filter function from datasets library with a lambda function
dev_set = ds['train'].filter(lambda x: x['patient_id'] in dev_patient_ids)
test_set = ds['train'].filter(lambda x: x['patient_id'] in test_patient_ids)

print(f"Development set size: {len(dev_set)}")
print(f"Test set size: {len(test_set)}")

# Verify the patient IDs in each set
dev_ids_found = sorted(list(set(dev_set['patient_id'])))
test_ids_found = sorted(list(set(test_set['patient_id'])))

print(f"\nDevelopment set patient IDs found: {dev_ids_found}")
print(f"Expected development IDs: {sorted(dev_patient_ids)}")

print(f"\nTest set patient IDs found: {test_ids_found}")
print(f"Expected test IDs: {sorted(test_patient_ids)}")

Filter:   0%|          | 0/158114 [00:00<?, ? examples/s]

Filter:   0%|          | 0/158114 [00:00<?, ? examples/s]

Development set size: 5
Test set size: 10

Development set patient IDs found: [825, 1411, 4399, 4644, 5353]
Expected development IDs: [825, 1411, 4399, 4644, 5353]

Test set patient IDs found: [418, 608, 2678, 3824, 3972, 4046, 4175, 4679, 4758, 5545]
Expected test IDs: [418, 608, 2678, 3824, 3972, 4046, 4175, 4679, 4758, 5545]


# <a id='toc5_'></a>[Question 3 ‚Äì LLM prompt engineering](#toc0_)

Your task is to create and test a range of prompts for running inference on the designated LLM (Qwen/Qwen2.5-0.5B-Instruct) in a Google Colab environment.

¬ß Use the hospital course summary as the input to your prompt.

¬ß Your goal is to generate the corresponding principal diagnosis ICD-10-AM code.

¬ß Your prompt must produce exactly one ICD-10-AM code in the model‚Äôs response for each admission.

¬ß You must output only the ICD-10-AM category, i.e., the first three characters of the code (one letter followed by two numbers).

¬ß You should experiment with several prompt-engineering techniques introduced in class and explore different inference hyperparameters.

¬ß For this question, you must use only the development set when designing and refining your prompts (similar to working with a combined training‚Äìvalidation set).

You will not be assessed on ICD-10-AM coding accuracy, but rather on the quality of your prompt-development process and your understanding of LLM inference.

Keep a record of all prompts you tried as evidence of your prompt-development process. Select the five that best represent your approach and rationale, and include these in Question 3 of your Jupyter Notebook. Any additional prompts should be placed in an appendix at the end of the notebook, accompanied by comments and text cells explaining them.

Run the cell below to download the `Qwen/Qwen2.5-0.5B-Instruct` model from huggingface, and running it on CPU.

In [6]:
# Load tokenizer and model from Hugging Face hub
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct",
    dtype="auto",
    device_map="auto"
)

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

Use from_pretrained() to load the weights and configuration file from the Hub into the model and preprocessor class.
- dtype="auto"
    - directly initializes the model weights in the data type they‚Äôre stored in, which can help avoid loading the weights twice.
    - PyTorch loads weights in torch.float32 by default).

- device_map="auto"   
    - automatically allocates the model weights to your fastest device first.

Source: [Transformers docs](https://huggingface.co/docs/transformers/v4.57.3/en/quicktour#pretrained-models)

The full process of prompt discovery I put into the Appendix.

In [7]:
def run_prompt(prompt):
    # Wrap prompt as a chat message
    messages = [{"role": "user", "content": prompt}]

    # Apply the model/tokenizer chat template
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False
    )

    # Tokenize
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

    # Generate
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=8, # enough for something like "I21"
        do_sample=False, # greedy decoding (choose highest-probability token each step);
        temperature=0.0 # temperature is irrelevant when sampling is disabled
    )

    # Extract only generated tokens
    output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

    # Decode raw output
    content = tokenizer.decode(output_ids, skip_special_tokens=True).strip()

    # Extract ICD-10-AM category (Letter + 2 digits)
    match = re.search(r"[A-Z][0-9]{2}", content.upper())
    if match:
        icd10_cat = match.group(0)
    else:
        icd10_cat = "UNK"

    return content, icd10_cat

### <a id='toc5_1_1_'></a>[Zero-shot prompt](#toc0_)

In [8]:
zero_shot_prompt = """
    You are an expert Australian clinical coder.

    From the hospital course summary below, identify the PRINCIPAL DIAGNOSIS and output its ICD-10-AM CATEGORY code.

    Important:
    - Output EXACTLY ONE code.
    - Output ONLY the ICD-10-AM CATEGORY (first three characters: one letter followed by two digits).
    - Do NOT output any extra words, punctuation, or explanation.

    Hospital course summary:
    {note_text}

    Now output ONLY the ICD-10-AM CATEGORY code:
    """

In [9]:
codes = []
for i in range(5):
    note_text = dev_set[i]["note"]
    prompt = zero_shot_prompt.format(note_text=note_text) # inject the current note into the prompt template
    raw, code = run_prompt(prompt)
    print(f"Entry {i}:")
    print("raw model output:", raw)
    print("ICD-10-AM category:", code)
    codes.append(code)

Entry 0:
raw model output: M80.1
ICD-10-AM category: M80
Entry 1:
raw model output: C628
ICD-10-AM category: C62
Entry 2:
raw model output: C045.1
ICD-10-AM category: C04
Entry 3:
raw model output: A85.4
ICD-10-AM category: A85
Entry 4:
raw model output: C078
ICD-10-AM category: C07


### <a id='toc5_1_2_'></a>[One-shot prompt](#toc0_)

In [10]:
one_shot_prompt = """
    You are an expert Australian clinical coder.

    Example:
    Hospital course summary:
    "Patient admitted with chest pain; after study diagnosed with acute myocardial infarction."
    ICD-10-AM code:
    I21

    Now code this case:
    "{note_text}"

    Rules:
    - Output EXACTLY ONE ICD-10-AM CATEGORY (1 letter + 2 digits).
    - No words, no punctuation, no explanations.
    - If unclear, output: I don't know

    Code:

"""

In [11]:
codes = []
for i in range(5):
    note_text = dev_set[i]["note"]
    prompt = one_shot_prompt.format(note_text=note_text) # inject the current note into the prompt template
    raw, code = run_prompt(prompt)
    print(f"Entry {i}:")
    print("raw model output:", raw)
    print("ICD-10-AM category:", code)
    codes.append(code)

Entry 0:
raw model output: I210800
ICD-10-AM category: I21
Entry 1:
raw model output: I2108
ICD-10-AM category: I21
Entry 2:
raw model output: I2100
ICD-10-AM category: I21
Entry 3:
raw model output: I210500
ICD-10-AM category: I21
Entry 4:
raw model output: I2100
ICD-10-AM category: I21


### <a id='toc5_1_3_'></a>[Three-shot prompt](#toc0_)

In [12]:
three_shot_prompt = """
    You are an expert Australian clinical coder.

Examples:

1)
Summary: "Chest pain; after study = acute myocardial infarction."
Code: I21

2)
Summary: "Fever, cough, hypoxia; CXR = pneumonia."
Code: J18

3)
Summary: "Polyuria, polydipsia, high glucose; dx = type 2 diabetes."
Code: E11

Now code this case:
{note_text}

Rules:
- Output ONE ICD-10-AM CATEGORY (1 letter + 2 digits)
- No words, no punctuation, no explanations
- If unclear, output: I don't know

Code:

"""

In [13]:
codes = []
for i in range(5):
    note_text = dev_set[i]["note"]
    prompt = three_shot_prompt.format(note_text=note_text) # inject the current note into the prompt template
    raw, code = run_prompt(prompt)
    print(f"Entry {i}:")
    print("raw model output:", raw)
    print("ICD-10-AM category:", code)
    codes.append(code)

Entry 0:
raw model output: I21
ICD-10-AM category: I21
Entry 1:
raw model output: I21
ICD-10-AM category: I21
Entry 2:
raw model output: I21
ICD-10-AM category: I21
Entry 3:
raw model output: I21
ICD-10-AM category: I21
Entry 4:
raw model output: I21
ICD-10-AM category: I21


### <a id='toc5_1_4_'></a>[Self-check prompt](#toc0_)

In [14]:
self_check_prompt = """
    You are an expert Australian clinical coder.

    Hospital course summary:
    {note_text}

    First, silently check:
    - What condition was chiefly responsible for the admission?
    - Is it explicitly documented?
    - Does it correspond to exactly one ICD-10-AM CATEGORY code?

    Now output ONLY the 3-character ICD-10-AM code (1 letter + 2 digits), with no explanation:
    """

In [15]:
codes = []
for i in range(5):
    note_text = dev_set[i]["note"]
    prompt = self_check_prompt.format(note_text=note_text) # inject the current note into the prompt template
    raw, code = run_prompt(prompt)
    print(f"Entry {i}:")
    print("raw model output:", raw)
    print("ICD-10-AM category:", code)
    codes.append(code)

Entry 0:
raw model output: MENINGITIS
ICD-10-AM category: UNK
Entry 1:
raw model output: 689
ICD-10-AM category: UNK
Entry 2:
raw model output: 111111
ICD-10-AM category: UNK
Entry 3:
raw model output: A84.50
ICD-10-AM category: A84
Entry 4:
raw model output: C078
ICD-10-AM category: C07


### <a id='toc5_1_5_'></a>[Zero-shot chain-of-thought prompt](#toc0_)

In [16]:
cot_hidden_prompt = """
    You are an expert Australian clinical coder.

    Hospital course summary:
    {note_text}

    Let‚Äôs think step by step about which condition is the PRINCIPAL DIAGNOSIS and what its ICD-10-AM CATEGORY code is.

    After thinking, output ONLY the 3-character ICD-10-AM CATEGORY code (1 letter + 2 digits), with no words, no punctuation, no explanation.
"""

In [17]:
codes = []
for i in range(5):
    note_text = dev_set[i]["note"]
    prompt = cot_hidden_prompt.format(note_text=note_text) # inject the current note into the prompt template
    raw, code = run_prompt(prompt)
    print(f"Entry {i}:")
    print("raw model output:", raw)
    print("ICD-10-AM category:", code)
    codes.append(code)

Entry 0:
raw model output: MENINGITIS
ICD-10-AM category: UNK
Entry 1:
raw model output: SPHONYRIC ARthritis
ICD-10-AM category: UNK
Entry 2:
raw model output: CPR
ICD-10-AM category: UNK
Entry 3:
raw model output: A
ICD-10-AM category: UNK
Entry 4:
raw model output: C078
ICD-10-AM category: C07


# <a id='toc6_'></a>[Prompt evaluation and conclusion](#toc0_)

For the evaluation, we were provided with ten data samples from the test set. The original task was to compare coding accuracy against the ground-truth labels. The code in the following cell generates a pandas DataFrame containing the ground-truth labels (ICD-10-AM principal diagnosis codes) for evaluation. 


In [18]:
# Evaluation dataset creation

data = [
    (418,  "C79"),
    (608,  "M85"),
    (2678, "C49"),
    (3824, "I97"),
    (3972, "A18"),
    (4046, "N49"),
    (4175, "K25"),
    (4679, "D16"),
    (4758, "J85"),
    (5545, "K42"),
]


Eval = pd.DataFrame(data, columns=[
    "patient_id",
    "ICD-10-AM principal code"
])


Eval["Model generated ICD-10-AM Code"] = ""

During my experiment, the concise, straightforward zero-shot prompt elicited instruction-following behaviour (unlike other prompting strategies, which degraded performance).

In [19]:
# ex zero-shot prompt
best_prompt = """
    You are an expert Australian clinical coder.

    From the hospital course summary below, identify the PRINCIPAL DIAGNOSIS and output its ICD-10-AM CATEGORY code.

    Important:
    - Output EXACTLY ONE code.
    - Output ONLY the ICD-10-AM CATEGORY (first three characters: one letter followed by two digits).
    - Do NOT output any extra words, punctuation, or explanation.

    Hospital course summary:
    {note_text}

    Now output ONLY the ICD-10-AM CATEGORY code:
    """

Here I re-run the best prompt on a test set. I also added the ICD-10 AM code description to the Eval dataset, making it easier to follow the topic.

In [20]:
# Run the best prompt on the test set and collect predictions

predictions = {}  # patient_id and predicted ICD-10-AM category
raw_outputs = {}  # raw model outputs

for i in range(len(test_set)):
    patient_id = test_set[i]["patient_id"]
    note_text = test_set[i]["note"]

    # Insert note into the best-performing prompt template
    prompt = best_prompt.format(note_text=note_text)

    # Run the model and extract predicted ICD-10-AM 3-character category
    raw_output, predicted_code = run_prompt(prompt)

    # Store prediction
    predictions[patient_id] = predicted_code
    raw_outputs[patient_id] = raw_output

    # Print progress
    print(f"Patient {patient_id}: model output = {raw_output}, extracted code = {predicted_code}")


Patient 418: model output = C030, extracted code = C03
Patient 608: model output = C25.8, extracted code = C25
Patient 2678: model output = C57, extracted code = C57
Patient 3824: model output = C024, extracted code = C02
Patient 3972: model output = C045.3, extracted code = C04
Patient 4046: model output = B75.2, extracted code = B75
Patient 4175: model output = C82.64, extracted code = C82
Patient 4679: model output = C049.1, extracted code = C04
Patient 4758: model output = C62.5, extracted code = C62
Patient 5545: model output = U060.45, extracted code = U06


In [21]:
# Dictionary of ICD-10-AM code ‚Üí description
ground_truth_icd10_meanings = {
    "C79": "Secondary malignant neoplasm (metastatic cancer)",
    "M85": "Other disorders of bone density and structure",
    "C49": "Malignant neoplasm of other connective and soft tissue",
    "I97": "Postprocedural disorders of circulatory system, not elsewhere classified",
    "A18": "Tuberculosis of other organs",
    "N49": "Inflammatory disorders of male genital organs",
    "K25": "Gastric ulcer",
    "D16": "Benign neoplasm of bone and articular cartilage",
    "J85": "Abscess of lung and mediastinum",
    "K42": "Umbilical hernia"
}

# Add a new column with the ground-truth code descriptions
Eval["True Diagnosis"] = Eval["ICD-10-AM principal code"].map(ground_truth_icd10_meanings)


# Insert predictions into the Eval DataFrame

Eval["Model generated ICD-10-AM Code"] = Eval["patient_id"].map(predictions)

model_extracted_code_meanings = {
    "C03": "Malignant neoplasm of gum",
    "C25": "Malignant neoplasm of pancreas",
    "C57": "Malignant neoplasm of other and unspecified female genital organs",
    "C02": "Malignant neoplasm of other and unspecified parts of tongue",
    "C04": "Malignant neoplasm of floor of mouth",
    "B75": "Trichinellosis",
    "C82": "Follicular lymphoma",
    "C62": "Malignant neoplasm of testis",
    "U06": "Emergency use codes (provisional assignment; not a specific clinical condition)"
}

# Add a new column with the code descriptions for generated codes
Eval["Predicted Diagnosis"] = Eval["Model generated ICD-10-AM Code"].map(model_extracted_code_meanings)

display(Eval)

Unnamed: 0,patient_id,ICD-10-AM principal code,Model generated ICD-10-AM Code,True Diagnosis,Predicted Diagnosis
0,418,C79,C03,Secondary malignant neoplasm (metastatic cancer),Malignant neoplasm of gum
1,608,M85,C25,Other disorders of bone density and structure,Malignant neoplasm of pancreas
2,2678,C49,C57,Malignant neoplasm of other connective and sof...,Malignant neoplasm of other and unspecified fe...
3,3824,I97,C02,Postprocedural disorders of circulatory system...,Malignant neoplasm of other and unspecified pa...
4,3972,A18,C04,Tuberculosis of other organs,Malignant neoplasm of floor of mouth
5,4046,N49,B75,Inflammatory disorders of male genital organs,Trichinellosis
6,4175,K25,C82,Gastric ulcer,Follicular lymphoma
7,4679,D16,C04,Benign neoplasm of bone and articular cartilage,Malignant neoplasm of floor of mouth
8,4758,J85,C62,Abscess of lung and mediastinum,Malignant neoplasm of testis
9,5545,K42,U06,Umbilical hernia,Emergency use codes (provisional assignment; n...


I used GenAI to convert full case desciption into short vignettes (it is much easier to follow).

Case #1:  

Predicted: C03 Malignant neoplasm of gum  
Ground truth: C79 Secondary malignant neoplasm (metastatic cancer)  

The patient is a 64-year-old man with metastatic lung adenocarcinoma to the pituitary gland, presenting with fatigue, nausea, scalp tenderness, and xeroderma. MRI confirmed a pituitary mass, and he underwent transsphenoidal resection followed by whole-brain radiation, with ongoing endocrine dysfunction requiring chronic steroid replacement.  

0 points.

Case #2:  
Predicted: C25 Malignant neoplasm of pancreas    
Ground Truth: M85 Other disorders of bone density and structure  
A 9-year-old girl presented with a severe right femoral deformity caused by recurrent pathological fractures secondary to a bone tumor. She underwent resection of the affected bone segment, deformity correction, and gradual limb lengthening using an Ilizarov fixator, with successful healing and restoration of function. Three years post-surgery, her alignment remains stable with no activity limitations.  
0 points.   

Case #3:  
Predicted code: C57 Malignant neoplasm of other and unspecified female genital organs  
Ground Truth: C49 Malignant neoplasm of other connective and soft tissue  
A 61-year-old woman was admitted with chest pain and fatigue, leading to the discovery of a left atrial mass initially presumed to be a myxoma and resected with mitral valve replacement. Postoperatively, she developed acute heart failure due to paravalvular leak, and pathology revealed the mass was actually a high-grade dedifferentiated liposarcoma. With rapid clinical deterioration and suspected metastasis, she ultimately chose hospice care.  

This case with really big simplification can be classified as correct. The model identified correct code for female and malignant tumor (let's ignore genital part of diagnosis). Let's give it 0.5 points.

Case #4:  
Predcited code: C02 Malignant neoplasm of other and unspecified parts of tongue  
Ground Truth: I97 Postprocedural disorders of circulatory system, not elsewhere classified  
A 67-year-old man developed worsening dyspnea and cough after pacemaker implantation, ultimately found to have a large right-sided exudative pleural effusion and a small pericardial effusion. He was treated with pleural drainage, antibiotics, and correction of coagulopathy, with subsequent clinical and radiological improvement. The presentation was determined to be an atypical form of post-cardiac injury syndrome (PCIS) with predominantly pulmonary symptoms.  
0 points.  

Case #5:  
Predcited code: C04 Malignant neoplasm of floor of mouth  
Ground truth: A18 Tuberculosis of other organs  
A 68-year-old man with extensive comorbidities presented with dyspnea and weakness, and was found to have a left-sided exudative pleural effusion and ascites. Imaging and biopsy revealed peritoneal carcinomatosis‚Äìlike lesions that were ultimately diagnosed as necrotizing granulomatous inflammation due to Mycobacterium tuberculosis. He improved rapidly after starting RIPE therapy and was discharged to continue tuberculosis treatment with close outpatient follow-up.  
0 points.  

Case #6:  
Predicted code: B75 Trichinellosis   
Ground Truth: N49 Inflammatory disorders of male genital organs  
A patient with significant comorbidities presented with penile swelling and infection following an unintentional bite injury üôÑ. Imaging revealed subcutaneous emphysema concerning for necrotizing soft tissue infection, requiring multiple operative debridements, glansectomy, and ultimately split-thickness skin graft reconstruction. With targeted antibiotics for polymicrobial infection, the patient recovered well and showed improvement at follow-up.  

The model was most likely triggered by the phrase ‚Äúnecrotizing soft tissue infection,‚Äù which it loosely associates with parasitic necrotizing infections like trichinellosis. It is still a miss, but we can make an educated guess why this happened.  

[Wiki](https://en.wikipedia.org/wiki/Trichinosis)  
0 points.  

Case #7:  
Predicted code: C82 Follicular lymphoma   
Ground Truth: K25 Gastric ulcer    

A 75-year-old woman presented with anemia and was found to have a bleeding gastric ulcer, which was successfully treated with endoscopic clipping and transfusions. Shortly afterward, she developed posterior reversible encephalopathy syndrome (PRES), likely triggered by blood pressure fluctuations, and her mental status gradually improved with conservative management. At discharge, her overall responsiveness had recovered, though her visual deficits persisted, and she was advised to monitor her blood pressure and follow up with her primary care physician.  

0 points. It seems like model tends to classify every severe case as an oncology presentation.  

Case #8:  
Predicted case: C04 Malignant neoplasm of floor of mouth  
Ground Truth: D16 Benign neoplasm of bone and articular cartilage    
A 22-year-old man underwent complete excision of a calcified mass in the left maxillary sinus associated with an impacted third molar, accessed intraorally via a Caldwell-Luc approach. Histopathology confirmed a benign cementoblastoma, and postoperative recovery was uncomplicated. Follow-up imaging showed successful removal with no recurrence after one year.  

0 points. For some reason model is very fond of C04 code.

Case #9:  
Predicted: C62 Malignant neoplasm of testis  
Ground Truth: J85 Abscess of lung and mediastinum  
A 3-year-old boy with persistent hemoptysis, cough, and chest pain was found to have significant right lung consolidation, pleural thickening, and a mass with severe pleural fibrosis despite prior treatment for suspected tuberculosis. Thoracoscopic surgery successfully removed the mass and necrotic tissue while repairing air leaks. His postoperative course was uncomplicated, and he was discharged with routine follow-up arrangements.

It seems that model used pattern of 3 year old boy and severe presentation to classify the case as oncology-related. Because testicular malignancy is one of the more common pediatric cancers in general literature, model classified it as C62, nstead of recognising real presentation.  0 points.

Case #10:  
Predicted: U06 Emergency use codes (provisional assignment; not a specific clinical condition)  
Ground Truth: K42 Umbilical hernia  
A 35-year-old woman at 39 weeks‚Äô pregnancy was admitted with a strangulated umbilical hernia and underlying pre-eclampsia. After multidisciplinary planning, she underwent an emergency caesarean section combined with hernia repair and excision of incidentally found sub-serosal uterine leiomyomas. Both mother and baby recovered well, and her postoperative course up to six months was uneventful.

Even without parsing error, the model completely hallucinated the code.

In [22]:
# Automatically compute accuracy on the test set
# Compare ground truth vs model predictions
correct = (Eval["ICD-10-AM principal code"] == Eval["Model generated ICD-10-AM Code"]).sum()
total = len(Eval)

accuracy = correct / total

print(f"Correct predictions: {correct}/{total}")
print(f"Accuracy: {accuracy:.2f}")


Correct predictions: 0/10
Accuracy: 0.00



Overall, the model demonstrated poor ICD-10-AM coding performance, achieving 0/10 exact matches on the test set. By human validation, one prediction showed partial alignment. Model correctly recognising malignancy in Case #3 despite selecting the wrong anatomical site. But this still represents only a very superficial level of semantic correctness. I decided to grant the model 0.5 points, considering its parameter size and the complexity of the task.   
Manual assessment: 0.5/10.  
Overall, the model tended to rely on simple pattern-matching rather than genuinely understanding the clinical stories. It frequently jumped to cancer-related codes whenever a case seemed severe or complex, and it was easily thrown off by isolated phrases like ‚Äúnecrotizing infection.‚Äù In several cases, it also made guesses based on broad associations (age or gender) rather than the actual anatomy or diagnosis described. Taken together, the model‚Äôs behaviour feels more like it is reacting to familiar keywords than truly interpreting the clinical picture.


# <a id='toc7_'></a>[Appendix A.](#toc0_)

It is a full version of my prompt-engineering thought process.

The zero-shot prompt was designed to elicit precise ICD-10-AM category-level coding from a language model without providing any prior examples. The prompt establishes a specific expert persona (‚Äúan expert Australian clinical coder‚Äù) to orient the model toward the expected domain reasoning. The task is to identify the principal diagnosis based solely on the provided hospital course summary.
To minimise variability in output formatting, the prompt includes strict formatting constraints. The model is instructed to output exactly one ICD-10-AM code, explicitly defined as a one-letter code followed by two digits. The prompt prohibits any explanatory text, punctuation, additional diagnoses, or alternative formats, reducing the likelihood of hallucinated or extraneous content. The final instruction appears at the end of the prompt to exploit the recency effect.


In [45]:
zero_shot_prompt = """
    You are an expert Australian clinical coder.

    From the hospital course summary below, identify the PRINCIPAL DIAGNOSIS and output its ICD-10-AM CATEGORY code.

    Important:
    - Output EXACTLY ONE code.
    - Output ONLY the ICD-10-AM CATEGORY (first three characters: one letter followed by two digits).
    - Do NOT output any extra words, punctuation, or explanation.

    Hospital course summary:
    \"\"\"{note_text}\"\"\"

    Now output ONLY the ICD-10-AM CATEGORY code:
    """

In [46]:
codes = []
for i in range(5):
    note_text = dev_set[i]["note"]
    prompt = zero_shot_prompt.format(note_text=note_text) # inject the current note into the prompt template
    raw, code = run_prompt(prompt)
    print(f"Entry {i}:")
    pprint(note_text)
    print("raw model output:", raw)
    print("ICD-10-AM category:", code)
    codes.append(code)


Entry 0:
('Hospital Course:\n'
 'The patient, a 15-month-old girl, was admitted to Peking Union Medical '
 'College Hospital (PUMCH) with high fever, irritability, and refusal to walk. '
 'She was initially seen at a local clinic for fever and constipation but was '
 "treated with ibuprofen. Later, she presented to Haikou People's Hospital "
 'with persistent high fever and a lumbar puncture was performed revealing an '
 'opening pressure of 140 mm H2O and clear CSF with 120 √ó 106/L white blood '
 'cells. The patient was treated for viral meningitis with an antiviral for 2 '
 'weeks but her fever persisted and she refused to walk. \n'
 '\n'
 'Clinical Findings:\n'
 'On physical examination, the patient had a weight of 11.5 kg and a '
 'temperature of 40¬∞C. Rashes, lymphadenectasis, and joint redness were not '
 'observed. Skin sensation could not be evaluated because the patient '
 "responded to any skin contact with exaggeration and crying. The patient's "
 'muscle strength and tone

Case #1:  
Clinical summary (generated by ChatGPT 5.1):  
A 15-month-old girl presented with persistent high fever, irritability, and refusal to walk. CSF and blood tests showed marked eosinophilia, and the child improved rapidly with antiparasitic and steroid treatment. This clinical picture is consistent with eosinophilic meningitis.  

Correct ICD-10-AM category:  
G05 - Encephalitis, myelitis and encephalomyelitis in diseases classified elsewhere.  

Model output:  
The model produced ‚ÄúB85‚Äùcorresponding to a Pediculosis and phthiriasis - an incorrect classification.  

Case #2:  
Clinical summary:  
A 71-year-old man presented with progressive right hip pain, shortening of the limb, and inability to walk. Imaging and synovial biopsy revealed characteristic vascular and inflammatory changes consistent with syphilitic arthritis, and symptoms resolved after surgical repair and antibiotic therapy.  

Correct ICD-10-AM category:  
 A52 ‚Äì Late syphilis.  

Model output:  
B86 - Scabies (Incorrect).


Case #3:   
Clinical summary:  
A 41-year-old man presented with fever, anorexia, nausea, and abnormal liver function tests. Imaging revealed portal vein thrombosis and inflammatory changes, and blood cultures grew Streptococcus anginosus. Further investigations identified a cholecysto-colonic fistula. He was diagnosed with thrombophlebitis of the portal vein with associated hepatobiliary inflammation, treated with antibiotics, anticoagulation, and later cholecystectomy with partial colectomy.  

Correct ICD-10-AM category:  
K75 - Other inflammatory liver diseases.  

Model output:  
The model generated ‚ÄúC04.1‚Äù, reduced to C04, corresponding to a malignant neoplasm of the floor of the mouth, which is clinically implausible for this case.  

Case #4:  
Clinical summary:  
A 23-year-old man presented with knee pain and limited motion after a football injury. Imaging confirmed a lateral patellar dislocation, and unsuccessful closed reduction required arthroscopic relocation. A medial patellar retinaculum lesion was noted. The patient recovered fully after splinting, bracing, and physiotherapy.  

Correct ICD-10-AM category:  
S83 - Dislocation and sprain of joints and ligaments of knee.  

Model output:  
The model generated ‚ÄúA87.5‚Äù, reduced to A87, which corresponds to viral meningitis.  

Case #5:  
Clinical summary:
A 56-year-old man presented with acute right lower extremity pain, redness, swelling, and fever. He was diagnosed with cellulitis of the right leg and required intravenous antibiotics, bedside and operative debridement, negative-pressure wound therapy, and later skin grafting. The wound progressively healed, and the patient was discharged with follow-up arranged.

Correct ICD-10-AM category:  
L03 - Cellulitis.  

Model output:  
The model produced ‚ÄúC078.1‚Äù, reduced to C07, corresponding to a secondary malignant neoplasm of the respiratory and digestive organs.  



The second prompt provides more explicit decision guidance by clarifying how to choose the principal diagnosis when multiple conditions are present. It also instructs the model to select the most defensible diagnosis when documentation is ambiguous. Overall, it includes more detailed reasoning constraints while maintaining the same strict output format.


In [47]:
zero_shot_prompt_v2 = """
    You are an expert Australian clinical coder.

    Your task:
    Determine the PRINCIPAL DIAGNOSIS from the hospital course summary, using ICD-10-AM coding rules (ACS 0001).
    Select the single condition that, after study, is chiefly responsible for the admission.

    Output rules:
    - Output EXACTLY ONE code.
    - Output ONLY the ICD-10-AM CATEGORY (first 3 characters: one letter + two digits).
    - No explanations, no text, no punctuation, no justification.
    - If multiple diagnoses are present, choose the principal diagnosis per ACS 0001.
    - If the note is unclear, select the *most defensible* principal diagnosis; NEVER output more than one code.

    Hospital course summary:
    {note_text}

    Output ONLY the ICD-10-AM CATEGORY code:

    """

In [48]:
codes = []
for i in range(5):
    note_text = dev_set[i]["note"]
    prompt = zero_shot_prompt_v2.format(note_text=note_text) # inject the current note into the prompt template
    raw, code = run_prompt(prompt)
    print(f"Entry {i}:")
    print("raw model output:", raw)
    print("ICD-10-AM category:", code)
    codes.append(code)

Entry 0:
raw model output: I0001
ICD-10-AM category: I00
Entry 1:
raw model output: **ICD-10-AM
ICD-10-AM category: UNK
Entry 2:
raw model output: **ICD-10-AM
ICD-10-AM category: UNK
Entry 3:
raw model output: Lateral Patellar Dislocation
ICD-10-AM category: UNK
Entry 4:
raw model output: ICD-10-AM CATEGORY
ICD-10-AM category: UNK


Case #1: Rheumatic fever without mention of heart involvement    
Case #2,3,4 -  Error    
Case #5: Secondary malignant neoplasm of the respiratory and digestive organs.  

Let's use LLM and create a more strict version of previous prompt:

In [49]:
zero_shot_prompt_v3 = """
    You are an expert Australian clinical coder with deep knowledge of ICD-10-AM.

    Hospital course summary:
    {note_text}

    Your task:
    Identify the PRINCIPAL DIAGNOSIS according to ICD-10-AM rules.

    Specific instructions:
    - Output EXACTLY ONE ICD-10-AM CATEGORY code (3 characters: one letter + two digits).
    - Output ONLY the code‚Äîno words, no punctuation, no explanation.
    - If the correct code cannot be confidently determined from the documentation, output: I don't know
    - Never guess or invent clinical information not stated in the summary.

    Now output ONLY the ICD-10-AM CATEGORY code (or ‚ÄúI don't know‚Äù):

    """

In [50]:
codes = []
for i in range(5):
    note_text = dev_set[i]["note"]
    prompt = zero_shot_prompt_v3.format(note_text=note_text) # inject the current note into the prompt template
    raw, code = run_prompt(prompt)
    print(f"Entry {i}:")
    print("raw model output:", raw)
    print("ICD-10-AM category:", code)
    codes.append(code)

Entry 0:
raw model output: Meningococcal Meningitis
ICD-10-AM category: UNK
Entry 1:
raw model output: I don't know
ICD-10-AM category: UNK
Entry 2:
raw model output: I don't know
ICD-10-AM category: UNK
Entry 3:
raw model output: Lateral Patellar Dislocation
ICD-10-AM category: UNK
Entry 4:
raw model output: C078
ICD-10-AM category: C07


Model performance declines in following instructions, but it still correctly identified two diagnosis outputs.

Add more instructions:

In [51]:
zero_shot_prompt_v4 = """
    You are an expert Australian clinical coder who strictly applies ICD-10-AM and ACS 0001 standards.
    Your reasoning must rely ONLY on the documented clinical facts in the hospital course summary.

    Hospital course summary:
    {note_text}

    Your task:
    Identify the PRINCIPAL DIAGNOSIS ‚Äî the condition established after study to be chiefly responsible for the admission (ACS 0001).

    Strict output rules:
    - Output EXACTLY ONE ICD-10-AM CATEGORY code (format: one letter + two digits, e.g., I21).
    - Output MUST match the regex: ^[A-Z][0-9]{{2}}$
    - Do NOT output subcategories, decimals, diagnosis names, or multiple codes.
    - Do NOT provide explanations, clarifications, or commentary.
    - If the documentation does NOT provide enough clear evidence to confidently determine a principal diagnosis, output exactly: I don't know
    - Never guess, infer undocumented conditions, or hallucinate clinical details.

    Tone:
    - Extremely concise
    - Neutral and factual
    - No additional text of any kind

    Now output ONLY the ICD-10-AM CATEGORY code (or ‚ÄúI don't know‚Äù):
    """

In [52]:
codes = []
for i in range(5):
    note_text = dev_set[i]["note"]
    prompt = zero_shot_prompt_v4.format(note_text=note_text) # inject the current note into the prompt template
    raw, code = run_prompt(prompt)
    print(f"Entry {i}:")
    print("raw model output:", raw)
    print("ICD-10-AM category:", code)
    codes.append(code)

Entry 0:
raw model output: I21
ICD-10-AM category: I21
Entry 1:
raw model output: I21
ICD-10-AM category: I21
Entry 2:
raw model output: I21
ICD-10-AM category: I21
Entry 3:
raw model output: I21
ICD-10-AM category: I21
Entry 4:
raw model output: I21
ICD-10-AM category: I21


It appears the model relies solely on the code from the prompt. This can be explained by **anchoring bias**: the model often repeats the closest matching example rather than reasoning from the note.

In [53]:
zero_shot_prompt_v5 = """
    You are an expert Australian clinical coder who strictly applies ICD-10-AM and ACS 0001 standards.
    Your reasoning must rely ONLY on the documented clinical facts in the hospital course summary.

    Hospital course summary:
    {note_text}

    Your task:
    Identify the PRINCIPAL DIAGNOSIS ‚Äî the condition established after study to be chiefly responsible for the admission.

    Strict output rules:
    - Output EXACTLY ONE ICD-10-AM CATEGORY code (format: one letter + two digits).
    - Output MUST match the regex: ^[A-Z][0-9]{{2}}$
    - Do NOT output subcategories, decimals, diagnosis names, or multiple codes.
    - Do NOT provide explanations, clarifications, or commentary.
    - If the documentation does NOT provide enough clear evidence to confidently determine a principal diagnosis, output exactly: I don't know
    - Never guess, infer undocumented conditions, or hallucinate clinical details.

    Tone:
    - Extremely concise
    - Neutral and factual
    - No additional text of any kind

    Now output ONLY the ICD-10-AM CATEGORY code (or ‚ÄúI don't know‚Äù):
    """

In [54]:
codes = []
for i in range(5):
    note_text = dev_set[i]["note"]
    prompt = zero_shot_prompt_v5.format(note_text=note_text) # inject the current note into the prompt template
    raw, code = run_prompt(prompt)
    print(f"Entry {i}:")
    print("raw model output:", raw)
    print("ICD-10-AM category:", code)
    codes.append(code)

Entry 0:
raw model output: I don't know
ICD-10-AM category: UNK
Entry 1:
raw model output: I don't know
ICD-10-AM category: UNK
Entry 2:
raw model output: I don't know
ICD-10-AM category: UNK
Entry 3:
raw model output: Lateral Patellar Dislocation
ICD-10-AM category: UNK
Entry 4:
raw model output: I don't know
ICD-10-AM category: UNK


Long prompts are detrimental for small model performance. Let's try more concise approach.

In [55]:
one_shot_prompt_v1 = """
    You are an expert Australian clinical coder.

    Example:
    Hospital course summary:
    "Patient admitted with chest pain; after study diagnosed with acute myocardial infarction."
    ICD-10-AM code:
    I21

    Now code this case:
    "{note_text}"

    Rules:
    - Output EXACTLY ONE ICD-10-AM CATEGORY (1 letter + 2 digits).
    - No words, no punctuation, no explanations.
    - If unclear, output: I don't know

    Code:

"""


In [56]:
codes = []
for i in range(5):
    note_text = dev_set[i]["note"]
    prompt = one_shot_prompt_v1.format(note_text=note_text) # inject the current note into the prompt template
    raw, code = run_prompt(prompt)
    print(f"Entry {i}:")
    print("raw model output:", raw)
    print("ICD-10-AM category:", code)
    codes.append(code)

Entry 0:
raw model output: I210800
ICD-10-AM category: I21
Entry 1:
raw model output: I2108
ICD-10-AM category: I21
Entry 2:
raw model output: I2100
ICD-10-AM category: I21
Entry 3:
raw model output: I210500
ICD-10-AM category: I21
Entry 4:
raw model output: I2100
ICD-10-AM category: I21


The same problem. It seems like if the model sees any code, it sticks with it.

In [57]:
three_shot_prompt_v1 = """
    You are an expert Australian clinical coder.

Examples:

1)
Summary: "Chest pain; after study = acute myocardial infarction."
Code: I21

2)
Summary: "Fever, cough, hypoxia; CXR = pneumonia."
Code: J18

3)
Summary: "Polyuria, polydipsia, high glucose; dx = type 2 diabetes."
Code: E11

Now code this case:
{note_text}

Rules:
- Output ONE ICD-10-AM CATEGORY (1 letter + 2 digits)
- No words, no punctuation, no explanations
- If unclear, output: I don't know

Code:

"""

In [58]:
codes = []
for i in range(5):
    note_text = dev_set[i]["note"]
    prompt = three_shot_prompt_v1.format(note_text=note_text)
    raw, code = run_prompt(prompt)
    print(f"Entry {i}:")
    print("raw model output:", raw)
    print("ICD-10-AM category:", code)
    codes.append(code)

Entry 0:
raw model output: I21
ICD-10-AM category: I21
Entry 1:
raw model output: I21
ICD-10-AM category: I21
Entry 2:
raw model output: I21
ICD-10-AM category: I21
Entry 3:
raw model output: I21
ICD-10-AM category: I21
Entry 4:
raw model output: I21
ICD-10-AM category: I21


Maybe we need a shorter version.

In [59]:
one_shot_prompt_v2 = """
    You are an expert Australian clinical coder.

    Example:
    Hospital course summary:
    "Patient admitted with chest pain; after study diagnosed with acute myocardial infarction."
    ICD-10-AM code:
    one letter + two digits

    Now summarise this case and code it:
    "{note_text}"

    Rules:
    - Output EXACTLY ONE ICD-10-AM CATEGORY (1 letter + 2 digits).
    - No words, no punctuation, no explanations.
    - If unclear, output: I don't know

    Code:

"""

In [60]:
codes = []
for i in range(5):
    note_text = dev_set[i]["note"]
    prompt = one_shot_prompt_v2.format(note_text=note_text)
    raw, code = run_prompt(prompt)
    print(f"Entry {i}:")
    print("raw model output:", raw)
    print("ICD-10-AM category:", code)
    codes.append(code)

Entry 0:
raw model output: ICD-10-AM Category
ICD-10-AM category: UNK
Entry 1:
raw model output: I don't know
ICD-10-AM category: UNK
Entry 2:
raw model output: ICD-10-AM category
ICD-10-AM category: UNK
Entry 3:
raw model output: I don't know
ICD-10-AM category: UNK
Entry 4:
raw model output: ICD-10-AM category
ICD-10-AM category: UNK


The very first prompt performed the best.

In [61]:
zero_shot_prompt_v6 = """
    You are an expert Australian clinical coder.
    Identify the PRINCIPAL DIAGNOSIS from the hospital course summary and output its ICD-10-AM CATEGORY code.
    Rules:
    - Output EXACTLY ONE code.
    - Output ONLY the 3-character ICD-10-AM CATEGORY (1 letter + 2 digits).
    - No extra words, punctuation, or explanation.
    Hospital course summary:
    {note_text}
    Now output ONLY the ICD-10-AM CATEGORY code:
    """


In [62]:
codes = []
for i in range(5):
    note_text = dev_set[i]["note"]
    prompt = zero_shot_prompt_v6.format(note_text=note_text)
    raw, code = run_prompt(prompt)
    print(f"Entry {i}:")
    print("raw model output:", raw)
    print("ICD-10-AM category:", code)
    codes.append(code)

Entry 0:
raw model output: B85.1
ICD-10-AM category: B85
Entry 1:
raw model output: C628
ICD-10-AM category: C62
Entry 2:
raw model output: C045.1
ICD-10-AM category: C04
Entry 3:
raw model output: A87.5
ICD-10-AM category: A87
Entry 4:
raw model output: C078.1
ICD-10-AM category: C07


It outputs code, at least. But they are still incorrect.

In [63]:
self_check_prompt = """
    You are an expert Australian clinical coder.

    Hospital course summary:
    {note_text}

    First, silently check:
    - What condition was chiefly responsible for the admission?
    - Is it explicitly documented?
    - Does it correspond to exactly one ICD-10-AM CATEGORY code?

    Now output ONLY the 3-character ICD-10-AM code (1 letter + 2 digits), with no explanation:
    """





In [64]:
codes = []
for i in range(5):
    note_text = dev_set[i]["note"]
    prompt = self_check_prompt.format(note_text=note_text)
    raw, code = run_prompt(prompt)
    print(f"Entry {i}:")
    print("raw model output:", raw)
    print("ICD-10-AM category:", code)
    codes.append(code)

Entry 0:
raw model output: MENINGITIS
ICD-10-AM category: UNK
Entry 1:
raw model output: 689
ICD-10-AM category: UNK
Entry 2:
raw model output: 111111
ICD-10-AM category: UNK
Entry 3:
raw model output: A84.50
ICD-10-AM category: A84
Entry 4:
raw model output: C078
ICD-10-AM category: C07


<img src = 'images\2025_11_prompt_engineering_clinical_coding_project\MENINGITIS.jpg' width = 500 alt='meningitis'>

"I have no mouth, and I must scream"

In [65]:
cot_hidden_prompt = """
    You are an expert Australian clinical coder.

    Hospital course summary:
    {note_text}

    Let‚Äôs think step by step about which condition is the PRINCIPAL DIAGNOSIS and what its ICD-10-AM CATEGORY code is.

    After thinking, output ONLY the 3-character ICD-10-AM CATEGORY code (1 letter + 2 digits), with no words, no punctuation, no explanation.
"""


In [66]:
codes = []
for i in range(5):
    note_text = dev_set[i]["note"]
    prompt = cot_hidden_prompt.format(note_text=note_text)
    raw, code = run_prompt(prompt)
    print(f"Entry {i}:")
    print("raw model output:", raw)
    print("ICD-10-AM category:", code)
    codes.append(code)

Entry 0:
raw model output: MENINGITIS
ICD-10-AM category: UNK
Entry 1:
raw model output: SPHONYRIC ARthritis
ICD-10-AM category: UNK
Entry 2:
raw model output: CPR
ICD-10-AM category: UNK
Entry 3:
raw model output: A
ICD-10-AM category: UNK
Entry 4:
raw model output: C078
ICD-10-AM category: C07


<img src = 'images\2025_11_prompt_engineering_clinical_coding_project\sphony.jpg' width = 500 alt='sphonycitis'>    

I know, little bro, I know.  

It is time to stop. The first time I run the prompt, the model hallucinated "SPHONYCITIS" diagnosis. The next time it became "SPHORIC ARthritis". 