# Automatic span annotations

We use GPT-4 to automatically annotate the text notes we generated with spans of where each symptom is mentioned in the note. This entails telling GPT-4 the symptoms experienced by the patient, and asking it to extract from the clinical note all phrases that mention these symptoms (whether positive or negative). If the symptom is not mentioned at all in the note, then no phrase should be extracted either. 

We use a different strategy for the normal notes and for the compact version of these notes. 

## Normal notes

We use the function below to generate the prompt for each note. 

In [4]:
import pickle
with open("../data/df_synsum.p", "rb") as file: 
    df = pickle.load(file)

In [9]:
sympt_dict = {"dyspnea": "dysp", "cough": "cough", "nasal symptoms": "nasal", "respiratory pain": "pain", "fever": "fever"}

def generate_prompt(row): 

    text_note = row["text"]
    text_note = text_note.replace("\n", " ")

    prompt = f"The following information is known about the patient's symptoms:\n"
    for sympt, sympt_col_name in sympt_dict.items():  
        sympt_val = row[sympt_col_name]
        prompt += f"- {sympt}: {sympt_val}\n"
    prompt += f"\nFollowing the instructions you received, please extract from the following clinical note all phrases (verbatim) that describe these symptoms:\n\"{text_note}\""

    return prompt

In [10]:
print(generate_prompt(df.loc[0]))

The following information is known about the patient's symptoms:
- dyspnea: no
- cough: no
- nasal symptoms: no
- respiratory pain: no
- fever: high

Following the instructions you received, please extract from the following clinical note all phrases (verbatim) that describe these symptoms:
"**History** Patient reports a significant increase in body temperature over the last 48 hours, exceeding normal ranges, indicating a high fever. There have been no respiratory symptoms such as pain, dyspnea, or cough. The patient illustrates general malaise and mentions feeling very fatigued due to the fever. No notable changes in daily routine or exposure to environments that might typically contribute to fever are reported. Recent stress levels and potential exposure to infectious agents during travels are also discussed.  **Physical Examination** Vital signs show elevated temperature (103 °F). Heart rate is slightly tachycardic at 98 bpm, corresponding with the fever. Oxygen saturation is within

In the system message, we show the LLM the annotation instructions. We explicitly ask the LLM to reply with a json object, so we can easily process the extracted phrases further. The annotations instructions are as follows: 

---

I will show you a clinical note containing information on a patient's symptoms. For each symptom, I will tell you whether the patient suffers from this symptom or not. 

Your task is to extract phrases from the note that mention these symptoms. The annotation must have the following JSON structure:  
[ 
   {
      "symptom": one of the symptoms ("dyspnea", "cough", "respiratory pain", "fever" or "nasal symptoms")
      "text": phrase in the text that mentions the symptom and whether it is present or absent
   }  
   {
      "symptom": ...
      "text":...
   }
   ...
]

Keep the following instructions in mind:  
- The same symptom may be mentioned multiple times. Include all phrases in which a symptom is mentioned. Consider both the "history" portion of the note, and the "physical examination" portion of the note.
- Also annotate a symptom if the note mentions that the patient does not suffer from it. 
- The phrases do not need to be full sentences, but need to be verbatim as they appear in the note. You are not allowed to alter any words. If you leave out words, use ...
- Order does not matter.
- You will reply only with the JSON itself, and you will not wrap in JSON markers.
- You can only extract phrases from the "clinical note", not from any of the other text in the prompt. 
- Not all symptoms are necessarily mentioned in the note. Do not include a symptom in the JSON if you cannot find any implicit or explicit mention of it in the clinical note. 

---

In [None]:
import openai

SYS_MESSAGE_NORMAL = """I will show you a clinical note containing information on a patient's symptoms. For each symptom, I will tell you whether the patient suffers from this symptom or not. 

Your task is to extract phrases from the note that mention these symptoms. The annotation must have the following JSON structure:  
[ 
   {
      "symptom": one of the symptoms ("dyspnea", "cough", "respiratory pain", "fever" or "nasal symptoms")
      "text": phrase in the text that mentions the symptom and whether it is present or absent
   }  
   {
      "symptom": ...
      "text":...
   }
   ...
]

Keep the following instructions in mind:  
- The same symptom may be mentioned multiple times. Include all phrases in which a symptom is mentioned. Consider both the "history" portion of the note, and the "physical examination" portion of the note.
- Also annotate a symptom if the note mentions that the patient does not suffer from it. 
- The phrases do not need to be full sentences, but need to be verbatim as they appear in the note. You are not allowed to alter any words. If you leave out words, use ...
- Order does not matter.
- You will reply only with the JSON itself, and you will not wrap in JSON markers.
- You can only extract phrases from the "clinical note", not from any of the other text in the prompt. 
- Not all symptoms are necessarily mentioned in the note. Do not include a symptom in the JSON if you cannot find any implicit or explicit mention of it in the clinical note. 
"""

def prompt_GPT_full(row, compl="normal"): 

   messages = []
   system_message = {"role": "system", "content": SYS_MESSAGE_NORMAL}
   messages.append(system_message)

   messages.append({"role": "user", "content": generate_prompt(row)})
   res = openai.chat.completions.create(
      model = "gpt-4o", 
      temperature = 0.2, 
      max_tokens = 2048,
      messages = messages
    )
   response = res.choices[0].message.content # response

   return response

In [14]:
response = prompt_GPT_full(df.loc[0])

In [15]:
print(response)

[
   {
      "symptom": "fever",
      "text": "a significant increase in body temperature over the last 48 hours, exceeding normal ranges, indicating a high fever"
   },
   {
      "symptom": "fever",
      "text": "feeling very fatigued due to the fever"
   },
   {
      "symptom": "fever",
      "text": "Vital signs show elevated temperature (103 °F)"
   },
   {
      "symptom": "fever",
      "text": "corresponding with the fever"
   },
   {
      "symptom": "fever",
      "text": "there are no evident physical findings to explain the fever apart from the stated temperature elevation"
   },
   {
      "symptom": "dyspnea",
      "text": "There have been no respiratory symptoms such as pain, dyspnea, or cough"
   },
   {
      "symptom": "cough",
      "text": "There have been no respiratory symptoms such as pain, dyspnea, or cough"
   },
   {
      "symptom": "respiratory pain",
      "text": "There have been no respiratory symptoms such as pain, dyspnea, or cough"
   }
]


In [41]:
import json
resp = json.loads(response)

We noticed that the LLM often hallucinates phrases that are actually not present in the note. We therefore process the responses by matching the phrases with the text. We use some regex matching to ensure that capital letters and punctuation don't form an issue. We also check whether the symptoms are named correctly in the JSON response (to ensure that the LLM is not extracting additional hallucinated symptoms). 

In [74]:
import re

def complete_annotations(ann_obj, compl, filter_empty=True): 

    completed_annotations = {}
    failed_attempts = {}

    for i, phrases in ann_obj.items():

        # retrieve note
        if compl == "normal":
            note = df.loc[int(i), "text"]
        elif compl == "adv": 
            note = df.loc[int(i), "advanced_text"]
        
        # find start and end character of phrase in the note
        # if simple regex fixes don't manage to find the phrase, we put the annotation aside for later
        for entry in phrases:

            phrase = entry["text"]

            if (len(phrase) != 0) or not filter_empty: # empty phrases are filtered out later
                start_idx = note.find(phrase) # check the note for the full phrase

                if start_idx == -1: 
                    regex_pattern = re.escape(phrase).replace(r'\.\.\.', r'.+?') # replace "..." with a regex pattern that matches any characters
                    regex_pattern = re.sub(r'\\ ', r'\\W*', regex_pattern)  # allow optional punctuation where spaces exist
                    match = re.search(regex_pattern, note, re.IGNORECASE) # ignore capital letters
                    if match: 
                        start_idx = match.start()
                        end_idx = match.end()
                    elif i not in failed_attempts: 
                        end_idx = -1
                        failed_attempts[i] = phrases
                else: 
                    end_idx = start_idx + len(phrase)

                entry["start"] = start_idx
                entry["end"] = end_idx

                # check if symptoms have the correct names 
                if entry["symptom"] not in ["dyspnea", "cough", "respiratory pain", "fever", "nasal symptoms"]: 
                    failed_attempts[i] = phrases

                if i not in failed_attempts:
                    new_phrase = note[start_idx:end_idx]
                    entry["text"] = new_phrase # make sure the phrase exactly corresponds to what is found in the text

        if i not in failed_attempts:
            if filter_empty:
                completed_annotations[i] = [entry for entry in phrases if len(entry["text"]) != 0] # leave out phrases that are empty ("")
            else: 
                completed_annotations[i] = phrases
            
    return completed_annotations, failed_attempts 

In [43]:
ann, failed = complete_annotations({"0": resp}, compl="normal")

In [44]:
ann

{'0': [{'symptom': 'fever',
   'text': 'a significant increase in body temperature over the last 48 hours, exceeding normal ranges, indicating a high fever',
   'start': 28,
   'end': 143},
  {'symptom': 'fever',
   'text': 'feeling very fatigued due to the fever',
   'start': 271,
   'end': 309},
  {'symptom': 'fever',
   'text': 'Vital signs show elevated temperature (103 °F)',
   'start': 556,
   'end': 602},
  {'symptom': 'fever',
   'text': 'corresponding with the fever',
   'start': 650,
   'end': 678},
  {'symptom': 'fever',
   'text': 'there are no evident physical findings to explain the fever apart from the stated temperature elevation',
   'start': 985,
   'end': 1088},
  {'symptom': 'dyspnea',
   'text': 'There have been no respiratory symptoms such as pain, dyspnea, or cough',
   'start': 145,
   'end': 216},
  {'symptom': 'cough',
   'text': 'There have been no respiratory symptoms such as pain, dyspnea, or cough',
   'start': 145,
   'end': 216},
  {'symptom': 'respirato

In [45]:
failed

{}

Records for which at least one of the phrases was not able to be matched to the text, go into the "failed" set. The other records receive "start" and "end" index annotations, indicating where the phrase can be found in the note.

Here, the "failed" set is empty. For the sake of showing what would happen to failed records, we adapt our correct annotation to a partially incorrect one. 

In [46]:
resp[0]["text"] = "an increase in body temperature over the last 48 hours, exceeding normal ranges, indicating a high fever"

In [47]:
ann, failed = complete_annotations({"0": resp}, compl="normal")

In [48]:
failed

{'0': [{'symptom': 'fever',
   'text': 'an increase in body temperature over the last 48 hours, exceeding normal ranges, indicating a high fever',
   'start': -1,
   'end': -1},
  {'symptom': 'fever',
   'text': 'feeling very fatigued due to the fever',
   'start': 271,
   'end': 309},
  {'symptom': 'fever',
   'text': 'Vital signs show elevated temperature (103 °F)',
   'start': 556,
   'end': 602},
  {'symptom': 'fever',
   'text': 'corresponding with the fever',
   'start': 650,
   'end': 678},
  {'symptom': 'fever',
   'text': 'there are no evident physical findings to explain the fever apart from the stated temperature elevation',
   'start': 985,
   'end': 1088},
  {'symptom': 'dyspnea',
   'text': 'There have been no respiratory symptoms such as pain, dyspnea, or cough',
   'start': 145,
   'end': 216},
  {'symptom': 'cough',
   'text': 'There have been no respiratory symptoms such as pain, dyspnea, or cough',
   'start': 145,
   'end': 216},
  {'symptom': 'respiratory pain',
  

Note that the phrase which could not be found in the text has start and end index -1. 

Due to the large number of notes, we cannot edit all these mistakes manually. We therefore constructed a second prompt that asks the LLM to correct these phrases to phrases that can actually be found in the text. 

In [49]:
SYS_MESSAGE = """I will show you a clinical note, together with one or more phrases that were extracted from it. However, some mistakes were made in extracting these phrases. You must correct them."""

def create_phrase_string(phrases): 
    str_format = ""
    for phrase in phrases: 
        str_format += f"- {phrase}\n"
    return str_format

def correcting_prompt(note, extracted_phrases): 
    
    messages = []
    system_message = {"role": "system", "content": SYS_MESSAGE}
    messages.append(system_message)

    note = note.replace("\n", " ")
    user_msg = f"The following is a clinical note:\n{note}\n\n"
    user_msg += "The following phrases were extracted from this note. However, they do not exactly match the text:\n"
    user_msg += create_phrase_string(extracted_phrases)
    user_msg += f"\n\nPlease correct the phrases so they map exactly to a phrase in the text. "
    user_msg += """You must reply with the following JSON format: 
{
   original phrase: corrected phrase
}

You will reply only with the JSON itself, and you will not wrap in JSON markers."""

    print(user_msg)

    messages.append({"role": "user", "content": user_msg})
    res = openai.chat.completions.create(
        model = "gpt-4o", 
        temperature = 0.2, 
        max_tokens = 2048,
        messages = messages
        )
    
    response = res.choices[0].message.content
    
    return response

In [52]:
idx = "0"
wrong_phrases = set([phrase["text"] for phrase in failed[idx] if phrase["start"] == -1])
note = df.loc[int(idx), "text"]
response = correcting_prompt(note, wrong_phrases)

The following is a clinical note:
**History** Patient reports a significant increase in body temperature over the last 48 hours, exceeding normal ranges, indicating a high fever. There have been no respiratory symptoms such as pain, dyspnea, or cough. The patient illustrates general malaise and mentions feeling very fatigued due to the fever. No notable changes in daily routine or exposure to environments that might typically contribute to fever are reported. Recent stress levels and potential exposure to infectious agents during travels are also discussed.  **Physical Examination** Vital signs show elevated temperature (103 °F). Heart rate is slightly tachycardic at 98 bpm, corresponding with the fever. Oxygen saturation is within normal limits at 98%, and lungs are clear to auscultation without any added sounds. Abdominal examination is normal, without tenderness or organomegaly. Skin shows no rashes, warmth, or lesions. Capillary refill time is adequate. Neurological assessment is n

In [53]:
print(response)

{
   "an increase in body temperature over the last 48 hours, exceeding normal ranges, indicating a high fever": "a significant increase in body temperature over the last 48 hours, exceeding normal ranges, indicating a high fever"
}


We then check again with the complete_annotations function whether the phrase can be found in the note. In this case, it can be, so the phrase is corrected. This was the case for the majority of incorrectly extracted phrases. There was a small set of notes (around 10) that were not able to be corrected this way, so we corrected them manually. 

The final span annotations for the normal notes can be found in "data/spans/normal_span_annotations.json". 

In [55]:
with open("../data/spans/normal_span_annotations.json", "r") as file:
    ann = json.load(file)

Example:

In [56]:
ann["5623"]

[{'symptom': 'fever',
  'text': 'mild fever onset over the past two days',
  'start': 28,
  'end': 67},
 {'symptom': 'fever',
  'text': 'Temperature: 37.8°C (low-grade fever)',
  'start': 436,
  'end': 473},
 {'symptom': 'respiratory pain',
  'text': 'notable respiratory pain primarily during deep inspiration and chest movements',
  'start': 78,
  'end': 156},
 {'symptom': 'cough',
  'text': 'The patient denies any cough',
  'start': 158,
  'end': 186},
 {'symptom': 'dyspnea',
  'text': 'The patient denies any cough or dyspnea',
  'start': 158,
  'end': 197},
 {'symptom': 'nasal symptoms',
  'text': 'The patient reports no recent travel, sick contacts, or changes in routine that could explain the symptoms',
  'start': 302,
  'end': 408}]

In [58]:
import textwrap
print(textwrap.fill(df.loc[5623, "text"], 100))

**History** Patient reports mild fever onset over the past two days. There is notable respiratory
pain primarily during deep inspiration and chest movements. The patient denies any cough or dyspnea.
It is stated that regular activities have become uncomfortable due to the persistent chest
discomfort. The patient reports no recent travel, sick contacts, or changes in routine that could
explain the symptoms.  **Physical Examination** Temperature: 37.8°C (low-grade fever). Lungs
auscultated with clear breath sounds and no adventitious sounds. Chest palpation reveals tenderness
in the costal region without any apparent soft tissue swelling. Respirations are regular, with no
signs of labored or distressed breathing. Vital signs: blood pressure 120/80 mmHg, heart rate 76
bpm, respiratory rate 16 breaths per minute. Cardiovascular examination is unremarkable. Skin: warm
and dry, no rash or lesions.


## Advanced notes

The advanced notes proved more challenging to make the LLM accurately extract symptom spans from scratch. So, we decided to start from the phrases extracted from the normal note, and ask the LLM to match these to phrases in the advanced note. 

The system instructions are as follows.

---

I will show you two versions of a clinical note. The first version describes a patient's visit to the doctor's office. The second one describes the same visit, but in a more compact style (using abbreviations and shortcuts), while preserving the overall message. 

I will show you a set of phrases which were extracted from the first version of the note. Your task is to map these to phrases in the second version of the note. 

You must reply with the following JSON format.
{
   phrase in version 1 : corresponding phrase extracted in version 2
}

Keep the following instructions in mind: 
- Please extract phrases verbatim. 
- Please use the empty string if you cannot find a phrase with the same meaning.
- The phrases you extract must have the same meaning, you cannot simply copy phrases that are in the same spot in the text.
- You will reply only with the JSON itself, and you will not wrap in JSON markers.

---

The full prompt is created as follows:

In [63]:
def create_phrase_string(phrases):
    list_str = "[\n"
    sel_phrases = []
    for phrase in phrases: 
        sent = phrase["text"]
        if sent not in sel_phrases: # only include each phrase once
            sel_phrases.append(sent)
            list_str += f"\"{sent}\",\n"
    list_str = list_str[:-2]
    list_str += "\n]"
    return list_str

In [62]:
SYS_MESSAGE_ADV = """I will show you two versions of a clinical note. The first version describes a patient's visit to the doctor's office. The second one describes the same visit, but in a more compact style (using abbreviations and shortcuts), while preserving the overall message. 

I will show you a set of phrases which were extracted from the first version of the note. Your task is to map these to phrases in the second version of the note. 

You must reply with the following JSON format.
{
   phrase in version 1 : corresponding phrase extracted in version 2
}

Keep the following instructions in mind: 
- Please extract phrases verbatim. 
- Please use the empty string if you cannot find a phrase with the same meaning.
- The phrases you extract must have the same meaning, you cannot simply copy phrases that are in the same spot in the text.
- You will reply only with the JSON itself, and you will not wrap in JSON markers.
"""

def prompt_GPT_advanced(note_ver1, note_ver2, phrases): 

   if len(phrases) == 0: 
      return "{}" # if no phrases to be extracted, don't prompt the LLM

   messages = []
   system_message = {"role": "system", "content": SYS_MESSAGE_ADV}
   messages.append(system_message)

   note_ver1 = note_ver1.replace("\n", " ")
   note_ver2 = note_ver2.replace("\n", " ")
   user_msg = f"Clinical note, version 1:\n{note_ver1}\n\n"
   user_msg += "Extracted phrases:\n"
   user_msg += create_phrase_string(phrases)
   user_msg += f"\n\nPlease extract the equivalent phrases from the second version of the note.\n\nClinical note, version 2:\n{note_ver2}"

   print(user_msg)

   messages.append({"role": "user", "content": user_msg})
   res = openai.chat.completions.create(
      model = "gpt-4o", 
      temperature = 0.2, 
      max_tokens = 2048,
      messages = messages
    )
   first_response = res.choices[0].message.content
   messages.append({"role": "assistant", "content": first_response}) # add response of assistant to chat history

   # additional check
   check_msg = "Please check if each extracted phrase has the same meaning as the original phrase. If not, substitute it by the empty string (""). The rest of the JSON must remain unchanged."
   messages.append({"role": "user", "content": check_msg})

   res = openai.chat.completions.create(
      model = "gpt-4o", 
      temperature = 0.2, 
      max_tokens = 2048,
      messages = messages
    )
   final_response = res.choices[0].message.content

   return final_response

In [65]:
normal_note = df.loc[0, "text"]
adv_note = df.loc[0, "advanced_text"]
extracted_phrases = ann["0"]
response = prompt_GPT_advanced(normal_note, adv_note, extracted_phrases)

Clinical note, version 1:
**History** Patient reports a significant increase in body temperature over the last 48 hours, exceeding normal ranges, indicating a high fever. There have been no respiratory symptoms such as pain, dyspnea, or cough. The patient illustrates general malaise and mentions feeling very fatigued due to the fever. No notable changes in daily routine or exposure to environments that might typically contribute to fever are reported. Recent stress levels and potential exposure to infectious agents during travels are also discussed.  **Physical Examination** Vital signs show elevated temperature (103 °F). Heart rate is slightly tachycardic at 98 bpm, corresponding with the fever. Oxygen saturation is within normal limits at 98%, and lungs are clear to auscultation without any added sounds. Abdominal examination is normal, without tenderness or organomegaly. Skin shows no rashes, warmth, or lesions. Capillary refill time is adequate. Neurological assessment is non-focal

In [66]:
print(response)

{
   "significant increase in body temperature over the last 48 hours, exceeding normal ranges, indicating a high fever": "Pt reports high fever for 48 hrs",
   "feeling very fatigued due to the fever": "Describes significant fatigue and malaise",
   "Vital signs show elevated temperature (103 °F)": "VS: Temp 103 °F",
   "corresponding with the fever": "",
   "There have been no respiratory symptoms such as pain, dyspnea, or cough": "denies resp pain, dyspnea, or cough"
}


Note that since the advanced note is more compact, it sometimes leaves out phrases. In this case "corresponding with the fever" was extracted from the normal note, but could not be matched with any phrase in the advanced note (since it is simply not there). When there is only one such case, we simply remove it from the annotations. When there are more than one such cases for a particular note, we try again with another LLM call (see further). 

Once we have these matched phrase pairs, we can go through the spans extracted from the normal note again, and see if we have a matched pair. Upfront, it seemed like a possibility that the phrase extracted from the normal note might be altered by the LLM in its response, making it impossible to find a match. However, in practice, we saw that this never happened.

In [68]:
def annotate_advanced_note(normal_ann, responses):

    adv_ann = {}
    failed = {}
    for i in responses: 
        orig_phrases = normal_ann[i]
        adv_phrases = responses[i]
        adv_entries = []
        for entry in orig_phrases: 
            phrase = entry["text"]
            try: 
                matched_phrase = adv_phrases[phrase]
            except: 
                # decide what to do when phrase cannot be matched exactly
                # look for similar phrase using regex 
                print("no exact match found!")
                matched_phrase = ""
                failed[i] = adv_phrases

            adv_entries.append({"symptom": entry["symptom"], "text": matched_phrase})
        adv_ann[i] = adv_entries

    return adv_ann, failed

In [67]:
resp = json.loads(response)

In [77]:
adv_ann, failed = annotate_advanced_note({"0": ann["0"]}, {"0": resp})

In [78]:
adv_ann

{'0': [{'symptom': 'fever', 'text': 'Pt reports high fever for 48 hrs'},
  {'symptom': 'fever', 'text': 'Describes significant fatigue and malaise'},
  {'symptom': 'fever', 'text': 'VS: Temp 103 °F'},
  {'symptom': 'fever', 'text': ''},
  {'symptom': 'dyspnea', 'text': 'denies resp pain, dyspnea, or cough'},
  {'symptom': 'cough', 'text': 'denies resp pain, dyspnea, or cough'},
  {'symptom': 'respiratory pain',
   'text': 'denies resp pain, dyspnea, or cough'}]}

Like before, we then try to match the phrases extracted from the advanced note with the text. If a phrase cannot be matched, the annotation is added to the batch of failed notes, and we correct the phrase using the same prompt we used for the normal notes. 

In [79]:
adv_ann, failed = complete_annotations(adv_ann, compl="adv", filter_empty=False)

In [80]:
adv_ann

{'0': [{'symptom': 'fever',
   'text': 'Pt reports high fever for 48 hrs',
   'start': 12,
   'end': 44},
  {'symptom': 'fever',
   'text': 'Describes significant fatigue and malaise',
   'start': 83,
   'end': 124},
  {'symptom': 'fever', 'text': 'VS: Temp 103 °F', 'start': 245, 'end': 260},
  {'symptom': 'fever', 'text': '', 'start': 0, 'end': 0},
  {'symptom': 'dyspnea',
   'text': 'denies resp pain, dyspnea, or cough',
   'start': 46,
   'end': 81},
  {'symptom': 'cough',
   'text': 'denies resp pain, dyspnea, or cough',
   'start': 46,
   'end': 81},
  {'symptom': 'respiratory pain',
   'text': 'denies resp pain, dyspnea, or cough',
   'start': 46,
   'end': 81}]}

As mentioned before, we keep track of the annotations for which no match could be found in the advanced note. If there are two or more such cases in the same note, we try to fix them with an additional LLM call. Instead of using the original annotations found in the normal note, and matching them with the advanced note, we ask the LLM to extract symptom phrases from the advanced note directly. By way of example, we adapt the annotations in our example so they fall in this category. Imagine the LLM did not find the denial of pain, dyspnea and cough in the advanced note: 

In [81]:
adv_ann["0"][4]["text"] = ""
adv_ann["0"][5]["text"] = ""
adv_ann["0"][6]["text"] = ""

We use the prompt to correct these cases. The system instructions are the following: 

---

I will show you a clinical note containing information on a patient's symptoms. For each symptom, I will tell you whether the patient suffers from this symptom or not. 

You will get to see phrases that have been extracted from the note, describing some of these symptoms. This input will have the following JSON structure:
[ 
   {
      "symptom": one of the symptoms ("dyspnea", "cough", "respiratory pain", "fever" or "nasal symptoms")
      "text": phrase in the text that mentions the symptom and whether it is present or absent
   }  
   {
      "symptom": ...
      "text":...
   }
   ...
]

Some phrases have not been filled in yet (indicated by "?" in the "text" field). Your task is to fill in this phrases, leaving the rest of the JSON and its structure untouched.

Keep the following instructions in mind:  
- Also annotate a symptom if the note mentions that the patient does not suffer from it. 
- The phrases do not need to be full sentences, but need to be verbatim as they appear in the note. You are not allowed to alter any words. If you leave out words, use ...
- You will reply only with the JSON itself, and you will not wrap in JSON markers.
- You can only extract phrases from the "clinical note", not from any of the other text in the prompt. 
- Not all symptoms are necessarily mentioned in the note. Simply fill in "" if you cannot find any implicit or explicit mention of it in the clinical note. 

---

In [83]:
def create_JSON_str(phrases):
    list_str = "[\n"
    for phrase in phrases: 
        sent = phrase["text"]
        sympt = phrase["symptom"]
        if len(sent) == 0:
            list_str += f"   {{\"symptom\":\"{sympt}\", \"text\": \"?\"}},\n"
        else: 
            list_str += f"   {{\"symptom\":\"{sympt}\", \"text\": \"{sent}\"}},\n"
    list_str = list_str[:-2]
    list_str += "\n]"
    return list_str

SYS_MESSAGE_CORR = """I will show you a clinical note containing information on a patient's symptoms. For each symptom, I will tell you whether the patient suffers from this symptom or not. 

You will get to see phrases that have been extracted from the note, describing some of these symptoms. This input will have the following JSON structure:
[ 
   {
      "symptom": one of the symptoms ("dyspnea", "cough", "respiratory pain", "fever" or "nasal symptoms")
      "text": phrase in the text that mentions the symptom and whether it is present or absent
   }  
   {
      "symptom": ...
      "text":...
   }
   ...
]

Some phrases have not been filled in yet (indicated by "?" in the "text" field). Your task is to fill in this phrases, leaving the rest of the JSON and its structure untouched.

Keep the following instructions in mind:  
- Also annotate a symptom if the note mentions that the patient does not suffer from it. 
- The phrases do not need to be full sentences, but need to be verbatim as they appear in the note. You are not allowed to alter any words. If you leave out words, use ...
- You will reply only with the JSON itself, and you will not wrap in JSON markers.
- You can only extract phrases from the "clinical note", not from any of the other text in the prompt. 
- Not all symptoms are necessarily mentioned in the note. Simply fill in "" if you cannot find any implicit or explicit mention of it in the clinical note. 
"""

sympt_dict = {"dyspnea": "dysp", "cough": "cough", "nasal symptoms": "nasal", "respiratory pain": "pain", "fever": "fever"}

def prompt_GPT_advanced_corrected(row, phrases): 

   if len(phrases) == 0: 
      return "{}" # if no phrases to be extracted, don't prompt the LLM

   messages = []
   system_message = {"role": "system", "content": SYS_MESSAGE_CORR}
   messages.append(system_message)

   note_adv = row["advanced_text"].replace("\n", " ")

   user_msg = f"The following information is known about the patient's symptoms:\n"
   for sympt, sympt_col_name in sympt_dict.items():  
      sympt_val = row[sympt_col_name]
      user_msg += f"- {sympt}: {sympt_val}\n"
   
   user_msg += f"\n Following the instructions you received, please extract from the following clinical note all phrases (verbatim) that describe these symptoms:\n{note_adv}\n\n"
   user_msg += "Please fill in the missing text (indicated with \"?\"). If you cannot find any mention of a symptom, simply fill in \"\".\n"
   user_msg += create_JSON_str(phrases)

   print(user_msg)
   
   messages.append({"role": "user", "content": user_msg})
   res = openai.chat.completions.create(
      model = "gpt-4o", 
      temperature = 0.2, 
      max_tokens = 2048,
      messages = messages
    )
   response = res.choices[0].message.content

   return response

In [85]:
response = prompt_GPT_advanced_corrected(df.loc[0], adv_ann["0"])

The following information is known about the patient's symptoms:
- dyspnea: no
- cough: no
- nasal symptoms: no
- respiratory pain: no
- fever: high

 Following the instructions you received, please extract from the following clinical note all phrases (verbatim) that describe these symptoms:
**History** Pt reports high fever for 48 hrs, denies resp pain, dyspnea, or cough. Describes significant fatigue and malaise. No recent routine changes or known infectious exposures. Discusses recent stress and travel.  **Physical Examination** VS: Temp 103 °F, HR 98 bpm (tachycardic), O2 sat 98%. Lungs clear, no adventitious sounds. Abd: non-tender, no organomegaly. Skin: no rashes/lesions, normal CRT. Neuro: non-focal. Overall: WNL apart from elevated fever.

Please fill in the missing text (indicated with "?"). If you cannot find any mention of a symptom, simply fill in "".
[
   {"symptom":"fever", "text": "Pt reports high fever for 48 hrs"},
   {"symptom":"fever", "text": "Describes significant

In [86]:
print(response)

[
   {"symptom":"fever", "text": "Pt reports high fever for 48 hrs"},
   {"symptom":"fever", "text": "Describes significant fatigue and malaise"},
   {"symptom":"fever", "text": "VS: Temp 103 °F"},
   {"symptom":"fever", "text": "Overall: WNL apart from elevated fever"},
   {"symptom":"dyspnea", "text": "denies resp pain, dyspnea, or cough"},
   {"symptom":"cough", "text": "denies resp pain, dyspnea, or cough"},
   {"symptom":"respiratory pain", "text": "denies resp pain, dyspnea, or cough"}
]


In [88]:
resp = json.loads(response)

In [89]:
adv_ann["0"] = resp

The LLM was able to extract the phrases pointing towards these symptoms now. We then use the "complete_annotations" function like before to complete the annotations. For the failed notes, the phrases that could not be found in the text are corrected with an additional LLM call (as before). We now remove the empty annotations, since it's now likely that they really were not mentioned in the advanced note. 

In [90]:
adv_ann, failed = complete_annotations(adv_ann, compl="adv", filter_empty=True)

In [92]:
adv_ann

{'0': [{'symptom': 'fever',
   'text': 'Pt reports high fever for 48 hrs',
   'start': 12,
   'end': 44},
  {'symptom': 'fever',
   'text': 'Describes significant fatigue and malaise',
   'start': 83,
   'end': 124},
  {'symptom': 'fever', 'text': 'VS: Temp 103 °F', 'start': 245, 'end': 260},
  {'symptom': 'fever',
   'text': 'Overall: WNL apart from elevated fever',
   'start': 425,
   'end': 463},
  {'symptom': 'dyspnea',
   'text': 'denies resp pain, dyspnea, or cough',
   'start': 46,
   'end': 81},
  {'symptom': 'cough',
   'text': 'denies resp pain, dyspnea, or cough',
   'start': 46,
   'end': 81},
  {'symptom': 'respiratory pain',
   'text': 'denies resp pain, dyspnea, or cough',
   'start': 46,
   'end': 81}]}

Finally, we delete duplicates from the annotations (exact same symptom and exact same phrase). This then leaves us with the full annotations for the advanced notes, of which an example is shown below. 

In [93]:
with open("../data/spans/adv_span_annotations.json", "r") as file:
    ann = json.load(file)

Example:

In [97]:
ann["8596"]

[{'symptom': 'dyspnea', 'text': 'no dyspnea', 'start': 110, 'end': 120},
 {'symptom': 'cough', 'text': 'no dyspnea or cough', 'start': 110, 'end': 129},
 {'symptom': 'nasal symptoms',
  'text': 'No notable PMHx of resp illnesses or chronic conditions.',
  'start': 285,
  'end': 341},
 {'symptom': 'respiratory pain',
  'text': 'sharp resp pain during deep breaths',
  'start': 73,
  'end': 108},
 {'symptom': 'respiratory pain',
  'text': 'Pain localized to lower right thorax, worsens with movement or cough.',
  'start': 131,
  'end': 200},
 {'symptom': 'fever',
  'text': 'sudden onset of high fever x 2 days',
  'start': 23,
  'end': 58},
 {'symptom': 'fever', 'text': 'Temp 39.2°C', 'start': 368, 'end': 379}]

In [98]:
import textwrap
print(textwrap.fill(df.loc[5623, "advanced_text"], 100))

**History** Pt reports mild fever onset x 2 days. Notable respiratory pain w/ deep inspiration &
chest movements. Denies cough or dyspnea. Stated regular activities uncomfortable due to chest
discomfort. No recent travel, sick contacts, or routine changes explained.  **Physical Examination**
Temp: 37.8°C (low-grade fever). Lungs: clear breath sounds, no adventitious sounds. Chest: tender on
costal palp, no soft tissue swelling. Regular respirations, no labored breathing. Vitals: BP 120/80
mmHg, HR 76 bpm, RR 16 bpm. CV exam unremarkable. Skin: warm, dry, no rash/lesions.


## Quality evaluation

We manually evaluated 100 randomly selected notes. The results can be found in "eval/spans/span_eval_results.xlsx", with "eval/spans/span_eval_dataset.txt" containing the subset of notes we evaluated. 

We went over every note (normal and compact versions) and annotated the following: 
- missing phrases: how many additional phrases mentioning a symptom were found in the text, that were not extracted by the LLM
- incorrect phrases: how many of the phrases extracted by the LLM were incorrect

These metrics allow us to calculate precision (how many phrases are correct, over the total number of extracted phrases per note) and recall (how many correct phrases are recovered, over the total number of correct phrases). The total number of correct phrases is the sum of the correct phrases (the extracted phrases minus the incorrect phrases) and the phrases that were missed. 

We calculate precision and recall for every note, and then average them at the end.

In [107]:
import pandas as pd
eval = pd.read_csv("../eval/spans/span_eval_results.csv", sep=";", header=[0])

In [110]:
eval.columns = ["note_idx", "missing_normal", "incorrect_normal", "missing_adv", "incorrect_adv"]
eval.set_index("note_idx", inplace=True)

In [113]:
import json
with open(f"../data/spans/normal_span_annotations.json", "r") as file: 
    ann_normal = json.load(file)
with open(f"../data/spans/adv_span_annotations.json", "r") as file: 
    ann_adv = json.load(file)

We need the total number of extracted phrases for each note. 

In [114]:
for idx, row in eval.iterrows(): 
    eval.loc[idx, "n_phrases_normal"] = len(ann_normal[str(idx)])
    eval.loc[idx, "n_phrases_adv"] = len(ann_adv[str(idx)])

We calculate precision and recall for each note. 

In [115]:
def calc_prec(row, suffix="normal"): 

    n_incorrect = row[f"incorrect_{suffix}"] # number of phrases that were incorrectly extracted (false positives)
    n_extracted = row[f"n_phrases_{suffix}"] # number of phrases that were extracted (true positives + false positives)

    if n_extracted != 0:
        precision = 1 - n_incorrect/n_extracted # 1 - false negatives/(true positives + false negatives)
    else: 
        precision = 1

    return precision

In [116]:
def calc_recall(row, suffix="normal"): 
    
    n_incorrect = row[f"incorrect_{suffix}"] # number of phrases that were incorrectly extracted (false positives)
    n_missing = row[f"missing_{suffix}"] # number of phrases that were missed (false negatives)
    n_extracted = row[f"n_phrases_{suffix}"] # number of phrases that were extracted (true positives + false positives)

    n_correct = n_extracted - n_incorrect # number of phrases that were correctly extracted (true positives)
    if n_correct < 0: 
        print(f"uh oh! {row.name}")
    n_total = n_correct + n_missing # number of phrases that should have been extracted (true positives + false negatives)

    if n_total != 0:
        recall = n_correct/n_total # true positives/(true positives + false negatives)
    else: 
        recall = 1

    return recall

In [117]:
eval["precision_normal"] = eval.apply(calc_prec, axis=1, args=("normal",))
eval["recall_normal"] = eval.apply(calc_recall, axis=1, args=("normal",))

In [118]:
eval["precision_adv"] = eval.apply(calc_prec, axis=1, args=("adv",))
eval["recall_adv"] = eval.apply(calc_recall, axis=1, args=("adv",))

In [119]:
eval.describe().loc[["mean"], ["precision_normal", "precision_adv"]]

Unnamed: 0,precision_normal,precision_adv
mean,0.940643,0.935536


In [120]:
eval.describe().loc[["mean"], ["recall_normal", "recall_adv"]]

Unnamed: 0,recall_normal,recall_adv
mean,0.98781,0.945036


## Statistics

We calculate some statistics on the spans, such as in how many spans a symptom is found, or how long the spans are on average. 

In [123]:
import pickle
with open(f"../data/df_synsum.p", "rb") as file: 
    df = pickle.load(file)
df = df[["dysp", "cough", "pain", "fever", "nasal", "text", "advanced_text"]]

As a pre-processing step, we calculate the number of spans extracted per symptom, for every note. 

In [150]:
sympt_dict = {"dysp": "dyspnea", "cough": "cough", "nasal": "nasal symptoms", "pain": "respiratory pain", "fever": "fever"}
def count_sympt(row, ann, sympt): 
    n = 0
    spans = ann[str(row.name)]
    for span in spans: 
        if span["symptom"] == sympt_dict[sympt]:
            n += 1
    return n

for sympt in sympt_dict:  
    df[f"count_{sympt}"] = df.apply(count_sympt, axis=1, args=(ann_normal, sympt,))

**Number of spans per symptom, per note, on average**

In [164]:
for sympt in sympt_dict:
    n = df[f"count_{sympt}"].mean()
    print(f"Average # of spans per note for {sympt}: {n:.2f}")
n = (df["count_dysp"]+df["count_cough"]+df["count_pain"]+df["count_fever"]+df["count_nasal"]).mean()
print(f"Average # of spans per note: {n:.2f}")

Average # of spans per note for dysp: 0.96
Average # of spans per note for cough: 1.03
Average # of spans per note for nasal: 0.85
Average # of spans per note for pain: 0.52
Average # of spans per note for fever: 0.79
Average # of spans per note: 4.16


**Distribution of symptoms over all extracted spans**

In [127]:
for sympt in sympt_dict:
    n = df[f"count_{sympt}"].sum()
    n_total = (df["count_dysp"]+df["count_cough"]+df["count_pain"]+df["count_fever"]+df["count_nasal"]).sum()
    print(f"Perc. of spans belonging to {sympt}: {n/n_total*100:.2f}%")

Perc. of spans belonging to dysp: 23.16%
Perc. of spans belonging to cough: 24.90%
Perc. of spans belonging to nasal: 20.35%
Perc. of spans belonging to pain: 12.59%
Perc. of spans belonging to fever: 19.01%


**How long are the spans on average?**

In [128]:
import numpy as np
def count_avg_length(ann, sympt):
    char_len = []
    word_len = []
    for i in ann: 
        spans = ann[i]
        for span in spans: 
            if (sympt is None) or (span["symptom"] == sympt_dict[sympt]):
                words = span["text"].split(" ")
                char_len.append(len(span["text"]))
                word_len.append(len(words))
    return np.mean(char_len), np.mean(word_len)

In [129]:
for sympt in sympt_dict:
    char_len, word_len = count_avg_length(ann_normal, sympt)
    print(f"{sympt} avg number of words: {word_len:.2f}, avg number of chars: {char_len:.2f}")
char_len, word_len = count_avg_length(ann_normal, None)
print(f"avg number of words: {word_len:.2f}, avg number of chars: {char_len:.2f}")

dysp avg number of words: 8.34, avg number of chars: 57.21
cough avg number of words: 8.83, avg number of chars: 56.70
nasal avg number of words: 8.92, avg number of chars: 63.05
pain avg number of words: 9.34, avg number of chars: 63.54
fever avg number of words: 7.04, avg number of chars: 44.82
avg number of words: 8.46, avg number of chars: 56.71


**What percentage of spans is found in the history portion of the note? What percentage is found in the physical examination portion of the note?**

In [133]:
import pickle
with open("../data/emb/df_train_emb_part1.p", "rb") as file:
    df_emb_train1 = pickle.load(file)
with open("../data/emb/df_train_emb_part2.p", "rb") as file:
    df_emb_train2 = pickle.load(file)
df_train = pd.concat([df_emb_train1, df_emb_train2])
with open("../data/emb/df_test_emb.p", "rb") as file:
    df_test = pickle.load(file)
df = pd.merge(df, df_train[["text_hist", "text_phys_exam", "advanced_text_hist", "advanced_text_phys_exam"]], how="left", left_index=True, right_index=True)
df.loc[df_test.index, "text_hist"] = df_test["text_hist"]
df.loc[df_test.index, "text_phys_exam"] = df_test["text_phys_exam"]
df.loc[df_test.index, "advanced_text_hist"] = df_test["advanced_text_hist"]
df.loc[df_test.index, "advanced_text_phys_exam"] = df_test["advanced_text_phys_exam"]

In [141]:
def count_hist_phys(ann, df, sympt, compl): 
    n_hist = 0
    n_phys = 0
    if compl == "normal": 
        text_hist = df["text_hist"]
        text_phys = df["text_phys_exam"]
    else: 
        text_hist = df["advanced_text_hist"]
        text_phys = df["advanced_text_phys_exam"]
    for i in ann: 
        for span in ann[i]:
            if (sympt is None) or span["symptom"] == sympt_dict[sympt]:
                if span["text"] in text_hist.loc[int(i)]: 
                    n_hist += 1
                elif span["text"] in text_phys.loc[int(i)]: 
                    n_phys += 1
                else: 
                    print(f"uh oh! {i}")
    return n_hist, n_phys

In [142]:
for sympt in sympt_dict: 
    n_hist, n_phys = count_hist_phys(ann_normal, df, sympt, "normal")
    print(f"{sympt} span found in hist: {n_hist/(n_hist+n_phys)*100:.2f}%, {sympt} span found in phys: {n_phys/(n_hist+n_phys)*100:.2f}%")
n_hist, n_phys = count_hist_phys(ann_normal, df, None, "normal")
print(f"span found in hist: {n_hist/(n_hist+n_phys)*100:.2f}%, span found in phys: {n_phys/(n_hist+n_phys)*100:.2f}%")

dysp span found in hist: 79.30%, dysp span found in phys: 20.70%
cough span found in hist: 91.54%, cough span found in phys: 8.46%
nasal span found in hist: 63.03%, nasal span found in phys: 36.97%
pain span found in hist: 81.57%, pain span found in phys: 18.43%
fever span found in hist: 65.87%, fever span found in phys: 34.13%
span found in hist: 76.77%, span found in phys: 23.23%


**Conditional on whether a symptom is present in the patient, how often is it mentioned in the note?**

In [143]:
for sympt in sympt_dict:
    if sympt != "fever":
        df_subset = df[df[sympt] == "yes"]
        perc = len(df_subset[df_subset[f"count_{sympt}"] != 0])/len(df_subset)
        print(f"Perc mentioned for {sympt}=yes: {perc*100:.2f}%")
sympt = "fever"
df_subset = df[df[sympt] == "low"]
perc = len(df_subset[df_subset[f"count_{sympt}"] != 0])/len(df_subset)
print(f"Perc mentioned for {sympt}=low: {perc*100:.2f}%")
sympt = "fever"
df_subset = df[df[sympt] == "high"]
perc = len(df_subset[df_subset[f"count_{sympt}"] != 0])/len(df_subset)
print(f"Perc mentioned for {sympt}=high: {perc*100:.2f}%")

Perc mentioned for dysp=yes: 99.05%
Perc mentioned for cough=yes: 96.97%
Perc mentioned for nasal=yes: 95.66%
Perc mentioned for pain=yes: 81.72%
Perc mentioned for fever=low: 71.60%
Perc mentioned for fever=high: 95.89%


In [144]:
for sympt in sympt_dict:
    if sympt != "fever":
        df_subset = df[df[sympt] == "no"]
        perc = len(df_subset[df_subset[f"count_{sympt}"] != 0])/len(df_subset)
        print(f"Perc mentioned for {sympt}=no: {perc*100:.2f}%")
sympt = "fever"
df_subset = df[df[sympt] == "none"]
perc = len(df_subset[df_subset[f"count_{sympt}"] != 0])/len(df_subset)
print(f"Perc mentioned for {sympt}=none: {perc*100:.2f}%")

Perc mentioned for dysp=no: 66.54%
Perc mentioned for cough=no: 64.48%
Perc mentioned for nasal=no: 26.55%
Perc mentioned for pain=no: 39.91%
Perc mentioned for fever=none: 44.44%


**Conditional on whether it is present in a patient, how many times is a symptom mentioned in a note on average?**

In [146]:
for sympt in sympt_dict:
    if sympt != "fever":
        df_subset = df[df[sympt] == "yes"]
        perc = df_subset[f"count_{sympt}"].mean()
        print(f"# times mentioned when {sympt}=yes: {perc:.2f}")
sympt = "fever"
df_subset = df[df[sympt] == "low"]
perc = df_subset[f"count_{sympt}"].mean()
print(f"# times mentioned when {sympt}=low: {perc:.2f}")
sympt = "fever"
df_subset = df[df[sympt] == "high"]
perc = df_subset[f"count_{sympt}"].mean()
print(f"# times mentioned when {sympt}=high: {perc:.2f}")

# times mentioned when dysp=yes: 2.03
# times mentioned when cough=yes: 1.78
# times mentioned when nasal=yes: 2.47
# times mentioned when pain=yes: 1.28
# times mentioned when fever=low: 1.55
# times mentioned when fever=high: 2.46


In [147]:
for sympt in sympt_dict:
    if sympt != "fever":
        df_subset = df[df[sympt] == "no"]
        perc = df_subset[f"count_{sympt}"].mean()
        print(f"# times mentioned when {sympt}=no: {perc:.2f}")
sympt = "fever"
df_subset = df[df[sympt] == "none"]
perc = df_subset[f"count_{sympt}"].mean()
print(f"# times mentioned when {sympt}=none: {perc:.2f}")

# times mentioned when dysp=no: 0.70
# times mentioned when cough=no: 0.65
# times mentioned when nasal=no: 0.30
# times mentioned when pain=no: 0.40
# times mentioned when fever=none: 0.52


**How many notes have no spans at all?**

In [148]:
n = 0
for i in ann_normal: 
    spans = ann_normal[i]
    if len(spans) == 0: 
        n += 1
print(f"# notes where no symptom is mentioned at all: {n} ({n/10000*100}%)")

# notes where no symptom is mentioned at all: 1628 (16.28%)


We can re-run the code above for the compact notes as well. 