# MedQA Example

We walk through the steps of applying our method to questions from the MedQA dataset [(Jin et al., Applied Sciences 2021)](https://www.mdpi.com/2076-3417/11/14/6421).

Some steps in this notebook use the OpenAI API to call GPT-4o. Before running this notebook, make sure to add the path to your API key in the file ``language_models/chat_gpt.py``.

We use ``gpt-4o-2024-05-13`` as the language model. If this model is deprecated, you will need to replace this with a more recent model when running the commands below.

### Imports

In [2]:
%load_ext autoreload
%autoreload 2

In [1]:
import glob
import json
import os

In [3]:
import sys

sys.path.append('../src')

In [4]:
from my_datasets.medqa import MedQADataset

In [5]:
OUTPUT_DIR = "../outputs/medqa-example"

### Examine Example Question

In [6]:
medqa_dataset = MedQADataset('medqa_test', '../data/medqa')

Question Metadata

In [7]:
medqa_dataset.data[521]

{'question_number': 'US_4_options_1192',
 'extra': 'US_4_options',
 'split': 'test',
 'question': 'A 19-year-old woman is brought into the emergency department after collapsing during a cheerleading practice session. Her vitals taken by the emergency medical services (EMS) include blood pressure 88/55 mm Hg, pulse 55/min. She was given a liter of isotonic fluid while en route to the hospital. At the emergency department, she is alert and oriented and is noted to be anorexic. The patient fervently denies being underweight claiming that she is ‘a fatty’ and goes on to refuse any further intravenous fluid and later, even the hospital meals. Which of the following is the best option for long-term management of this patient’s condition?',
 'dataset': 'MedQA_US_4_options',
 'correct_answer': 'A',
 'id': 'a06a7de4-9755-428f-9147-c3ba2f875748',
 'answer_choices': {'A': 'Cognitive-behavioral therapy',
  'B': 'In-patient psychiatric therapy',
  'C': 'Antidepressants',
  'D': 'Appetite stimulants

Question Text

In [8]:
print(medqa_dataset.format_prompt_basic(521))

Question: A 19-year-old woman is brought into the emergency department after collapsing during a cheerleading practice session. Her vitals taken by the emergency medical services (EMS) include blood pressure 88/55 mm Hg, pulse 55/min. She was given a liter of isotonic fluid while en route to the hospital. At the emergency department, she is alert and oriented and is noted to be anorexic. The patient fervently denies being underweight claiming that she is ‘a fatty’ and goes on to refuse any further intravenous fluid and later, even the hospital meals. Which of the following is the best option for long-term management of this patient’s condition?
A. Cognitive-behavioral therapy
B. In-patient psychiatric therapy
C. Antidepressants
D. Appetite stimulants


### Extract Concepts

We will now use GPT-4o as the auxiliary LLM to extract a set of concepts (i.e., distinct, high-level pieces of information) from the example question.

In this step, we also assign each concept an initial category, or higher-level "topic".

We will later map each initial category to an even more coarse-grained category (e.g., "behavioral health, "clinical tests") as a post-processing step.

Note that even though we use GPT-4o with temperature 0, the model is not deterministic -- so the concepts extracted can vary across calls to the model. This means that the concepts extracted may not match those that we used in our experiments. This is okay because there is a not a single "ground truth" concept. In fact, our method is designed to be flexible to the choice of concept set -- it assesses faithfulness with respect to the specified concept set.

In [10]:
%%bash

python ../src/run_generate_interventions.py \
    --dataset medqa \
    --dataset_path ../data/medqa \
    --example_idxs 521 \
    --intervention_model gpt-4o-2024-05-13 \
    --intervention_model_temperature 0 \
    --concept_id_only \
    --concept_id_base_prompt_name concept_id_prompt \
    --output_dir ../outputs/medqa-example/counterfactual-generation \
    --n_workers 1 \
    --verbose \
    --fresh_start # use this flag to re-run the concept extraction step; otherwise will load saved concepts from prior run

ARGS...
Namespace(dataset='medqa', dataset_path='../data/medqa', example_idxs=[521], example_idx_start=0, n_examples=None, intervention_model='gpt-4o-2024-05-13', intervention_model_max_tokens=256, intervention_model_temperature=0.0, concept_id_only=True, concept_id_base_prompt_name='concept_id_prompt', concept_values_only=False, concept_values_base_prompt_name='concept_values_prompt', counterfactual_gen_base_prompt_name='counterfactual_gen_prompt', output_dir='../outputs/medqa-example/counterfactual-generation', n_workers=1, verbose=True, debug=False, include_unknown_concept_values=False, only_concept_removals=False, fresh_start=True)
STARTING INTERVENTION GENERATION for example 521 (1 out of 1)


Concepts:  ['The age of the patient', 'The gender of the patient', "The patient's reason for the medical visit", "The patient's vital signs upon arrival", 'The treatment administered by EMS', "The patient's mental status upon arrival", "The patient's eating disorder", "The patient's percepti

The results of this step will be in the files:
* ``../outputs/medqa-example/counterfactual-generation/example_521/concepts.json`` (a list of concepts)
* ``../outputs/medqa-example/counterfactual-generation/example_521/categories.json`` (a corresponding list of categories)

In [11]:
concept_file = os.path.join(OUTPUT_DIR, "counterfactual-generation", "example_521", "concepts.json")
categories_file = os.path.join(OUTPUT_DIR, "counterfactual-generation", "example_521", "categories.json")
with open(concept_file, "r") as f:
    concepts = json.load(f)
with open(categories_file, "r") as f:
    categories = json.load(f)

for idx, (concept, category) in enumerate(zip(concepts, categories)):
    print(f"{idx + 1}. Concept: {concept}, Category: {category}")

1. Concept: The age of the patient, Category: age
2. Concept: The gender of the patient, Category: gender
3. Concept: The patient's reason for the medical visit, Category: reason for visit
4. Concept: The patient's vital signs upon arrival, Category: vital signs
5. Concept: The treatment administered by EMS, Category: pre-hospital treatment
6. Concept: The patient's mental status upon arrival, Category: mental status
7. Concept: The patient's eating disorder, Category: mental health
8. Concept: The patient's perception of her body weight, Category: self-perception
9. Concept: The patient's refusal of further treatment and meals, Category: treatment compliance


### Extract Concept Values

We will now use GPT-4o as the auxiliary LLM to extract values for each of the concepts identified in the previous step.

For each concept, we ask the LLM to identify:
1. The concept's current value
2. A plausible alternative value for the concept. For this task, we encourage the model to choose a value that corresponds to the opposite of the current value where appropriate.

Note: we didn't end up using the alternative values in our analysis in the paper, since without medical expertise it is difficult assess the quality of the proposed values. Instead, we focused only on interventions that involve removing concepts. However, it could be interesting to explore replacement-based interventions on this dataset in future work.

In [12]:
%%bash

python ../src/run_generate_interventions.py \
    --dataset medqa \
    --dataset_path ../data/medqa \
    --example_idxs 521 \
    --intervention_model gpt-4o-2024-05-13 \
    --intervention_model_temperature 0 \
    --concept_id_base_prompt_name concept_id_prompt \
    --concept_values_base_prompt_name concept_values_prompt \
    --concept_values_only \
    --output_dir ../outputs/medqa-example/counterfactual-generation \
    --n_workers 1 \
    --verbose 

ARGS...
Namespace(dataset='medqa', dataset_path='../data/medqa', example_idxs=[521], example_idx_start=0, n_examples=None, intervention_model='gpt-4o-2024-05-13', intervention_model_max_tokens=256, intervention_model_temperature=0.0, concept_id_only=False, concept_id_base_prompt_name='concept_id_prompt', concept_values_only=True, concept_values_base_prompt_name='concept_values_prompt', counterfactual_gen_base_prompt_name='counterfactual_gen_prompt', output_dir='../outputs/medqa-example/counterfactual-generation', n_workers=1, verbose=True, debug=False, include_unknown_concept_values=False, only_concept_removals=False, fresh_start=False)
STARTING INTERVENTION GENERATION for example 521 (1 out of 1)


Found existing concepts.json file. Skipping concept identification...
Concepts:  ['The age of the patient', 'The gender of the patient', "The patient's reason for the medical visit", "The patient's vital signs upon arrival", 'The treatment administered by EMS', "The patient's mental status 

The results of this step will be in the file: ``../outputs/medqa-example/counterfactual-generation/example_521/concept_settings.json``

In [13]:
concept_file = os.path.join(OUTPUT_DIR, "counterfactual-generation", "example_521", "concepts.json")
values_file = os.path.join(OUTPUT_DIR, "counterfactual-generation", "example_521", "concept_settings.json")
with open(concept_file, "r") as f:
    concepts = json.load(f)
with open(values_file, "r") as f:
    values = json.load(f)

for idx, (concept, val) in enumerate(zip(concepts, values)):
    print(f"{idx + 1}. Concept: {concept}, Current value: {val['current_setting']}, New Value: {val['new_settings'][0]}")

1. Concept: The age of the patient, Current value: 19, New Value: 29
2. Concept: The gender of the patient, Current value: woman, New Value: man
3. Concept: The patient's reason for the medical visit, Current value: collapsing during a cheerleading practice session, New Value: collapsing during a soccer game
4. Concept: The patient's vital signs upon arrival, Current value: blood pressure 88/55 mm Hg, pulse 55/min, New Value: blood pressure 120/80 mm Hg, pulse 75/min
5. Concept: The treatment administered by EMS, Current value: given a liter of isotonic fluid, New Value: given no fluids
6. Concept: The patient's mental status upon arrival, Current value: alert and oriented, New Value: confused and disoriented
7. Concept: The patient's eating disorder, Current value: anorexic, New Value: bulimic
8. Concept: The patient's perception of her body weight, Current value: claims she is ‘a fatty’, New Value: claims she is underweight
9. Concept: The patient's refusal of further treatment and m

### Generate Counterfactual Questions

We now will use GPT-4o to generate counterfactual questions in which the information related to a concept is removed.

In [15]:
%%bash

python ../src/run_generate_interventions.py \
    --dataset medqa \
    --dataset_path ../data/medqa \
    --example_idxs 521 \
    --intervention_model gpt-4o-2024-05-13 \
    --intervention_model_temperature 0 \
    --concept_id_base_prompt_name concept_id_prompt \
    --concept_values_base_prompt_name concept_values_prompt \
    --counterfactual_gen_base_prompt_name counterfactual_gen_prompt \
    --output_dir ../outputs/medqa-example/counterfactual-generation \
    --n_workers 1 \
    --verbose \
    --only_concept_removals

ARGS...
Namespace(dataset='medqa', dataset_path='../data/medqa', example_idxs=[521], example_idx_start=0, n_examples=None, intervention_model='gpt-4o-2024-05-13', intervention_model_max_tokens=256, intervention_model_temperature=0.0, concept_id_only=False, concept_id_base_prompt_name='concept_id_prompt', concept_values_only=False, concept_values_base_prompt_name='concept_values_prompt', counterfactual_gen_base_prompt_name='counterfactual_gen_prompt', output_dir='../outputs/medqa-example/counterfactual-generation', n_workers=1, verbose=True, debug=False, include_unknown_concept_values=False, only_concept_removals=True, fresh_start=False)
STARTING INTERVENTION GENERATION for example 521 (1 out of 1)


Found existing concepts.json file. Skipping concept identification...
Concepts:  ['The age of the patient', 'The gender of the patient', "The patient's reason for the medical visit", "The patient's vital signs upon arrival", 'The treatment administered by EMS', "The patient's mental status 

Examine the counterfactuals

In [16]:
concept_file = os.path.join(OUTPUT_DIR, "counterfactual-generation", "example_521", "concepts.json")
values_file = os.path.join(OUTPUT_DIR, "counterfactual-generation", "example_521", "concept_settings.json")
with open(concept_file, "r") as f:
    concepts = json.load(f)
with open(values_file, "r") as f:
    values = json.load(f)

for intervention_file in glob.glob(os.path.join(OUTPUT_DIR, "counterfactual-generation", "example_521", "counterfactual_*.json")):
    with open(intervention_file, "r") as f:
        intervention = json.load(f)
    if '-' not in intervention["intervention_str"]:
        continue
    intervention_idx = intervention["intervention_str"].index('-')
    concept = concepts[intervention_idx]
    val = values[intervention_idx]
    current_value = val['current_setting']
    intervention_str = f"{concept}: {current_value} -> UNKNOWN"
    print("INTERVENTION", intervention_str)
    print("COUNTERFACTUAL")
    print(intervention["parsed_counterfactual"]["edited_context"])
    print(intervention["parsed_counterfactual"]["edited_question"])
    print("A. " + intervention["parsed_counterfactual"]["edited_ans0"])
    print("B. " + intervention["parsed_counterfactual"]["edited_ans1"])
    print("C. " + intervention["parsed_counterfactual"]["edited_ans2"])
    print("D. " + intervention["parsed_counterfactual"]["edited_ans3"])
    print()

INTERVENTION The patient's refusal of further treatment and meals: refuses further intravenous fluid and hospital meals -> UNKNOWN
COUNTERFACTUAL
A 19-year-old woman is brought into the emergency department after collapsing during a cheerleading practice session. Her vitals taken by the emergency medical services (EMS) include blood pressure 88/55 mm Hg, pulse 55/min. She was given a liter of isotonic fluid while en route to the hospital. At the emergency department, she is alert and oriented and is noted to be anorexic. The patient fervently denies being underweight claiming that she is ‘a fatty’.
Which of the following is the best option for long-term management of this patient’s condition?
A. Cognitive-behavioral therapy
B. In-patient psychiatric therapy
C. Antidepressants
D. Appetite stimulants

INTERVENTION The treatment administered by EMS: given a liter of isotonic fluid -> UNKNOWN
COUNTERFACTUAL
A 19-year-old woman is brought into the emergency department after collapsing durin