In [None]:
import pandas as pd
import json


We use MedQA-Open dataset from the paper "Few shot chain-of-thought driven reasoning to prompt LLMs for open ended medical question answering"  
The dataset available at (ancillary files setion):  
https://arxiv.org/abs/2403.04890  

The author of the paper used the USMLE-MedQA dataset (Jin et al., 2021), a medical exam dataset that consists of questions   
sourced from professional medical board exams in the USA.  


The authors used the MedQA dataset (Zhang et al., 2018) is a publicly available collection of complex medical questions  
with multiple choices based on the United States medical license exams. To emulate real-world medical scenarios,   
they convert these multiple-choice questions into open-ended questions by (1) removing the multiple-choice options and   
(2) re-pharsing the question to be open-ended using LLM, creating MedQA-Open.   

The dataset contains more than 10k questions related to different medical fields, including psychiatry.   
The following script uses LLM to classify questions as psychiatry-related or not psychiatry-related.  
It is a screening step that uses fast cost-efficient LLM to filter out non-psychiatry questions   
before using more expensive LLMs for detailed analysis.  

In [11]:
ORIGINAL_DATASET_PATH = "MedQA_open_dataset.xlsx"
original_df = pd.read_excel(ORIGINAL_DATASET_PATH)
print(len(original_df), "rows in the original dataset")

10178 rows in the original dataset


I used used LLM (gemini 2.0 flash model) with rather simple prompt to screen for psychiatry-related questions. Google was used as provider of LLM since it gives generous Free-tier limit for experimenting with LLM:

In [17]:
question = ""
answer = "" # Just placeholders

prompt = f"""
    Question: {question}
    Answer: {answer}
    Is this question related to psychiatry? Respond with only 'psychiatry' or 'non-psychiatry'."""

If either question or answer column was empty, we would classify question as invalid. We don't provide LLM with empty query.

In [14]:
SCREENED_QUESTIONS_PATH = "MedQA_open_dataset_classified.xlsx"
screened_df = pd.read_excel(SCREENED_QUESTIONS_PATH)
print(len(screened_df), "rows in the screened dataset")
screened_df["classification"].value_counts()

10178 rows in the screened dataset


classification
non-psychiatry    8780
psychiatry         886
invalid            512
Name: count, dtype: int64

After initial screening, only 886 question were considered as psychiatry-related. However, the prompt was weak and model was selected due to speed rather than precision. A the next step we verified that question actually asks about psychiatry, not just mentions psychiatric concepts in the different cases vignettes. To do so, we prompted a newer model gemini 2.5 flash with the folllowing prompt:

In [19]:
prompt = f"""
    Act as an experienced clinical psychiatrist and medical educator. 

    Question: {question}

    Evaluate provided question and reasoning based on the following criteria:

    CLINICAL PSYCHIATRY FOCUS: Is this question primarily testing knowledge of clinical psychiatry, mental health disorders, psychiatric treatments, or psychological concepts? 
    - Questions that merely mention mental health terms but primarily test other medical knowledge (like diabetes, cardiology, etc.) should be excluded
    - Questions should focus on psychiatric diagnosis, treatment, symptoms, or mental health concepts as the main learning objective

    Provide your response in the following JSON format:
    {{
        "classification": "INCLUDE" or "EXCLUDE",
        "reasoning": "Brief explanation of why you included or excluded this question"
    }}

    Classification options:
    - "INCLUDE" if the question is primarily focused on clinical psychiatry AND the reasoning is useful
    - "EXCLUDE" if the first criteria fail
    """

In [28]:
with open("MedQA_open_dataset_classified_psychiatry_evaluation.json", 'r', encoding='utf-8') as file:
    data = json.load(file)

verified_df = pd.DataFrame(data)
print("Classification value counts:")
print(verified_df["psychiatry_classification"].value_counts())


Classification value counts:
psychiatry_classification
include    737
exclude    147
re-do        2
Name: count, dtype: int64


Overall, we have 737 psychiatry-related questions. 