## MedMCQA Dataset Sample (Kaggle)

### Importing Dataset

In [1]:
# %pip install kagglehub
import kagglehub
import os
import pandas as pd
import random

path = kagglehub.dataset_download("thedevastator/medmcqa-medical-mcq-dataset")
dataset_path = os.path.join(path, 'train.csv')
df = pd.read_csv(dataset_path)




### Cleaning Dataset

EXCLUDE questions with multiple answers. Cleaning ID column, choice_type column, exp column, and subject name column.

In [2]:
#from the dataset, remove all rows where choice_type is 'multi' 
df = df[df['choice_type'] != 'multi']

df = df[['question', 'opa', 'opb', 'opc', 'opd', 'cop']]    #remove the id, choice_type, explanation, and subject name columns -> not useful for the model


#size of the dataset
print(df.shape)
df

(120765, 6)


Unnamed: 0,question,opa,opb,opc,opd,cop
0,Chronic urethral obstruction due to benign pri...,Hyperplasia,Hyperophy,Atrophy,Dyplasia,2
1,Which vitamin is supplied from only animal sou...,Vitamin C,Vitamin B7,Vitamin B12,Vitamin D,2
4,Growth hormone has its effect on growth through?,Directly,IG1-1,Thyroxine,Intranuclear receptors,1
5,Scrub typhus is transmitted by: September 2004,Louse,Tick,Mite,Milk,2
7,Per rectum examination is not a useful test fo...,Anal fissure,Hemorrhoid,Pilonidal sinus,Rectal ulcer,2
...,...,...,...,...,...,...
182814,Pseudopolyposis is seen in -,Crohn's disease,Ulcerative colitis,Juvenile polyposis,Enteric fever,1
182815,Organism that causes emphysematous cholecystit...,Salmonella typhi,Cytomegalovirus,Clostridium perfringens,Bacteroides,2
182816,Which of the following conditions may necessit...,Cellulitis,Ludwig's angina,Cavernous sinus thrombosis,Maxillary sinusitis,1
182817,Most common site for extra mammary Paget&;s di...,Vulva,Vagina,Penis,Anus,0


### Polishing and Saving Dataset

Generated prompts based on the questions, and saved corresponding answers per row. Saved first 5000 rows as a sample dataset.

In [None]:
prompts = []
answers = []
options = ['A', 'B', 'C', 'D']
count = 0

for i, each in df.iterrows():

    if count == 5000:
        break

    prompt = f"""Question: {each['question']}

A) {each['opa']}
B) {each['opb']}
C) {each['opc']}
D) {each['opd']}

Choose the best option."""
    
    print(prompt)

    ans = options[each['cop']]  #convert to A, B, C, D

    prompts.append(prompt)
    answers.append(ans)
    count+=1


new_df = pd.DataFrame({'prompt': prompts, 'answer': answers})

new_df.to_csv("MedMCQA_5000_sample.csv", index=False)

### Introducing Positional Variation

In [None]:
prompts = []
answers = []
options = ['A', 'B', 'C', 'D']
count = 0

for i, each in df.iterrows():

    rand_opt = [each['opa'], each['opb'], each['opc'], each['opd']]
    random.shuffle(rand_opt)

    if count == 5000:
        break

    prompt = f"""Question: {each['question']}

A) {rand_opt[0]}
B) {rand_opt[1]}
C) {rand_opt[2]}
D) {rand_opt[3]}

Choose the best option."""
    
    print(prompt)

    ans = options[each['cop']]  #convert to A, B, C, D

    prompts.append(prompt)
    answers.append(ans)
    count+=1


new_df = pd.DataFrame({'prompt': prompts, 'answer': answers})

new_df.to_csv("SHUFFLED_MedMCQA_5000_sample.csv", index=False)

Question: Chronic urethral obstruction due to benign prismatic hyperplasia can lead to the following change in kidney parenchyma

A) Atrophy
B) Dyplasia
C) Hyperophy
D) Hyperplasia

Choose the best option.
Question: Which vitamin is supplied from only animal source:

A) Vitamin C
B) Vitamin B12
C) Vitamin D
D) Vitamin B7

Choose the best option.
Question: Growth hormone has its effect on growth through?

A) Intranuclear receptors
B) IG1-1
C) Thyroxine
D) Directly

Choose the best option.
Question: Scrub typhus is transmitted by: September 2004

A) Louse
B) Mite
C) Tick
D) Milk

Choose the best option.
Question: Per rectum examination is not a useful test for diagnosis of

A) Anal fissure
B) Pilonidal sinus
C) Hemorrhoid
D) Rectal ulcer

Choose the best option.
Question: Characteristics of Remifentanyl – a) Metabolised by plasma esteraseb) Short half lifec) More potent than Alfentanyld) Dose reduced in hepatic and renal diseasee) Duration of action more than Alfentanyl

A) ab
B) bcd
C) 