# Prepare training data for Phi2 fine-tuning

In this notebook, we will be preparing high-quality MCQs for fine-tuning Phi2 small LLM model.
The training set will consist of MCQs with an evaluation score between 0.7 and 0.85. The test set will consist of MCQs with an evaluation score above 0.85.

## Imports

In [1]:
import sys
from pathlib import Path

# Add the root directory to sys.path
root_path = Path().resolve().parent
sys.path.append(str(root_path))

%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
from src.training_formatter import TrainingFormatter

## Prepare training data

In [3]:
EVAL_CSV = "../data/evaluation_scores/mcq_eval_scores_1000.csv"
OUT_JSONL_1 = "../data/training_jsonl/mcqs_all_1000.jsonl"
OUT_JSONL_2 = "../data/training_jsonl/mcqs_filtered_07_1000.jsonl"
OUT_JSONL_3 = "../data/training_jsonl/mcqs_filtered_085_1000.jsonl"

In [4]:
df = pd.read_csv(EVAL_CSV)
df

Unnamed: 0,focus,qtypes,summary,objective,question,options,answer,explanation,relevance_to_summary,alignement_with_objective,plausibility_of_distractors,plausibility_qa,medical_validity,average
0,Diphtheria,['information'],Diphtheria is a serious bacterial infection ca...,Describe the importance of timely vaccination ...,A patient presents with symptoms of diphtheria...,"{'A': ""The patient's immunity is likely to pro...",D,Given the patient's vaccination history and th...,0.708517,0.736968,0.833,1.0,1.0,0.856
1,trichothiodystrophy,"['frequency', 'genetic changes', 'information'...","Trichothiodystrophy is a rare, inherited condi...",Analyze the inheritance pattern of a rare gene...,A patient is diagnosed with Trichothiodystroph...,"{'A': 'Autosomal dominant pattern, where one m...",C,Trichothiodystrophy is inherited in an autosom...,0.792817,0.544787,0.750,1.0,1.0,0.818
2,Nemaline myopathy,['information'],Nemaline myopathy is a genetic disorder that p...,Identify the primary inheritance pattern of Ne...,What is the typical genetic inheritance patter...,"{'A': 'Autosomal Dominant', 'B': 'X-linked Rec...",D,Nemaline Myopathy is primarily inherited in an...,0.801020,0.921066,0.833,1.0,1.0,0.911
3,Gracile bone dysplasia,['symptoms'],Gracile bone dysplasia is a rare genetic disor...,Identify the primary cause of respiratory comp...,What is the primary reason why individuals wit...,{'A': 'Cardiac anomalies leading to pulmonary ...,C,Individuals with Gracile bone dysplasia are at...,0.683419,0.878695,0.917,1.0,1.0,0.896
4,Gastrocutaneous syndrome,['symptoms'],Gastrocutaneous syndrome is a complex conditio...,Identify a characteristic symptom associated w...,Which of the following skin lesions is commonl...,"{'A': 'Pityriasis alba-like macules', 'B': 'Me...",B,Melanocytic nevi are a characteristic skin les...,0.678607,0.729024,1.000,1.0,1.0,0.882
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,Cataract,"['susceptibility', 'symptoms', 'treatment', 'i...",A cataract is a clouding of the lens in the ey...,Analyze the relationship between environmental...,A patient with a history of prolonged exposure...,"{'A': 'Macular degeneration', 'B': 'Cataract',...",B,Prolonged exposure to ultraviolet sunlight has...,0.474832,0.507629,1.000,1.0,0.5,0.696
996,Pericarditis,"['information', 'causes', 'susceptibility', 's...",Pericarditis is a condition characterized by i...,Evaluate the appropriate initial treatment app...,A 35-year-old man presents with sudden onset o...,{'A': 'Order an emergency echocardiogram to ev...,B,Colchicine and prednisone are anti-inflammator...,0.579525,0.427334,0.833,1.0,1.0,0.768
997,Ectopic Pregnancy,['information'],An ectopic pregnancy occurs when a fertilized ...,Identify the typical location where an ectopic...,Where does an ectopic pregnancy most commonly ...,"{'A': 'Ovaries', 'B': 'Fallopian tubes', 'C': ...",B,"The correct answer is B, as fallopian tubes ar...",0.626850,0.858277,1.000,1.0,0.5,0.797
998,15q24 microdeletion,"['frequency', 'genetic changes', 'information'...",15q24 microdeletion is a rare chromosomal diso...,Evaluate the relationship between a genetic co...,A patient diagnosed with 15q24 microdeletion p...,{'A': 'The deletion of 1-6 million DNA base pa...,B,Individuals with 15q24 microdeletion are at ri...,0.737288,0.625763,1.000,1.0,1.0,0.873


In [5]:
formatter = TrainingFormatter()

In [6]:
data_all = formatter.format_dataframe(df)
formatter.save_jsonl(data_all, OUT_JSONL_1)

Saved 1000 formatted mcq in ../data/training_jsonl/mcqs_all_1000.jsonl


In [7]:
df_07 = df[(df["average"] >= 0.7) & (df["average"] < 0.85)]
data_07 = formatter.format_dataframe(df_07)
formatter.save_jsonl(data_07, OUT_JSONL_2)

Saved 635 formatted mcq in ../data/training_jsonl/mcqs_filtered_07_1000.jsonl


(635, 14)

In [8]:
df_085 = df[df["average"] >= 0.85]
data_085 = formatter.format_dataframe(df_085)
formatter.save_jsonl(data_085, OUT_JSONL_3)

Saved 294 formatted mcq in ../data/training_jsonl/mcqs_filtered_085_1000.jsonl


(294, 14)