# Finetune llama 3.2 on medical dataset with Hugging Face and peft for fine-tuning

# Mia Hallage

In this notebook, we will train a llama 3.2 model on a medical dataset with Hugging Face and peft for fine-tuning. We will follow all the typical steps of a training pipeline, from loading the model and tokenizer, to training, evaluating and saving the model. Then we will test the model with a simple inference function to see if it's working as expected 🤗

If you are not familiar with the peft library, you can read more about it here

In [1]:
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, TaskType
import json


Check your GPU ressources the code below is using the MPS backend for macs silicon. If you have a GPU, re write this code to use the CUDA backend or run this notebook on colab

In [3]:
torch.cuda.is_available()


True

Load the meta-llama/Llama-3.2-1B-Instruct model from hugging face hub and pass it to the AutoTokenizer and AutoModelForCausalLM classes below.

In [5]:
model_name = "meta-llama/Llama-3.2-1B-Instruct"

In [13]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [14]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

model_name = "meta-llama/Llama-3.2-1B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype=torch.float16,
    device_map="auto",
)

print(f"✅ Model loaded: {model_name}")

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

✅ Model loaded: meta-llama/Llama-3.2-1B-Instruct


Initialize LoRA configuration with the following parameters:

- r=16: The rank of the LoRA matrices
- lora_alpha=32: The scaling factor for the LoRA matrices
- target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]: The modules to apply LoRA to read more about it here
- lora_dropout=0.05: The dropout rate for the LoRA matrices
- bias="none": The bias for the LoRA matrices
- task_type=TaskType.CAUSAL_LM: The type of task to train for (only task supported yet)

In [15]:
print("\n⚙️ Configuring LoRA...")
lora_config = LoraConfig(

)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()



⚙️ Configuring LoRA...
trainable params: 851,968 || all params: 1,236,666,368 || trainable%: 0.0689


load and format the dataset with the formating function below and use only 500 examples for training.

In [16]:
def format_prompt(example):
    """Format with CORRECT field names"""
    # Use the ACTUAL field names from the dataset
    question = example.get('Open-ended Verifiable Question', '')
    answer = example.get('Ground-True Answer', '')

    # Validate we have real content
    if not question or len(question) < 10:
        return None

    if not answer or len(answer) < 2:
        return None

    # Format with Llama 3 template
    # Note: This dataset doesn't have step-by-step reasoning,
    # we'll create a simpler format
    text = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The answer is: {answer}<|eot_id|>"""

    return {"text": text}

this function will format the dataset into our desired prompt for the model 🤖

In [19]:

print("\n📊 Loading dataset...")
dataset = load_dataset("FreedomIntelligence/medical-o1-verifiable-problem")

print("\n Formatting dataset...")

train_dataset = dataset["train"].map(format_prompt)
train_dataset = train_dataset.filter(lambda x: x is not None)
train_dataset = train_dataset.select(range(500))

print(f"✅ Training on {len(train_dataset)} examples")


📊 Loading dataset...

 Formatting dataset...
✅ Training on 500 examples


In [20]:
train_dataset

Dataset({
    features: ['Open-ended Verifiable Question', 'Ground-True Answer', 'text'],
    num_rows: 500
})

tokenize the train dataset with the tokenizer and the tokenize_function below.

In [21]:
def tokenize_function(examples):
    tokenized = tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=512,  # Shorter for Mac memory
        return_tensors="pt"
    )
    tokenized["labels"] = tokenized["input_ids"].clone()
    return tokenized

Then apply the tokenize_function to the train dataset with the .map method with the following parameters:

- tokenize_function: out function defined above to apply to the dataset
- batched=True
- remove_columns=train_dataset.column_names

This will tokenize the train dataset and return a new dataset with the tokenized text.

In [24]:
# Ensure tokenizer has a pad token
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print("Tokenizer padding token set to:", tokenizer.pad_token)


Tokenizer padding token set to: <|eot_id|>


In [25]:

print("🔄 Tokenizing...")

tokenized_dataset = train_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=train_dataset.column_names
)

print("Tokenization complete!")




🔄 Tokenizing...


Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Tokenization complete!


Set up the training arguments with the TrainingArguments class with the following parameters:

- output_dir="./results": The directory to save the results
- num_train_epochs=3: The number of training epochs
- per_device_train_batch_size=1: The batch size for the training
- gradient_accumulation_steps=4: The number of gradient accumulation steps
- learning_rate=2e-4: The learning rate
- warmup_steps=10: The number of warmup steps
- logging_steps=10: The number of logging steps
- save_steps=100: The number of steps to save the model
- save_total_limit=2: The number of total models to save
- fp16=False: Whether to use fp16 training
- logging_dir="./logs": The directory to save the logs
- report_to="none": The report to save the logs
- use_mps_device=True: Whether to use mps device ⚠️ only if you are on macos silicon else use cuda

In [28]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)


In [29]:
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()


trainable params: 851,968 || all params: 1,236,666,368 || trainable%: 0.0689




In [33]:
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,

    learning_rate=2e-4,
    warmup_steps=10,

    logging_steps=10,
    save_steps=100,
    save_total_limit=2,

    fp16=True,
    logging_dir="./logs",
    report_to="none",

    remove_unused_columns=False,
)


In [34]:
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

trainer.train()


The model is already on multiple devices. Skipping the move to device specified in `args`.


Step,Training Loss
10,3.1471
20,2.9249
30,2.4373
40,2.239
50,2.2278
60,2.1294
70,2.146
80,2.1836
90,1.9691
100,2.1056


TrainOutput(global_step=375, training_loss=2.0862343521118163, metrics={'train_runtime': 272.9245, 'train_samples_per_second': 5.496, 'train_steps_per_second': 1.374, 'total_flos': 4488183152640000.0, 'train_loss': 2.0862343521118163, 'epoch': 3.0})

 Use a DataCollatorForLanguageModeling class to collate the data for the training.

Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of the same type as the elements of train_dataset or eval_dataset more about it here

In [35]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

What is the purpose of the mlm parameter ?

The mlm parameter controls whether the model trains with Masked Language Modeling (mlm=True) or causal next-token prediction (mlm=False).


For LLaMA and other causal LLMs, we set mlm=False because they do not mask tokens and instead learn to predict the next token in the sequence.


Save the model and the tokenizer with the .save_pretrained method.

Now let's test the model with a simple inference function to see if it's working as expected on unseen question-answering data 🤖

Before starting this exercise, ensure you have:

- Completed the fine-tuning of your model on the first 1000 examples of the medical dataset
- Your fine-tuned model loaded and ready for inference
- The medical-o1-verifiable-problem dataset from FreedomIntelligence
- Required libraries installed: transformers, torch, datasets, random, json

## Step 1: Load and Split the Dataset

Load the complete dataset
Define your train/test split:
Training set: Examples 0-999 (used during our fine-tuning)
Test set: Examples 1000+ (held out for our evaluation purposes)
Verify the total dataset size and confirm the split boundaries

In [38]:
import random
import torch
from datasets import load_dataset
import time
import re
import numpy as np

print("\n Loading dataset...")
dataset = load_dataset("FreedomIntelligence/medical-o1-verifiable-problem")

# Extract full train split from dataset
full_data = dataset["train"]

# TRAIN = 0–999
train_data_eval = full_data.select(range(0, 1000))

# TEST = 1000+
test_data = full_data.select(range(1000, len(full_data)))

print(f" Total dataset size: {len(full_data)}")
print(f" Train split: {len(train_data_eval)}")
print(f" Test split:  {len(test_data)}")



 Loading dataset...
 Total dataset size: 40644
 Train split: 1000
 Test split:  39644


## Step 2: Sample Test Examples

Set a random seed (e.g., 42) for reproducibility
Randomly select 20 examples from the test set
Record the indices of selected examples for reference

In [41]:
random.seed(42)

selected_indices = random.sample(range(len(test_data)), 20)

print(f"Randomly selected {len(selected_indices)} test examples")
print(f"Indices: {selected_indices[:5]}... (showing first 5)")


Randomly selected 20 test examples
Indices: [7296, 1639, 18024, 16049, 14628]... (showing first 5)


## Step 3: Create the Inference Function

Implement a get_prediction() function that:

1- Formats the question using the proper chat template (with user/assistant headers)

2- Tokenizes the input and moves it to the appropriate device

3- Generates a response using appropriate parameters:
- max_new_tokens=50 (adjust as needed)
- temperature=0.3 (lower for more deterministic answers)
- top_p=0.9

4- Extracts and returns only the assistant's response (removing special tokens)

In [42]:
def get_prediction(question, max_tokens=50):
    """
    Generates answer using the fine-tuned LLaMA model
    """

    # Format according to LLaMA chat template
    prompt = f"""<begin_of_text>|<start_header_id|user|end_header_id>
{question}
<eot_id>|<start_header_id|assistant|end_header_id>
"""

    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        truncation=True,
        max_length=512
    ).to("cuda")

    # Generate prediction
    with torch.no_grad():
        outputs = peft_model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=0.3,
            top_p=0.9
        )

    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract assistant answer (everything after "assistant")
    if "assistant" in decoded:
        decoded = decoded.split("assistant")[-1].strip()

    return decoded


## Step 4: Implement Accuracy Checking

Create a check_accuracy() function that:

1- Compares the model's prediction against the ground truth answer

2- Implements two types of matching:
- Exact match: Ground truth appears verbatim in prediction
- Partial match: At least 70% of key medical terms from ground truth appear in prediction

3- Filters out common stop words when checking partial matches

4- Returns whether the prediction is correct and the match type


In [43]:
import string

STOPWORDS = set(["the","a","an","is","are","and","or","of","to","with","in","on","for","that"])

def preprocess(text):
    text = text.lower()
    text = text.translate(str.maketrans("", "", string.punctuation))
    return text.split()

def check_accuracy(prediction, ground_truth):

    pred_words = preprocess(prediction)
    gt_words   = preprocess(ground_truth)

    # EXCLUDE stopwords
    pred_words = [w for w in pred_words if w not in STOPWORDS]
    gt_words   = [w for w in gt_words if w not in STOPWORDS]

    # Exact match = GT is substring of prediction
    if ground_truth.lower() in prediction.lower():
        return True, "exact"

    # Partial match: ≥70% overlap of key medical terms
    if len(gt_words) > 0:
        overlap = sum(1 for w in gt_words if w in pred_words)
        ratio = overlap / len(gt_words)

        if ratio >= 0.7:
            return True, "partial"

    return False, "no_match"


## Step 5: Run Evaluation Loop

For each of the 20 test examples you will :

1- Extract the question and ground truth answer

2- Display the question (truncated if long)

3- Generate a prediction using your model

4- Check if the prediction is correct using your accuracy function

5- Display the result (✅ correct or ❌ incorrect)

6- Track running accuracy and timing metrics

In [44]:
print("\n" + "="*80)
print("EVALUATING MODEL")
print("="*80)

results = []
correct_exact = 0
correct_partial = 0
total = 0
start_time = time.time()

for i, idx in enumerate(selected_indices, 1):

    entry = test_data[idx]
    question = entry["Open-ended Verifiable Question"]
    ground_truth = entry["Ground-True Answer"]

    print(f"\n{i}. Question: {question[:120]}...")

    prediction = get_prediction(question)

    correct, match_type = check_accuracy(prediction, ground_truth)

    if correct:
        if match_type == "exact":
            correct_exact += 1
        else:
            correct_partial += 1

    total += 1

    current_accuracy = 100 * (correct_exact + correct_partial) / total

    print(f"   → Prediction: {prediction[:100]}...")
    print(f"   → Ground truth: {ground_truth}")
    print(f"   → Correct? {correct} ({match_type})")
    print(f"   🎯 Running accuracy: {current_accuracy:.1f}%")

    results.append({
        "index": idx,
        "question": question,
        "ground_truth": ground_truth,
        "prediction": prediction,
        "correct": correct,
        "match_type": match_type
    })

total_time = time.time() - start_time


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



EVALUATING MODEL

1. Question: After a 60-year-old man underwent a successful orthotopic liver transplantation, the transplanted liver exhibited poor f...


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


   → Prediction: |end_header_id>
The answer is: Hypoxia-inducible factor-1 alpha (HIF-1α)....
   → Ground truth: Reactive oxygen species
   → Correct? False (no_match)
   🎯 Running accuracy: 0.0%

2. Question: In a 37-year-old female patient with a fractured clavicle where the junction of the inner and middle third of the bone s...


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


   → Prediction: |end_header_id>
The answer is: Malunion of the clavicle. The correct answer is: Malunion of the clav...
   → Ground truth: Thrombosis of the subclavian vein, causing a pulmonary embolism
   → Correct? False (no_match)
   🎯 Running accuracy: 0.0%

3. Question: In which condition does the antagonism of histamine by H1 antihistaminics not afford any benefit?...


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


   → Prediction: |end_header_id>
The answer is: Asthma. <end_of_text>...
   → Ground truth: Common cold
   → Correct? False (no_match)
   🎯 Running accuracy: 0.0%

4. Question: A 74-year-old man has a 1.5-centimeter, faintly erythematous, raised lesion with irregular borders on his right forearm....


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


   → Prediction: |end_header_id>
The answer is: Keratinization of the epidermis....
   → Ground truth: Irreversible nuclear changes in the stratum basale
   → Correct? False (no_match)
   🎯 Running accuracy: 0.0%

5. Question: A 24-year-old male presents to the psychiatry emergency department with symptoms of excitement, grandiosity, lack of sle...


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


   → Prediction: |end_header_id>
The answer is: Risperidone....
   → Ground truth: Risperidone
   → Correct? True (exact)
   🎯 Running accuracy: 20.0%

6. Question: An 18-year-old pregnant woman, who is 10 weeks along, presents at her first prenatal visit reporting nausea with occasio...


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


   → Prediction: |end_header_id>
The answer is: Metronidazole. The best answer is The answer is: Metronidazole. The b...
   → Ground truth: Treat with nitrofurantoin for seven days.
   → Correct? False (no_match)
   🎯 Running accuracy: 16.7%

7. Question: A 40-year-old male presented with right loin pain referred to the right iliac fossa. After an ultrasound and a non-contr...


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


   → Prediction: |end_header_id>
The answer is: 2nd ureter. <end_of_text>...
   → Ground truth: Mid ureter
   → Correct? False (no_match)
   🎯 Running accuracy: 14.3%

8. Question: What is the most common functioning pancreatic islet cell tumor?...


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


   → Prediction: |end_header_id>...
   → Ground truth: Insulinoma
   → Correct? False (no_match)
   🎯 Running accuracy: 12.5%

9. Question: In an MRI scan showing a transaxial section through the head, which structure may be obliterated by a pituitary tumor wi...


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


   → Prediction: |end_header_id>
The answer is: Pituitary gland. The pituitary gland is a small endocrine gland locat...
   → Ground truth: The optic chiasm.
   → Correct? False (no_match)
   🎯 Running accuracy: 11.1%

10. Question: What artery is a direct branch of the gastroduodenal artery?...


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


   → Prediction: |end_header_id>
The answer is: The hepatic artery....
   → Ground truth: Right gastroepiploic artery
   → Correct? False (no_match)
   🎯 Running accuracy: 10.0%

11. Question: A patient diagnosed with bronchiectasis has now presented with nephrotic syndrome. What is the most likely underlying co...


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


   → Prediction: |end_header_id>
The answer is: Glomerulonephritis. The patient's symptoms of nephrotic syndrome and ...
   → Ground truth: Amyloidosis
   → Correct? False (no_match)
   🎯 Running accuracy: 9.1%

12. Question: What is the most general term for the process by which the amount of active drugs in the body is reduced after absorptio...


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


   → Prediction: |end_header_id>
The answer is: Decrease in drug concentration....
   → Ground truth: Elimination
   → Correct? False (no_match)
   🎯 Running accuracy: 8.3%

13. Question: A 7-year-old boy presents with developmental delay, intellectual disability, and a history of cerebral venous thrombosis...


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


   → Prediction: |end_header_id>
The answer is: Elevated serum homocysteine levels. Based on the patient's symptoms a...
   → Ground truth: Decreased methionine concentration
   → Correct? False (no_match)
   🎯 Running accuracy: 7.7%

14. Question: A 27-year-old male presents with a palpable mass in his scrotum and mild testicular pain. Upon physical examination, the...


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


   → Prediction: |end_header_id>
Answer: Torsion of the left testicular vein. Based on the symptoms and examination f...
   → Ground truth: Compression of the left renal vein at the aortic origin of the superior mesenteric artery
   → Correct? False (no_match)
   🎯 Running accuracy: 7.1%

15. Question: A farmer has a black mole on the cheek that has increased in size to more than 6mm with sharply defined borders and a ce...


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


   → Prediction: |end_header_id>
The answer is: Pigmented macule....
   → Ground truth: Superficial spreading melanoma
   → Correct? False (no_match)
   🎯 Running accuracy: 6.7%

16. Question: Which viruses are known to cause hemorrhagic fever?...


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


   → Prediction: |end_header_id>
Answer: Ebola virus, Marburg virus, Lassa virus, and Hantavirus....
   → Ground truth: Lassa fever virus, Yellow fever virus, Crimean-Congo hemorrhagic fever virus
   → Correct? False (no_match)
   🎯 Running accuracy: 6.2%

17. Question: What is the name of the vertical crest found in the fundus of the internal auditory canal?...


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


   → Prediction: |end_header_id>
The answer is: The superior semicircular crest....
   → Ground truth: Bill's bar
   → Correct? False (no_match)
   🎯 Running accuracy: 5.9%

18. Question: What is the structure not involved in the Unhappy triad of O'Donoghue?...


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


   → Prediction: |end_header_id>
The answer is: blood pressure. The Unhappy triad consists of blood pressure, heart r...
   → Ground truth: Fibular collateral ligament
   → Correct? False (no_match)
   🎯 Running accuracy: 5.6%

19. Question: What is the most likely diagnosis for a 22-year-old woman who develops small itchy wheals after physical exertion, expos...


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


   → Prediction: |end_header_id>
The answer is: Hypersensitivity reaction to food. <end_of_text>...
   → Ground truth: Cholinergic urticaria
   → Correct? False (no_match)
   🎯 Running accuracy: 5.3%

20. Question: What is the most reliable feature indicating the malignant transformation of pheochromocytoma?...
   → Prediction: |end_header_id>
The answer is: Increased cell proliferation rate....
   → Ground truth: Presence of metastasis to other organs
   → Correct? False (no_match)
   🎯 Running accuracy: 5.0%


## Step 6: Calculate Final Metrics

Compute and display :

- Total number of examples evaluated
- Number and percentage of exact matches
- Number and percentage of partial matches
- Overall accuracy percentage
- Number of incorrect predictions
- Total evaluation time and average time per example

In [45]:
accuracy = 100 * (correct_exact + correct_partial) / total

print("\n" + "="*80)
print("FINAL RESULTS")
print("="*80)
print(f"Total examples evaluated: {total}")
print(f"Exact matches:  {correct_exact} ({100*correct_exact/total:.1f}%)")
print(f"Partial matches:{correct_partial} ({100*correct_partial/total:.1f}%)")
print(f"Overall accuracy: {accuracy:.1f}%")
print(f"Incorrect predictions: {total - (correct_exact + correct_partial)}")
print(f"Total time: {total_time:.1f} sec")
print(f"Avg time per example: {total_time/total:.2f} sec")



FINAL RESULTS
Total examples evaluated: 20
Exact matches:  1 (5.0%)
Partial matches:0 (0.0%)
Overall accuracy: 5.0%
Incorrect predictions: 19
Total time: 19.8 sec
Avg time per example: 0.99 sec


## Step 7: Analyze Detailed Results

Review and display :

1- Incorrect examples: Show all questions where the model failed, with ground truth vs. prediction

2- Correct examples: Show a sample (first 5) of successful predictions

3-Understand patterns in successes and failures

In [46]:
print("\n" + "="*80)
print("DETAILED RESULTS")
print("="*80)

incorrect = [r for r in results if not r["correct"]]
correct = [r for r in results if r["correct"]]

if incorrect:
    print(f"\n❌ Incorrect examples ({len(incorrect)}):")
    for r in incorrect:
        print("\nQ:", r["question"])
        print("GT:", r["ground_truth"])
        print("Pred:", r["prediction"][:120], "...")
else:
    print("🎉 ALL CORRECT!")

print(f"\n\n✅ Showing first 5 correct examples ({len(correct)} total)\n")
for r in correct[:5]:
    print("Q:", r["question"])
    print("GT:", r["ground_truth"])
    print("Pred:", r["prediction"][:120], "...")
    print("Match:", r["match_type"], "\n")



DETAILED RESULTS

❌ Incorrect examples (19):

Q: After a 60-year-old man underwent a successful orthotopic liver transplantation, the transplanted liver exhibited poor function and produced minimal bile for the first 3 days. This poor graft function is thought to result from 'reperfusion injury.' What substance is most likely responsible for causing reperfusion injury in the transplanted liver?
GT: Reactive oxygen species
Pred: |end_header_id>
The answer is: Hypoxia-inducible factor-1 alpha (HIF-1α). ...

Q: In a 37-year-old female patient with a fractured clavicle where the junction of the inner and middle third of the bone shows overriding of the medial and lateral fragments, and the arm is rotated medially but not laterally, what medical condition is likely to occur as a complication of this fracture?
GT: Thrombosis of the subclavian vein, causing a pulmonary embolism
Pred: |end_header_id>
The answer is: Malunion of the clavicle. The correct answer is: Malunion of the clavicle. The

## Step 8: Assess Performance

Interpret your results using these benchmarks :

- ≥80% accuracy: Excellent - Fine-tuning was highly successful
- 60-79% accuracy: Good - Model learned successfully
- 40-59% accuracy: Moderate - Consider training longer or using more data
- 20-39% accuracy: Poor - Check data quality and training parameters
- <20% accuracy: Very poor - Verify data formatting and retrain

In [47]:
print("\n" + "="*80)
print("PERFORMANCE ASSESSMENT")
print("="*80)

if accuracy >= 80:
    print("EXCELLENT! Model is performing very well!")
elif accuracy >= 60:
    print("GOOD! Model learned successfully!")
elif accuracy >= 40:
    print("MODERATE. More training may help.")
elif accuracy >= 20:
    print("POOR. Model needs improvement.")
else:
    print("VERY POOR. Check data formatting or train longer.")



PERFORMANCE ASSESSMENT
VERY POOR. Check data formatting or train longer.


## Step 9: Save Results

Create a comprehensive results dictionary containing:

- All accuracy metrics
- Timing information
- Selected test indices
- Detailed results for each example
- Save to evaluation_results.json for future reference and analysis

In [48]:
results_summary = {
    "accuracy": accuracy,
    "correct_exact": correct_exact,
    "correct_partial": correct_partial,
    "total": total,
    "selected_indices": selected_indices,
    "results": results
}

with open("evaluation_results.json", "w") as f:
    json.dump(results_summary, f, indent=2)

print("Results saved to evaluation_results.json")
print("\n" + "="*80)
print("EVALUATION COMPLETE")
print("="*80)


Results saved to evaluation_results.json

EVALUATION COMPLETE


#What's about the next steps ?

##Part A : Model Improvement Strategies

**Question 1: Improving Model Performance**

Based on your evaluation results, propose at least 2 or 3 specific strategies to improve your model's accuracy. For each strategy, explain what you would change, why it helps, and potential trade-offs.

- Train on more or better data for the model to see more diverse medical cases. But more data is longer training
- Train longer, increase num_train_epochs. This will let the layers better adapt instead of underfitting. However, risk of over fitting

**Question 2: Analyzing Failure Patterns**

Review your incorrect predictions and identify patterns in failures. What can you tell about the model errors ?

- Long questions, model only answers part of the question
- Rare conditions

**Question 3: Data Quality vs. Quantity**

What do you think it's better between training on 2000 examples (same quality) or 500 curated high-quality examples ?

- For small LoRA fine tuning, quality usually matters more than quantity
- Once the quality is under control, then scaling to 2000+ examples becomes useful

##Part B : Resource-Constrained Inference

**Question 4: Optimizing for limited resources**

How can you design a strategie to reduce inference time/memory for deployment in constrained environments ?

- For deployment on a single GPU or CPU, I would use a small quantized model, short prompts, and reduced max_new_tokens

**Question 5: Speed vs. Accuracy Trade-offs**

Analyze how changing generation parameters affects speed, quality, and consistency 🥸

- I would prioritize low temperature, moderate top_p, and a moderate max_new_tokens to favor consistent, safe, and relatively fast answers over creativity in a medical setting

##Part C : Evaluation Methodology

**Question 7: Improving Evaluation Metrics**

Analyze limitations of current exact/partial match evaluation and propose improvements. Do you think you have false negatives or false positives ? What can we do about it ?

- Yes, we likely have both false negatives (paraphrased correct answers) and false positives (answers that mention the right term but overall are wrong). Better evaluation would combine semantic similarity and human/LLM judgment

**Question 8: Test Set Size and Confidence**

Test other test size and observe the result. What can you say about the results ? How can you improve it ?

- Our current evaluation on 20 examples is more of a rough indication; to be confident in the true performance we should evaluate on a larger, well-stratified test set

##Part D : Real-World deployment scenario

**Question 9: Production Considerations**

What can you do to address safety, reliability, updates, and edge cases for deploying in a medical assistance application ?

- The system must include strong guardrails, human oversight, and mechanisms to prevent harmful or incorrect medical advice. It should regularly be updated, monitored, and versioned to reflect evolving medical knowledge, while ensuring privacy and regulatory compliance (e.g., HIPAA/GDPR). The model must also detect uncertainty or out-of-scope queries and redirect users to qualified professionals when necessary

# Final question

Use this notebook to finetune a model on a medical dataset and evaluate the performance of your model. You can choose any dataset from this list : https://huggingface.co/collections/mfmezger/medical-datasets-for-llms

If you feel brave you can use this dataset https://huggingface.co/datasets/openlifescienceai/medmcqa and train a model to answer questions like a medical student / future doctor 🧙 If you have CUDA backend enabled you can use this lib to speed up the training : https://github.com/unslothai/unsloth?tab=readme-ov-file

In [49]:
dataset = load_dataset("openlifescienceai/medmcqa")


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/85.9M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/936k [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/1.48M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/182822 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6150 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/4183 [00:00<?, ? examples/s]

In [50]:
full = dataset["train"]

train_data = full.select(range(0, 1000))
test_data = full.select(range(1000, len(full)))


In [57]:
def get_ground_truth(example):
    label = example["cop"]   # integer 0,1,2,3

    option_map = {
        0: example["opa"],
        1: example["opb"],
        2: example["opc"],
        3: example["opd"]
    }

    return option_map[label]



In [58]:
def build_question(example):
    return (
        example["question"] + "\n"
        f"A) {example['opa']}\n"
        f"B) {example['opb']}\n"
        f"C) {example['opc']}\n"
        f"D) {example['opd']}\n"
        "Select the correct option."
    )


In [59]:
import random
random.seed(42)

selected_indices = random.sample(range(len(test_data)), 20)


In [60]:
correct_exact = 0
correct_partial = 0
total = 0
results = []

for idx in selected_indices:
    example = test_data[idx]

    question = build_question(example)
    ground_truth = get_ground_truth(example)

    prediction = get_prediction(question)

    correct, match_type = check_accuracy(prediction, ground_truth)

    if correct:
        if match_type == "exact":
            correct_exact += 1
        else:
            correct_partial += 0

    total += 1

    results.append({
        "idx": idx,
        "question": question,
        "ground_truth": ground_truth,
        "prediction": prediction,
        "correct": correct,
        "match_type": match_type
    })


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for

In [61]:
def check_accuracy(prediction, ground_truth):
    # Exact match if the correct option text appears in prediction
    if ground_truth.lower() in prediction.lower():
        return True, "exact"
    return False, "no_match"


In [62]:
accuracy = 100 * (correct_exact + correct_partial) / total
print(f"Accuracy on MedMCQA = {accuracy:.1f}%")


Accuracy on MedMCQA = 25.0%


In [63]:
def mcq_exact(pred, gt):
    return gt.lower() in pred.lower()
