# CS614 Individual Assignment: Fine-Tuning an LLM for Medical Question Answering

**Student:** Ikhwan Wahid  
**Module:** CS614 - Generative AI with LLMs  
**Date:** February 2026

---
## 1. Task

The goal of this assignment is to fine-tune a pre-trained large language model (LLM) to improve its performance on a domain-specific task. Specifically, we fine-tune **Mistral-7B-Instruct-v0.3** on the **MedQA-USMLE** dataset — a collection of 4-option multiple-choice questions from the United States Medical Licensing Examination (USMLE).

The task is framed as a classification problem: given a clinical vignette and four answer options (A, B, C, D), the model must select the single best answer. We use **QLoRA** (Quantized Low-Rank Adaptation) to make fine-tuning feasible on a single GPU, and evaluate across **6 hyperparameter configurations** to study the effect of LoRA rank, learning rate, training duration, and regularization.

We also conduct additional analyses beyond the core fine-tuning:
- **Zero-shot and 3-shot baselines** to establish pre-training performance
- **Per-topic accuracy breakdown** across 13 medical specialties
- **Error analysis** with confusion matrices
- **Confidence calibration** (Expected Calibration Error)
- **Answer position bias analysis** with chi-squared statistical tests
- **Prompt template sensitivity** testing across 4 different system prompts

---
## 2. Dataset

**Source:** [`GBaker/MedQA-USMLE-4-options-hf`](https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options-hf) on HuggingFace Hub

| Split | Examples |
|-------|----------|
| Train | 10,178 |
| Validation | 1,272 |
| Test | 1,273 |

Each example contains:
- A clinical vignette (`sent1`) describing a patient scenario
- Four answer options (`ending0` through `ending3`)
- A gold label (0-3, mapped to A-D)

The questions are drawn from USMLE Step 1, Step 2, and Step 3 exams, covering a broad range of medical topics. Using a keyword-based topic classifier, we identified 13 medical specialties in the test set:

| Topic | Test Examples | % of Test Set |
|-------|:---:|:---:|
| Other (General Medicine) | 145 | 11.4% |
| Gastroenterology | 157 | 12.3% |
| Cardiology | 158 | 12.4% |
| Infectious Disease | 81 | 6.4% |
| Neurology | 65 | 5.1% |
| Pulmonology | 90 | 7.1% |
| Nephrology | 59 | 4.6% |
| Endocrinology | 57 | 4.5% |
| Hematology | 46 | 3.6% |
| Obstetrics/Gynecology | 108 | 8.5% |
| Psychiatry | 68 | 5.3% |
| Oncology | 50 | 3.9% |
| Pharmacology | 8 | 0.6% |

The answer label distribution in the training set is approximately uniform across A/B/C/D, which means no class rebalancing was needed.

---
## 3. Model Choice

### Base Model: Mistral-7B-Instruct-v0.3

We selected [`mistralai/Mistral-7B-Instruct-v0.3`](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) for several reasons:

1. **7B parameters** strikes a balance between capability and trainability on a single GPU
2. **Instruction-tuned** — the model already understands the instruct format (`[INST] ... [/INST]`), which means it can follow prompts out of the box
3. **Strong general reasoning** — Mistral-7B outperforms many larger models on standard benchmarks
4. **Wide QLoRA support** — well-tested with PEFT, bitsandbytes, and trl libraries

### Why QLoRA?

Full fine-tuning of a 7B parameter model requires approximately **56 GB of VRAM** (7B params x 4 bytes for weights + 4 bytes for AdamW optimizer states). This exceeds even an 80GB H100 once activations and gradients are included. QLoRA makes fine-tuning feasible through three techniques:

| Technique | Effect |
|-----------|--------|
| **4-bit NF4 quantization** | Compresses base model from ~14 GB (FP16) to ~4 GB in GPU memory |
| **LoRA adapters** | Injects small trainable matrices (1-4% of total params), leaving the quantized base frozen |
| **Gradient checkpointing** | Trades compute for memory by recomputing activations during backward pass |

This combination reduces VRAM usage to ~15-20 GB, fitting comfortably on a single GPU.

### LoRA Configuration

We target **all 7 linear layers** in each Mistral transformer block:

```
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
```

This is more comprehensive than the common approach of only targeting attention projections (`q_proj`, `v_proj`), and gives the adapter more capacity to learn task-specific representations.

---
## 4. Fine-Tuning Process

### Training Format

Each training example is formatted in the Mistral instruct template:

```
<s>[INST] You are a medical expert. Answer the following USMLE-style
multiple-choice question by selecting the single best answer.
Respond with ONLY the letter (A, B, C, or D) of the correct answer.

{clinical vignette}

A) {option_0}
B) {option_1}
C) {option_2}
D) {option_3} [/INST] {answer_letter}</s>
```

The model is trained to predict only the answer letter (A/B/C/D) — a single token — after seeing the full prompt.

### Hyperparameter Sweep

We trained **6 configurations** that systematically vary LoRA rank, learning rate, training duration, and dropout:

| Config | LoRA Rank (r) | Alpha | Learning Rate | Epochs | Dropout | Trainable Params | What It Tests |
|--------|:---:|:---:|:---:|:---:|:---:|:---:|-------|
| 1 (Baseline) | 16 | 32 | 2e-4 | 2 | 0.05 | 42M (1.10%) | Standard QLoRA defaults |
| 2 (Low Rank) | 8 | 16 | 2e-4 | 2 | 0.05 | 21M (0.55%) | Fewer params sufficient? |
| 3 (High Rank) | 64 | 128 | 1e-4 | 2 | 0.05 | 168M (4.27%) | More capacity helps? |
| 4 (Low LR) | 16 | 32 | 5e-5 | 3 | 0.05 | 42M (1.10%) | Slower, more stable learning? |
| 5 (Extended) | 16 | 32 | 2e-4 | 3 | 0.05 | 42M (1.10%) | More epochs help? |
| 6 (Aggressive) | 32 | 64 | 3e-4 | 2 | 0.10 | 84M (2.18%) | Speed + regularization |

**Fixed settings across all configs:**
- Effective batch size: 16 (batch=4 x gradient accumulation=4)
- Optimizer: Paged AdamW 8-bit
- LR scheduler: Cosine with 5% warmup
- Max sequence length: 1024 tokens
- Weight decay: 0.01
- Early stopping: patience = 3 evaluation steps (every 100 training steps)
- Precision: BFloat16
- Quantization: NF4 with double quantization

### Training Results

| Config | Best Eval Loss | Training Time | Steps Before Early Stop |
|--------|:---:|:---:|:---:|
| 1 (Baseline) | 0.9594 | 29.5 min | ~600 |
| 2 (Low Rank) | 0.9583 | 29.6 min | ~600 |
| 3 (High Rank) | 0.9527 | 30.0 min | ~600 |
| 4 (Low LR) | **0.9515** | 49.2 min | ~1500 |
| 5 (Extended) | 0.9685 | 29.6 min | ~600 |
| 6 (Aggressive) | 0.9810 | 42.9 min | ~600 |

**Key observations from training:**
- **Early stopping activated in all configs**, preventing overfitting. No config completed its full epoch budget.
- **Config 4** (low LR) trained for the most steps (~1500) because the slower learning rate allowed gradual improvement without triggering early stopping.
- **Configs 5 and 6** had the worst eval losses, confirming that higher learning rates lead to faster overfitting.
- **Total training time** for all 6 configs: approximately **3.5 hours** on an NVIDIA H100 80GB.

---
## 5. Evaluation

### Evaluation Setup

We evaluate three approaches on the **full test set** (1,273 examples):

1. **Zero-shot** — Base Mistral-7B with no examples, just the system prompt
2. **3-shot** — Base Mistral-7B with 3 random training examples as in-context exemplars
3. **Fine-tuned** — Best QLoRA config (config 4) applied to the base model

Inference uses greedy decoding (`do_sample=False`) with `max_new_tokens=5`. The generated text is parsed to extract the answer letter using a multi-strategy regex parser (exact first-char match, "The answer is X" patterns, standalone letter detection).

### Main Results

| Metric | Zero-Shot | 3-Shot | Fine-Tuned | Delta (FT vs ZS) |
|--------|:---:|:---:|:---:|:---:|
| **Accuracy** | 49.10% | 49.02% | **57.42%** | **+8.32 pp** |
| **Macro F1** | 0.4929 | 0.4888 | **0.5695** | +0.0766 |
| **Extraction Failure Rate** | 1.18% | 1.65% | **0.16%** | -1.02 pp |

### Validation Accuracy Across All 6 Configs

| Config | Val Accuracy | Val Macro F1 | Extraction Failures |
|--------|:---:|:---:|:---:|
| **4 (Low LR)** | **57.39%** | **0.5716** | **0.00%** |
| 3 (High Rank) | 56.68% | 0.5667 | 0.00% |
| 2 (Low Rank) | 56.29% | 0.5620 | 0.00% |
| 1 (Baseline) | 55.82% | 0.5562 | 0.00% |
| 5 (Extended) | 52.99% | 0.5283 | 0.00% |
| 6 (Aggressive) | 51.34% | 0.5115 | 0.00% |

### Per-Class Performance (Fine-Tuned, Test Set)

| Answer | Precision | Recall | F1-Score | Support |
|:---:|:---:|:---:|:---:|:---:|
| A | 0.62 | 0.58 | 0.60 | 353 |
| B | 0.52 | 0.59 | 0.55 | 309 |
| C | 0.61 | 0.63 | 0.62 | 346 |
| D | 0.54 | 0.48 | 0.50 | 265 |

Answer **D** has the lowest recall (0.48), meaning the model fails to identify D as correct ~52% of the time. This is consistent with the position bias analysis in Section 6.

### Per-Topic Accuracy

Fine-tuning improved accuracy across **all 13 medical topics**:

| Topic | Zero-Shot | Fine-Tuned | Delta |
|-------|:---:|:---:|:---:|
| Obstetrics/Gynecology | 52.8% | **63.0%** | +10.2 |
| Cardiology | 50.2% | **62.0%** | +11.8 |
| Psychiatry | 47.1% | **61.8%** | **+14.7** |
| Other (General Medicine) | 59.3% | 60.0% | +0.7 |
| Oncology | 54.0% | 60.0% | +6.0 |
| Gastroenterology | 53.5% | 59.2% | +5.7 |
| Pulmonology | 45.6% | 55.6% | +10.0 |
| Nephrology | 49.2% | 54.2% | +5.1 |
| Infectious Disease | 46.9% | 53.1% | +6.2 |
| Endocrinology | 40.4% | 52.6% | **+12.3** |
| Hematology | 37.0% | 50.0% | **+13.0** |
| Neurology | 43.1% | 47.7% | +4.6 |
| Pharmacology | 25.0% | 37.5% | +12.5 |

**Largest gains:** Psychiatry (+14.7 pp), Hematology (+13.0 pp), Endocrinology (+12.3 pp), Pharmacology (+12.5 pp). These topics likely had the most room for the model to learn answer formatting and basic pattern recognition.

**Weakest after fine-tuning:** Pharmacology (37.5%, but only 8 test examples) and Neurology (47.7%). These topics may require specialized reasoning that the base model's pre-training did not cover.

### Error Analysis

Of 1,273 test examples, the fine-tuned model made **542 errors** (42.6% error rate):

| Error Type | Count |
|-----------|:---:|
| Substantive errors (wrong answer selected) | 540 |
| Extraction failures (no valid A/B/C/D produced) | 2 |

**Most confused answer pairs** (gold -> predicted):

| Gold | Predicted | Count |
|:---:|:---:|:---:|
| A | B | 62 |
| C | B | 55 |
| D | B | 51 |
| A | C | 48 |
| B | C | 48 |

The model disproportionately predicts **B** when it is wrong, consistent with the position bias analysis below.

**Error rate by topic** (highest to lowest):

| Topic | Error Rate |
|-------|:---:|
| Pharmacology | 62.5% |
| Neurology | 52.3% |
| Hematology | 50.0% |
| Endocrinology | 47.4% |
| Infectious Disease | 46.9% |
| Nephrology | 45.8% |
| Pulmonology | 44.4% |
| Gastroenterology | 40.8% |
| Other | 40.0% |
| Oncology | 40.0% |
| Psychiatry | 38.2% |
| Cardiology | 38.0% |
| Obstetrics/Gynecology | 37.0% |

### Confidence Calibration

We computed the model's confidence by examining the softmax probability over the A/B/C/D token logits at the first generated position:

| Metric | Value |
|--------|:---:|
| Expected Calibration Error (ECE) | **0.2536** |
| Avg. confidence on correct predictions | 57.97% |
| Avg. confidence on incorrect predictions | 54.42% |

The ECE of 0.254 indicates **poor calibration** — the model is overconfident. It assigns high confidence even when wrong (54.4% avg), and the gap between confidence on correct vs incorrect predictions is only 3.6 percentage points. A well-calibrated model should show a much larger gap.

The calibration curve shows that for predictions with >90% confidence, the model is only correct ~41% of the time — a severe overconfidence problem.

---
## 6. Results & Analysis

### 6.1 Where the Fine-Tuned Model Improved

**1. Overall accuracy: +8.3 percentage points over zero-shot baseline.**

The fine-tuned model (57.4%) significantly outperformed both the zero-shot (49.1%) and 3-shot (49.0%) baselines. This confirms that QLoRA fine-tuning is effective at adapting a general-purpose LLM to the medical MCQ domain, even with only 1.1% of parameters being trainable.

**2. Extraction failures nearly eliminated.**

The base model failed to produce a valid A/B/C/D answer 1.2-1.7% of the time (15-21 examples). After fine-tuning, this dropped to 0.16% (2 examples). The model learned the expected output format perfectly, which is a direct benefit of supervised fine-tuning on structured answer templates.

**3. Improvement across all 13 medical topics.**

No topic regressed after fine-tuning. The gains ranged from +0.7 pp (General Medicine, already the strongest) to +14.7 pp (Psychiatry). This broad improvement suggests the fine-tuning transferred general medical reasoning rather than memorizing topic-specific patterns.

**4. Reduced answer position bias.**

The zero-shot model had a severe positional bias, over-predicting D by 134 examples (chi-squared = 55.0, p < 0.001). Fine-tuning reduced this to a milder B-bias of +42 examples (chi-squared = 29.9, p < 0.001). While still statistically non-uniform, the fine-tuned model's predictions are much closer to the gold label distribution.

**5. Robust to prompt template changes.**

Testing 4 different system prompts on the fine-tuned model yielded accuracy within a narrow 0.8% band:

| Prompt Template | Accuracy |
|----------------|:---:|
| Original (training prompt) | 57.42% |
| Minimal ("Answer with only the letter") | 57.50% |
| CoT-style ("Think step by step") | 56.87% |
| Role-emphasis ("experienced physician") | **57.66%** |

This confirms the model learned to answer medical questions generally, rather than memorizing the specific training prompt wording. The role-emphasis prompt slightly outperformed the original, suggesting the model responds to contextual priming even after fine-tuning.

### 6.2 Where the Fine-Tuned Model Did Not Improve (or Showed Limitations)

**1. Accuracy ceiling at ~57%, regardless of hyperparameters.**

All 6 configurations converged to validation accuracy between 51.3% and 57.4%, with best eval losses within a narrow range (0.95-0.98). This ceiling is likely set by Mistral-7B's pre-trained medical knowledge, not the fine-tuning method. QLoRA can teach the output format and sharpen existing knowledge, but cannot inject new medical facts that were not learned during pre-training.

**2. Few-shot prompting provided no benefit.**

The 3-shot baseline (49.0%) performed marginally *worse* than zero-shot (49.1%), contradicting the common expectation that in-context examples help. For a 7B model on complex medical reasoning, the exemplars may consume too much of the context window or introduce confusing patterns. This contrasts with larger models (e.g., GPT-4) where few-shot prompting is highly effective.

**3. Persistent answer position bias.**

Despite reducing bias severity, the fine-tuned model still exhibits non-uniform prediction distributions (chi-squared = 29.9, p < 0.001). It over-predicts B (+42 over gold) and under-predicts D (-30 under gold). This means answer D has only 0.48 recall — the model fails to identify the correct answer almost half the time when D is correct.

**4. Severe overconfidence (ECE = 0.254).**

The model cannot reliably distinguish when it is right vs wrong. Average confidence for correct predictions (58.0%) is barely higher than for incorrect predictions (54.4%). In the highest confidence bin (>90%), accuracy is only 41.4%. This means the model's confidence scores are essentially meaningless for decision-making.

**5. Pharmacology and Neurology remain weak.**

Despite fine-tuning, Pharmacology (37.5%) and Neurology (47.7%) remain below 50% accuracy. Pharmacology improved from 25% (random-chance level) to 37.5%, but the small sample size (8 examples) makes this unreliable. Neurology's modest +4.6 pp gain suggests the base model lacks foundational neurological reasoning.

**6. Aggressive hyperparameters hurt performance.**

Config 6 (lr=3e-4, r=32, dropout=0.1) was the worst performer at 51.3% — only 2 pp above zero-shot. Config 5 (3 epochs at lr=2e-4) also underperformed at 53.0%. Both showed severe train-eval loss divergence within ~600 steps, confirming that overfitting is the primary risk when fine-tuning LLMs on domain-specific data.

### 6.3 Observed Limitations and Failure Modes

1. **Knowledge ceiling from pre-training.** The ~57% accuracy ceiling is determined by what the base model learned during pre-training, not by the fine-tuning method. All 6 configs converge to similar eval losses (~0.95-0.98), suggesting QLoRA adapts output formatting and sharpens existing knowledge but cannot compensate for missing medical facts.

2. **Answer position bias (D under-predicted).** The model under-predicts D by 30 examples, resulting in D having the lowest recall (0.48) of any answer position. The confusion matrix shows that when D is the correct answer, the model most often incorrectly predicts B (51 times). This systematic pattern suggests the model may be learning positional shortcuts rather than fully understanding answer content.

3. **Overconfidence renders confidence scores unusable.** With ECE of 0.254 and near-identical confidence for correct (58.0%) and incorrect (54.4%) predictions, the model's uncertainty estimates provide almost no signal. In a medical setting, this is particularly dangerous as users cannot rely on confidence scores to flag uncertain predictions.

4. **CoT prompting slightly hurts accuracy.** The chain-of-thought prompt (56.87%) slightly underperformed the original prompt (57.42%). Since the model was trained to output a single letter and we limit generation to 5 tokens, the CoT instruction creates a mismatch — the model cannot actually reason step-by-step within the token budget.

5. **Heuristic topic classifier.** Our keyword-based topic tagger is approximate — questions spanning multiple specialties (e.g., a cardiology question involving pharmacology) may be misclassified, affecting per-topic accuracy estimates.

### 6.4 Alternative Design Choices

To push beyond the current accuracy ceiling, several alternative approaches could be considered:

| Alternative | Expected Impact | Rationale |
|------------|:-:|---|
| **Answer shuffling during training** | +2-5% | Randomly permuting option positions (A/B/C/D) during training forces the model to learn answer content rather than positional shortcuts. Our bias analysis shows this would specifically improve D recall (currently only 0.48). |
| **Chain-of-thought fine-tuning** | +3-8% | Training on "Question -> Explanation -> Answer" sequences rather than direct letter prediction. This engages the model's reasoning capabilities and has been shown to improve medical QA accuracy. Requires CoT-annotated training data (e.g., from GPT-4 explanations). |
| **Retrieval-Augmented Generation (RAG)** | +5-15% | Augmenting each question with retrieved medical textbook passages at inference time. This directly addresses the knowledge ceiling by providing external evidence the base model lacks. |
| **Medical-domain base model (BioMistral)** | +3-8% | Starting from BioMistral-7B (pre-trained on PubMed) instead of general Mistral-7B provides a richer medical knowledge base before fine-tuning. |
| **Larger base model (70B)** | +5-10% | Fine-tuning Llama-3-70B or Mixtral-8x7B with QLoRA would provide more pre-trained medical knowledge, but at 4-8x the compute cost. |
| **Ensemble of top configs** | +1-3% | Majority vote across the top 3 configs (4, 3, 2) could correct individual model errors. Increases inference cost linearly but requires no retraining. |
| **Data augmentation (MedMCQA)** | +2-5% | Adding 183K Indian medical exam questions from MedMCQA would broaden the training distribution, especially for underrepresented topics. |

---
## 7. Ethical Considerations

1. **Clinical risk.** At 57.4% accuracy, the model answers incorrectly on nearly half of medical questions. It must **not** be used for clinical decision-making. Even significantly higher accuracy (e.g., 90%) would require rigorous clinical validation before any deployment.

2. **Cultural bias.** USMLE-based training data reflects US-centric medical practice, drug formularies, and clinical guidelines. The model's performance would likely degrade on questions about non-US treatment protocols or disease prevalence patterns.

3. **Overconfidence.** The calibration analysis (ECE = 0.254) shows the model cannot reliably signal when it is uncertain. In a medical context, a confident wrong answer could lead to harm if the model were misused for decision support.

4. **Demographic blind spots.** Clinical vignettes in MedQA may underrepresent certain patient demographics (age groups, ethnicities, socioeconomic backgrounds), potentially leading to performance disparities across populations.

---
## 8. Reproducibility

| Parameter | Value |
|-----------|-------|
| Base model | `mistralai/Mistral-7B-Instruct-v0.3` |
| Dataset | `GBaker/MedQA-USMLE-4-options-hf` (10,178 / 1,272 / 1,273) |
| Method | QLoRA (4-bit NF4, bfloat16 compute, double quantization) |
| Best config | lr=5e-5, r=16, alpha=32, cosine schedule, 3 epochs, early stopping (patience=3) |
| Trainer | HuggingFace `trl.SFTTrainer` v0.28 with `SFTConfig` |
| Optimizer | Paged AdamW 8-bit |
| Libraries | transformers, peft, trl, bitsandbytes, accelerate |
| Seed | 42 |
| Compute | Google Colab, NVIDIA H100 80GB HBM3 |
| Total training time | ~3.5 hours (all 6 configs) |
| Code | Available in `src/` directory with modular architecture |
| Results | Saved to Google Drive (`/content/drive/MyDrive/cs614_results/`) |