# ReadMe
On this notebook we finetune Qwen 2.5 VL (3B and 7B) using Lora adapters with Unsloth. At the end of the notebook we test our models performance using our eval function

## Installation

In [None]:
%%capture
!pip install pip3-autoremove
!pip-autoremove torch torchvision torchaudio -y
!pip install torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu121
!pip install unsloth==2025.5.6
!pip install VLLM==0.8.4
!pip install transformers==4.51.3

## Unsloth
For the finetuning process we're going to use the UNSLOTH library. Unsloth offers faster and more memory efficient finetuning compared to the huggingface libraries. Using HF we wouldve required a more powerful gpu.

In [None]:
from unsloth import FastVisionModel
import torch

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


2025-05-23 15:07:15.937804: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748012836.148091      35 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748012836.214150      35 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 05-23 15:07:35 [__init__.py:239] Automatically detected platform cuda.


In [None]:


model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Qwen2.5-VL-3B-Instruct-unsloth-bnb-4bit",
    load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)

### Picking the layers to train

In [None]:
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True,
    finetune_language_layers   = True,
    finetune_attention_modules = True,
    finetune_mlp_modules       = True,

    r = 32,
    lora_alpha = 32,
    lora_dropout = 0.1,
    bias = "none",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,

)

##  Data Preparation
Since the dataset is really big and our compute power so limited we're going to used a Dataset sampled from the original [NIH-CXR14-BiomedCLIP-Features](https://huggingface.co/datasets/Yasintuncer/NIH-CXR14-BiomedCLIP-Features/viewer/default/train?row=0&views%5B%5D=train) dataset.

There are 200 samples for label which in total are 15. This number has been chosen because our label with the lowest number of samples was `Hernea` with aprox 250 samples whereas the label with the second lowest amount of samples was `Emphysema` with aprox 1000 samples. In order to reduce the bias caused by the samples we decided to pick 200 samples for each in Training and 50 for Testing.

In [None]:
from datasets import load_dataset
dataset = load_dataset("Martingkc/MediBert_Dataset")
dataset_testvalid = dataset["test"].train_test_split(test_size=0.2)
dataset_train = dataset["train"]
dataset_test = dataset_testvalid["test"]
dataset_valid = dataset_testvalid["train"]

README.md:   0%|          | 0.00/1.16k [00:00<?, ?B/s]

train-00000-of-00002.parquet:   0%|          | 0.00/480M [00:00<?, ?B/s]

train-00001-of-00002.parquet:   0%|          | 0.00/485M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2380 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2380 [00:00<?, ? examples/s]

## Examples from the DS
#### Image

In [None]:
dataset_train[2]["Image"]

#### Text

In [None]:
dataset_train[2]["Texts"]

### Formatting DS and defining a prompt
- UNSLOTH accepts data structured as follows

```python
[
{ "role": "user",
  "content": [{"type": "text",  "text": Q}, {"type": "image", "image": image} ]
},
{ "role": "assistant",
  "content": [{"type": "text",  "text": A} ]
},
]
```

In [None]:
instruction = """You are an expert radiologist AI assistant. Your task is to carefully and thoroughly analyze a given chest X-ray image and provide a a concise, two-sentence summary. Your response MUST strictly adhere to the following format:

**Sentence 1 (Findings):**
This sentence MUST begin with "This photo of a chest x-ray shows ".
*   If there are **multiple abnormalities**, it must continue with: "multiple findings including [FINDING_1, FINDING_2, ...]." (e.g., "multiple findings including Atelectasis, Cardiomegaly, and Effusion.")
*   If there is **exactly one abnormality**, it must continue with: "a [FINDING] finding." (e.g., "a Hernia finding.")
*   If there are **no abnormalities**, it must continue with: "no findings." (e.g., "no findings.")

The ONLY permissible terms for [FINDING] are: Atelectasis, Cardiomegaly, Effusion, Infiltration, Mass, Nodule, Pneumonia, Pneumothorax, Consolidation, Edema, Emphysema, Fibrosis, Hernia, Pleural Thickening.

**Sentence 2 (Projection):**
This sentence MUST be exactly: "The image is taken from a [PROJECTION] view."
*   [PROJECTION] MUST be one of: "PA" or "AP".

**CRITICAL: Do not deviate from this two-sentence structure and wording. Use only the provided terms for findings.**
”"""

def convert_to_conversation(sample):
    conversation = [
        { "role": "user",
          "content" : [
            {"type" : "text",  "text"  : instruction},
            {"type" : "image", "image" : sample["Image"]} ]
        },
        { "role" : "assistant",
          "content" : [
            {"type" : "text",  "text"  : sample["Texts"]} ]
        },
    ]
    return { "messages" : conversation }
pass

### Convert DS in the Unsloth Format

In [None]:
converted_dataset = [convert_to_conversation(sample) for sample in dataset_train]
converted_dataset_valid = [convert_to_conversation(sample) for sample in dataset_testvalid["test"]]
converted_dataset_test = [convert_to_conversation(sample) for sample in dataset_testvalid["train"]]

In [None]:
converted_dataset[0]

## Pre Finetune Test

In [None]:
FastVisionModel.for_inference(model)

image = dataset['train'][0]["Image"]

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)


### Train the model
Now let's use Huggingface's `SFTTrainer`, with 3 epochs and a low learning rate of 5e-5 and a validation set that is going to be used every 100 steps. For the lora adapters we chose an r value of 24 and an alpha value of 32 increasing the layers trained.

In [None]:
from unsloth import is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
from transformers import AutoProcessor



FastVisionModel.for_training(model)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    data_collator = UnslothVisionDataCollator(model, tokenizer),
    train_dataset = converted_dataset,
    eval_dataset = converted_dataset_valid,

    args = SFTConfig(
        num_train_epochs = 2,               # fixed total steps
        warmup_steps = 20,          # ~10% warm-up
        lr_scheduler_type = "cosine",         # smooth decay
        learning_rate = 5e-5,                 # gentle LR
        per_device_train_batch_size = 2,      # or bump to 4 if you can
        gradient_accumulation_steps = 4,      # adjust if batch size increases
        fp16 = not is_bf16_supported(),
        bf16 = is_bf16_supported(),
        optim = "adamw_8bit",
        weight_decay = 0.02,
        logging_steps = 50,
        eval_strategy = "steps",
        eval_steps = 100,
        save_strategy = "steps",
        save_steps = 100,
        load_best_model_at_end = True,
        metric_for_best_model = "loss",
        report_to = "none",
        remove_unused_columns = False,
        dataset_text_field = "caption",
        dataset_kwargs = {"skip_prepare_dataset": True},
        dataset_num_proc = 4,
        max_seq_length = 2048,
    ),
)

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
trainer_stats = trainer.train()

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.



In [None]:
from google.colab import userdata
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
HF_TOKEN = user_secrets.get_secret("Hugging_Face_Token")

model.push_to_hub("Martingkc/Qwen_2.5VL_3B_3_NIHCXR14_LORA", token =HF_TOKEN)
tokenizer.push_to_hub("Martingkc/Qwen_2.5VL_3B_3_NIHCXR14_LORA", token =HF_TOKEN)

### Test & Inference

In [None]:
dataset_test[2]["Texts"]

In [None]:
FastVisionModel.for_inference(model)

image = dataset_test[2]["Image"]

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

## Inference using LoRa Adapters on HF
Load our LoRa Adapters. Unsloth will load the adapters from the HF repo and merge it automatically with our base model.

In [None]:

from unsloth import FastVisionModel

MODEL_7B = "Martingkc/Qwen_2.5VL_NIHCXR14_LORA"
MODEL_3B = "Martingkc/Qwen_2.5VL_3B_3_NIHCXR14_LORA"
model, tokenizer = FastVisionModel.from_pretrained(
model_name=MODEL_3B,
max_seq_length=1024,
load_in_4bit=True,
)




==((====))==  Unsloth 2025.5.6: Fast Qwen2 patching. Transformers: 4.51.3. vLLM: 0.8.4.
   \\   /|    Tesla T4. Num GPUs = 2. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/3.79G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/238 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/575 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/5.80k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/329M [00:00<?, ?B/s]

In [None]:
from transformers import TextStreamer

def generate_caption(image):

    messages = [
        {"role": "user", "content": [
            {"type": "image"},
            {"type": "text", "text": instruction}
        ]}
    ]
    input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
    inputs = tokenizer(
        image,
        input_text,
        add_special_tokens = False,
        return_tensors = "pt",
    ).to("cuda")

    text_streamer = TextStreamer(tokenizer, skip_prompt = True)
    tokens = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
                       use_cache = True, temperature = 1.5, min_p = 0.1)
    decoded_tokens = [tokenizer.decode(id) for id in tokens]

    return "".join(decoded_tokens).split("<|im_start|>assistant\n")[-1].replace("<|im_end|>", "")


### Testing
We defined the following evaluation function that takes a list containing the base and predicted texts extracts the labels and gives the f1, recall, precision and accuracy scores. And it also calculates the rouge scores of the base and predicted texts. Unsloth doesnt give access to the Logits so were unable to calculate probability based scores such as AUC ROC.

In [None]:
!pip install rouge_score evaluate

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=14b42eb8407e2b8a370d602e7ba453e5f960bf77fd366e30bb54b19b7404eeb2
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score, evaluate
Successfully installed evaluate-0.4.3 rouge_score-0.1.2


In [None]:
import evaluate
import numpy as np
from sklearn.metrics import f1_score, recall_score, precision_score, accuracy_score

rouge = evaluate.load('rouge')
label_cols = [
"Atelectasis", "Cardiomegaly", "Effusion", "Infiltration",
"Mass", "Nodule", "Pneumonia", "Pneumothorax",
"Consolidation", "Edema", "Emphysema", "Fibrosis",
"Hernia", "Pleural Thickening", "no findings","AP", "PA"
]

def eval_model(pred_texts, base_texts):
    """
    Parameters:
        - `pred`: array of len 100 containing predictions
        - `base`: array of len 100 containing reference sentences
    """
    pred_labels = []
    base_labels = []
    for p, b in zip(pred_texts, base_texts):
        p = p.lower()
        b = b.lower()

        p_vec = []
        b_vec = []
        for lbl in label_cols:
            lbl_lc = lbl.lower()
            p_vec.append(1 if lbl_lc in p else 0)
            b_vec.append(1 if lbl_lc in b else 0)

        pred_labels.append(p_vec)
        base_labels.append(b_vec)

    rouge_scores = rouge.compute(predictions=pred_texts, references=base_texts)
    f1 = f1_score(pred_labels, base_labels,average='macro',zero_division=0)
    recall = recall_score(pred_labels, base_labels,average='micro',zero_division = 0 )
    precision = precision_score(pred_labels, base_labels,average='micro',zero_division = 0 )
    accuracy = accuracy_score(pred_labels, base_labels)

    return {
        "f1@100": f1,
        "recall@100": recall,
        "accuracy@100": accuracy,
        "precision@100": precision,
        "rouge_scores@100": rouge_scores
    }


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [None]:
import random
test_samples_100 = dataset_test.select(random.sample(range(1, len(dataset_test)), 100))


base_text = []
pred_text = []

for idx, row in enumerate(test_samples_100):
    base_text.append(row["Texts"])
    pred_text.append(generate_caption(row["Image"]))






This photo of a chest x-ray shows multiple findings including Atelectasis, Infiltration, and Mass. The image is taken from a AP view.<|im_end|>
This photo of a chest x-ray shows no findings available. The image is taken from a PA view.<|im_end|>
This photo of a chest x-ray shows multiple findings including Atelectasis, Emphysema, Infiltration, Mass, and Pneumonia. The image is taken from a PA view.<|im_end|>
This photo of a chest x-ray shows multiple findings including Infiltration, Mass, Nodule, Pleural_Thickening, and Pneumothorax. The image is taken from a PA view.<|im_end|>
This photo of a chest x-ray shows no findings available. The image is taken from a AP view.<|im_end|>
This photo of a chest x-ray shows multiple findings including Emphysema, Mass, Nodule. The image is taken from a PA view.<|im_end|>
This photo of a chest x-ray shows multiple findings including Effusion, Infiltration, Pneumothorax, and Consolidation. The image is taken from a PA view.<|im_end|>
This photo of a c

In [None]:
eval_model(pred_text, base_text)

{'f1@100': 0.19134492879408804,
 'recall@100': 0.3523489932885906,
 'accuracy@100': 0.05,
 'precision@100': 0.38181818181818183,
 'rouge_scores@100': {'rouge1': 0.8294648597229838,
  'rouge2': 0.7386126460556653,
  'rougeL': 0.8279210471233425,
  'rougeLsum': 0.8278508397958884}}

## Result for 7B Parameter Qwen VL 2.5
{'f1@100': 0.1685669275460801,
 'recall@100': 0.3839541547277937,
 'accuracy@100': 0.01,
 'precision@100': 0.47017543859649125,
 'rouge_scores@100': {'rouge1': 0.8524994674908319,
  'rouge2': 0.7473405414870823,
  'rougeL': 0.8443709933441017,
  'rougeLsum': 0.8439830183982576}}