# Fine-Tuning a Vision Language Model (Qwen2.5-VL-3B) with the Hugging Face Ecosystem (TRL) to Annotate Alchemic Objects in Historic Book Illustrations

Notebook based on ressource: https://huggingface.co/learn/cookbook/fine_tuning_vlm_trl

Training perfomed using NVIDIA RTX A6000 on hyperstack: https://www.hyperstack.cloud/

## 1. Setup

Let’s start by installing the essential libraries we’ll need for fine-tuning! 🚀


In [3]:
!pip install  -U -q git+https://github.com/huggingface/transformers.git git+https://github.com/huggingface/trl.git datasets bitsandbytes peft qwen-vl-utils wandb accelerate
# Tested with transformers==4.47.0.dev0, trl==0.12.0.dev0, datasets==3.0.2, bitsandbytes==0.44.1, peft==0.13.2, qwen-vl-utils==0.0.8, wandb==0.18.5, accelerate==1.0.1

In [4]:
!pip install -U -q torch==2.4.1+cu121 torchvision==0.19.1+cu121 torchaudio==2.4.1+cu121 --extra-index-url https://download.pytorch.org/whl/cu121

Let's check the available ressources.

In [5]:
!nvidia-smi

Sat Mar  8 09:57:21 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12              Driver Version: 550.90.12      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA RTX A6000               Off |   00000000:00:06.0 Off |                  Off |
| 30%   32C    P8             12W /  300W |       2MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

Log in to Hugging Face to upload your fine-tuned model! 🗝️

You’ll need to authenticate with your Hugging Face account to save and share your model directly from this notebook.


In [7]:
import huggingface_hub
huggingface_hub.login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## 2. Load Dataset 📁

In this section, we’ll load the dataset. This dataset contains book pages with illustrations paired with the desired annotation.

Next, we’ll generate a system message for the VLM. In this case, we want to create a system that acts as an expert in analyzing historic illustrations and provides an output in JSON format.

In [8]:
system_message = """You are a Vision Language Model specialized in interpreting alchemic objects in historic book illustrations.
Your task is to analyze the provided book page and respond in JSON format only. In total there are 12 classes of alchemic objects you should detect.
The classes are: ampullae, animal, cucurbitae, cucurbitae-ambix, ollae, cucurbitae-retorte, cucurbitae-rosenhut, furnace, human, mineral-metal, other-equipment, plant.
Respond in a well-structured JSON format in which you always name all of the 12 classes and then their number of occurences in the provided image.
An output should look like this, for example: 
{
  "ampullae": 0,
  "animal": 0,
  "cucurbitae": 0,
  "cucurbitae-ambix": 0,
  "cucurbitae-retorte": 0,
  "cucurbitae-rosenhut": 0,
  "furnace": 0,
  "human": 0,
  "mineral-metal": 0,
  "other-equipment": 0,
  "plant": 0,
  "ollae": 0
}
Focus on delivering accurate, succinct answers based on the visual information. 
Don't add any additional explanation."""

As the task is the same for all samples we provide a general query.

In [9]:
query = """Which alchemic objects are in the image?
Provide an answer in a well-structured JSON format with 12 classes (ampullae, animal, cucurbitae, cucurbitae-ambix,
ollae, cucurbitae-retorte, cucurbitae-rosenhut, furnace, human, mineral-metal,
other-equipment, plant) and the number of their occurences."""

We’ll format the dataset into a chatbot structure for interaction. Each interaction will consist of a system message, followed by the image and the query, and finally, the answer to the query.

In [10]:
def format_data(sample):
    return [
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": system_message
                }
            ],
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": sample["image"],
                },
                {
                    "type": "text",
                    "text": query,
                }
            ],
        },
        {
            "role": "assistant",
            "content": [
                {
                    "type": "text",
                    "text": sample["labels"]
                }
            ],
        },
    ]

In [11]:
from datasets import load_dataset

dataset_id = "maychnix/AlchObj"
train_dataset, eval_dataset, test_dataset = load_dataset(dataset_id, split=['train', 'valid', 'test'])

README.md:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/144M [00:00<?, ?B/s]

valid-00000-of-00001.parquet:   0%|          | 0.00/23.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/21.6M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/434 [00:00<?, ? examples/s]

Generating valid split:   0%|          | 0/93 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/94 [00:00<?, ? examples/s]

Let’s take a look at the structure of the dataset. It includes an image, labels (which is the answer), two other features that we’ll be discarding.


In [12]:
train_dataset

Dataset({
    features: ['image', 'labels', 'idx', 'source'],
    num_rows: 434
})

Now, let’s format the data using the chatbot structure. This will allow us to set up the interactions appropriately for our model.


In [13]:
train_dataset = [format_data(sample) for sample in train_dataset]
eval_dataset = [format_data(sample) for sample in eval_dataset]
test_dataset = [format_data(sample) for sample in test_dataset]

In [14]:
train_dataset[200]

[{'role': 'system',
  'content': [{'type': 'text',
    'text': 'You are a Vision Language Model specialized in interpreting alchemic objects in historic book illustrations.\nYour task is to analyze the provided book page and respond in JSON format only. In total there are 12 classes of alchemic objects you should detect.\nThe classes are: ampullae, animal, cucurbitae, cucurbitae-ambix, ollae, cucurbitae-retorte, cucurbitae-rosenhut, furnace, human, mineral-metal, other-equipment, plant.\nRespond in a well-structured JSON format in which you always name all of the 12 classes and then their number of occurences in the provided image.\nAn output should look like this, for example: \n{\n  "ampullae": 0,\n  "animal": 0,\n  "cucurbitae": 0,\n  "cucurbitae-ambix": 0,\n  "cucurbitae-retorte": 0,\n  "cucurbitae-rosenhut": 0,\n  "furnace": 0,\n  "human": 0,\n  "mineral-metal": 0,\n  "other-equipment": 0,\n  "plant": 0,\n  "ollae": 0\n}\nFocus on delivering accurate, succinct answers based on the

## 3. Fine-Tune the Model using TRL


### Load the Quantized Model for Training ⚙️

Next, we’ll load the quantized model using [bitsandbytes](https://huggingface.co/docs/bitsandbytes/main/en/index). 

In [29]:
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor
from qwen_vl_utils import process_vision_info

model_id="Qwen/Qwen2.5-VL-3B-Instruct"

In [20]:
from transformers import BitsAndBytesConfig

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config
)

MIN_PIXELS = 256 * 28 * 28
MAX_PIXELS = 1280 * 28 * 28

processor = Qwen2_5_VLProcessor.from_pretrained(model_id, 
                                                min_pixels=MIN_PIXELS, 
                                                max_pixels=MAX_PIXELS)

config.json:   0%|          | 0.00/1.37k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/65.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.53G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/243 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/7.23k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

### Set Up QLoRA and SFTConfig 🚀

Next, we will configure [QLoRA](https://github.com/artidoro/qlora) for our training setup. QLoRA enables efficient fine-tuning of large language models while significantly reducing the memory footprint compared to traditional methods. Unlike standard LoRA, which reduces memory usage by applying a low-rank approximation, QLoRA takes it a step further by quantizing the weights of the LoRA adapters. This leads to even lower memory requirements and improved training efficiency, making it an excellent choice for optimizing our model's performance without sacrificing quality.




In [21]:
from peft import LoraConfig, get_peft_model

# Configure LoRA
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    r=8,
    bias="none",
    target_modules=["q_proj", "v_proj"],
    task_type="CAUSAL_LM",
)

# Apply PEFT model adaptation
peft_model = get_peft_model(model, peft_config)

# Print trainable parameters
peft_model.print_trainable_parameters()

trainable params: 1,843,200 || all params: 3,756,466,176 || trainable%: 0.0491


We will use Supervised Fine-Tuning (SFT) to refine our model’s performance on the task at hand. To do this, we'll define the training arguments using the [SFTConfig](https://huggingface.co/docs/trl/sft_trainer) class from the [TRL library](https://huggingface.co/docs/trl/index). SFT allows us to provide labeled data, helping the model learn to generate more accurate responses based on the input it receives. This approach ensures that the model is tailored to our specific use case, leading to better performance in understanding and responding to visual queries.






In [22]:
from trl import SFTConfig

# Configure training arguments
training_args = SFTConfig(
    output_dir="qwen2.5-3b-instruct-trl-sft-AlchObj",  # Directory to save the model
    num_train_epochs=12,  # Number of training epochs
    per_device_train_batch_size=4,  # Batch size for training
    per_device_eval_batch_size=4,  # Batch size for evaluation
    gradient_accumulation_steps=8,  # Steps to accumulate gradients
    gradient_checkpointing=True,  # Enable gradient checkpointing for memory efficiency
    # Optimizer and scheduler settings
    optim="adamw_torch_fused",  # Optimizer type
    learning_rate=2e-4,  # Learning rate for training
    lr_scheduler_type="constant",  # Type of learning rate scheduler
    # Logging and evaluation
    logging_steps=10,  # Steps interval for logging
    eval_steps=10,  # Steps interval for evaluation
    eval_strategy="steps",  # Strategy for evaluation
    save_strategy="steps",  # Strategy for saving the model
    save_steps=20,  # Steps interval for saving
    metric_for_best_model="eval_loss",  # Metric to evaluate the best model
    greater_is_better=False,  # Whether higher metric values are better
    load_best_model_at_end=True,  # Load the best model after training
    # Mixed precision and gradient settings
    bf16=True,  # Use bfloat16 precision; v100 doesn't support: https://github.com/lm-sys/FastChat/issues/399
    tf32=True,  # Use TensorFloat-32 precision
    max_grad_norm=0.3,  # Maximum norm for gradient clipping
    warmup_ratio=0.03,  # Ratio of total steps for warmup
    # Hub and reporting
    push_to_hub=True,  # Whether to push model to Hugging Face Hub
    report_to="wandb",  # Reporting tool for tracking metrics
    # Gradient checkpointing settings
    gradient_checkpointing_kwargs={"use_reentrant": False},  # Options for gradient checkpointing
    # Dataset configuration
    dataset_text_field="",  # Text field in dataset
    dataset_kwargs={"skip_prepare_dataset": True},  # Additional dataset options
    #max_seq_length=1024  # Maximum sequence length for input
)

training_args.remove_unused_columns = False  # Keep unused columns in dataset

### Training the Model 🏃

We will log our training progress using [Weights & Biases (W&B)](https://wandb.ai/). Let’s connect our notebook to W&B to capture essential information during training.


In [25]:
import wandb
import datetime

wandb.login()


now = str(datetime.datetime.now())
now = now.replace(" ", "").replace(":",".")

wandb.init(
    project=("qwen2.5-3b-instruct-trl-sft-AlchObj" + now),  
    name=("qwen2.5-3b-instruct-trl-sft-AlchObj" + now),  
    config=training_args,
)

We need a collator function to properly retrieve and batch the data during the training procedure. This function will handle the formatting of our dataset inputs, ensuring they are correctly structured for the model. Let's define the collator function below.


In [26]:
# Create a data collator to encode text and image pairs
def collate_fn(examples):
    # Get the texts and images, and apply the chat template
    texts = [processor.apply_chat_template(example, tokenize=False) for example in examples]  # Prepare texts for processing
    image_inputs = [process_vision_info(example)[0] for example in examples]  # Process the images to extract inputs

    # Tokenize the texts and process the images
    batch = processor(text=texts, images=image_inputs, return_tensors="pt", padding=True)  # Encode texts and images into tensors

    # The labels are the input_ids, and we mask the padding tokens in the loss computation
    labels = batch["input_ids"].clone()  # Clone input IDs for labels
    labels[labels == processor.tokenizer.pad_token_id] = -100  # Mask padding tokens in labels

    # Ignore the image token index in the loss computation (model specific)
    if isinstance(processor, Qwen2_5_VLProcessor):  # Check if the processor is Qwen2VLProcessor
        image_tokens = [151652, 151653, 151655]  # Specific image token IDs for Qwen2VLProcessor
    else:
        image_tokens = [processor.tokenizer.convert_tokens_to_ids(processor.image_token)]  # Convert image token to ID

    # Mask image token IDs in the labels
    for image_token_id in image_tokens:
        labels[labels == image_token_id] = -100  # Mask image token IDs in labels

    batch["labels"] = labels  # Add labels to the batch

    return batch  # Return the prepared batch

Now, we will define the [SFTTrainer](https://huggingface.co/docs/trl/sft_trainer), which is a wrapper around the [transformers.Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) class and inherits its attributes and methods. This class simplifies the fine-tuning process by properly initializing the [PeftModel](https://huggingface.co/docs/peft/v0.6.0/package_reference/peft_model) when a [PeftConfig](https://huggingface.co/docs/peft/v0.6.0/en/package_reference/config#peft.PeftConfig) object is provided. By using `SFTTrainer`, we can efficiently manage the training workflow and ensure a smooth fine-tuning experience for our Vision Language Model.



In [27]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=collate_fn,
    peft_config=peft_config,
    processing_class=processor.tokenizer, # view https://github.com/huggingface/trl/pull/2162
)

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Time to Train the Model! 🎉

In [30]:
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]


Step,Training Loss,Validation Loss
10,1.8864,1.581891
20,1.4621,1.252273
30,1.1539,0.96552
40,0.8234,0.6144
50,0.4421,0.255934
60,0.1493,0.053951
70,0.0366,0.029417
80,0.0239,0.020717
90,0.0165,0.018322
100,0.0164,0.017704


  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
Could not locate the best model at qwen2.5-3b-instruct-trl-sft-AlchObj/checkpoint-150/pytorch_model.bin, if you are running a distri

TrainOutput(global_step=156, training_loss=0.39060245998776877, metrics={'train_runtime': 17910.1627, 'train_samples_per_second': 0.291, 'train_steps_per_second': 0.009, 'total_flos': 1.8508428415873843e+17, 'train_loss': 0.39060245998776877})

Let's save the results 💾

In [31]:
trainer.save_model(training_args.output_dir)