<a href="https://colab.research.google.com/github/agalashov/m2ls_vlm_tutorial_private/blob/main/vlm_tutorial_practical_3_students.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# M2LS 2025: Vision-Language Models -- Practical 3
---
- Alexandre Galashov (agalashov@google.com)
- Petra Bevandic (Petra.Bevandic@fer.hr)
<br>
[link to colab]

In this practical session, you'll learn how to use and adapt **Vision-Language Models (VLMs)** for real-world tasks without having to train them from scratch.

Large and powerful VLMs, such as [Gemma-3-12B-IT](https://huggingface.co/google/gemma-3-12b-it), will often perform well on a wide variety of real-world tasks straight out of the box. Smaller and **more efficient models** may be also used for specific use cases, but they will often require additional fine-tuning. This is not a major limitation, as relatively inexpensive and efficent methods for finetuning are being developed to complement the progress of large foundational models. In this particular practical, we will use [**Low-Rank Adaptation (LoRA)**](https://arxiv.org/abs/2106.09685).

**In this tutorial, you will:**

1. Evaluate a pre-trained VLM by testing its out-of-the-box (zero-shot) performance on a task of generating Amazon product descriptions

2. Fine-tune a smaller VLM using LoRA to significantly boost its effectiveness.

**Disclaimer**: You will mainly be required to complete code blocks which we noted as **"Your code here"**. We took care of most of the boilerplate code for you. However, please also feel free to deviate from the code which we prepared and code things in the way you feel is right!

---

## Compute requirements

This notebook will require use a GPU with large memory in order to be able to train a model. If you do not have access to a GPU with large memory, we also pretrained a model for you. It is available as `"agalashov/vlm-tutorial-finetuned-llm-final"` from HuggingFace.

## Instal dependencies

In [None]:
!pip install  -U -q transformers trl datasets bitsandbytes peft accelerate

## Preliminary Setup (Hugging Face account)
---

1. Make a HuggingFace account if you already don't have one (Sign Up).

2. Create (if  you have not done so already) an access token in HuggingFace.

3. Either specify `HF_TOKEN` secret in colab secrets or specify `MANUALLY_ENTERED_HF_TOKEN`.

In [None]:
from google.colab import userdata
from huggingface_hub import login

MANUALLY_ENTERED_HF_TOKEN = '' # If not specified `HF_TOKEN`, enter your token.

try:
  HF_TOKEN = userdata.get('HF_TOKEN')
except userdata.SecretNotFoundError:
  HF_TOKEN = MANUALLY_ENTERED_HF_TOKEN

login(token=HF_TOKEN)

## Exercise 1 - Zero-shot generation of Amazon product descriptions

In this exercise, we will see how we can use VLMs to automatically generate Amazon product descriptions given their images.

To do this, we will:

1. Load a pretrained VLM and inspect its components;
2. Load the dataset of Amazon product descriptions and inspect the data;
3. Create a prompt from the data to be able to use for inference in the pretrained VLM;
4. Inspect the performance of this pretrained VLM on the product descriptions.

We start by importing the required modules.

In [None]:
#@title Run imports
from datasets import load_dataset
from PIL import Image
import torch

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {DEVICE}")

### **Prep task 1**: Load the model

We will load one of two models depending on the value of `SMOL_MODEL`, either [`SmolVLM-256M-Base`](https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Base) (when `SMOL_MODEL=True`) or [`Gemma-3-4B`](https://huggingface.co/google/gemma-3-4b-pt) (when `SMOL_MODEL=False`).

[`SmolVLM-256M-Base`](https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Base) is one of the smallest VLMs available in hugging face. This makes it a good choice when getting acquainted with the finetuning task as it requires significantly less computational resources, especially when compared to [`Gemma-3-4B`](https://huggingface.co/google/gemma-3-4b-pt) (or any other bigger model). Keep in mind that its small size will hinder its performance in real-world applications.

In [None]:
SMOL_MODEL = True # Choose whether to use a small model or not

In [None]:
from transformers import Idefics3ForConditionalGeneration, AutoProcessor
from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig

if SMOL_MODEL:
  model_id = "HuggingFaceTB/SmolVLM-256M-Base"
  chat_template_model_id = "HuggingFaceTB/SmolVLM-256M-Instruct"
else:
  model_id = "google/gemma-3-4b-pt"
  chat_template_model_id = "google/gemma-3-4b-it"

processor = AutoProcessor.from_pretrained(chat_template_model_id)
model = AutoModelForImageTextToText.from_pretrained(
      model_id,
      device_map="auto",
      torch_dtype=torch.bfloat16,
      # In principle, you could use flash attention to speed up performance.
      _attn_implementation="eager",
  ).to(DEVICE)

Have a look at the architecture of the model.

In [None]:
model

If you are using `SmolVLM`, the underlying model class is `Idefics3Model`, the description is provided here:

https://huggingface.co/docs/transformers/v4.53.3/en/model_doc/idefics3#transformers.Idefics3Model

The corresponding code is provided here:

https://github.com/huggingface/transformers/blob/v4.53.3/src/transformers/models/idefics3/modeling_idefics3.py#L601




**Question**: How is visial and text input processed via the model? You can study the code (here)[https://github.com/huggingface/transformers/blob/v4.53.3/src/transformers/models/idefics3/modeling_idefics3.py#L601]



**Answer**: To fill

In [None]:
print('Do not forget to answer! :)')

### **Prep Task 2**: Load and inspect dataset

We load the dataset of Amazon product descriptions. Please inspect the entries of the dataset.

In [None]:
# Load dataset from the hub
dataset = load_dataset("philschmid/amazon-product-descriptions-vlm", split="train")

Inspect the first element of the dataset and the information it contains.

In [None]:
data = dataset[0]
data

The image of the product is stored in the `image` value.

In [None]:
data['image']

Corresponding product description is given by `description` entry

In [None]:
data['description']

The data entry contains some additional meta-data which can be useful

In [None]:
print(data["Product Name"])
print(data["Category"])
# ....

### Zero shot generation of product descriptions

For this part of a practical, we will use a pre-trained VLM to automatically generate product descriptions in a structured format.

We will be doing this in a **zero-shot setting**. This means we will prompt the model to perform the task directly, relying entirely on its pre-trained capabilities without any additional training or fine-tuning.

#### Visual Question Answering (VQA) task

The VLMs are trained to solve **Visual Question Answering (VQA) task**, meaning that they are supposed to provide a text **answer** given a **text question** and **an image** (or multiple images)

When querying VLM, we provide `query_prompt` which contains `question` and special image tokens (often, `<image>`). The image input is provided separately, so that the corresponding image embedding of fixed size will replace `<image>` token by a fixed size embedding.

Here is an example of a `query_prompt`:

`query_prompt="What do you see?<image>"`

A pseudocode of a VLM query:

`answer=vlm(query_prompt=query_prompt,image=image)`

When calling `vlm`, the `query_prompt` is processed as follows:

* We embed the question `What do you see?` to a `[text_embedding]` of certain size. For example, it can have a shape `<num_text_tokens,embed_dim>`, where `num_text_tokens` is the number of text tokens in the question and `embed_dim` is some fixed embedding dimension.

* We embed the image `image` to a `[image_embedding]` of fixed size `<fixed_num_image_tokens,embed_dim>`, where `fixed_num_image_tokens` is a hardcoded number of image tokens (often, 256) and `embed_dim` is the same embedding dimension as for text.

* Both embeddings are combined to `query_embedding=[text_embedding][image_embedding]` a vector of shape `<num_text_tokens+fixed_num_image_tokens, embed_dim>`

* The `query_embedding` is then processed by the corresponding transformer (LLM).

#### Chat format


VLM promts should follow consistent format which is often VLM specific. This means that we will have to carefully design both the `query_prompt` and `answer` to match the expected structure.

A widely used option is the chat template provided by the [`processor.apply_chat_template`](https://huggingface.co/docs/transformers/en/chat_templating) function. Since many LLMs and VLMs are trained with conversational data, they tend to perform better when prompts and responses are expressed in this format as well

When constructing a prompt for VLM, we tipically provide the following entries:

* **System prompt**: A short instruction to guide the overall VLM behaviour. Typically fixed for all different **User prompts**.

* **User prompt**: The actual input which contains an optional task description, a question and the associated image.

* **Assistant prompt**: The expected response that needs to be generated by the model (needed only for training/fine-tuning).

The **chat format** for a system identifying animals in images would look something like this:

```
messages = [
  {
    "role": "system", "content": [
      {"type": "text", "text": "You are friendly AI."}
    ]
  },
  {
    "role": "user", "content": [
      {"type": "text", "text": "What do you see on the picture?"},
      {"type": "image", "image": "<image>"}
    ]
  },
  # You do not add this in the inference phase
  {
    "role": "assis", "assistant": [
      {"type": "text", "text": "This is a picture of a cat."}
    ]
  },
]


### **Task 1**: Create a `system_prompt` and a `user_prompt`

Come up with a potential `system_prompt` and `user_prompt` for product descriptions. You can provide any information from `data` into both prompts, as you want. Just do not provide `description`, since it is the answer. :)

In [None]:
################################################################################
# YOUR CODE HERE
################################################################################
...
system_prompt = "" # TO FILL
user_prompt = "" # TO FILL

### **Task 2**: Create chat-style prompt.

Write a function that generates a chat-style prompt for product description (see above). The function should follow the chat format.

Since this function will also be reused in the fine-tuning exercise, include an option to control whether an assistant (answer) message is added (e.g., through a parameter like `add_assistant_message`).

In [None]:
def format_data(data, system_prompt, user_prompt, add_assistant_message=True):
  messages = []
  ##############################################################################
  # YOUR CODE HERE
  ##############################################################################
  ...
  return messages

In [None]:
example = {'Product Name': 'Hello', 'Category': 'Cheese.', 'description': 'This is cheesecake.'}
formatted_data = format_data(example, system_prompt, user_prompt, add_assistant_message=True)

Take a look on `formatted_data` and confirm that it is consistent with the chat template.

In [None]:
formatted_data

Let's see what we get as a result of `apply_chat_template`

In [None]:
text = processor.apply_chat_template(formatted_data, add_generation_prompt=False, tokenize=False)
text

Note that we use `add_generation_prompt=False`. If you set it to `add_generation_prompt=True`, it will add an additional `"Assistant: "` string to the end of the prompt, encouraging the VLM to respond.

See how it behaves

In [None]:
text = processor.apply_chat_template(formatted_data, add_generation_prompt=True, tokenize=False)
text

This however is not ideal, because we now inserted two `"Assistant: "` strings to the prompts.

Here is how we fix it

In [None]:
example = {'Product Name': 'Hello', 'Category': 'Cheese.', 'description': 'This is cheesecake.'}
formatted_data = format_data(example, system_prompt, user_prompt, add_assistant_message=False)
text = processor.apply_chat_template(formatted_data, add_generation_prompt=True, tokenize=False)
text

Ultimately, the prompt which ends by `"Assistant: "` is the one which we want to use for **inference**, while the the prompt which contains already a pre-defined answer `"Assistant: some answer"` should be used for **training**.

**Important**: Make sure that "\<image\>" is part of the text. If it is not, you must modify `format_data` accordingly

In [None]:
if "<image>" not in text:
  raise ValueError("Tag <image> must be in the text!")

For future, it will be useful to have a function which takes a data entry, formats it and applies the chat template.

In [None]:
def prepare_text_for_vlm(data, system_prompt, user_prompt, use_for_training):
  text = ""
  ##############################################################################
  # YOUR CODE HERE
  ##############################################################################
  ...

  return text

Confirm it works in a way you expect

In [None]:
result = prepare_text_for_vlm(example, system_prompt, user_prompt, use_for_training=True)
assert len(result.split('Assistant:')[1]) > 0
assert result.endswith("<end_of_utterance>\n")
assert "<image>" in result
print('`prepare_text_for_training` is correctly implemented')
print()
result

In [None]:
result = prepare_text_for_vlm(example, system_prompt, user_prompt, use_for_training=False)
assert result.endswith("Assistant:")
assert "<image>" in result
print('`prepare_text_for_inference` is correctly implemented')
print()
result

Now we are ready to query VLM for providing product descriptions

### **Task 3**: Run the VLM to get zero-shot product descriptions.

Now, we will query the VLM to generate the product descriptions. First, we select any example from a dataset.

In [None]:
# Check a dataset example
example_idx = 2 # Feel free to modify
data = dataset[example_idx]
print(f"Product name: {data['Product Name']}")
print(f"Category: {data['Category']}")
print(f"Description: {data['description']}")
data['image'].convert('RGB')

We need to produce `inputs` to the model as follows.

In [None]:
query_prompt = prepare_text_for_vlm(data, system_prompt, user_prompt, use_for_training=False)
image = data['image'].convert('RGB')
# Preprocessing inputs
inputs = processor(text=[query_prompt], images=[image], return_tensors="pt", padding=True).to(DEVICE)

Inspect `inputs`

In [None]:
inputs

Take a look on what keys are produced

In [None]:
for k in inputs.keys():
  print(k)

Now, we will implement a generating (inference) function.

We will use `model.generate` function to achieve that. Have a look on the signature of this function

In [None]:
model.generate

We will implement `generate_product_description` which receives `query_prompt`, `image` as well as `max_new_tokens` (which controls the amount of output tokens). It is useful to specify this parameter to decrease the inference time.

It then should call `model.generate` to generate product descriptions.

In [None]:
def generate_product_description(query_prompt, image, max_new_tokens = 256):
  # Preprocessing inputs
  inputs = processor(text=[query_prompt], images=[image], return_tensors="pt", padding=True).to(DEVICE)
  ##############################################################################
  # YOUR CODE HERE
  ##############################################################################
  generated_ids = ...

  # `generated_ids` contains both, input ids (from the prompt) as well as the
  # generated ones. Thus, we need to "trim" it only to keep the output part.
  generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
  output_text = processor.batch_decode(
      generated_ids_trimmed,
      skip_special_tokens=True,
      clean_up_tokenization_spaces=False)

  return output_text[0]

Let's generate the descriptions

In [None]:
output_text = generate_product_description(query_prompt, image, max_new_tokens=32)
output_text

As we see, the model outputs a bit of a nonsense.

Thus, we need to finetune it.

Let's clean up the memory in order to unload the model.

In [None]:
del model
torch.cuda.empty_cache()

## Exercise 2 - Finetuning the VLM to improve product description generation

In our case, the zero-shot performance of the model for generating product descriptions is quite poor. This is due to the fact that the model was not instruction tuned (i.e. it does not know how to follow the user instructions in general) and is not adapted for the specialized task of product description.

In order to adapt the model for a task in hand, we should finetune it.

For that, we are going to use [LoRA](https://arxiv.org/abs/2106.09685) (Low Rank Adaptation) with quantization, which is called [QLoRA](https://arxiv.org/abs/2305.14314), which enables cheap and memory-efficient fine-tuning.

To achieve this, we will be using [SFTConfig](https://huggingface.co/docs/trl/sft_trainer) class from the [TRL library](https://huggingface.co/docs/trl/index).

#### Pretrained model checkpoint

Note that during the tutorial it might be challenging to actually run the finetuning because you can run out of memory. We have prepared a finetuned model checkpoint for you (available only when `SMOL_MODEL=True`). You can use it by setting `USE_PRETRAINED_MODEL_CHECKPOINT=True`.

In [None]:
PRETRAINED_MODEL_CHECKPOINT = "agalashov/vlm-tutorial-finetuned-llm-final"

USE_PRETRAINED_MODEL_CHECKPOINT = False

if USE_PRETRAINED_MODEL_CHECKPOINT:
  if not SMOL_MODEL:
    raise ValueError('Must use SMOL_MODEL = True.')

### Load the int-4 quantized model

We will be using `BitsAndBytesConfig` package which allows us to load the model in the quantized way in order to reduce the memory. This is what we are going to use for QLoRA.



In [None]:
from transformers import BitsAndBytesConfig

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    # `load_in_4bit=True`: This is the main argument that enables the 4-bit
    # quantization. When set to `True`, the model's weights will be loaded
    # and stored in 4-bit format, drastically reducing memory consumption.
    load_in_4bit=True,
    # `bnb_4bit_use_double_quant=True`: This enables "double quantization."
    # It quantizes the quantization constants themselves, which are the values
    # used to de-quantize the 4-bit weights back to their original form. This
    # can save an additional 0.4 bits per parameter, leading to a further
    # small reduction in memory.
    bnb_4bit_use_double_quant=True,
    # `bnb_4bit_quant_type="nf4"`: This specifies the type of 4-bit
    # quantization to use. "nf4" stands for "NormalFloat 4-bit." It is a
    # quantization data type that is theoretically optimal for weights that
    # follow a normal distribution, which is common in transformer models.
    bnb_4bit_quant_type="nf4",
    # `bnb_4bit_compute_dtype=torch.bfloat16`: This sets the data type for
    # computations that are performed during the forward and backward passes.
    # Even though the weights are stored in 4-bit, the actual matrix
    # multiplications need to be done in a higher precision. `bfloat16`
    # (Brain Floating Point 16-bit) is a good choice as it provides a
    # larger dynamic range than `float16`, which is beneficial for training.
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = AutoModelForImageTextToText.from_pretrained(
      model_id,
      device_map="auto",
      torch_dtype=torch.bfloat16,
      quantization_config=bnb_config,
      _attn_implementation="eager",
  ).to(DEVICE)
processor = AutoProcessor.from_pretrained(chat_template_model_id)

### **Task 1**: Set Up QLoRA

Next, we will configure [QLoRA](https://github.com/artidoro/qlora) for our training setup. QLoRA allows efficient fine-tuning of large models by reducing the memory footprint. Unlike traditional LoRA, which uses low-rank approximation, QLoRA further quantizes the LoRA adapter weights, leading to even lower memory usage and faster training.

We took care of most of a boiler plate, just fill the arguments for `LoraConfig` !

In [None]:
from peft import LoraConfig, get_peft_model

lora_rank = 8

# Configure LoRA
peft_config = LoraConfig(
    # `r`: This is the LoRA rank. It determines the size of the low-rank
    # matrices that are added to the model. A higher rank means more trainable
    # parameters and potentially better performance, but also more memory
    # usage.

    # YOUR CODE HERE
    r=TO_FILL,

    # `lora_alpha`: This is the scaling factor for the LoRA update. It
    # controls the magnitude of the LoRA matrices. The scaling is often
    # set to `lora_alpha / r` by default, so in this case, the scaling
    # factor would be 1. A higher alpha can make the updates more
    # significant.

    # YOUR CODE HERE
    lora_alpha=TO_FILL,

    # `lora_dropout`: This is the dropout probability for the LoRA
    # layers. It helps prevent overfitting during fine-tuning by randomly
    # setting some of the LoRA weights to zero.

    # YOUR CODE HERE
    lora_dropout=TO_FILL,


    # `target_modules`: This specifies the names of the model's layers
    # (or modules) that the LoRA adapter will be applied to. It's common
    # to target the attention layers' linear projections, such as `q_proj`
    # (query), `k_proj` (key), `v_proj` (value), and `o_proj` (output).
    # Targeting the MLP layers (`gate_proj`, `up_proj`, `down_proj`) is
    # also a common practice.

    # YOUR CODE HERE -- You can investigate `model` structure to see which
    # parameters to target.
    target_modules=TO_FILL

    # `use_dora`: This enables "DoRA" (Weight-Decomposed LoRA), an
    # improvement on standard LoRA. DoRA works by decomposing the LoRA
    # weights into two components, which can lead to better fine-tuning
    # performance, especially on smaller models.

    # YOUR CODE HERE
    use_dora=TO_FILL,

    # `init_lora_weights="gaussian"`: This determines how the LoRA matrices
    # are initialized. `"gaussian"` means the weights are drawn from a
    # Gaussian (normal) distribution. Other options include `"loftq"`.
    init_lora_weights="gaussian"
)



Once you have configured LoRA, apply it to the model to get only `LoRA`-trainable model.

In [None]:
# Apply PEFT model adaptation
peft_model = get_peft_model(model, peft_config)

# Print trainable parameters
peft_model.print_trainable_parameters()

Notice that the number of parameters we are going to train is significantly lower than the original model

### **Task 2**: Set Up SFTConfig

We will use Supervised Fine-Tuning (SFT) to improve our model's performance on the specific task. To achieve this, we'll define the training arguments with the [SFTConfig](https://huggingface.co/docs/trl/sft_trainer) class from the [TRL library](https://huggingface.co/docs/trl/index). SFT leverages labeled data to help the model generate more accurate responses, adapting it to the task. This approach enhances the model's ability to understand and respond to visual queries more effectively.

Configure `SFTConfig`. You can read comments for the different hyperparameters. It is already preconfigured for you, but you can modify it in any way you want.

In [None]:
from trl import SFTConfig

YOUR_OUTPUT_DIRECTORY = "vlm-tutorial-finetuned-llm"

# Configure training arguments using SFTConfig
args = SFTConfig(
    output_dir=YOUR_OUTPUT_DIRECTORY,           # directory to save and repository id
    num_train_epochs=2,                         # number of training epochs
    per_device_train_batch_size=4,              # batch size per device during training
    gradient_accumulation_steps=4,              # number of steps before performing a backward/update pass
    gradient_checkpointing=True,                # use gradient checkpointing to save memory
    optim="adamw_torch_fused",                  # use fused adamw optimizer
    logging_steps=5,                            # log every 5 steps
    save_strategy="epoch",                      # save checkpoint every epoch
    learning_rate=2e-4,                         # learning rate, based on QLoRA paper
    bf16=True,                                  # use bfloat16 precision
    max_grad_norm=0.3,                          # max gradient norm based on QLoRA paper
    warmup_ratio=0.03,                          # warmup ratio based on QLoRA paper
    lr_scheduler_type="constant",               # use constant learning rate scheduler
    push_to_hub=True,                           # push model to hub
    report_to="tensorboard",                    # report metrics to tensorboard
    gradient_checkpointing_kwargs={
        "use_reentrant": False
    },  # use reentrant checkpointing
    dataset_text_field="",                      # need a dummy field for collator
    dataset_kwargs={"skip_prepare_dataset": True},  # important for collator
)
args.remove_unused_columns = False # important for collator

### We need to set up `SFTTrainer`

Now, we need to set up `SFTTrainer` -- a class, which will allow us to run an experiment.

In order to set it up we need to define:

* `model` -- already defined
* `args` -- SFTConfig, already defined
* `peft_config` -- peft_config, already defined
* `processing_class` -- processor, already defined
* `train_dataset` -- a dataset over which it will iterate. **NEED TO DEFINE**
* `data_collator` -- a function which can parse every example from a dataset. **NEED TO DEFINE**

### **Task 3**: Create a dataset of text-image data

Set up a dataset `prompts_and_images` which will contain training prompts and images.

In [None]:
prompts_and_images = []
for data in dataset:
  image = data['image'].convert('RGB')
  ##############################################################################
  # Your code here
  ##############################################################################
  ...



In [None]:
assert len(prompts_and_images) == len(dataset)
assert isinstance(prompts_and_images[0][0], str)
assert isinstance(prompts_and_images[0][1], Image.Image)
print('Looks like `prompts_and_images` is correctly implemented')

In [None]:
prompts_and_images[0][0]

### **Task 4**: Implement the dataset processor

We will implement `collate_fn(list_of_examples)`, which receives a list (a batch) of entries from `train_dataset` (in our case it is `prompts_and_images`).

Note that we can call `processor(text=list_of_prompts, images=list_of_images, ...)` to process the list of prompts and images all together

In [None]:
# Create a data collator to encode text and image pairs
def collate_fn(examples):
  texts, images = [], []
  ##############################################################################
  # YOUR CODE HERE
  ##############################################################################
  ...


  batch = processor(text=texts, images=images, return_tensors="pt", padding=True)
  # We do an additional masking here.
  # The labels are the input_ids, and we mask the padding tokens and image
  # tokens in the loss computation
  labels = batch["input_ids"].clone()
  # Mask image tokens
  image_token_id = [processor.tokenizer.convert_tokens_to_ids("<image>")]
  # Mask tokens for not being used in the loss computation
  labels[labels == processor.tokenizer.pad_token_id] = -100
  labels[labels == image_token_id] = -100
  labels[labels == 262144] = -100
  batch["labels"] = labels
  return batch

#### Create `SFTTrainer` instance

In [None]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=prompts_and_images,
    peft_config=peft_config,
    processing_class=processor,
    data_collator=collate_fn,
)

### **Task 5**: Push the button on finetuning!

In [None]:
if not USE_PRETRAINED_MODEL_CHECKPOINT:
  # Start training, the model will be automatically saved to the Hub and the output directory
  trainer.train()

  # Save the final model again to the Hugging Face Hub
  trainer.save_model()

In [None]:
if not USE_PRETRAINED_MODEL_CHECKPOINT:
  # need to free the memory
  del model
  del trainer
  torch.cuda.empty_cache()

 `trainer.save_model()` saves only the LoRA parameters and not the full model. For convenience, you might want to save the full model.

In [None]:
if not USE_PRETRAINED_MODEL_CHECKPOINT:
  from peft import PeftModel

  # Load Model base model
  model = AutoModelForImageTextToText.from_pretrained(model_id, low_cpu_mem_usage=True)

  # Merge LoRA and base model and save
  peft_model = PeftModel.from_pretrained(model, args.output_dir)
  merged_model = peft_model.merge_and_unload()
  merged_model.save_pretrained("merged_model", safe_serialization=True, max_shard_size="2GB")

  processor = AutoProcessor.from_pretrained(args.output_dir)
  processor.save_pretrained("merged_model")

### **Task 6**: Verify your finetuned model

Now, we are going to query our finetuned model to see whether it provides more accurate product descriptions.

First, we reload our newly trained model (or a pretrained checkpoint).

In [None]:
import torch

if USE_PRETRAINED_MODEL_CHECKPOINT:
  model_dir = PRETRAINED_MODEL_CHECKPOINT
else:
  model_dir = args.output_dir

# Load Model with PEFT adapter
model = AutoModelForImageTextToText.from_pretrained(
  model_dir,
  device_map="auto",
  torch_dtype=torch.bfloat16,
  _attn_implementation="eager",
)
processor = AutoProcessor.from_pretrained(model_dir)

Let's query a specific example

In [None]:
example_idx = 2 # You can modify it
data = dataset[example_idx]
query_prompt = prepare_text_for_vlm(data, system_prompt, user_prompt, use_for_training=False)
image = data['image'].convert('RGB')
image

In [None]:
output_text = generate_product_description(query_prompt, image, max_new_tokens=32)
output_text

The output here should make much more sense!

## Using VLM for alternative real-world problems

You now know how to fine-tune a VLM for a specific task, which primarily involves correct data formatting and scripting. As we've discussed, the standard VLM architecture solves tasks like **Visual Question Answering (VQA)** by combining text with images processed by a dedicated vision encoder.

The key takeaway is that this is not limited to images. In fact, the image modality can be replaced by any other data type, such as:

* Video clips

* Audio recordings

* Anything else

The fundamental mechanism for combining the non-textual data with the text prompts can stay the same. By simply swapping the vision encoder for one suited to a different modality, you can adapt the VLM framework to a whole new range of problems.