## PaliGemma Fine-tuning

In this notebook, we will fine-tune [pretrained PaliGemma](https://huggingface.co/google/paligemma2-3b-pt-448) on a small split of [VQAv2](https://huggingface.co/datasets/HuggingFaceM4/VQAv2) dataset. Let's get started by installing necessary libraries.

In [None]:
!nvidia-smi

Thu Dec 19 13:00:25 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0              43W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [1]:
!pip install -q -U datasets bitsandbytes peft git+https://github.com/huggingface/transformers.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
[31mERROR: pip's dependency

We will authenticate to access the model using `notebook_login()`.

In [2]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
!pip install wandb

import wandb
wandb.login()



[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [4]:
%env WANDB_PROJECT=VQA

env: WANDB_PROJECT=VQA


Let's load the dataset.

In [5]:
from datasets import load_dataset
ds = load_dataset('kaischue/ACLFigVQA', token=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/27.2M [00:00<?, ?B/s]

val-00000-of-00001.parquet:   0%|          | 0.00/10.5M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/8.40M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1491 [00:00<?, ? examples/s]

Generating val split:   0%|          | 0/508 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/493 [00:00<?, ? examples/s]

In [6]:
train_ds = ds["train"]

In [7]:
train_ds

Dataset({
    features: ['img_file_name', 'image', 'label', 'caption', 'inline_reference', 'metadata', 'acl_paper_id', 'pdf_text', 'question_german', 'question_english', 'corrected_answer_german', 'corrected_answer_english', 'short_answer_german', 'short_answer_english', 'category', 'context'],
    num_rows: 1491
})

Our dataset is a very general one and similar to many datasets that PaliGemma was trained with. In this case, we do not need to fine-tune the image encoder, the multimodal projector but we will only fine-tune the text decoder.

In [8]:
from transformers import PaliGemmaProcessor
model_id ="google/paligemma2-3b-pt-448" # or your favorite PaliGemma

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [9]:
from transformers import PaliGemmaForConditionalGeneration
import torch
device = "cuda"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)

for param in model.vision_tower.parameters():
    param.requires_grad = True

for param in model.multi_modal_projector.parameters():
    param.requires_grad = True


config.json:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/75.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.07G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

Alternatively, if you want to do LoRA and QLoRA fine-tuning, you can run below cells to load the adapter either in full precision or quantized.

In [None]:
from transformers import BitsAndBytesConfig, PaliGemmaForConditionalGeneration
from peft import get_peft_model, LoraConfig
import torch

bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)

lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, device_map="auto")#, quantization_config=bnb_config)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
#trainable params: 11,298,816 || all params: 2,934,634,224 || trainable%: 0.38501616002417344


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

trainable params: 11,876,352 || all params: 3,045,003,504 || trainable%: 0.3900


We need to take tokens to same dtype as model so need to store it as a variable.

In [10]:
DTYPE = model.dtype

Load the processor to preprocess the dataset.

In [11]:
processor = PaliGemmaProcessor.from_pretrained(model_id)

preprocessor_config.json:   0%|          | 0.00/425 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/243k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/34.6M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/733 [00:00<?, ?B/s]

In [12]:
device = "cuda"

We will preprocess our examples. We need to prepare a prompt template and pass the text input inside, pass it with batches of images to processor. Then we will set the pad tokens and image tokens to -100 to let the model ignore them. We will pass our preprocessed input as labels to make the model learn how to generate responses.

In [None]:
import torch

image_token = processor.tokenizer.convert_tokens_to_ids("<image>")
def collate_fn(examples):
  texts = ["<image>answer en " + example["question"] for example in examples]
  labels= [example['multiple_choice_answer'] for example in examples]
  images = [example["image"].convert("RGB") for example in examples]
  tokens = processor(text=texts, images=images, suffix=labels,
                    return_tensors="pt", padding="longest")

  tokens = tokens.to(DTYPE).to(device)
  return tokens


In [13]:
import torch

image_token = processor.tokenizer.convert_tokens_to_ids("<image>")
def collate_fn(examples):
  # Convert image to RGB
  images = [example["image"].convert('RGB') for example in examples]
  # Combine the text inputs
  combined_inputs = [f"<image> answer de {example['question_german']} {example['caption']} {example['context']}" for example in examples]
  labels = [example["corrected_answer_german"] for example in examples]
  # Tokenize the data
  tokens = processor(text=combined_inputs, images=images, suffix=labels,
                  return_tensors="pt", padding="longest")
  tokens = tokens.to(DTYPE).to(device)
  return tokens

In [13]:
import torch

image_token = processor.tokenizer.convert_tokens_to_ids("<image>")
def collate_fn(examples):
  # Convert image to RGB
  images = [example["image"].convert('RGB') for example in examples]
  # Combine the text inputs
  combined_inputs = [f"<image> answer en {example['question_english']} {example['caption']} {example['context']}" for example in examples]
  labels = [example["corrected_answer_english"] for example in examples]
  # Tokenize the data
  tokens = processor(text=combined_inputs, images=images, suffix=labels,
                  return_tensors="pt", padding="longest")
  tokens = tokens.to(DTYPE).to(device)
  return tokens

We will now initialize the `TrainingArguments`.

In [14]:
from transformers import TrainingArguments
args=TrainingArguments(
            num_train_epochs=2,
            remove_unused_columns=False,
            per_device_train_batch_size=1,
            gradient_accumulation_steps=4,
            warmup_steps=2,
            learning_rate=2e-5,
            weight_decay=1e-6,
            adam_beta2=0.999,
            logging_steps=1,
            optim="adamw_hf", # you can use paged optimizers like paged_adamw_8bit for QLoRA
            save_strategy="steps",
            save_steps=1000,
            save_total_limit=1,
            output_dir="paligemma2-3b-pt-448-vis-ACLFigQA-de",
            hub_private_repo=True,
            bf16=True,
            dataloader_pin_memory=False,
            report_to="wandb",
            run_name="paligemma2-3b-pt-448-vis-finetune-de"
        )


We can now start training.

In [15]:
from transformers import Trainer

trainer = Trainer(
        model=model,
        train_dataset=train_ds ,
        data_collator=collate_fn,
        args=args
        )


LoRA with bsz of 2 works on A100 Colab. You can apply gradient accumulation (which is enabled in this notebook) to simulate larger batch sizes.
Currently there's an issue with QLoRA, we are investigating and will solve soon.

In [16]:
trainer.train()



[34m[1mwandb[0m: Currently logged in as: [33mkai-schueler[0m ([33mkai-schueler-technische-universit-t-berlin[0m). Use [1m`wandb login --relogin`[0m to force relogin


It is strongly recommended to train Gemma2 models with the `eager` attention implementation instead of `sdpa`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.


Step,Training Loss
1,16.8729
2,13.1997
3,14.527
4,6.6393
5,5.1551
6,7.633
7,8.3703
8,8.1463
9,11.1022
10,7.8869


TrainOutput(global_step=744, training_loss=4.21066762915542, metrics={'train_runtime': 1184.3152, 'train_samples_per_second': 2.518, 'train_steps_per_second': 0.628, 'total_flos': 4.846947619875408e+16, 'train_loss': 4.21066762915542, 'epoch': 1.995305164319249})

In [17]:
trainer.push_to_hub()

model-00002-of-00002.safetensors:   0%|          | 0.00/1.07G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.37k [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/kaischue/paligemma2-3b-pt-448-vis-ACLFigQA-de/commit/b908531ccfbf0591ba8962075abe87e719d6f455', commit_message='End of training', commit_description='', oid='b908531ccfbf0591ba8962075abe87e719d6f455', pr_url=None, repo_url=RepoUrl('https://huggingface.co/kaischue/paligemma2-3b-pt-448-vis-ACLFigQA-de', endpoint='https://huggingface.co', repo_type='model', repo_id='kaischue/paligemma2-3b-pt-448-vis-ACLFigQA-de'), pr_revision=None, pr_num=None)

In [18]:
wandb.finish()

0,1
train/epoch,▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▅▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇██
train/global_step,▁▁▁▁▁▂▂▂▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▅▅▆▆▇▇▇▇▇▇▇▇▇███
train/grad_norm,▂▂▁▂▁▂▁█▂▂▇▂▄▁▂▂▂▁▂▂▁▂▂▁▁▁▂▂▁▁▁▁▁▁▁▂▂▁▁▂
train/learning_rate,████▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▅▅▅▅▄▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▁
train/loss,▅█▅▅▄▃▅▆▄▆▄▃▆▄▆▄▇▃▂▄▁▃▃▂▁▁▁▃▂▂▃▁▁▁▄▂▂▁▂▁

0,1
total_flos,4.846947619875408e+16
train/epoch,1.99531
train/global_step,744.0
train/grad_norm,97.0
train/learning_rate,0.0
train/loss,2.7384
train_loss,4.21067
train_runtime,1184.3152
train_samples_per_second,2.518
train_steps_per_second,0.628


In [None]:
from google.colab import runtime
runtime.unassign()

You can find steps to infer [here](https://colab.research.google.com/drive/100IQcvMvGm9y--oelbLfI__eHCoz5Ser?usp=sharing).