## PaliGemma Fine-tuning

In this notebook, we will fine-tune [pretrained PaliGemma](https://huggingface.co/google/paligemma-3b-pt-448) on a small split of [VQAv2](https://huggingface.co/datasets/HuggingFaceM4/VQAv2) dataset. Let's get started by installing necessary libraries.

You can fine PaliGemma Blog here: https://huggingface.co/blog/paligemma


In [1]:
!pip install -q -U git+https://github.com/huggingface/transformers.git datasets accelerate

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


We will authenticate to access the model using `notebook_login()`.

In [None]:
!pip install peft

In [None]:
!pip install pip install bitsandbytes

In [None]:
from huggingface_hub import notebook_login
notebook_login()

# Load dataset.

In [5]:
from datasets import load_dataset
ds = load_dataset('HuggingFaceM4/VQAv2', split="train[:10%]")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
Repo card metadata block was not found. Setting CardData to empty.


In [6]:
cols_remove = ["question_type", "answers", "answer_type", "image_id", "question_id"]
ds = ds.remove_columns(cols_remove)

In [7]:
split_ds = ds.train_test_split(test_size=0.05) # we'll use a very small split for demo
train_ds = split_ds["test"]

In [8]:
train_ds

Dataset({
    features: ['multiple_choice_answer', 'question', 'image'],
    num_rows: 2219
})

In [9]:
train_ds.to_pandas()

Unnamed: 0,multiple_choice_answer,question,image
0,yes,Is the sandwich intact?,"{'bytes': None, 'path': '/root/.cache/huggingf..."
1,yes,Does this woman like bears?,"{'bytes': None, 'path': '/root/.cache/huggingf..."
2,2,How many different colors of sandals are in th...,"{'bytes': None, 'path': '/root/.cache/huggingf..."
3,3,How many people on the field?,"{'bytes': None, 'path': '/root/.cache/huggingf..."
4,none,What clothing label is a sponsor of this event?,"{'bytes': None, 'path': '/root/.cache/huggingf..."
...,...,...,...
2214,snowboarding,What are the people doing?,"{'bytes': None, 'path': '/root/.cache/huggingf..."
2215,window,What is reflected in the mirror?,"{'bytes': None, 'path': '/root/.cache/huggingf..."
2216,no,Is the bus in motion?,"{'bytes': None, 'path': '/root/.cache/huggingf..."
2217,0,How many clocks are there?,"{'bytes': None, 'path': '/root/.cache/huggingf..."


# Process Data

In [10]:
from transformers import PaliGemmaProcessor

In [11]:
model_id = "google/paligemma-3b-pt-224"
processor = PaliGemmaProcessor.from_pretrained(model_id)


In [12]:
import torch
device = "cuda"

image_token = processor.tokenizer.convert_tokens_to_ids("<image>")
def collate_fn(examples):
  texts = ["answer " + example["question"] for example in examples]
  labels= [example['multiple_choice_answer'] for example in examples]
  images = [example["image"].convert("RGB") for example in examples]
  tokens = processor(text=texts, images=images, suffix=labels,
                    return_tensors="pt", padding="longest",
                    tokenize_newline_separately=False)

  tokens = tokens.to(torch.bfloat16).to(device)
  return tokens


# Load Model

In [13]:
from transformers import PaliGemmaForConditionalGeneration
import torch

model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)

for param in model.vision_tower.parameters():
    param.requires_grad = False

for param in model.multi_modal_projector.parameters():
    param.requires_grad = False

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Alternatively, if you want to do LoRA and QLoRA fine-tuning, you can run below cells to load the adapter either in full precision or quantized.

In [19]:
from transformers import BitsAndBytesConfig
from peft import get_peft_model, LoraConfig

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_type=torch.bfloat16
)

lora_config = LoraConfig(
    r=4,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


Unused kwargs: ['bnb_4bit_compute_type']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

trainable params: 5,649,408 || all params: 2,929,115,888 || trainable%: 0.1929


# Initialize the `TrainingArguments`.

In [20]:
from transformers import TrainingArguments


args=TrainingArguments(
            num_train_epochs=2,
            remove_unused_columns=False,
            per_device_train_batch_size=4,
            gradient_accumulation_steps=4,
            warmup_steps=2,
            learning_rate=2e-5,
            weight_decay=1e-6,
            adam_beta2=0.999,
            logging_steps=100,
            optim="adamw_hf",
            save_strategy="steps",
            save_steps=1000,
            push_to_hub=True,
            save_total_limit=1,
            output_dir="paligemma_vqav2",
            bf16=True,
            report_to=["tensorboard"],
            dataloader_pin_memory=False
        )


#Start Training.

In [21]:
from transformers import Trainer

trainer = Trainer(
        model=model,
        train_dataset=train_ds ,
        data_collator=collate_fn,
        args=args
        )



In [22]:
trainer.train()



Step,Training Loss
100,1.5539
200,0.9795


TrainOutput(global_step=276, training_loss=1.1635485939357593, metrics={'train_runtime': 569.5023, 'train_samples_per_second': 7.793, 'train_steps_per_second': 0.485, 'total_flos': 1.7238599866362336e+16, 'train_loss': 1.1635485939357593, 'epoch': 1.9891891891891893})

In [23]:
trainer.push_to_hub()

CommitInfo(commit_url='https://huggingface.co/moma1820/paligemma_vqav2/commit/afa47d8f8cb2bf0c77efe0d2ed05e69505fd88ed', commit_message='End of training', commit_description='', oid='afa47d8f8cb2bf0c77efe0d2ed05e69505fd88ed', pr_url=None, pr_revision=None, pr_num=None)

You can find steps to infer [here](https://colab.research.google.com/drive/100IQcvMvGm9y--oelbLfI__eHCoz5Ser?usp=sharing).