# Finetuning of Gemma 3 4B model on Brain Tumor VQA task using
We will be using unsloth library for finetuning because it provides dynamic quantization <br>

Checkout Unsloth documentation for other models : [Link](https://docs.unsloth.ai/get-started/unsloth-notebooks)



### Installation

In [None]:
%%capture
import os
!pip install unsloth vllm

In [None]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

`FastModel` supports loading nearly any model now! This includes Vision and Text models!

In [None]:
from unsloth import FastModel
import torch

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-4b-it",
    max_seq_length = 2048, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False,
    full_finetuning = False,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 04-12 08:17:05 [__init__.py:239] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.19: Fast Gemma3 patching. Transformers: 4.50.0.dev0. vLLM: 0.8.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.


model.safetensors:   0%|          | 0.00/4.56G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/210 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

LoRA Adapters (defining parameters)

---



In [None]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = True, # Turn off for just text!
    finetune_language_layers   = True,  # Should leave on!
    finetune_attention_modules = True,  # Attention good for GRPO
    finetune_mlp_modules       = True,  # SHould leave on always!

    r = 8,           # Larger = higher accuracy, but might overfit
    lora_alpha = 8,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

Unsloth: Making `base_model.model.vision_tower.vision_model` require gradients


<a name="Data"></a>
### Data Prep
The dataset contains MRI/CT scan images labeled for brain tumor detection, with corresponding visual question answering (VQA) pairs. This small dataset was created for research purposes.

**Dataset** : "Kaith-jeet123/brain_tumor_vqa" [Link](https://huggingface.co/datasets/Kaith-jeet123/brain_tumor_vqa)

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3` and more.

In [None]:
# Import the function to get a chat template from Unsloth
from unsloth.chat_templates import get_chat_template

# Initialize the tokenizer using the "gemma-3" chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="gemma-3",  # Specify the chat template for Gemma-3 model
)

In [None]:
# Import the function to load datasets
from datasets import load_dataset

# Load the dataset for brain tumor visual question answering (VQA), specifically the training split
dataset = load_dataset("Kaith-jeet123/brain_tumor_vqa", split="train")

README.md:   0%|          | 0.00/1.56k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/9.00M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/101 [00:00<?, ? examples/s]

We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!

In [None]:
# Import the function to standardize data formats for compatibility with Unsloth templates
from unsloth.chat_templates import standardize_data_formats

# Standardize the dataset format to ensure it aligns with the expected input structure for fine-tuning
dataset = standardize_data_formats(dataset)

In [None]:
dataset[10]

{'Question': 'Tell me the condition of the brain. Healthy or having a tumor?',
 'Answer': "This MRI scan reveals a contrast-enhancing lesion in the brain, suggesting the presence of a tumor. The lesion appears well-defined and located near the brain's surface, possibly indicative of a metastatic tumor or a meningioma.",
 'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=377x500>}

We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`

In [None]:
# Define a function to apply the chat template to each example in the dataset
def apply_chat_template(examples):
    # Create a list of conversations by pairing questions and answers from the dataset
    conversations = [
        [
            {"role": "user", "content": q},
            {"role": "assistant", "content": a}
        ] for q, a in zip(examples["Question"], examples["Answer"])
    ]

    # Apply chat template to text components
    texts = tokenizer.apply_chat_template(
        conversations,
        tokenize=False,
        add_generation_prompt=False
    )

    return {"text": texts, "image": examples["image"]}

# Apply the `apply_chat_template` function to every example in the dataset in batches for efficiency
dataset = dataset.map(apply_chat_template, batched=True)

Map:   0%|          | 0/101 [00:00<?, ? examples/s]

Let's see how the chat template did! Notice `Gemma-3` default adds a `<bos>`!

In [None]:
dataset[90]['text']

'<bos><start_of_turn>user\nDoes the MRI show any focal areas of abnormal signal intensity within the brain parenchyma?<end_of_turn>\n<start_of_turn>model\nBased on this axial T1-weighted MRI image, there are no clearly discernible focal areas of abnormal signal intensity within the brain parenchyma that are definitively suggestive of a tumor. There are no clearly visible focal areas of abnormal signal intensity within the brain parenchyma that strongly suggest a tumor.<end_of_turn>\n'

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [None]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 30,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",  #Specify the optimizer as AdamW with 8-bit precision for efficient training
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Switching to float32 training since model cannot work with float16


Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/101 [00:00<?, ? examples/s]

Utilize `train_on_completions` method to focus training solely on the model's output responses, disregarding the loss from user inputs. This approach enhances the accuracy of fine-tuning.

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Map (num_proc=2):   0%|          | 0/101 [00:00<?, ? examples/s]

In [None]:
tokenizer.decode(trainer.train_dataset[90]["input_ids"])

'<bos><bos><start_of_turn>user\nDoes the MRI show any focal areas of abnormal signal intensity within the brain parenchyma?<end_of_turn>\n<start_of_turn>model\nBased on this axial T1-weighted MRI image, there are no clearly discernible focal areas of abnormal signal intensity within the brain parenchyma that are definitively suggestive of a tumor. There are no clearly visible focal areas of abnormal signal intensity within the brain parenchyma that strongly suggest a tumor.<end_of_turn>\n'

Now let's print the masked out example - you should see only the answer is present:

In [None]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[90]["labels"]]).replace(tokenizer.pad_token, " ")

'                          Based on this axial T1-weighted MRI image, there are no clearly discernible focal areas of abnormal signal intensity within the brain parenchyma that are definitively suggestive of a tumor. There are no clearly visible focal areas of abnormal signal intensity within the brain parenchyma that strongly suggest a tumor.<end_of_turn>\n'

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
5.57 GB of memory reserved.


Let's train the model!

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 101 | Num Epochs = 3 | Total steps = 30
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 19,248,896/4,000,000,000 (0.48% trained)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
1,3.8425
2,3.243
3,3.8877
4,3.3141
5,3.0253
6,2.9492
7,2.8604
8,2.0905
9,2.0762
10,1.8301


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

159.226 seconds used for training.
2.65 minutes used for training.
Peak reserved memory = 5.57 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 37.786 %.
Peak reserved memory for training % of max memory = 0.0 %.


<a name="Inference"></a>
### Inference
According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`

In [None]:
from unsloth.chat_templates import get_chat_template
from PIL import Image

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

#Prepare the image (example using dataset's 10th element)
image = dataset[10]["image"]  # Already a PIL image from dataset

#Create multimodal messages with image+text input
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "My close friend had a sudden onset of speech difficulties. This is their brain MRI. Can you interpret the findings?"}
    ]
}]

text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)
outputs = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 100,
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
)
tokenizer.batch_decode(outputs)

['<bos><start_of_turn>user\n<start_of_image>My close friend had a sudden onset of speech difficulties. This is their brain MRI. Can you interpret the findings?<end_of_turn>\n<start_of_turn>model\nThis MRI is most likely showing a significant structural abnormality in the brain, specifically a large, contrast-enhancing lesion in the left hemisphere. The MRI is showing a lesion of high signal intensity (bright white) in the left parietal and occipital lobes with irregular borders, suggesting brain tissue breakdown or fluid accumulation. This is highly suspicious for a tumor, potentially brain abscess, or severe stroke.<end_of_turn>']

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model.

In [None]:
#model.save_pretrained("gemma-3")  # Local saving
#tokenizer.save_pretrained("gemma-3")
model.push_to_hub("HUB_ACCOUNT/gemma-3_4B_Brain_Tumor_VQA", token = "....") # Online saving
tokenizer.push_to_hub("HUB_ACCOUNT/gemma-3_4B_Brain_Tumor_VQA", token = "....") # Online saving

README.md:   0%|          | 0.00/604 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/77.1M [00:00<?, ?B/s]

Saved model to https://huggingface.co/Kaith-jeet123/gemma-3_4B_Brain_Tumor_VQA


  0%|          | 0/2 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]