# Fine-tuning Gemma-3 on AMD Strix Halo (Unsloth Benchmarks)

**This notebook is an adapted 1-to-1 comparison of the standard Hugging Face finetuning pipeline, accelerated using [Unsloth](https://github.com/unslothai/unsloth).**

### What is Unsloth?
Unsloth is a heavily optimized Open Source library that dramatically speeds up LLM fine-tuning (2x-5x faster) and reduces VRAM usage (up to 70% less) without degrading accuracy.

**How does it work?**
- It manually derives matrix differentials for backpropagation, skipping PyTorch's generic `.compile()` autograd overhead.
- It writes standard operations like RoPE embeddings, Cross Entropy Loss, and MLP forward passes in highly optimized custom **Triton kernels**.
- It recycles memory and dynamically quantizes weights on the fly.

This notebook mirrors the exact configurations (Full Finetune, LoRA, 8-bit, QLoRA) found in the standard HF notebook, but uses `unsloth.FastLanguageModel` as the backend engine to demonstrate the speed and memory improvements on AMD hardware.

In [None]:
import os
os.environ["UNSLOTH_SKIP_TORCHVISION_CHECK"] = "1"
import unsloth
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
import torch
from transformers import AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model, PeftModel
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset
from lib import full_cleanup

def reset_peak_mem():
    if torch.cuda.is_available():
        torch.cuda.reset_peak_memory_stats()

def report_peak_mem(tag: str = ""):
    if torch.cuda.is_available():
        print(f"Peak training memory{(' ' + tag) if tag else ''}: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")

In [None]:
from huggingface_hub import login

# Login into Hugging Face Hub
#hf_token = '' 
#login(hf_token)

# Model selection and training parameters

## What is Gemma?
[Gemma](https://ai.google.dev/gemma) is Google’s family of open, instruction-tuned large language models built for open research and efficient deployment.  
It uses the same core architecture as Gemini models but is trained and released for reproducible downstream fine-tuning and evaluation.  
Weights are available on Hugging Face under the `google/gemma-*` namespaces, in different sizes from **270M** to **27B** parameters.

- `gemma-3-270m-it` – smallest, fast for experimentation.  
- `gemma-3-1b-it` – light but expressive; good for quick SFT runs.  
- `gemma-3-4b-it` – balanced for instruction-tuning tests.  
- `gemma-3-12b-it` – large model, needs substantial memory.  
- `gemma-3-27b-it` – very large; QLoRA recommended on Strix Halo.

Reference: [Gemma model documentation](https://ai.google.dev/gemma/docs/overview)


In [None]:
#MODEL = "google/gemma-3-270m-it"
MODEL = "google/gemma-3-1b-it"  # Default model
#MODEL = "google/gemma-3-4b-it"
#MODEL = "google/gemma-3-12b-it"
#MODEL = "google/gemma-3-27b-it"

#MODEL = "Qwen/Qwen3-4B"
#MODEL = "qwen/Qwen3-4B-Instruct-2507"

model_name = MODEL.split("/")[-1]

## About the training parameters
The parameters below control the supervised fine-tuning (SFT) phase.

| Parameter | Meaning | Typical Range / Source |
|:-----------|:--------|:----------------------|
| **LR** | Learning rate for optimizer (`adamw_*`) | `5e-5` to `1e-4` for small datasets; recommended baseline from Hugging Face TRL and Google Gemma guides. |
| **EPOCHS** | Number of full passes over the dataset | 1–3 are typical for small SFT datasets. |
| **BATCH_SIZE** | Samples per device per step | Adjust based on available VRAM / unified memory. |

References:  
- [Gemma fine-tuning examples](https://ai.google.dev/gemma/docs/core/huggingface_text_full_finetune)  
- [Hugging Face TRL SFTTrainer docs](https://huggingface.co/docs/trl)

In [None]:
LR = 5e-5
EPOCHS = 2
BATCH_SIZE = 4

# Dataset
We use a small dataset (`Abirate/english_quotes`) for demonstration.

Replace it with your dataset containing `messages` for chat-format SFT tasks.

In [None]:
ds = load_dataset("Abirate/english_quotes", split="train").shuffle(seed=42).select(range(1000))

def format_chat(ex):
    return {
        "messages": [
            {"role": "user", "content": f"Give me a quote about: {ex['tags']}"},
            {"role": "assistant", "content": f"{ex['quote']} - {ex['author']}"}
        ]
    }

ds = ds.map(format_chat, remove_columns=ds.column_names)

# Pre-format dataset for Unsloth SFTTrainer
from unsloth.chat_templates import get_chat_template
from transformers import AutoTokenizer
tokenizer_for_data = AutoTokenizer.from_pretrained(MODEL)
tokenizer_for_data = get_chat_template(tokenizer_for_data, chat_template="gemma-3")

def apply_template(examples):
    texts = [tokenizer_for_data.apply_chat_template(m, tokenize=False, add_generation_prompt=False).removeprefix('<bos>') for m in examples["messages"]]
    return {"text": texts}

ds = ds.map(apply_template, batched=True)
ds = ds.train_test_split(test_size=0.2)
print(f"Train: {len(ds['train'])}, Test: {len(ds['test'])}")

In [None]:
ds['train'][1]

# 1. Full fine-tuning
Full fine-tuning updates **all model parameters**, giving the highest adaptation quality and flexibility, but at the highest compute and memory cost.

- Trains every weight with full gradients and optimizer states.  
- Provides the best results when memory allows, but becomes infeasible for large models.  
- Recommended for small and medium models only.  
- Docs: [Gemma full fine-tune](https://ai.google.dev/gemma/docs/core/huggingface_text_full_finetune)

| Model | Peak Memory | Time (2 epochs) | Notes |
|:------|-------------:|----------------:|:------|
| Gemma-3-1B-IT | **19 GB** | ~2 min 52 s | Fits easily |
| Gemma-3-4B-IT | **46 GB** | ~9 min | Manageable |
| Gemma-3-12B-IT | **115 GB** | ~25 min | Pushes system limit |
| Gemma-3-27B-IT | — | ❌ Out of memory | Not feasible |


### Unsloth Advantage: Full Finetuning
Even without PEFT (LoRA), Unsloth accelerates full fine-tuning by up to 1.3x. It uses manually derived matrix differentials and optimized Triton kernels for operations like Cross Entropy Loss and RoPE embeddings, saving significant overhead.


In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = MODEL,
    max_seq_length = 512,
    dtype = None,
    load_in_4bit = False,
)
torch_dtype = model.dtype
print(f"Weights footprint: {model.get_memory_footprint()/1e9:.2f} GB")

In [None]:
args = SFTConfig(
    dataset_text_field="text",
    output_dir=f"output-unsloth-{model_name}-full",
    max_length=512,
    packing=False,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_checkpointing=False,
    optim="adamw_torch_fused",
    logging_steps=10,
    save_strategy="epoch",
    eval_strategy="epoch",
    learning_rate=LR,
    fp16=True if torch_dtype == torch.float16 else False,
    bf16=True if torch_dtype == torch.bfloat16 else False,
    lr_scheduler_type="constant",
    report_to="none",
    dataset_kwargs={"add_special_tokens": False, "append_concat_token": True},
    save_total_limit=1,  # keep only latest checkpoint
)

trainer = SFTTrainer(model=model, args=args, train_dataset=ds['train'], eval_dataset=ds['test'], processing_class=tokenizer)
reset_peak_mem()
trainer.train()
report_peak_mem("full")
trainer.save_model()

In [None]:
# Free all memory
full_cleanup(model, trainer)


# 2. LoRA (Low-Rank Adaptation)
LoRA freezes the base model and inserts small low-rank adapter matrices into key projections, training only a tiny subset of weights.

- Typically ~0.5–1.5 % of parameters are trainable.  
- Big savings in memory and compute with minimal loss in quality.  
- Excellent balance for mid-sized models.  
- Docs: [PEFT LoRA](https://huggingface.co/docs/peft)

| Model | Peak Memory | Time (2 epochs) | Notes |
|:------|-------------:|----------------:|:------|
| Gemma-3-1B-IT | **15 GB** | ~2 min | Very efficient |
| Gemma-3-4B-IT | **30 GB** | ~5 min | Fast and stable |
| Gemma-3-12B-IT | **67 GB** | ~13 min | Heavier but fits |
| Gemma-3-27B-IT | — | ❌ Out of memory | Too large |


### Unsloth Advantage: 16-bit LoRA
Unsloth excels at LoRA finetuning. By utilizing optimized custom kernels and an efficient `get_peft_model` implementation, it drastically speeds up PEFT training and reduces peak memory consumption by recycling memory efficiently.


In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = MODEL,
    max_seq_length = 512,
    dtype = None,
    load_in_4bit = False,
)
torch_dtype = model.dtype

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model.print_trainable_parameters()
print(f"Weights footprint: {model.get_memory_footprint()/1e9:.2f} GB")

In [None]:
args = SFTConfig(
    dataset_text_field="text",
    output_dir=f"output-unsloth-{model_name}-lora",
    max_length=512,
    packing=False,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_checkpointing=False,
    optim="adamw_torch_fused",
    logging_steps=10,
    save_strategy="epoch",
    eval_strategy="epoch",
    learning_rate=LR,
    fp16=True if torch_dtype == torch.float16 else False,
    bf16=True if torch_dtype == torch.bfloat16 else False,
    lr_scheduler_type="constant",
    report_to="none",
    dataset_kwargs={"add_special_tokens": False, "append_concat_token": True},
    save_total_limit=1,  # keep only latest checkpoint
)

trainer = SFTTrainer(model=model, args=args, train_dataset=ds['train'], eval_dataset=ds['test'], processing_class=tokenizer)
reset_peak_mem()
trainer.train()
report_peak_mem("lora")
trainer.save_model()

In [None]:
# Free all memory
full_cleanup(model, trainer)

# 3. 8-bit + LoRA
This configuration loads the base model in **int8** with BitsAndBytes and trains LoRA adapters in fp16/bf16, reducing memory even further at some performance cost.

- Roughly halves memory compared to plain LoRA.  
- The warning  
  `MatMul8bitLt: inputs will be cast from torch.float32 to torch.float16 during quantization`  
  is expected — it adds minor casting overhead and can slow training slightly on ROCm.  
- Docs: [BitsAndBytes integration](https://huggingface.co/blog/hf-bitsandbytes-integration)

| Model | Peak Memory | Time (2 epochs) | Notes |
|:------|-------------:|----------------:|:------|
| Gemma-3-1B-IT | **13 GB** | ~8 min | Works well |
| Gemma-3-4B-IT | **21 GB** | ~41 min | Stable |
| Gemma-3-12B-IT | **43 GB** | ~2 h 38 min | Slow but fits |
| Gemma-3-27B-IT | **32 GB** | ✳ May fail near end | Memory-tight |


### Unsloth Advantage: 8-bit LoRA
Instead of relying on the standard BitsAndBytes overhead and PyTorch `float16` casting on every forward pass, Unsloth's native `load_in_8bit=True` flag optimizes the underlying matrix multiplications, offering significant speedups over standard HF 8-bit configurations.


In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = MODEL,
    max_seq_length = 512,
    dtype = None,
    load_in_8bit = True,
    load_in_4bit = False,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model.print_trainable_parameters()
print(f"Weights footprint: {model.get_memory_footprint()/1e9:.2f} GB")

In [None]:
args = SFTConfig(
    dataset_text_field="text",
    output_dir=f"output-unsloth-{model_name}-8bit-lora",
    max_length=512,
    packing=False,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    optim="adamw_8bit",
    logging_steps=10,
    save_strategy="epoch",
    eval_strategy="epoch",
    learning_rate=LR,
    fp16=False,
    bf16=True,
    lr_scheduler_type="constant",
    report_to="none",
    dataset_kwargs={"add_special_tokens": False, "append_concat_token": True},
    save_total_limit=1,  # keep only latest checkpoint
)

trainer = SFTTrainer(model=model, args=args, train_dataset=ds['train'], eval_dataset=ds['test'], processing_class=tokenizer)
reset_peak_mem()
trainer.train()
report_peak_mem("8bit-lora")
trainer.save_model()

In [None]:
# Free all memory
full_cleanup(model, trainer)

# 4. QLoRA (4-bit NF4 + double quantization)
QLoRA compresses the base weights to **4-bit NF4** with double quantization while training LoRA adapters in bf16 for numerical stability.  
It offers the best trade-off between quality and memory use, allowing even large models to train.

- Memory savings up to 4× vs full precision; minimal quality loss.  
- Best choice for large-scale fine-tuning under memory constraints.  
- Docs: [Gemma QLoRA guide](https://ai.google.dev/gemma/docs/core/huggingface_text_finetune_qlora)

| Model | Peak Memory | Time (2 epochs) | Notes |
|:------|-------------:|----------------:|:------|
| Gemma-3-1B-IT | **13 GB** | ~9 min | Efficient |
| Gemma-3-4B-IT | **13 GB** | ~9 min | Similar footprint to 1B |
| Gemma-3-12B-IT | **26 GB** | ~23 min | Fits comfortably |
| Gemma-3-27B-IT | **19 GB** | ✅ Runs successfully | Best option for 27B |


### Unsloth Advantage: 4-bit QLoRA
This is Unsloth's primary focus. It utilizes "Dynamic 4-bit Quantization" and highly optimized Triton kernels to make 4-bit finetuning up to 2x faster than standard BitsAndBytes, while completely eliminating the 4-bit accuracy degradation often seen in standard transformers.


In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = MODEL,
    max_seq_length = 512,
    dtype = None,
    load_in_4bit = True,
)
model.config.use_cache = False

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model.print_trainable_parameters()
print(f"Weights footprint: {model.get_memory_footprint()/1e9:.2f} GB")

In [None]:
args = SFTConfig(
    dataset_text_field="text",
    output_dir=f"output-unsloth-{model_name}-qlora",
    max_length=512,
    packing=False,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    optim="paged_adamw_8bit",
    fp16=False,
    bf16=True,
    lr_scheduler_type="constant",
    report_to="none",
    logging_steps=10,
    save_strategy="epoch",
    eval_strategy="epoch",
    dataset_kwargs={"add_special_tokens": False, "append_concat_token": True},
    save_total_limit=1,  # keep only latest checkpoint
)

trainer = SFTTrainer(model=model, args=args, train_dataset=ds['train'], eval_dataset=ds['test'], processing_class=tokenizer)
reset_peak_mem()
trainer.train()
report_peak_mem("qlora")
trainer.save_model()

In [None]:
# Free all memory
full_cleanup(model, trainer)

---
# Inference

## Inference with a fully fine-tuned model
This path loads the model checkpoint produced by **full fine-tuning** (`output-full`) and runs inference with **FlashAttention 2** enabled for efficiency.

- The base model weights already include the fine-tuned parameters — no adapters to merge or attach.  
- `attn_implementation="flash_attention_2"` is safe for inference and improves throughput compared to eager mode.  
- The Hugging Face `pipeline("text-generation")` handles tokenization and sampling.  
- Use this path when you trained with full fine-tuning or after you have merged adapters into the base model.


In [None]:
import glob
import os

# Find all full finetune output directories
full_checkpoints = sorted(glob.glob("output-unsloth-*full"))

if not full_checkpoints:
    raise FileNotFoundError("No full fine-tuning directories found (looking for 'output-*full').")

print(f"Found {len(full_checkpoints)} checkpoints:")
for i, ckpt in enumerate(full_checkpoints):
    print(f"[{i}] {ckpt}")

selection = input(f"\nSelect checkpoint index [0-{len(full_checkpoints)-1}] (default 0): ").strip()
try:
    idx = int(selection) if selection else 0
    selected_path = full_checkpoints[idx]
except (ValueError, IndexError):
    print(f"Invalid selection '{selection}', defaulting to {full_checkpoints[0]}")
    selected_path = full_checkpoints[0]

print(f"Loading model from: {selected_path}")

# Adapter load path
from unsloth import FastLanguageModel
base, tokenizer = FastLanguageModel.from_pretrained(
    model_name = selected_path if 'selected_path' in locals() else MODEL,
    max_seq_length = 512,
)
FastLanguageModel.for_inference(base)
tok = AutoTokenizer.from_pretrained(MODEL)

from transformers import pipeline
from transformers import TextStreamer
streamer = TextStreamer(tok, skip_prompt=True)
pipe = pipeline("text-generation", model=base, tokenizer=tok, streamer=streamer)

sample = ds['test'][0]
prompt = tok.apply_chat_template(sample["messages"][:1], tokenize=False, add_generation_prompt=True)
out = pipe(prompt, max_new_tokens=100, disable_compile=True)
print(f"User: {sample['messages'][0]['content']}")
print(f"\nExpected: {sample['messages'][1]['content']}")
print("\nGenerated: ", end="")


In [None]:
ds['test'][1]

## Inference with LoRA or QLoRA adapters
This path loads the **original base model** and then attaches the **LoRA / QLoRA adapters** from the fine-tuned checkpoint (`output-lora`, `output-8bit-lora`, or `output-qlora`).

- The base model remains frozen; the adapter layers modify its activations at runtime.  
- You must load the base model first, then call `PeftModel.from_pretrained()` to apply the trained adapters.  
- `attn_implementation="flash_attention_2"` is again enabled for faster inference.  
- Use this path when the model was trained with LoRA or QLoRA and you want to keep the adapters separate (for lightweight sharing or quick swapping).


In [None]:
import glob
import os

# Find all LoRA/QLoRA output directories (excluding full finetunes)
# Matches anything containing 'lora' or 'qlora' in the name
lora_checkpoints = sorted([d for d in glob.glob("output-unsloth-*") if "full" not in d and ("lora" in d or "qlora" in d)])

if not lora_checkpoints:
    raise FileNotFoundError("No LoRA/QLoRA directories found (looking for 'output-*' excluding 'full').")

print(f"Found {len(lora_checkpoints)} adapter checkpoints:")
for i, ckpt in enumerate(lora_checkpoints):
    print(f"[{i}] {ckpt}")

selection = input(f"\nSelect checkpoint index [0-{len(lora_checkpoints)-1}] (default 0): ").strip()
try:
    idx = int(selection) if selection else 0
    selected_path = lora_checkpoints[idx]
except (ValueError, IndexError):
    print(f"Invalid selection '{selection}', defaulting to {lora_checkpoints[0]}")
    selected_path = lora_checkpoints[0]

print(f"Loading base model: {MODEL}")
print(f"Loading adapters from: {selected_path}")

# Adapter load path
from unsloth import FastLanguageModel
base, tokenizer = FastLanguageModel.from_pretrained(
    model_name = selected_path if 'selected_path' in locals() else MODEL,
    max_seq_length = 512,
)
FastLanguageModel.for_inference(base)
tok = AutoTokenizer.from_pretrained(MODEL)

from transformers import pipeline
from transformers import TextStreamer
streamer = TextStreamer(tok, skip_prompt=True)
pipe = pipeline("text-generation", model=base, tokenizer=tok, streamer=streamer)

sample = ds['test'][0]
prompt = tok.apply_chat_template(sample["messages"][:1], tokenize=False, add_generation_prompt=True)
out = pipe(prompt, max_new_tokens=100, disable_compile=True)
print(f"User: {sample['messages'][0]['content']}")
print(f"\nExpected: {sample['messages'][1]['content']}")
print("\nGenerated: ", end="")


## Save and Export Options (Unsloth Native)
Unsloth allows you to merge LoRA adapters back into the base model instantly and export to formats like GGUF or 16-bit without needing external scripts.


In [None]:
# Merge LoRA adapters to a 16-bit huggingface model
# model.save_pretrained_merged("gemma-unsloth-merged-16bit", tokenizer, save_method="merged_16bit")


In [None]:
# Export directly to GGUF format for llama.cpp/Ollama (Q8_0 for 8-bit, F16 for 16-bit)
# model.save_pretrained_gguf("gemma-unsloth-gguf", tokenizer, quantization_method="Q8_0")
