#  3️⃣ Finetuning Llama models using Quantized-LoRA with _Adapters_

In this notebook, we show how to efficiently fine-tune a quantized **Llama-3.1** model using [**QLoRA** (Dettmers et al., 2023)](https://dl.acm.org/doi/10.5555/3666122.3666563) and the [bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes) library.

For this example, we will finetune Llama-3 8B on supervised instruction tuning data collected by the [Open Assistant project](https://github.com/LAION-AI/Open-Assistant) for training chatbots. This is similar to the setup used to train the Guanaco models in the QLoRA paper.
You can simply replace this with any of your own domain-specific data!

Additionally, you can quickly adapt this notebook to use other **adapter methods such as bottleneck adapters, prefix tuning or prompt tuning.**


## LoRA vs. QLoRA

<!-- <img src="../../images/qlora.png" width="600"> -->
<img src="https://raw.githubusercontent.com/ivanvykopal/peft-kinit-2025/heads/master/images/qlora.png" alt="LoRA vs. QLoRA" width="500"/>

### How LoRA Works

**LoRA** (Low-Rank Adaptation) is a parameter-efficient fine-tuning method. Instead of updating all model weights during training, LoRA freezes the original pre-trained weights and injects a small number of trainable low-rank matrices.

From the image (left side):

- **W (FP16)**: The original pre-trained model weights, kept frozen (non-trainable).
- **A and B matrices**: Low-rank trainable adapters. Matrix A projects the input into a lower-dimensional space, and matrix B projects it back. Basically, we learn the difference between pre-trained and the expected trained model.
- **D_in → D_int → D_out**: Dimensions of the input, intermediate (low-rank), and output spaces.
- **X**: Input data, duplicated and passed through both the frozen weights and the LoRA adapters.

### QLoRA

**QLoRA** builds on LoRA, but applies LoRA on top of a quantized model (typically 4-bit or 8-bit). The key idea is to reduce both memory and compute requirements by combining LoRA with quantization-aware training.


You can also open this example in Google Colab:

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ivanvykopal/peft-kinit-2025/blob/master/examples/adapters/03_QLoRA_Llama.ipynb)

## Installation

Besides `adapters`, we require `bitsandbytes` for quantization and `accelerate` for training.

In [None]:
!pip install -qq -U adapters accelerate bitsandbytes datasets

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

## Load Open Assistant dataset

We use the [`timdettmers/openassistant-guanaco`](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) dataset by the QLoRA, which contains a small subset of conversations from the full Open Assistant database and was also used to finetune the Guanaco models in the QLoRA paper.

In [None]:
from datasets import load_dataset

dataset = load_dataset("timdettmers/openassistant-guanaco")

Our training dataset has roughly 10k training samples:

In [None]:
dataset

In [None]:
print(dataset["train"][0]["text"])

## Load and prepare model and tokenizer

We download the the official Llama-3.1 8B checkpoint from the HuggingFace Hub (**Note:** You must request access to this model on the HuggingFace website and use an API token to download it.).

Via the `BitsAndBytesConfig`, we specify that the model should be loaded in 4bit quantization and with double quantization for even better memory efficiency. See [their documentation](https://huggingface.co/docs/bitsandbytes/main/en/index) for more on this.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig

modelpath="meta-llama/Llama-3.1-8B"
# modelpath="Qwen/Qwen2.5-7B"   # Alternatively, you can use Qwen2.5-7B instead of Llama-3.1-8B

# Load 4-bit quantized model
model = AutoModelForCausalLM.from_pretrained(
    modelpath,    
    device_map="auto",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True, # we are loading the model in 4-bit
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
    ),
    torch_dtype=torch.bfloat16,
)
model.config.use_cache = False

tokenizer = AutoTokenizer.from_pretrained(modelpath)
tokenizer.pad_token = tokenizer.eos_token

We initialize the adapter functionality in the loaded model via `adapters.init()` and add a new LoRA adapter (named `"assistant_adapter"`) via `add_adapter()`.

In the call to `LoRAConfig()`, you can configure how and where LoRA layers are added to the model. Here, we want to add LoRA layers to all linear projections of the self-attention modules (`attn_matrices=["q", "k", "v"]`) as well as intermediate and outputa linear layers.

In [None]:
import adapters
from adapters import LoRAConfig

adapters.init(model)

config = LoRAConfig(
    selfattn_lora=True,
    intermediate_lora=True,
    output_lora=True,
    attn_matrices=["q", "k", "v"], # We are using LoRA on the attention matrices q, k, v
    alpha=16,
    r=64,
    dropout=0.1
)
model.add_adapter("assistant_adapter", config=config) # This is the name of the LoRA adapter we are creating
model.train_adapter("assistant_adapter")

print(model.adapter_summary())

To correctly train bottleneck adapters or prefix tuning, uncomment the following lines to move the adapter weights to GPU explicitly:

In [None]:
# model.adapter_to("assistant_adapter", device="cuda")

Some final preparations for 4bit training: we cast a few parameters to float32 for stability.

In [None]:
for param in model.parameters():
    if param.ndim == 1:
        # cast the small parameters (e.g. layernorm) to fp32 for stability
        param.data = param.data.to(torch.float32)

# Enable gradient checkpointing to reduce required memory if needed
# model.gradient_checkpointing_enable()
# model.enable_input_require_grads()

class CastOutputToFloat(torch.nn.Sequential):
    def forward(self, x): return super().forward(x).to(torch.float32)

model.lm_head = CastOutputToFloat(model.lm_head)

In [None]:
model

In [None]:
# Verifying the datatypes.
dtypes = {}
for _, p in model.named_parameters():
    dtype = p.dtype
    if dtype not in dtypes:
        dtypes[dtype] = 0
    dtypes[dtype] += p.numel()
total = 0
for k, v in dtypes.items():
    total += v
for k, v in dtypes.items():
    print(k, v, v / total)

## Prepare data for training

The dataset is tokenized and truncated.

In [None]:
import os 

def tokenize(element):
    return tokenizer(
        element["text"],
        truncation=True,
        max_length=512, # can set to longer values such as 2048
        add_special_tokens=False,
    )

dataset_tokenized = dataset.map(
    tokenize, 
    batched=True, 
    num_proc=os.cpu_count(),    # multithreaded
    remove_columns=["text"]     # don't need this anymore, we have tokens from here on
)

In [None]:
dataset_tokenized

## Training

We specify training hyperparameters and train the model using the `AdapterTrainer` class.

The hyperparameters here are similar to those chosen [in the official QLoRA repo](https://github.com/artidoro/qlora/blob/main/scripts/finetune_llama2_guanaco_7b.sh), but feel free to configure as you wish!

In [None]:
args = TrainingArguments(
    output_dir="output/llama_qlora",
    # output_dir="output/qwen_qlora",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    evaluation_strategy="steps",
    logging_steps=100,
    save_steps=250,
    eval_steps=187,
    save_total_limit=3,
    gradient_accumulation_steps=16,
    max_steps=1000,
    lr_scheduler_type="constant",
    optim="paged_adamw_32bit",
    learning_rate=0.0002,
    group_by_length=True,
    bf16=True,
    warmup_ratio=0.03,
    max_grad_norm=0.3,
)

In [None]:
from adapters import AdapterTrainer
from transformers import DataCollatorForLanguageModeling

trainer = AdapterTrainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
    train_dataset=dataset_tokenized["train"],
    eval_dataset=dataset_tokenized["test"],
    args=args,
)

trainer.train()

In [None]:
trainer.save_model()

## Inference

Finally, we can prompt the model:

In [None]:
# Ignore warnings
from transformers import logging
logging.set_verbosity(logging.CRITICAL)

def prompt_model(model, text: str):
    batch = tokenizer(f"### Human: {text}\n### Assistant:", return_tensors="pt")
    batch = batch.to(model.device)
    
    model.eval()
    with torch.inference_mode(), torch.cuda.amp.autocast():
        output_tokens = model.generate(**batch, max_new_tokens=50)

    return tokenizer.decode(output_tokens[0], skip_special_tokens=True)


In [None]:
print(prompt_model(model, "Explain Calculus to a primary school student"))

## Merge LoRA weights

For lower inference latency, the LoRA weights can be merged with the base model:

In [None]:
model.merge_adapter("assistant_adapter")

In [None]:
print(prompt_model(model, "Explain NLP in simple terms"))

## References

This tutorial is inspired by the [AdapterHub](https://adapterhub.ml) project and its associated codebase. In particular, the implementation is based on the official example notebook: [**QLoRA_Llama_Finetuning.ipynb**](https://github.com/adapter-hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb).

**Citations:**

[1] Hu et al. (2021). [**LoRA: Low-Rank Adaptation of Large Language Models**](https://arxiv.org/abs/2106.09685)  
[2] Dettmers et al. (2023). [**QLoRA: Efficient Finetuning of Quantized LLMs**](https://dl.acm.org/doi/10.5555/3666122.3666563) <br/>
[3] [**bitsandbytes**](https://github.com/bitsandbytes-foundation/bitsandbytes) <br/>
[4] Grattafiori et al. (2024). [**The Llama 3 Herd of Models**](https://arxiv.org/abs/2407.21783)  
[5] Köpf et al. (2023). [**Open-Assistant**](https://arxiv.org/abs/2304.07327)  