# 1️⃣ Training a LoRA for Seq2Seq Conditional Generation

In this example, we will fine-tune model `mt0-large` to generate text labels based on the input. For this purpose, we will use the reparametrization PEFT method called **Low Rank Adaptation** (LoRA). We will use transformers to download models and training, datasets for data download peft for LoRA reparametrization.

You can also open this example in Google Colab:

<!-- TODO: Open in Colab -->

### How LoRA Works

<!-- <img src="../../images/qlora.png" width="600"> -->
<img src="https://raw.githubusercontent.com/ivanvykopal/peft-kinit-2025/heads/master/images/qlora.png" alt="LoRA vs. QLoRA" width="500"/>

**LoRA** (Low-Rank Adaptation) is a parameter-efficient fine-tuning method. Instead of updating all model weights during training, LoRA freezes the original pre-trained weights and injects a small number of trainable low-rank matrices.

From the image (left side):

- **W (FP16)**: The original pre-trained model weights, kept frozen (non-trainable).
- **A and B matrices**: Low-rank trainable adapters. Matrix A projects the input into a lower-dimensional space, and matrix B projects it back. Basically, we learn the difference between pre-trained and the expected trained model.
- **D_in → D_int → D_out**: Dimensions of the input, intermediate (low-rank), and output spaces.
- **X**: Input data, duplicated and passed through both the frozen weights and the LoRA adapters.


## Installation

Besides `transformers`, we require `datasets` for laoding datasets and `peft` for training LoRA adapters.

In [None]:
%pip install -q --user transformers[torch]==4.36.0
%pip install -q --user datasets
%pip install -q --user peft

In [None]:
import torch
import os

# Import PEFT
from peft import (
    get_peft_model,
    LoraConfig,
    TaskType
)

from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    default_data_collator,
    get_linear_schedule_with_warmup,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    GenerationConfig
)

from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm import tqdm
from datasets import load_dataset

We will be fine-tuning the pre-trained version of model `mt0-large` which has 1.2B parameters. We will set the max **input length to 128 tokens** and train for **3 epochs** with a batch size of 32.


In [None]:
device = "cuda"
model_name_or_path = "bigscience/mt0-large"
tokenizer_name_or_path = "bigscience/mt0-large"

os.environ["TOKENIZERS_PARALLELISM"] = "false"

max_length = 128
lr = 1e-3
num_epochs = 3
batch_size = 32 # in case of "unable to allocate" errors, decrease batch size to some lower number (e.g. 8 or 16) 


checkpoint_name = "financial_sentiment_analysis_lora_v1.pt"
text_column = "sentence"
label_column = "text_label"

The internal size of the A and B matrices will be **8 (rank)**. This means that A will have a size of **d x r** and B will have a size of **r x l**, therefore the matrix A x B will be the size of **d x l**, which is also the size of a certain matrix of weights W.

We can also set the dropout rate and **alpha to scale the matrices**. We can also specify the **target modules** (names of the modules that we want to reparametrize). By default, it is set by peft library based on their known model list.

In [None]:
r = 8 # Size of the low-rank matrices (rank)
lora_alpha = 32 # The alpha parameter for Lora scaling
lora_dropout = 0.1 # The dropout probability for Lora layers

# Experiment with different reparametrization
target_modules = None
# target_modules = "all-linear"
# target_modules = ["q", "k", "v"]

**Now, we will create the PEFT model.**

The Hugging Face PEFT module will freeze the weights and add LoRA weights automatically.

Compare the model architectures with and without the added LoRA weights.

In [None]:
# creating model
peft_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    inference_mode=False,
    r=r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
model

We can also see that we have been able to reduce the number of trainable parameters to a mere **0.2% of original model parameters**.

## Dataset and preprocessing

The dataset that we will be using is called [Financial Phrasebank](https://huggingface.co/datasets/financial_phrasebank). Which is polar sentiment dataset of sentences from financial news. The dataset consists of 4.84k sentences from English language financial news categorised by sentiment. The dataset is divided by agreement rate of 5-8 annotators. We will be using the portion of dataset where all annotators agreed, which contains around **2.26k samples**.

We will also split the dataset using ratio of **_80% : 10% : 10%_** for train, valid and test sets. Because we are doing seq2seq training, we also need to convert integer labels to string labels: 0 for negative, 1 for positive and 2 for neutral.

In [None]:
# loading dataset
dataset = load_dataset("financial_phrasebank", "sentences_allagree")

dataset = dataset["train"].train_test_split(test_size=0.2)
validtest = dataset["test"].train_test_split(test_size=0.5)

dataset["validation"] = validtest["train"]
dataset["test"] = validtest["test"]

classes = dataset["train"].features["label"].names
dataset = dataset.map(
    lambda x: {"text_label": [classes[label] for label in x["label"]]},
    batched=True,
    num_proc=1,
)

dataset["train"][0]

Now we need to tokenize the datasets. We will the tokenizer trained for the model and also tokenize the labels with padding to the max_length. The max_length for the **target labels is set to 4 tokens** (3 would be okay, but 4 is a nicer number). The padding token is 0.


In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

def preprocess_function(examples):
    inputs = examples[text_column]
    targets = examples[label_column]
    model_inputs = tokenizer(inputs, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt")
    labels = tokenizer(targets, max_length=3, padding="max_length", truncation=True, return_tensors="pt")
    labels = labels["input_ids"]
    labels[labels == tokenizer.pad_token_id] = -100
    model_inputs["labels"] = labels
    return model_inputs


processed_datasets = dataset.map(
    preprocess_function,
    batched=True,
    num_proc=1,
    remove_columns=dataset["train"].column_names,
    load_from_cache_file=False,
    desc="Running tokenizer on dataset",
)

train_dataset = processed_datasets["train"].shuffle()
eval_dataset = processed_datasets["validation"]
test_dataset = processed_datasets["test"]

## Training and evaluation

For training we are using the Hugging Face [Seq2SeqTrainer](https://huggingface.co/docs/transformers/main_classes/trainer) and provide it with [Seq2SeqTrainingArguments](https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/trainer#transformers.TrainingArguments). We would also like to predict the labels during evaluation and we will do it with `model.generate()` method. The trainer will take a compute_metrics method that will be used to compute metrics during the evaluation.

We would like to compute the accuracy (exact match) between two sets of strings.

In [None]:
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    correct = 0
    total = 0
    for pred, true in zip(preds, labels):
        if pred.strip() == true.strip():
            correct += 1
        total += 1
    accuracy = correct / total
    return {"accuracy": accuracy}


training_args = Seq2SeqTrainingArguments(
    "lora",
    per_device_train_batch_size=batch_size,
    learning_rate=lr,
    num_train_epochs=num_epochs,
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    predict_with_generate=True,
    generation_config=GenerationConfig(max_new_tokens=10),
)

Now we will do the training and evaluation. Give a quick look at GPU memory usage, how much are we using? How would the memory usage change if we did FFT?

In [None]:
trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=default_data_collator,
    compute_metrics=compute_metrics,
)
trainer.train()

trainer.evaluate(eval_dataset=test_dataset, metric_key_prefix="test")

## Save and load

Now we can save the model just with the save_pretrained method (like we would for other Hugging Face transformers models).

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

peft_model_id = f"{model_name_or_path}_{peft_config.peft_type}_{peft_config.task_type}"
model.save_pretrained(peft_model_id)

ckpt = f"{peft_model_id}/adapter_model.safetensors"
!du -h $ckpt

We can now load the pre-trained model and give it a custom example.

Notice that we have saved the last version of the model. In a more real scenario, we would like to save the model with the best validation score and load it at the end of the training. We can do this with training args.

In [None]:
from peft import PeftModel, PeftConfig

peft_model_id = f"{model_name_or_path}_{peft_config.peft_type}_{peft_config.task_type}"

config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(model, peft_model_id)

In [None]:
model.eval()

inputs = tokenizer(input(), return_tensors="pt")
print(inputs)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=10)
    print(outputs)
    print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))

## References

This tutorial is inspired by the code prepared by [Robert Belanec](https://kinit.sk/member/robert-belanec/). In particular, the implementation is based on the following example: [**lora_seq2seq.ipynb**](https://github.com/Wicwik/peft_tutorial/blob/main/examples/lora_seq2seq.ipynb).

**Citations:**

[1] Hu et al. (2021). [**LoRA: Low-Rank Adaptation of Large Language Models**](https://arxiv.org/abs/2106.09685) <br/>
[2] [**peft**](https://github.com/huggingface/peft) <br/>
[3] Muennighoff et al. (2022). [**Crosslingual Generalization through Multitask Finetuning**](https://arxiv.org/abs/2211.01786)  