# Language Models 2 | Parameter-Efficient Finetuning (PEFT), Low-Rank Adaptation (LoRA) and Supervised Fine-Tuning (SFT)


In this example, the authors use a new, higher level library for training called [tlr](https://huggingface.co/docs/trl/en/index), "transformers reinforcement learning", used to fine-tune LLMs.

There are three main concepts here:
- **PEFT**: training only a few parameters in an LLM (much lighter)
- **LoRA**: replacing LLM matrices by smaller ones, training only those (one PEFT technique);
- **Supervised Fine-Tuning**: after doing plain next-token-prediction training on large datasets, the next step is usually to finetune the LLM on a smaller dataset that has a specific *format* (for instance, the format of a chat). The training method remains the same, with some tweaks (dataset format, sometimes only allowing the training signal to come from the continuation and not the prompt/earlier chat messages, etc.).

First inspiration: [this blog post](https://huggingface.co/blog/gemma-peft) (original notebook [here](https://github.com/huggingface/notebooks/blob/main/peft/gemma_7b_english_quotes.ipynb)), then using also bits from [this](https://www.datacamp.com/tutorial/fine-tuning-qwen3) and [this](https://www.datacamp.com/tutorial/fine-tune-gemma-3) (and also ChatGPT).

### PEFT

For more on these techniques, see:
- the ['Quicktour'](https://huggingface.co/docs/peft/main/en/quicktour), and
- ['Configurations and models'](https://huggingface.co/docs/peft/main/en/tutorial/peft_model_config), as well as
- ['Integrations'](https://huggingface.co/docs/peft/main/en/tutorial/peft_integrations).

### LoRA

**LoRA** stands for "[Low-Rank Adaptation (for Large Language Models](https://arxiv.org/abs/2106.09685))". This is a set of techniques that leverages linear algebra to replace large weight matrices by smaller ones, and then merges the results, allowing users to train only "patches" (in HF they are called _adapters_). As often happens in this field, this sparked a wave of interest, leading to many different techniques (see [here](https://huggingface.co/docs/peft/main/en/conceptual_guides/adapter) for some of them).

<!-- ![LoRA illustration](images/lora_diagram.png) -->
![LoRA illustration](https://raw.githubusercontent.com/jchwenger/DMLCP/main/notebooks/images/lora_diagram.png)

([source](https://huggingface.co/docs/peft/main/en/developer_guides/lora#merge-lora-weights-into-the-base-model))

See also:
- [LoRA Conceptual Guide](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora)
- [LoRA Methods](https://huggingface.co/docs/peft/main/en/task_guides/lora_based_methods)

### Supervised Fine-Tuning (& TRL)

- [TRL Quickstart](https://huggingface.co/docs/trl/en/quickstart)
- [Dataset formats](https://huggingface.co/docs/trl/en/dataset_formats)
- [`SFTTrainer` & `SFTConfig` references](https://huggingface.co/docs/trl/en/sft_trainer)

## Install & Workflow

In [None]:
import sys
if 'google.colab' in sys.modules:
    !pip install -Uq bitsandbytes trl

#### Drive

If you need to load/save to your drive:

```python
import sys
if "google.colab" in sys.modules:
    from google.colab import drive
    drive.mount("/content/drive/")

import os
os.chdir("drive/My Drive/gold/IS53055B-DMLCP/DMLCP") # to change to another directory
```

#### Huggingface login

For some models and datasets, and if you want to push your model to HF (same as GitHub, but for models) you need to be logged into your HF account.

For that, you need to create an account [here](https://huggingface.co/) and then to ['/settings/tokens'](https://huggingface.co/settings/tokens) to create an access token.

```python
import pathlib
from huggingface_hub import notebook_login
if not (pathlib.Path.home()/'.huggingface'/'token').exists():
    notebook_login()
```

## Imports

In [None]:
import os
import gc
import time

import torch

# Get cpu, gpu or mps device for training.
# See: https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html#creating-models
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

# transformers
from transformers import AutoTokenizer
from transformers import GenerationConfig
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM

# datasets
from datasets import load_dataset

# parameter-efficient fine-tuning
from peft import LoraConfig
from peft import get_peft_model

# training library from transformers
from trl import SFTConfig
from trl import SFTTrainer

### Printing Utils

In [None]:
# The textwrap module automatically formats text for you
import textwrap

# many more options, see them with textwrap.TextWrapper?
tw = textwrap.TextWrapper(
    # the formatted width we want
    width=79,
    # this will keep whitespace & line breaks in the original text
    replace_whitespace=False
)

def wrap_print(s):
    """Format text into Textwrapped lines and print it"""
    print("\n".join(tw.wrap(s)))

#### Model Loading & Config

We will fine-tune [Google's Gemma](https://huggingface.co/collections/google/gemma-3-release), but you can have a look at [other available recent models](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending).

Note that models names ending in "-it" mean "instruction tuning" (i.e. chatbots, following instructions); whereas "-pt" means pre-training, that's the base models (so are the models with no suffix).

In [None]:
# thanks to these techniques you can finetune these on a 15GB T4, but you
# may still encounter memory issues (depending on dataset, batch size, etc.)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# this time we quantize in 4-bits
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=quantization_config,
    device_map={"":0},
)

# this passed to the trainer further down, which applies it for us
lora_config = LoraConfig(
    # rank, aka LoRA attention dimension
    r=8, 
    # `target_modules` describes the matrices to be reduced (which ones should be
    # chosen depends on the architecture, rule of thumb: follow what HF does...)
    target_modules=[
        "q_proj", "o_proj", "k_proj", "v_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    task_type="CAUSAL_LM",
)

#### Inference

Let's test our model before training.

In [None]:
text = "Quote: Imagination"

inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

## Load Model


We use a member of the [`Qwen3` family](https://huggingface.co/collections/Qwen/qwen3) (you could also try [`Gemma3`](https://huggingface.co/collections/google/gemma-3-release)).

In the name, `-base` means it has not been fine-tuned to follow a chat template (same as `-pt` in Gemma).


In [None]:
MODEL_ID = "qwen/qwen3-4b-base"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# remove annoying warning
tokenizer.pad_token_id = tokenizer.eos_token_id

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype=torch.bfloat16,
    device_map="auto",
)

In [None]:
# encode context the generation is conditioned on
inputs = tokenizer("Quote:", return_tensors="pt").to(device)

# generate text (use a GenerationConfig or arguments to change the defaults)
output = model.generate(**inputs, max_new_tokens=100)

# decode back from tokens to text
text = tokenizer.decode(output[0], skip_special_tokens=True)

wrap_print(text)

## Load the dataset

See the dataset page [here](https://huggingface.co/datasets/Abirate/english_quotes).

For more on the `datasets` library, see the [`Quickstart`](https://huggingface.co/docs/datasets/en/quickstart) and the tutorials (on the left-hand side bar).

[Here](https://huggingface.co/datasets?modality=modality:text&format=format:text&sort=trending) is where you would search for other ready-made datasets.

In [None]:
DATASET_ID = "Abirate/english_quotes"
dataset = load_dataset("Abirate/english_quotes", split="train")

In [None]:
print(dataset)

In [None]:
n = 2
for i, d in enumerate(dataset):
    print(d.keys())
    print()
    print(d["quote"])
    print(d["author"])
    print(d["tags"])
    print()
    print("---")
    if n == i:
        break

## Fine-Tuning

The main point of using `SFTConfig` and `SFTTrainer` instead of the regular `TrainerConfig` and `Trainer` is to make it more high-level (_supposedly_ simpler), and also to allow you to define this `formatting_func`, which will affect the behaviour of your model. Making your model follow a certain format allows you to know how the generation is likely to go, which can help you then code functions around that (for instance, you could have a better idea when to cut off generation when reaching the end of an answer).

In [None]:
def formatting_func(example):
    # change this freely (example['quote'] already has “” inside)
    formatted_str  = f"Quote:\n\n{example['quote']}\n"
    formatted_str += f"– by {example['author']}\n"
    formatted_str += f"(themes: {', '.join(example['tags'])})\n---"
    return formatted_str

# our configuration: which matrices to replace
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],  # Target modules for LoRA
    task_type="CAUSAL_LM",
)

# our final configuration
sft_config = SFTConfig(
    # output_dir="./qwen3-4b-base.abirate-quotes",
    # save_steps=50,
    # push_to_hub=True,
    per_device_train_batch_size=5,
    gradient_accumulation_steps=2,
    # num_train_epochs=1,
    # you can also train for less than one epoch!
    max_steps=40,
    warmup_steps=10,
    learning_rate=2e-4,
    logging_steps=10,
    bf16=True,
    # this can potentially be tweaked: longer sequences take up more memory
    max_length=512,
)

# our trainer object
trainer = SFTTrainer(
    model=model,
    processing_class=tokenizer,
    train_dataset=dataset,
    formatting_func=formatting_func,
    peft_config=peft_config,
    args=sft_config,
)

In [None]:
# we only train on a tiny fraction of the parameters!
trainer.model.print_trainable_parameters()

We train for only a few steps (very much a trial and error process: too little and you don't see a difference, too much and your model loses its prior knowledge).

In [None]:
history = trainer.train()

The one thing to note is that the model now follows the format produced by `formatting_func`:

```
Quote:
“Blah blah blah”
– Mildred McBlah
(themes: language, blather, nonsense, blah)
```

In [None]:
trainer.model.eval()

# encode context the generation is conditioned on
inputs = tokenizer("Quote:", return_tensors="pt").to(device)

# the trained model is available under `trainer.model`!
generation_config = GenerationConfig().from_pretrained(MODEL_ID)
generation_config.do_sample = True
generation_config.temperature = .9
generation_config.top_k = 20
generation_config.top_p = .95
generation_config.max_new_tokens = 100

output = trainer.model.generate(**inputs, generation_config=generation_config)

# decode back from tokens to text
text = tokenizer.decode(output[0], skip_special_tokens=True)

print(text)

## Saving your model

There are two ways of saving your model. It can be using the `trainer` (this won't save the weights of the base model, since it's assumed that you will download those again from the HF Hub if needed):

```python
HF_USERNAME = "jchwenger"
# using some string manipulation to create the name
MODEL_DIR = f"{HF_USERNAME}/{MODEL_ID.split('/')[1]}.{DATASET_ID.replace('/', '_').lower()}"
trainer.save_model(MODEL_DIR)
# trainer.push_to_hub()
```

Or saving the individual components (entire model):
```python
model.save_pretrained(MODEL_DIR)
tokenizer.save_pretrained(MODEL_DIR)
lora_config.save_pretrained(MODEL_DIR)
```