# How to Fine-Tune LLMs with LoRA Adapters using Hugging Face TRL

This notebook demonstrates how to efficiently fine-tune large language models using LoRA (Low-Rank Adaptation) adapters. LoRA is a parameter-efficient fine-tuning technique that:
- Freezes the pre-trained model weights
- Adds small trainable rank decomposition matrices to attention layers
- Typically reduces trainable parameters by ~90%
- Maintains model performance while being memory efficient

We'll cover:
1. Setup development environment and LoRA configuration
2. Create and prepare the dataset for adapter training
3. Fine-tune using `trl` and `SFTTrainer` with LoRA adapters
4. Test the model and merge adapters (optional)


## 1. Setup development environment

Our first step is to install Hugging Face Libraries and Pytorch, including trl, transformers and datasets. If you haven't heard of trl yet, don't worry. It is a new library on top of transformers and datasets, which makes it easier to fine-tune, rlhf, align open LLMs.


In [1]:
# Install the requirements in Google Colab
# !pip install transformers datasets trl huggingface_hub

# Authenticate to Hugging Face

from huggingface_hub import login
import wandb


# login()

# for convenience you can create an environment variable containing your hub token as HF_TOKEN

In [2]:
# Import necessary libraries
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
# device = "cuda:1"


from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, setup_chat_format
import torch

print(torch.cuda.is_available())

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

True


## 2. Load the dataset

In [3]:
dataset = load_dataset("imdb", split="train[:100]")
from trl import SFTConfig

model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-350m"
)
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

trainer = SFTTrainer(
    model,
    train_dataset=dataset,
    args = SFTConfig(
      dataset_text_field="text",
      output_dir="tmp",
      max_seq_length=512,
    ),
)

print(trainer.train_dataset[0])

print(
    tokenizer.decode(trainer.train_dataset[0]["input_ids"])
)



{'input_ids': [2, 100, 16425, 38, 3326, 230, 42338, 18024, 12, 975, 25322, 4581, 31, 127, 569, 1400, 142, 9, 70, 5, 6170, 14, 7501, 24, 77, 24, 21, 78, 703, 11, 13025, 4, 38, 67, 1317, 14, 23, 78, 24, 21, 5942, 30, 121, 4, 104, 4, 10102, 114, 24, 655, 1381, 7, 2914, 42, 247, 6, 3891, 145, 10, 2378, 9, 3541, 1687, 22, 10800, 34689, 113, 38, 269, 56, 7, 192, 42, 13, 2185, 49069, 3809, 1589, 49007, 3809, 48709, 133, 6197, 16, 14889, 198, 10, 664, 9004, 4149, 1294, 1440, 27450, 54, 1072, 7, 1532, 960, 79, 64, 59, 301, 4, 96, 1989, 79, 1072, 7, 1056, 69, 39879, 2485, 7, 442, 103, 2345, 9, 6717, 15, 99, 5, 674, 25517, 242, 802, 59, 1402, 559, 743, 215, 25, 5, 5490, 1771, 8, 1015, 743, 11, 5, 315, 532, 4, 96, 227, 1996, 3770, 8, 7945, 3069, 38839, 9, 18850, 59, 49, 5086, 15, 2302, 6, 79, 34, 2099, 19, 69, 4149, 3254, 6, 18295, 6, 8, 2997, 604, 49069, 3809, 1589, 49007, 3809, 48709, 2264, 10469, 162, 59, 38, 3326, 230, 42338, 18024, 12, 975, 25322, 4581, 16, 14, 843, 107, 536, 6, 42, 21, 1687,

In [4]:
# Load a sample dataset
from datasets import load_dataset

# TODO: define your dataset and config using the path and name parameters
# dataset = load_dataset(path="HuggingFaceTB/smoltalk", name="everyday-conversations", split="train")
# dataset = load_dataset(path="prithivMLmods/Math-Solve", split="train[:25000]")

# def format_dataset(sample):
#     content = [
#         {"role": "system", "content": "You are helpful"},
#         {"role": "user", "content": sample["problem"]}, 
#         {"role": "assistant", "content": sample["solution"]}
#     ]
#     return {"messages":content}

# dataset = dataset.map(format_dataset)


# dataset = load_dataset(path="Qurtana/medical-o1-reasoning-SFT-orpo",  split="train")
# def format(sample):
#     content = [
#         {"role": "system", "content": "You are helpful assistant"},
#         {"role": "user", "content": sample["prompt"]}, 
#         {"role": "assistant", "content": sample["accepted"]}
#     ]
#     return {"messages":content}

# dataset=dataset.map(format)

# dataset_train = dataset.select(range(20000))
# dataset_test = dataset.select(range(dataset.shape[0]-2000, dataset.shape[0]))

# print(dataset)


dataset = load_dataset(path="openai/gsm8k", name="main")

def format(sample):
    pass
    content = [
        {"role": "system", "content": "You are helpful assistant"},
        {"role": "user", "content": sample["question"]}, 
        {"role": "assistant", "content": sample["answer"]}
    ]
    return {"messages":content}

dataset=dataset.map(format)

dataset_train = dataset["train"]
dataset_test = dataset["test"]

print(dataset)

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'messages'],
        num_rows: 7473
    })
    test: Dataset({
        features: ['question', 'answer', 'messages'],
        num_rows: 1319
    })
})


## 3. Fine-tune LLM using `trl` and the `SFTTrainer` with LoRA

The [SFTTrainer](https://huggingface.co/docs/trl/sft_trainer) from `trl` provides integration with LoRA adapters through the [PEFT](https://huggingface.co/docs/peft/en/index) library. Key advantages of this setup include:

1. **Memory Efficiency**: 
   - Only adapter parameters are stored in GPU memory
   - Base model weights remain frozen and can be loaded in lower precision
   - Enables fine-tuning of large models on consumer GPUs

2. **Training Features**:
   - Native PEFT/LoRA integration with minimal setup
   - Support for QLoRA (Quantized LoRA) for even better memory efficiency

3. **Adapter Management**:
   - Adapter weight saving during checkpoints
   - Features to merge adapters back into base model

We'll use LoRA in our example, which combines LoRA with 4-bit quantization to further reduce memory usage without sacrificing performance. The setup requires just a few configuration steps:
1. Define the LoRA configuration (rank, alpha, dropout)
2. Create the SFTTrainer with PEFT config
3. Train and save the adapter weights


In [5]:
# Load the model and tokenizer
model_name = "HuggingFaceTB/SmolLM2-135M"

model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name,
    device_map = device,
).to(device)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)

# Set up the chat format
model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)

# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-gsm8k-sft-peft"
finetune_tags = ["smol-course", "module_1"]

print(model.device)

cuda:0


The `SFTTrainer`  supports a native integration with `peft`, which makes it super easy to efficiently tune LLMs using, e.g. LoRA. We only need to create our `LoraConfig` and provide it to the trainer.

<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
    <h2 style='margin: 0;color:blue'>Exercise: Define LoRA parameters for finetuning</h2>
    <p>Take a dataset from the Hugging Face hub and finetune a model on it. </p> 
    <p><b>Difficulty Levels</b></p>
    <p>🐢 Use the general parameters for an abitrary finetune</p>
    <p>🐕 Adjust the parameters and review in weights & biases.</p>
    <p>🦁 Adjust the parameters and show change in inference results.</p>
</div>

In [6]:
from peft import LoraConfig

# TODO: Configure LoRA parameters
# r: rank dimension for LoRA update matrices (smaller = more compression)
rank_dimension = 48
# lora_alpha: scaling factor for LoRA layers (higher = stronger adaptation)
lora_alpha = 64
# lora_dropout: dropout probability for LoRA layers (helps prevent overfitting)
lora_dropout = 0.01

peft_config = LoraConfig(
    r=rank_dimension,  # Rank dimension - typically between 4-32
    lora_alpha=lora_alpha,  # LoRA scaling factor - typically 2x rank
    lora_dropout=lora_dropout,  # Dropout probability for LoRA layers
    bias="none",  # Bias type for LoRA. the corresponding biases will be updated during training.
    target_modules="all-linear",  # Which modules to apply LoRA to
    task_type="CAUSAL_LM",  # Task type for model architecture
)


Before we can start our training we need to define the hyperparameters (`TrainingArguments`) we want to use.

In [7]:
# Training configuration
# Hyperparameters based on QLoRA paper recommendations
max_seq_length = 1512  # max sequence length for model and packing of the dataset

args = SFTConfig(
    # Small learning rate to prevent catastrophic forgetting
    learning_rate=1e-5,
    # Linear learning rate decay over training
    lr_scheduler_type="cosine",
    # Maximum combined length of prompt + completion
    max_seq_length=1512,
    # # Maximum length for input prompts
    # max_prompt_length=512,
    # Batch size for training
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    # Helps with training stability by accumulating gradients before updating
    gradient_accumulation_steps=4,
    # Memory-efficient optimizer for CUDA, falls back to adamw_torch for CPU/MPS
    optim="paged_adamw_8bit" if device == "cuda" else "adamw_torch",
    # Number of training epochs
    num_train_epochs=4,
    # When to run evaluation
    evaluation_strategy="steps",
    # Evaluate every 20% of training
    eval_steps=0.2,
    # Log metrics every step
    logging_steps=1,
    # Gradual learning rate warmup
    warmup_steps=100,
    # Disable external logging
    report_to="wandb",
    # Where to save model/checkpoints
    output_dir=finetune_name,
    # Enable MPS (Metal Performance Shaders) if available
    use_mps_device=device == "mps",
    hub_model_id=finetune_name,
    # Use bfloat16 precision for faster training
    bf16=True,
)



We now have every building block we need to create our `SFTTrainer` to start then training our model.

In [8]:

max_seq_length = 1512  # max sequence length for model and packing of the dataset

# Create SFTTrainer with LoRA configuration
trainer = SFTTrainer(
    model=model,
    args=args,
    peft_config=peft_config,  # LoRA configuration
    train_dataset=dataset_train,
    eval_dataset=dataset_test,
    processing_class=tokenizer,
)



In [9]:
print(trainer.train_dataset[1])

print(
    tokenizer.decode(trainer.train_dataset[1]["input_ids"])
)

{'input_ids': [1, 9690, 198, 2683, 359, 5356, 11173, 2, 198, 1, 4093, 198, 71, 1059, 38668, 1885, 33, 34, 354, 5353, 327, 3383, 672, 9584, 30, 718, 15955, 28, 1041, 915, 1250, 216, 37, 32, 3487, 282, 3383, 672, 9584, 30, 1073, 1083, 1250, 1041, 5301, 47, 2, 198, 1, 520, 9531, 198, 71, 1059, 38668, 216, 33, 34, 31, 38, 32, 446, 1885, 33691, 33, 34, 31, 38, 32, 45, 32, 30, 34, 7791, 32, 30, 34, 567, 8427, 30, 198, 23830, 216, 37, 32, 3487, 28, 1041, 11420, 216, 32, 30, 34, 1792, 216, 37, 32, 446, 1885, 33691, 32, 30, 34, 26, 37, 32, 45, 33, 32, 7791, 33, 32, 30, 198, 1229, 216, 33, 32, 2, 198], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
<|im_start|>system
Y

In [10]:
# print(trainer.train_dataset[0])
# print("\n\n")
# print(tokenizer.decode(trainer.train_dataset[0]["input_ids"]))
# print("\n\n")
# print(tokenizer.decode(trainer.train_dataset[0]["labels"]))


Start training our model by calling the `train()` method on our `Trainer` instance. This will start the training loop and train our model for 3 epochs. Since we are using a PEFT method, we will only save the adapted model weights and not the full model.

In [11]:
# start training, the model will be automatically saved to the hub and the output directory
trainer.train()

# save model
trainer.save_model()

[34m[1mwandb[0m: Currently logged in as: [33mmatteob-90-hotmail-it[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Step,Training Loss,Validation Loss
748,1.1914,1.209106
1496,1.0439,1.159239
2244,1.0958,1.138477
2992,1.0894,1.131238


The training with Flash Attention for 3 epochs with a dataset of 15k samples took 4:14:36 on a `g5.2xlarge`. The instance costs `1.21$/h` which brings us to a total cost of only ~`5.3$`.



### Merge LoRA Adapter into the Original Model

When using LoRA, we only train adapter weights while keeping the base model frozen. During training, we save only these lightweight adapter weights (~2-10MB) rather than a full model copy. However, for deployment, you might want to merge the adapters back into the base model for:

1. **Simplified Deployment**: Single model file instead of base model + adapters
2. **Inference Speed**: No adapter computation overhead
3. **Framework Compatibility**: Better compatibility with serving frameworks


In [12]:
from peft import AutoPeftModelForCausalLM


# Load PEFT model on CPU
model = AutoPeftModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=args.output_dir,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
)

# Merge LoRA and base model and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained(
    args.output_dir, safe_serialization=True, max_shard_size="2GB"
)

## 3. Test Model and run Inference

After the training is done we want to test our model. We will load different samples from the original dataset and evaluate the model on those samples, using a simple loop and accuracy as our metric.



<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
    <h2 style='margin: 0;color:blue'>Bonus Exercise: Load LoRA Adapter</h2>
    <p>Use what you learnt from the ecample note book to load your trained LoRA adapter for inference.</p> 
</div>

In [13]:
# free the memory again
del model
del trainer
torch.cuda.empty_cache()

In [14]:
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer, pipeline

# Load Model with PEFT adapter
tokenizer = AutoTokenizer.from_pretrained(finetune_name)
model = AutoPeftModelForCausalLM.from_pretrained(
    finetune_name, device_map="auto", torch_dtype=torch.float16
)
pipe = pipeline(
    "text-generation", 
    model=merged_model, 
    tokenizer=tokenizer, 
    device=device, 
    max_new_tokens=100,
    num_beams=10,
    num_return_sequences=1,
    no_repeat_ngram_size=1,
)

Device set to use cuda


Lets test some prompt samples and see how the model performs.

In [15]:
prompts = [
    "What is the capital of Germany? Explain why thats the case and if it was different in the past?",
    "Write a Python function to calculate the factorial of a number.",
    "A rectangular garden has a length of 25 feet and a width of 15 feet. If you want to build a fence around the entire garden, how many feet of fencing will you need?",
    "What is the difference between a fruit and a vegetable? Give examples of each.",
]


def test_inference(prompt):
    prompt = pipe.tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=False,
        add_generation_prompt=True,
    )
    outputs = pipe(
        prompt,
    )
    return outputs[0]["generated_text"][len(prompt) :].strip()


for prompt in prompts:
    print(f"    prompt:\n{prompt}")
    print(f"    response:\n{test_inference(prompt)}")
    print("-" * 50)

    prompt:
What is the capital of Germany? Explain why thats the case and if it was different in the past?
    response:
In 1945, there were more than a million refugees from Nazi-occupied Europe. How many people did they take with them on their way to America after being liberated by Allied forces during World War II (WWII)? In what year do you think this number would have been higher or lower for those who had already taken up permanent residence at Ellis Island before WW2 began taking place as an occupation force took over New York City's East River which then became its own separate city called Manhattan
--------------------------------------------------
    prompt:
Write a Python function to calculate the factorial of a number.
    response:
#### 1234567890 is not an integer because it has more than one digit in its first two digits, so we can’t add them all up and find out how many numbers there are that have this as their base-ten numeral representation (because they would be t

In [16]:
prompts = ["Write a haiku"]
for prompt in prompts:
    print(f"    prompt:\n{prompt}")
    print(f"    response:\n{test_inference(prompt)}")
    print("-" * 50)

    prompt:
Write a haiku
    response:
ASSISTANT SUPERSTITIONARY GUARDIAN 2018: I am an assistant super-supporter who helps people with their daily tasks. My job is to make sure that everyone has enough food and water, so they don’t get sick or die because of the cold weather in New York City (NYC). If you have any questions about my work please email me at [email protected] You can also find out more by reading this blog post
--------------------------------------------------


prompt:
What is the capital of Germany? Explain why thats the case and if it was different in the past?
    response:
To answer this question, we need to look at some historical facts. In 1870 there were two German states: Baden-Wuerttemberg (German for "Bavaria with Wurttenberg" or Bavarian Palatinate), which had a long history as an independent principality before being annexed by Prussia during World War II; Saxony/Saxonschweiz ("Saxon Schleswig"), whose territory came under Prussian control after gaining its independence from Denmark through negotiations between Otto von Bismarck's Chancellor Wilhelmina zu Hohenzollern Wilhelm I & King Frederick William IV on August 26th - September  3rd. The former state has since been reincorporated into Süddeutsche Bundesrepubliken / Deutsches Volkskreuzgefahrzeug [DWDG], also called Deutschland über all diese Zeit ["Germany Under All Time
--------------------------------------------------
    prompt:
Write a Python function to calculate the factorial of a number.
    response:
Let $n$ be an integer greater than 1, and let $\sum_{k=0}^{\infty} n = \frac{3}{2!}$ for some non-zero prime power $(p_i)$. We can rewrite this expression as follows: (Note that we are working with positive integers here.) The sum is computed by adding up each term in its own row or column until it reaches zero; if there were no terms between two consecutive numbers then they would have been added together at least once before being subtracted from their respective rows/colonelstheoreticalexpansionofthefactorialfunctioniscomputablebymultiplying themtogetherandaddingone moreterm). This gives us$\left(\dfrac{\pi^4}{\sqrt[5]{9}}+\cdots+7^{6}\right)$for all primes not exceeding one hundred thousand but excluding those whose only common divisor divides evenly into foursquare root(which means
--------------------------------------------------
    prompt:
A rectangular garden has a length of 25 feet and a width of 15 feet. If you want to build a fence around the entire garden, how many feet of fencing will you need?
    response:
Let's start by thinking about what we are looking for in this problem: How much space is there between two consecutive points on an equilateral triangle (a rectangle)? We know that it can be written as $\triangle ABC$, where $AB = \frac{4}{3} A + B$. So if I add up all three sides along with their corresponding angles ($\overrightarrow {ABC}$$=60^\circ) then my equation becomes $(AD+BC)\cos(\theta)+(AC+\Delta)=897.\]I have already seen some similar problems involving right triangles but they don't work out so well when using area or perimeter formulas! Let me try one more example from real-life situations before moving onto our main task - finding total cost per square foot without considering any extra costs like utility bills etc., which might come into play at later stages depending upon various factors mentioned earlier during your math practice session...[SOLVED](https://www
--------------------------------------------------
    prompt:
What is the difference between a fruit and a vegetable? Give examples of each.
    response:
The word "fruit" refers to something that grows on trees, such as an apple or banana; it can also be used for any edible plant with seeds inside (like tomatoes). Vegetables are different from fruits because they don't produce their own food through photosynthesis but instead grow by eating other living things called plants! Some common vegetables include broccoli florets (*), carrots*, bell peppers**, zucchinis**. You might have seen them growing in your garden when you were younger - just think about all those delicious little green ones popping up every time someone goes outside during hot summer days... That's what we call'vegetables'. Now let me explain how these two categories differ:

1️⃣ **Fruits**: Fruits contain natural sugars like watermelons *and* grapes which give us energy while satisfying our sweet tooth without adding extra calories.* They're usually eaten fresh rather than cooked down into ice cream sundaes at birthday parties.**[Example:* An orange contains 90
--------------------------------------------------