# Lab | QLoRA Tuning using PEFT from Hugging Face

<!-- ### Introduction to Quantization & Fine-tune a Quantized Model -->

**Note:** This is more or less the same notebook you saw in the previous lesson, but that is ok. This is an LLM fine-tuning lab. In class we used a set of datasets and models, and in the labs you are required to change the LLMs models and the datasets including the pre-processing pipelines. 

# Brief Introduction to Quantization
The main idea of quantization is simple: Reduce the precision of floating-point numbers, which normally occupy 32 bits, to integers of 8 or even 4 bits.

This reduction occurs in the model’s parameters, specifically in the weights of the neural layers, and in the activation values that flow through the model’s layers.

This means that we not only achieve an improvement in the model’s storage size and memory consumption but also greater agility in its calculations.

Naturally, there is a loss of precision, but particularly in the case of 8-bit quantization, this loss is minimal.



## Let's see a example of a quantized number.

In reality, what I want to examine is the precision loss that occurs when transitioning from a 32-bit number to a quantized 8/4-bit number and then returning to its original 32-bit value.

First, I'm going to create a function to quantize and another to unquantize.

In [None]:
%pip install -q bitsandbytes==0.43.1
%pip install -q --upgrade bitsandbytes
%pip install -q triton==3.5.0

# 2️⃣ Install TRL (required for SFTTrainer)
%pip install -q trl

In [None]:
#Importing necesary linbraries
import numpy as np
import math
import matplotlib.pyplot as plt

In [None]:
#Functions to quantize and unquantize
def quantize(value, bits=4):
    quantized_value = np.round(value * (2**(bits - 1) - 1))
    return int(quantized_value)

def unquantize(quantized_value, bits=4):
    value = quantized_value / (2**(bits - 1) - 1)
    return float(value)

Quatizied values:

In [None]:
quant_4 = quantize(0.622, 4)
print (quant_4)
quant_8 = quantize(0.622, 8)
print(quant_8)

Unquantized values:

In [None]:
unquant_4 = unquantize(quant_4, 4)
print(unquant_4)
unquant_8 = unquantize(quant_8, 8)
print(unquant_8)

If we consider that the original number was 0.622, it can be said that 8-bit quantization barely loses precision, and the loss from 4-bit quantization is manageable.

In [None]:
x = np.linspace(-1, 1, 50)
y = [math.cos(val) for val in x]


y_quant_8bit = np.array([quantize(val, bits=8) for val in y])
y_unquant_8bit = np.array([unquantize(val, bits=8) for val in y_quant_8bit])

y_quant_4bit = np.array([quantize(val, bits=4) for val in y])
y_unquant_4bit = np.array([unquantize(val, bits=4) for val in y_quant_4bit])


Let’s plot a curve with the unquantized values of a cosine.


In [None]:
plt.figure(figsize=(10, 12))

plt.subplot(4, 1, 1)
plt.plot(x, y, label="Original")
plt.plot(x, y_unquant_8bit, label="unquantized_8bit")
plt.plot(x, y_unquant_4bit, label="unquantized_4bit")
plt.legend()
plt.title("Quantized Curves Graph Comparision")
plt.grid(True)

As you can see, the difference between the 8-bit and the original values is minimal. However, we need to use 4-bit quantization if we want to load the 7B Model into a 16GB GPU without problems.


# QLoRA. Fine-tuning a 4-bit Quantized Model using LoRA.
We are going to fine-tune with LoRA a 7B Model Quantizated to 4 bits.

## Load the PEFT and Datasets Libraries.

The PEFT library contains the Hugging Face implementation of differente fine-tuning techniques, like LoRA Tuning.

Using the Datasets library we have acces to a huge amount of Datasets.

In [None]:
%pip install -q accelerate==1.4.0
%pip install -q bitsandbytes==0.47.0 # pip install -U bitsandbytes>=0.46.1
%pip install -q trl==0.8.6
%pip install -q peft==0.10.0

I'm going to download the peft and Transformers libraries from their repositories on GitHub instead of using pip. This is not strictly necessary, but this way, you can get the newest versions of the libraries with support for newer models. If you want to check one of the latest models, you can use this trick.


In [None]:
#Install the lastest versions of peft & transformers library recommended
#if you want to work with the most recent models
%pip install -q git+https://github.com/huggingface/peft.git
%pip install -q git+https://github.com/huggingface/transformers.git

From the Transformers library, we import the necessary classes to load the model and the tokenizer.

The notebook is ready to work with different Models I tested it with models from the Bloom Family and Llama-3.

I recommend you to test different models.

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

print("CUDA Available:", torch.cuda.is_available())

CUDA Available: True


## Load Model

In [2]:
#Use any model you want, if you want to do some fast test, just use the smallest one.

#model_name = "bigscience/bloomz-560m"
#model_name="bigscience/bloom-1b1"
#model_name = "bigscience/bloom-7b1"
#target_modules = ["query_key_value"]

model_name = "EleutherAI/pythia-410m"
target_modules = ["query_key_value"] #YOU MAY CHANGE THIS BASED ON YOUR MODEL

To load the model, we need a configuration class that specifies how we want the quantization to be performed. We’ll achieve this with the BitesAndBytesConfig from the Transformers library.

In [3]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16 # Only for L4
)

We are specifying the use of 4-bit quantization and also enabling double quantization to reduce the precision loss.

For the bnb_4bit_quant_type parameter, I've used the recommended value in the paper [QLoRA: Efficient Finetuning of Quantized LLMs.](https://arxiv.org/abs/2305.14314)

Now, we can go ahead and load the model.

In [4]:
device_map = {"": 0}
foundation_model = AutoModelForCausalLM.from_pretrained(model_name,
                    quantization_config=bnb_config,
                    device_map=device_map,
                    use_cache = False)


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


Loading weights:   0%|          | 0/292 [00:00<?, ?it/s]

Now we have the quantized version of the model in memory. Yo can try to load the unquantized version to see if it's possible.

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

## Inference with the pre-trained model.
I'm going to do a test with the pre-trained model without fine-tuning, to see if something changes after the fine-tuning.

In [6]:
#this function returns the outputs from the model received, and inputs.
def get_outputs(model, inputs, max_new_tokens=100):#PLAY WITH ARGS AS YOU SEE FIT
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_new_tokens=max_new_tokens,
        repetition_penalty=1.5, #Avoid repetition.
        early_stopping=False, #The model can stop before reach the max_length
        eos_token_id=tokenizer.eos_token_id,
    )
    return outputs

The dataset used for the fine-tuning contains prompts to be used with Large Language Models.

I'm going to request the pre-trained model that acts like a motivational coach.

In [7]:
#Inference original model
input_sentences = tokenizer("User: Explain what overfitting is in one paragraph. Assistant:", return_tensors="pt").to('cuda')
foundational_outputs_sentence = get_outputs(foundation_model, input_sentences, max_new_tokens=50)

print(tokenizer.batch_decode(foundational_outputs_sentence, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


['User: Explain what overfitting is in one paragraph. Assistant: What are the main characteristics of an under-fitted model?\n  * **2** - How can we improve our models to be more robust against noise and hyperparameters, or at least not so bad as they were before training on a dataset']


The answer is good enough, the models used is a really well trained Model. But we will try to improve the quality with a sort fine-tuning process.


## Preparing the Dataset.
The Dataset useds is:

https://huggingface.co/datasets/fka/awesome-chatgpt-prompts

In [8]:
from datasets import load_dataset

dataset = "tatsu-lab/alpaca"

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=256,
        padding=False,  # no static padding
    )

response_token_ids = tokenizer("### Response:", add_special_tokens=False)["input_ids"]

def mask_response_only(example):
    input_ids = example["input_ids"]
    labels = input_ids.copy()

    # Find where response starts
    for i in range(len(input_ids) - len(response_token_ids) + 1):
        if input_ids[i:i+len(response_token_ids)] == response_token_ids:
            response_start = i + len(response_token_ids)
            break
    else:
        # If not found, mask everything
        response_start = len(input_ids)

    # Mask everything before response
    labels[:response_start] = [-100] * response_start

    example["labels"] = labels
    return example

data = load_dataset(dataset)
data = data.map(tokenize_function, batched=True)
data = data.map(mask_response_only)

train_sample = data["train"]
train_sample = train_sample.remove_columns(
    ['instruction', 'input', 'output', 'text']
)



In [9]:
example = train_sample[0]
print(len(example["input_ids"]), len(example["labels"]))

83 83


## Fine-Tuning.
The first step will be to create a LoRA configuration object where we will set the variables that specify the characteristics of the fine-tuning process.

In [10]:
# TARGET_MODULES
# https://github.com/huggingface/peft/blob/39ef2546d5d9b8f5f8a7016ec10657887a867041/src/peft/utils/other.py#L220

import peft
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16, #As bigger the R bigger the parameters to train.
    lora_alpha=32, # a scaling factor that adjusts the magnitude of the weight matrix. It seems that as higher more weight have the new training.
    target_modules=target_modules,
    lora_dropout=0.05, #Helps to avoid Overfitting.
    bias="none", # this specifies if the bias parameter should be trained.
    task_type="CAUSAL_LM"
)

In [11]:
# Wrap model + LoRA [ONLY RUN ONCE]

model = get_peft_model(foundation_model, lora_config)

In [12]:
model.print_trainable_parameters()

trainable params: 1,572,864 || all params: 406,906,880 || trainable%: 0.3865


The most important parameter is **r**, it defines how many parameters will be trained. As bigger the value more parameters are trained, but it means that the model will be able to learn more complicated relations between inputs and outputs.

Yo can find a list of the **target_modules** available on the [Hugging Face Documentation]( https://github.com/huggingface/peft/blob/39ef2546d5d9b8f5f8a7016ec10657887a867041/src/peft/utils/other.py#L220)

**lora_alpha**. Ad bigger the number more weight have the LoRA activations, it means that the fine-tuning process will have more impac as bigger is this value.

**lora_dropout** is like the commom dropout is used to avoid overfitting.

**bias** I was hesitating if use *none* or *lora_only*. For text classification the most common value is none, and for chat or question answering, *all* or *lora_only*.

**task_type**. Indicates the task the model is beign trained for. In this case, text generation.

In [13]:
#Create a directory to contain the Model
import os
working_dir = './'

output_directory = os.path.join(working_dir, "peft_lab_outputs")

In the TrainingArgs we inform the number of epochs we want to train, the output directory and the learning_rate.

In [14]:
#Creating the TrainingArgs
import transformers
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir=output_directory,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=2e-4,
    num_train_epochs=1,
    logging_steps=100,
    save_strategy="no",
    bf16=True, # only in L4 GPU
    fp16=False, # only in L4 GPU
    tf32=True, # only in L4 GPU
    optim="adamw_torch",
    report_to="none",
)

Now we can train the model.
To train the model we need:


*   The Model.
*   The training_args
* The Dataset
* The result of DataCollator, the Dataset ready to be procesed in blocks.
* The LoRA config.





In [15]:
import torch
from transformers import DataCollatorWithPadding

class ResponseOnlyDataCollator:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        self.pad_collator = DataCollatorWithPadding(tokenizer)

    def __call__(self, features):
        # Extract labels first
        labels = [f["labels"] for f in features]

        # Remove labels from features before padding
        features_no_labels = [{k: v for k, v in f.items() if k != "labels"} for f in features]

        # Pad input_ids & attention_mask
        batch = self.pad_collator(features_no_labels)

        max_length = batch["input_ids"].shape[1]

        # Pad labels manually with -100
        padded_labels = []
        for label in labels:
            padded = label + [-100] * (max_length - len(label))
            padded_labels.append(padded)

        batch["labels"] = torch.tensor(padded_labels, dtype=torch.long)

        return batch


In [16]:
from transformers import Trainer

data_collator = ResponseOnlyDataCollator(tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_sample,
    data_collator=data_collator,  # ← here
)

In [17]:
trainer.train()

Step,Training Loss
100,2.023626
200,1.941585
300,1.911147
400,1.872009
500,1.889548
600,1.87478
700,1.875095
800,1.871446
900,1.83447
1000,1.811592


TrainOutput(global_step=3251, training_loss=1.8300443360050873, metrics={'train_runtime': 1029.1839, 'train_samples_per_second': 50.527, 'train_steps_per_second': 3.159, 'total_flos': 2.124267925836595e+16, 'train_loss': 1.8300443360050873, 'epoch': 1.0})

In [22]:
#Save the model.
peft_model_path = os.path.join(output_directory, f"lora_model")


In [23]:
#Save the model.
trainer.model.save_pretrained(peft_model_path)

In [None]:
#In case you are having memory problems uncomment this lines to free some memory
import gc
import torch
del foundation_model
del trainer
del train_sample
torch.cuda.empty_cache()
gc.collect()

## Inference with the pretrained model

In [None]:
#import peft
from peft import AutoPeftModelForCausalLM, PeftConfig
#import os

device_map = {"": 0}
working_dir = './'

output_directory = os.path.join(working_dir, "peft_lab_outputs")
peft_model_path = os.path.join(output_directory, f"lora_model")


In [None]:
bnb_config2 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [None]:
#Load the Model.
loaded_model = AutoPeftModelForCausalLM.from_pretrained(
                                        peft_model_path,
                                        #torch_dtype=torch.bfloat16,
                                        is_trainable=False,
                                        #load_in_4bit=True,
                                        quantization_config=bnb_config2,
                                        device_map = 'cuda')

## Inference the fine-tuned model.

In [None]:
input_sentences = tokenizer("I want you to act as a motivational coach. ", return_tensors="pt").to('cuda')
foundational_outputs_sentence = get_outputs(loaded_model, input_sentences, max_new_tokens=50)

print(tokenizer.batch_decode(foundational_outputs_sentence, skip_special_tokens=True))

The result is really good. Let's compare the answer of the pre-trained model with the fine-tuned one:

* **Pretrained Model**: 'I want you to act as a motivational coach. \xa0You are going on an adventure with me, and I need your help.\nWe will be traveling through the land of “What If.” \xa0 This is not some place that exists in reality; it’s more like one those places we see when watching'

* **Fine-Tuned Model**: 'I want you to act as a motivational coach.  I will provide some information about an individual or group of people who need motivation, and your role is help them find the inspiration they require in order achieve their goals successfully! You can use techniques such as positive reinforcement, visualization exercises etc., depending on what'

As you can see, the result is really similar to the samples contained in the dataset used to fine-tune the model. And we only trained the model for some epochs and with a really small number of rows.

 - Complete the prompts similar to what we did in class. 
     - Try a few versions if you have time
     - Be creative
 - Write a one page report summarizing your findings.
     - Were there variations that didn't work well? i.e., where GPT either hallucinated or wrong
 - What did you learn?

# Evaluation

In [None]:
model.eval()

In [None]:
# Plain prompt
prompt = "Explain what overfitting is in one paragraph."

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

In [None]:
# Alpaca-style format prompt
prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Explain what overfitting is in one paragraph.

### Response:
"""


inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

In [None]:
# non-instruction prompt
prompt = """The capital of France is"""


inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

## Conclusion

The model choice of `EleutherAI/pythia-410m` was deliberate to train fast due to its reduced paramater count while being adequately size to show real behavioral shifts and its limitations. 

The dataset choice  `tatsu-lab/alpaca` comes with a set of instructions -> response formatting, polite explanatory tone and structured completion behavior. This allowed the model to infer with a multi-style prompting and show where our model strengths and weaknesses laid.

It was necessary to add masking of the `### Response` after one loop to prevent the model to memorize pattern and output heavy repetition

### Prompts

A) Plain prompt

`Explain what overfitting is in one paragraph` 

This tested:

* Has the model internalized instruction-following without template?

* Did alignment shift the general behavior boundary?

Result:
After masking + training → cleaner answers, less repetition.

B) Alpaca Template Promp

`### Instruction:`
...
`### Response:`
...
This tested:

* Has the model learned the intended instruction-response mapping?

* Does it respect structural boundaries?

Before masking:

* It looped templates.

* It generated new instructions.

After response-only masking:

* Clean single response.

* No template loops.

* Objective correction worked.

This was the most important structural test.

C) Non-Instruction Continuation

This tested:

* Did we break base LM ability?

* Did the model overfit to instruction formatting?

* Does it inject ### Response: randomly?

Result:

* No template injection.

* Continued normally.

* Base LM ability preserved.

This confirmed masking prevented over-steering.

### Key Takeaways

Low training loss does not explain model behavior. Structural correctness required changing the objective (masking)

Instruction Tuning is not Knowledge Injection -> Our non-instruction prompt demonstrated small model capacity equates to limited reasoning depth.

LoRA does not add intelligence. It reallocates capacity toward the supervised pattern
