# Lab | QLoRA Tuning using PEFT from Hugging Face

<!-- ### Introduction to Quantization & Fine-tune a Quantized Model -->

**Note:** This is more or less the same notebook you saw in the previous lesson, but that is ok. This is an LLM fine-tuning lab. In class we used a set of datasets and models, and in the labs you are required to change the LLMs models and the datasets including the pre-processing pipelines. 

# Brief Introduction to Quantization
The main idea of quantization is simple: Reduce the precision of floating-point numbers, which normally occupy 32 bits, to integers of 8 or even 4 bits.

This reduction occurs in the model’s parameters, specifically in the weights of the neural layers, and in the activation values that flow through the model’s layers.

This means that we not only achieve an improvement in the model’s storage size and memory consumption but also greater agility in its calculations.

Naturally, there is a loss of precision, but particularly in the case of 8-bit quantization, this loss is minimal.



## Let's see an example of a quantized number.

In reality, what I want to examine is the precision loss that occurs when transitioning from a 32-bit number to a quantized 8/4-bit number and then returning to its original 32-bit value.

First, I'm going to create a function to quantize and another to unquantize.

In [48]:
# Importing necessary libraries

import math

import numpy as np
import matplotlib.pyplot as plt

In [49]:
# Functions to quantize and unquantize

def quantize(value, bits = 4):
    quantized_value = np.round(value * (2 ** (bits - 1) - 1))
    return int(quantized_value)

def unquantize(quantized_value, bits = 4):
    value = quantized_value / (2 ** (bits - 1) - 1)
    return float(value)

Quatizied values:

In [None]:
quant_4 = quantize(0.622, 4)

print(quant_4)

quant_8 = quantize(0.622, 8)

print(quant_8)

Unquantized values:

In [None]:
unquant_4 = unquantize(quant_4, 4)

print(unquant_4)

unquant_8 = unquantize(quant_8, 8)

print(unquant_8)

If we consider that the original number was 0.622, it can be said that 8-bit quantization barely loses precision, and the loss from 4-bit quantization is manageable.

In [52]:
x = np.linspace(-1, 1, 50)
y = [math.cos(val) for val in x]

y_quant_8bit = np.array([quantize(val, bits = 8) for val in y])
y_unquant_8bit = np.array([unquantize(val, bits = 8) for val in y_quant_8bit])

y_quant_4bit = np.array([quantize(val, bits = 4) for val in y])
y_unquant_4bit = np.array([unquantize(val, bits = 4) for val in y_quant_4bit])

Let’s plot a curve with the unquantized values of a cosine.


In [None]:
plt.figure(figsize = (10, 12))

plt.subplot(4, 1, 1)
plt.plot(x, y, label = 'Original')
plt.plot(x, y_unquant_8bit, label = 'unquantized_8bit')
plt.plot(x, y_unquant_4bit, label = 'unquantized_4bit')
plt.legend()
plt.title('Quantized Curves Graph Comparison')
plt.grid(True)

As you can see, the difference between the 8-bit and the original values is minimal. However, we need to use 4-bit quantization if we want to load the 7B Model into a 16GB GPU without problems.


# QLoRA. Fine-tuning a 4-bit Quantized Model using LoRA.
We are going to fine-tune with LoRA a 7B Model Quantizated to 4 bits.

## Load the PEFT and Datasets Libraries.

The PEFT library contains the Hugging Face implementation of different fine-tuning techniques, like LoRA Tuning.

Using the Datasets library we have access to a huge amount of datasets.

In [54]:
# !pip install -q accelerate==0.29.3
# !pip install -q bitsandbytes==0.43.1
# !pip install -q trl==0.8.6
# !pip install -q peft==0.10.0
# !pip install -q transformers==4.40.0

I'm going to download the peft and Transformers libraries from their repositories on GitHub instead of using pip. This is not strictly necessary, but this way, you can get the newest versions of the libraries with support for newer models. If you want to check one of the latest models, you can use this trick.


In [55]:
# Install the latest versions of peft & transformers library
# - recommended if you want to work with the most recent models

# !pip install -q git+https://github.com/huggingface/peft.git
# !pip install -q git+https://github.com/huggingface/transformers.git

From the Transformers library, we import the necessary classes to load the model and the tokenizer.

The notebook is ready to work with different models. I tested it with models from the Bloom Family and Llama-3.

I recommend you to test different models.

In [56]:
import torch

from trl import SFTTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

## Hugging Face login

In [57]:
import os

from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv('../IHAI-lessons/000_lesson_data/044_llm/.env'))

HF_TOKEN  = os.getenv('HF_TOKEN')

# print(HF_TOKEN)

In [None]:
from huggingface_hub import login

login(token=HF_TOKEN)

from huggingface_hub import whoami

# print(whoami())

## Load Model

In [59]:
# Use any model you want. If you want to do some fast tests, just use the smallest one.

# model_name = 'bigscience/bloom-560m'
# model_name = 'bigscience/bloom-1b1'
# model_name = 'bigscience/bloom-7b1'

# target_modules = ['query_key_value']

model_name = 'meta-llama/Meta-Llama-3-8B'

target_modules = ['q_proj', 'v_proj']

To load the model, we need a configuration class that specifies how we want the quantization to be performed. We’ll achieve this with the BitesAndBytesConfig from the Transformers library.

In [60]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_use_double_quant = True,
    bnb_4bit_quant_type = 'nf4',  # or 'fp4'
    bnb_4bit_compute_dtype = torch.bfloat16  # float16
)

We are specifying the use of 4-bit quantization and also enabling double quantization to reduce the precision loss.

For the bnb_4bit_quant_type parameter, I've used the recommended value in the paper [QLoRA: Efficient Finetuning of Quantized LLMs.](https://arxiv.org/abs/2305.14314)

Now, we can go ahead and load the model.

In [61]:
device_map = 'auto'  # {'': 'cuda'}  # or (was) '': 0

foundation_model = AutoModelForCausalLM.from_pretrained(model_name,
                    mirror = 'https://hf-mirror.com',
                    token = HF_TOKEN,
                    quantization_config = bnb_config,
                    device_map = device_map,
                    use_cache = False)

Now we have the quantized version of the model in memory. You can try to load the unquantized version to see if it's possible.

In [62]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer.pad_token = tokenizer.eos_token

## Inference with the pre-trained model.
I'm going to do a test with the pre-trained model without fine-tuning, to see if something changes after the fine-tuning.

In [63]:
# This function returns the outputs from the model received, and inputs

def get_outputs(model, inputs, max_new_tokens = 100):

    outputs = model.generate(
        input_ids = inputs['input_ids'],
        attention_mask = inputs['attention_mask'],
        max_new_tokens = max_new_tokens,
        repetition_penalty = 1.5,  # Avoid repetition
        early_stopping = False,  # The model can't stop before reaching the max_length
        eos_token_id = tokenizer.eos_token_id
    )
    
    return outputs

The dataset used for the fine-tuning contains prompts to be used with Large Language Models.

I'm going to request the pre-trained model that acts like a motivational coach.

In [None]:
# Inference original model

input_sentence = tokenizer('I want you to act as a motivational coach. ', return_tensors = 'pt').to('cuda')

foundational_outputs_sentence = get_outputs(foundation_model, input_sentence, max_new_tokens = 50)

print(tokenizer.batch_decode(foundational_outputs_sentence, skip_special_tokens = True))

The answer is good enough, the models used is a really well trained Model. But we will try to improve the quality with a sort fine-tuning process.


## Preparing the Dataset.
The dataset used is:

https://huggingface.co/datasets/fka/awesome-chatgpt-prompts

In [None]:
from datasets import load_dataset

dataset = 'fka/awesome-chatgpt-prompts'

data = load_dataset(dataset)

data = data.map(lambda samples: tokenizer(samples['prompt']), batched = True)

train_sample = data['train'].select(range(50))

del data

train_sample = train_sample.remove_columns('act')

display(train_sample)

In [None]:
print(train_sample[0])

## Fine-Tuning.
The first step will be to create a LoRA configuration object where we will set the variables that specify the characteristics of the fine-tuning process.

In [None]:
# TARGET_MODULES
# https://github.com/huggingface/peft/blob/39ef2546d5d9b8f5f8a7016ec10657887a867041/src/peft/utils/other.py#L220

from peft import LoraConfig

lora_config = LoraConfig(
    r = 16,  # 16 # The bigger the R, the bigger the parameters to train
    lora_alpha = 16,  # A scaling factor that adjusts the magnitude of the weight matrix
    target_modules = target_modules,
    lora_dropout = 0.05,  # Helps to avoid overfitting
    bias = 'none',  # This specifies if the bias parameter should be trained
    task_type = 'CAUSAL_LM'
)

The most important parameter is **r**, it defines how many parameters will be trained. As bigger the value more parameters are trained, but it means that the model will be able to learn more complicated relations between inputs and outputs.

Yo can find a list of the **target_modules** available on the [Hugging Face Documentation]( https://github.com/huggingface/peft/blob/39ef2546d5d9b8f5f8a7016ec10657887a867041/src/peft/utils/other.py#L220)

**lora_alpha**. The bigger the number, the more weights have LoRA activations. The fine-tuning process will have more impact.

**lora_dropout** is like the common dropout and used to avoid overfitting.

**bias** I was hesitating to use *none* or *lora_only*. For text classification the most common value is none, and for chat or question answering, *all* or *lora_only*.

**task_type**. Indicates the task the model is being trained for. In this case, text generation.

In [68]:
# Create a directory to contain the Model

working_dir = './'

output_directory = os.path.join(working_dir, "peft_lab_outputs")

In the TrainingArgs we inform the number of epochs we want to train, the output directory and the learning_rate.

In [69]:
# Creating the TrainingArgs

import transformers

from transformers import TrainingArguments  #, Trainer

training_args = TrainingArguments(
    output_dir = output_directory,
    auto_find_batch_size = True,  # Find a correct batch size that fits the size of data
    learning_rate = 2e-4,  # Higher learning rate than full fine-tuning
    per_device_train_batch_size = 4,  # Added by me
    num_train_epochs = 5
)

Now we can train the model.
To train the model we need:


*   The Model.
*   The training_args
* The Dataset
* The result of DataCollator, the Dataset ready to be procesed in blocks.
* The LoRA config.





In [None]:
tokenizer.pad_token = tokenizer.eos_token

trainer = SFTTrainer(
    model = foundation_model,
    args = training_args,
    train_dataset = train_sample,
    peft_config = lora_config,
    dataset_text_field = 'prompt',
    tokenizer = tokenizer,
    data_collator = transformers.DataCollatorForLanguageModeling(tokenizer, mlm = False)
)

trainer.train()

In [None]:
# Save the model

trainer.model.save_pretrained(output_directory)

In [None]:
# In case you are having memory problems uncomment these lines to free up some memory

import gc
import torch

del foundation_model
del trainer
del train_sample

torch.cuda.empty_cache()
gc.collect()

## Inference with the pretrained model

In [73]:
from peft import AutoPeftModelForCausalLM  #, PeftConfig

peft_model_path = output_directory

device_map = {'': 0}

In [None]:
# Load the Model

loaded_model = AutoPeftModelForCausalLM.from_pretrained(
                                        peft_model_path,
                                        # torch_dtype = torch.bfloat16,
                                        is_trainable = False,
                                        # load_in_4bit = True,
                                        quantization_config = bnb_config,
                                        device_map = 'cuda')

## Inference the fine-tuned model

In [None]:
foundational_outputs_sentence = get_outputs(loaded_model, input_sentence, max_new_tokens = 50)

print(tokenizer.batch_decode(foundational_outputs_sentence, skip_special_tokens = True))

The result is really good. Let's compare the answer of the pre-trained model with the fine-tuned one:

* **Pretrained Model**: 'I want you to act as a motivational coach. \xa0You are going on an adventure with me, and I need your help.\nWe will be traveling through the land of “What If.” \xa0 This is not some place that exists in reality; it’s more like one those places we see when watching'

* **Fine-Tuned Model**: 'I want you to act as a motivational coach.  I will provide some information about an individual or group of people who need motivation, and your role is help them find the inspiration they require in order achieve their goals successfully! You can use techniques such as positive reinforcement, visualization exercises etc., depending on what'

As you can see, the result is really similar to the samples contained in the dataset used to fine-tune the model. And we only trained the model for some epochs and with a really small number of rows.

 - Complete the prompts similar to what we did in class. 
     - Try a few versions if you have time
     - Be creative
 - Write a report summarizing your findings.
     - Were there variations that didn't work well? i.e., where GPT either hallucinated or wrong
 - What did you learn?

In [None]:
# REPORT AND FINDINGS

# Switched the model to Llama 3 8b, and the results were really good, even with low epochs.
# The model felt more accurate and hallucinated less compared to my earlier attempts with other models.
# Memory spikes were a huge issue, especially during inference.
# I tried a few variations of prompts. Keeping the prompts more structured seemed to help a lot.