# Fintune LLaMa-2 7B Chat model with QLoRA

In this example, we are going to explore how to use SFT (Supervised Fine-tuning) to train LLaMa-2 7B model. To drastically reduce the VRAM usage, we must fine-tune the model in 4-bit precision, which is why we’ll use QLoRA here.

As a result, this example perfectly runs on a RTX 3080 with only 10G VRAM, with the help of QLoRA.

## Setup

Make sure you have the following requirements
```
bitsandbytes>=0.40.2
accelerate>=0.21.0
peft>=0.4.0
trl>=0.4.7
datasets>=2.17.0
transformers>=4.31.0
```

In [1]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

# The model that you want to train from the Hugging Face hub
model_name = "NousResearch/llama-2-7b-chat-hf"

# The instruction dataset to use
dataset_name = "AdiOO7/llama-2-finance"

# Fine-tuned model name
new_model = "llama-2-7b-test-finance"

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 25

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

## Configure `bitsandbytes` for 4-bit quantization

In [2]:
# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

print('loading model - this gonna take some time for the first timer')
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map,
)
model.config.use_cache = True
model.config.pretraining_tp = 1
print('model loaded')

loading model - this gonna take some time for the first timer


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Load dataset
First of all, we want to load the dataset we defined. Here, our dataset is already preprocessed but, usually, this is where you would reformat the prompt, filter out bad text, combine multiple datasets, etc.

In [4]:
# Load dataset
dataset = load_dataset(dataset_name)

In [7]:
dataset["train"][0]

{'text': '### Instruction: What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive} ### Human: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing . ### Assistant: neutral.'}

## Load tokenizer

Next, we will load the tokenizer from Hugginface and set padding_side to “right” to fix the issue with fp16.

In [8]:
# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)


# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

Your GPU supports bfloat16: accelerate training with bf16=True


## Peft parameters

Traditional fine-tuning of pre-trained language models (PLMs) requires updating all of the model's parameters, which is computationally expensive and requires massive amounts of data.

Parameter-Efficient Fine-Tuning (PEFT) works by only updating a small subset of the model's most influential parameters, making it much more efficient. Learn about parameters by reading the PEFT official [doc](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora).

In [9]:
print('Load LoRA configuration')
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

Load LoRA configuration


## Traning parameters

Below is a list of hyperparameters that can be used to optimize the training process:

- output_dir: The output directory is where the model predictions and checkpoints will be stored.
- num_train_epochs: One training epoch.
- fp16/bf16: Disable fp16/bf16 training.
- per_device_train_batch_size: Batch size per GPU for training.
- per_device_eval_batch_size: Batch size per GPU for evaluation.
- gradient_accumulation_steps: This refers to the number of steps required to accumulate the gradients during the update process.
- gradient_checkpointing: Enabling gradient checkpointing.
- max_grad_norm: Gradient clipping.
- learning_rate: Initial learning rate.
- weight_decay: Weight decay is applied to all layers except bias/LayerNorm weights.
- Optim: Model optimizer (AdamW optimizer).
- lr_scheduler_type: Learning rate schedule.
- max_steps: Number of training steps.
- warmup_ratio: Ratio of steps for a linear warmup.
- group_by_length: This can significantly improve performance and accelerate the training process.
- save_steps: Save checkpoint every 25 update steps.
- logging_steps: Log every 25 update steps

In [10]:
# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)



## Train

Now we can do the actual training with SFT.

In [11]:
import time

start_time = time.time()

# Train model
trainer.train()

end_time = time.time()
execution_time = end_time - start_time
print("Execution time: ", execution_time, "seconds")

# Save trained model
trainer.model.save_pretrained(new_model)

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
25,2.2513
50,1.0907
75,1.4452
100,0.9773
125,1.2893
150,0.9938
175,1.2913
200,1.0229
225,1.2476
250,1.0391




Execution time:  758.4831464290619 seconds


## Inference with new model

Once the training is done, we can test the input/output from new model. Here I throw a question from the original dataset. And unsurprisingly, the model returns the correct answer. The problem is the model repeating it selft with continued text. I have not found a fix yet. Please raise an issue in github if you know the fix.

In [16]:
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "the company has no plans to move all production to Russia"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"### Instruction: What is the sentiment of this tweet? Please choose an answer from {{negative/neutral/positive}} ### Human: {prompt} ### Assistant:")
print(result[0]['generated_text'])

OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacity of 10.00 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 16.11 GiB is allocated by PyTorch, and 129.54 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [15]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "What is a large language model?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] What is a large language model? [/INST]  A large language model is a type of artificial intelligence (AI) model that is trained on a large dataset of text to generate language outputs that are coherent and natural-sounding. everybody is talking about the next big thing in AI: large language models. Large language models are neural networks that are trained on vast amounts of text data to generate language outputs that are coherent and natural-sounding. These models are capable of generating text that is often indistinguishable from human-written text, and they have a wide range of applications, from chatbots and language translation to content generation and text summarization. In this article, we will explore the current state of large language models, their applications, and the challenges and limitations of these models. What is a large language model? A large language model is a type of neural network that is trained on a large dataset of text


## Store the trained model

How can we store our new llama-2-7b-test-finance model now? We need to merge the weights from LoRA with the base model. Unfortunately, as far as I know, there is no straightforward way to do it: we need to reload the base model in FP16 precision and use the peft library to merge everything. Alas, it also creates a problem with the VRAM (despite emptying it), so I recommend restarting the notebook, re-executing the three first cells, and then executing the next one.

In [15]:
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacity of 10.00 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 16.11 GiB is allocated by PyTorch, and 128.01 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

## Push to HF

Our weights are merged and we reloaded the tokenizer. We can now push everything to the Hugging Face Hub to save our model.

In [None]:
model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

## Acknowledgements

This example is greatly inspired by Maxime Labonne @mlabonne and his great blog post https://mlabonne.github.io/blog/posts/Fine_Tune_Your_Own_Llama_2_Model_in_a_Colab_Notebook.html 