# Fine Tuning LLaMa v2 with QLoRA

This [documentation](https://rentry.org/llm-training) and [blog post](https://towardsdatascience.com/fine-tune-your-own-llama-2-model-in-a-colab-notebook-df9823a04a32) were extremely helpful for the creation of the notebook. Most of the code is adapted from the blog post linked above, however, I try to add more explanation in the hope of enhancing my own understanding. I modified the dataset curation process as well.

## Installing Libraries

The following libraries are installed:
- [Accelerate](https://huggingface.co/docs/accelerate/index): This enables PyTorch code to runs across differnet configurations. Using Accelerate, we can dynamically configure how models consume different resources (GPU, CPU, etc).
- [PEFT](https://huggingface.co/docs/peft/index): Library that implements various parameter efficient fine-tuning methods for language models (LoRA, Prefix Tuning, Prompt Tuning, etc).
- [bitsandbytes](https://pypi.org/project/bitsandbytes/): In combination with the transformers library, bitsandbytes enables models to be loaded in 8-bit precision (and more recently 4-bit!), which is critical for reducing memory usage when fine-tuning LLMs.
- [Transformers](https://huggingface.co/docs/transformers/index): Provides the basis for all tools and models related to the [transformer architecture](https://arxiv.org/abs/1706.03762) on Hugging Face.
- [TRL](https://pypi.org/project/trl/): the Transformer Reinforcement Learning Library provides a set of tools for the various steps of RLHF. In particular, this tutorial uses SFTTrainer to implement supervised fine-tuning for LLaMa v2.




In [None]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes transformers==4.31.0 trl==0.4.7

Let's log in to the Hugging Face CLI too.

In [None]:
!huggingface-cli login

While this next block is not necessary to fine tune the model, I think it is important to get in the habit of tracking and reporting Carbon Dioxoide emmissions on model cards 🌱

We will use the library later to track emissions from our fine tuning step.

In [None]:
!pip install codecarbon

## Set up packages

In [4]:
import os
import torch

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)

from datasets import load_dataset
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

#### Model and Dataset

First, the model and data names are instantiated. These will be used later when downloading a pretrained model and dataset from Hugging Face using `AutoModelForCausalLM` and `AutoTokenizer`. Here the new model name is specified as well.  

In [5]:
# The model that you want to train from the Hugging Face hub
model_name = "NousResearch/Llama-2-7b-chat-hf"

# The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k"

# Fine-tuned model name
new_model = "llama-2-7b-miniguanaco"

## Dataset Configuration ✨

This is where the magic happens to create different specialized versions of LLMs. In the case that the dataset is not preprocessed, it is necessary to process it here in a way that LLaMa v2 expects (ex: f"\<s>[INST] {prompt} [/INST]").

In [None]:
# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")

## Bitsandbytes Configuration

The section below sets up the arguments for the `BitsAndBytesConfig` instance. Using the 4-bit configuration, quantized versions of compatible Hugging Face models can be downloaded to the model. This significantly reduces the memory foot print of LLaMa v2!

In particular, the "nf4" (Normal Float 4) datatype is a 4-bit datatype that has been designed for variables that have been sampled from a normal distribution. This is perfect for normalized weights in transformers. If you are curious on what exactly is going on under the hood, refer to this [blog](https://huggingface.co/blog/hf-bitsandbytes-integration) and this [post](https://huggingface.co/blog/4bit-transformers-bitsandbytes). The Hugging Face team gives a great overview of quantization and walks through the optimization steps for 8-bit matrix multiplication in transformers.

Let's break down exactly what these arguments are doing.

For documentation, please refer [here](https://huggingface.co/docs/transformers/main_classes/quantization#transformers.BitsAndBytesConfig). We can use `BitsAndBytesConfig` to configure how we want to quantize model. In the code block below, the arguments specified accomplish the following:
- `load_in_4bit` specifies that we want enable 4-bit quantization of the model.
- `bnb_4bit_quant_type` sets the quantization data type in the bnb.nn.Linear4Bit layers to either "nf4" or "fp4".
- `bnb_4bit_compute_dtype` sets the computation type the model will be working with, which can be different from the input type. Using "float16" can add aditional speed-ups to the fine-tuning process.
- `use_nested_quant` makes quantization constants from the first quantization quantized again.

Once the config file is all set up, we can pass it as an argument to `AutoModelForCausalLM.from_pretrained()`. Then, when we download LLaMa v2 from Hugging Face, 4-bit quantization is employed to compress the pretrained model!

In [None]:
compute_dtype = getattr(torch, "float16")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=False,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

## Load the Base Model and Tokenizer

As mentioned in the section above, we are using the `AutoModelForCausalLLM` to download a pretrained model on the Hugging Face Hub. You can utilize `Auto Classes` to automatically load the correct model architecture based on the name or path supplied. In this code, `AutoModelForCausalLM` is used to load a model with a causal language modeling head.

For more documentation, refer [here](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForCausalLM).

In order to finetune the pretrained model, we need its tokenizer. Using the `AutoTokenizer`, this can be accomplished in a single line of code. The additional lines configure the tokenizer to be capatible with "fp16" training.  

In [8]:
# Load the entire model on the GPU 0
# We can use accelerate to infer a device map if we want to shard the model
# between the GPU and CPU
device_map = {"": 0}

In [None]:
# Load base model using bitsandbytes config created above
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

We have sucessfully loaded in a compressed version of LLaMa v2 with 4-bit quantization and its tokenizer!

## LoRA Configuration

We are almost ready to initialize our `SFTTrainer` to finetune LLaMa v2. So far, we have our quantized 4-bit LLM and its tokenizer. The final step involves setting up a `LoraConfig` instance that enables us to use LoRA to drastically reduce the number of trainable parameters in our LLM.

At its core, LoRA uses the fact gradient updates for weight matrices in large neural networks can be effectively transcribed as low rank approximations. Aghajanyan et al. (2020) demonstrated that the models learned through over-parametrization actually exist within a significantly constrained dimensional space.

LoRA is a fine-tuning method for large language models. It is not an algorithm that replaces the actual training process. In order to be effective, LoRA requires that you have a set of pre-trained weights. If these weights are garbage, then your fine-tuned model using LoRA will have poor performance as well.

LoRA is predicated on the assumption that adaptations to the weight matrices,  when conforming to specific tasks, reside in low dimensional space relative to the entire matrix. Using this fact, LoRA stores weight updates as low rank matrice modules. During fine tuning, these modules (with far less training parameters) are updated through back propogation. When training is done, they are added to the frozen, quantized base model (in this case LLaMa v2).

Below the arguments for the LoRA configuration are set. These will be used to create a `LoraConfig` instance for the "peft_config" argument for the `SFTTrainer`. For more information on what each argument is responsible for, please refer [here](https://huggingface.co/docs/peft/conceptual_guides/lora).

Lets briefly break down the arguments:
- `lora_alpha`: the LoRA scaling factor.
- `lora_dropout`: the dropout probability of the LoRA layers.
- `r`: the dimnesion of the LoRA modules that are added on top of the frozen weight matrices.
- `bias`: Specifies if the bias parameters should be trained. None denotes none of the bias parameters will be trained.
- `task_type`: the type of task the base model is responsible for.

In [10]:
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

## SFT Trainer Configuration

🚨 Do not forget to modify the bf16 value when using A100's on Colab Pro!

Here is what each argument specifies:
- `output_dir`: Specifies the directory where the model predictions and checkpoints will be saved. E.g., "./results" will create or use the "results" folder in the current directory.

- `num_train_epochs`: The total number of training epochs (complete passes through the training data). For example, a value of 1 means the model will be trained on the entire dataset once.

- `per_device_train_batch_size`: Defines the batch size per GPU (or other hardware like TPU or CPU) for training. The total batch size will be this value multiplied by the number of devices used.

- `gradient_accumulation_steps`: The number of update steps required to accumulate the gradients before performing a backward/update pass. It essentially splits a large batch into smaller ones to fit the GPU memory.

- `optim`: The optimizer to be used during the training. Here, it is set to "paged_adamw_32bit", indicating a specific variant of the AdamW optimizer.

- `save_steps`: Determines the number of update steps between two checkpoint saves. For instance, setting it to 25 means the model will be saved every 25 update steps.

- `logging_steps`: Specifies how often to log training information such as loss, accuracy, etc. A value of 25 means logging every 25 update steps.

- `learning_rate`: The initial learning rate for the AdamW optimizer. This controls how quickly or slowly the model learns from the data.

- `weight_decay`: This represents the rate of decay applied to the weights during training, helping prevent overfitting by penalizing large weights.

- `fp16`: A boolean indicating whether to use 16-bit (mixed) precision training (FP16) instead of 32-bit training. It can make training faster, with a slight trade-off in precision.

- `bf16`: Similar to fp16 but uses brain floating-point 16-bit format (bf16). It can be set to True when training on specific hardware like NVIDIA's A100.

- `max_grad_norm`: This parameter sets the maximum gradient norm for gradient clipping, preventing excessively large updates that can destabilize training.

- `max_steps`: If set to a positive number, this defines the total number of training steps to perform, overriding num_train_epochs. A value of -1 means this is ignored.

- `warmup_ratio`: Determines the portion of total training steps used for a linear warmup of the learning rate from 0 to its initial value. It can help the model start learning more gently.

- `group_by_length`: When set to True, sequences of the same length are grouped together in batches. This can improve efficiency by minimizing the padding required.

- `lr_scheduler_type`: Determines the type of learning rate scheduler to be used. In this case, it's set to "constant", meaning the learning rate does not change during training.

- `report_to`: Defines where the training logs should be reported. Here, it's set to "tensorboard", so logs will be sent to TensorBoard for visualization.

TrainerArgument library documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).

In [11]:
# Set training parameters
training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=True, # (set bf16 to True with an A100)
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

With the training arguments set up, we can now create a supervised fine-tuning trainer for LLaMa v2!

If you are curious, you can find the Supervised Fine-tuning Trainer documentation [here](https://huggingface.co/docs/trl/main/en/sft_trainer)

In [12]:
# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

In [None]:
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config, # Here's where the LoRA config is used
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments, # Here's where the training_arguments are used
    packing=packing,
)
# Note since the quantized version of the model was loaded above, in order to
# use the QLoRA algorithm all that is needed is to specify it in the peft_config
# argument.

## Fine-tune and Save Trained LoRA Modules

With the `SFTTrainer` all set up, we are ready to begin finetuning!

Here I wrap the trainer in a simple helper function so we are able to track carbon emissions! The results will be uploaded to the current directory in an emissions.csv file. You can change the output diirecty location by specifying it as an argument. [Here](https://mlco2.github.io/codecarbon/installation.html) is the documentation.

In [14]:
from codecarbon import track_emissions

@track_emissions()
def trainer_with_carbon_tracker(trainer):
  # Train model
  trainer.train()

  # Save trained model
  trainer.model.save_pretrained(new_model)

In [None]:
# Train and save model with emissions tracker enabled
trainer_with_carbon_tracker(trainer)

## Merge LoRA Modules to Base Model and Save

Once done training, we download the base model again and merge the LoRA modules with its weights. With this final step complete, the finetuned model is ready to be uploaded to the Hugging Face Hub 🤗

In [None]:
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [None]:
model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)