# Fine-Tuning LLMs with PEFT and LoRa

## Problems with Fine-Tuning LLMs

We end up with big weights

1. Need a lot more computational power to train

2. Big file sizes

## Parameter-Efficient Fine-Tuning (PEFT)

One of the techniques is called LoRA: Low Rank Adaption

Big Idea: Fine tune only a small number of extra weights in the model while freezing most of the parameters of the pre-trained network

Advantages: We still have the original weights which helps to stop catastrophic forgetting (models forget what they were originally trained on if fine-tuned too much)
- PEFT doesn't face this problem because its adding extra/add-on weights and tuning those as it freezes the original ones
- Small size

LoRA Fine-tuning

Training adapters, adding weights to the model at various points in the model and we are fine tuning those to get the results we want

In [1]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 17.0.0 which is inc

In [2]:
from google.colab import userdata
sec_key=userdata.get("HF_TOKEN")
import os
os.environ["HF_TOKEN"]=sec_key

## Set up the model

In [3]:
"""
https://huggingface.co/bigscience/bloom-7b1

load_in_8bit: Load the model in 8-bit precision
- reduce memory to store model weights
- faster inference because operations with 8-bit numbers require less computation
"""

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb # Handle turning our model into 8 bit so it doesn't take so much gpu, ram
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-7b1",
    load_in_8bit=True, # take care of the 8-bit conversion, using the bitsandbytes lib for doing this
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-7b1")

config.json:   0%|          | 0.00/739 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/28.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.16G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

## Freeze the original weights

In [4]:
# Iterates all the parameters/weights of the model
for param in model.parameters():
  # Model weight will not be updated during training. Gradients will not be computed for them.
  param.requires_grad = False  # freeze the model - train adapters later
  # Check if the parameter is a 1D tensor. Cast to float32 to maintain numerical stability
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

"""
Gradient checkpointing saves strategically selected activations throughout the computational graph so only a fraction of the activations need to be re-computed for the gradients
- https://huggingface.co/docs/transformers/v4.18.0/en/performance
- https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9
Instead of storing all activations, it stores fewer and recomputes some activations during the backward pass, reducing the memory footprint but adding some computational cost.
Not contradictory with param.requires_grad = False,
For fine tuning the adapters???
- In scenarios where you are fine-tuning certain layers (e.g., adapters, classification heads),
you still need gradients for those layers. Gradient checkpointing will help reduce the memory needed to compute gradients for the layers that are still trainable.

The code is likely freezing the bulk of the model (e.g., transformer layers)
because you're only interested in fine-tuning a small part of it
(like adapters, embeddings, or the final layer).
However, the fine-tuning part still requires gradient computation,
and gradient checkpointing helps make this process more memory efficient.

You freeze the large pre-trained layers of the model because you don’t want to update them (saving computational resources and avoiding overfitting).
You fine-tune lightweight layers (like adapters or task-specific heads) to adapt the model to your task.
Gradient checkpointing is useful here because the trainable layers still need to store activations, and checkpointing helps keep memory usage low.

Freezing parameters focuses on preventing updates to parts of the model you don’t want to change.
Gradient checkpointing is a memory-saving technique applied to the parts of the model that still require gradients, which is typically helpful when you are fine-tuning certain layers.


Enables the gradients for the input embeddings.
- This is useful for fine-tuning adapter weights while keeping the model weights fixed.
- https://huggingface.co/docs/transformers/en/main_classes/model
- Allow the gradients to be computed wrt input embeddings of the model. Important for
  fine-tuning scenarios where you want to modify the input representations (embeddings) but keep the rest of the model frozen
- Allows input embeddings to be updated during fine-tuning for the task,
- By enabling gradients for the inputs, the model can modify the embeddings during fine-tuning, helping it better adapt to the specific task you are working on.

With enable_input_require_grads(): Gradients will also be computed with respect to the input embeddings. During backpropagation, the embedding matrix will be updated even if other parts of the model (e.g., the transformer layers) are frozen.
- need this to tune the embedding layer because of NEW inputs

By calling model.enable_input_require_grads(),
you're essentially setting requires_grad=True for the input embeddings during forward passes.
This is a special case because inputs are typically not updated, but in this case, you explicitly enable this for fine-tuning.

Better adaptation to new tasks: Fine-tuning the input embeddings allows the model to create task-specific representations
"""
model.gradient_checkpointing_enable()  # reduce number of stored activations, some activations are recomputed during the backwards pass
model.enable_input_require_grads()

"""
Cast the output of the lm_head (the language modeling head) to FP32

Avoid issues with numerical instability during the final stages of training or inference, particularly in models that are trained or fine-tuned in lower precision

Extend/override the behaviour of the lm_head/last layer of the model
"""
class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

## Set up LoRA

In [5]:
# Helper Function
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [6]:
from peft import LoraConfig, get_peft_model

"""
task type

Is it a Causal Language Model i.e. decoder only model
or is it a Seq-to-Seq model

Each PEFT method is defined by it's PEFT configuration class that stores the configuration of the Peft model
- For example to train with LoRA we can load the LoraConfig class

We can then create the PeftModel with the get_peft_model() function
- The get_peft_model() function takes a base model and the peft config containing the parameters for how to configure the model for training with the specified peft method
"""
config = LoraConfig(
    r=16, #attention heads
    lora_alpha=32, #alpha scaling
    # target_modules=["q_proj", "v_proj"], #if you know the
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM" # set this for CLM or Seq2Seq
)

# Pass in model and get the peft model: have the original model and the LoRA adapters on the model
"""
We have a lot of parameters but the number of trainable parameters is tiny
"""
model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 7864320 || all params: 7076880384 || trainable%: 0.11112693126452029


<br/>
<br/>
<br/>

## Get the Data

In [7]:
"""
Dataset of English quotes

Make a model where you can input your own quote and the model
can enerate tags for that quote
"""

import transformers
from datasets import load_dataset
data = load_dataset("Abirate/english_quotes")

README.md:   0%|          | 0.00/5.55k [00:00<?, ?B/s]

quotes.jsonl:   0%|          | 0.00/647k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2508 [00:00<?, ? examples/s]

In [9]:
def merge_columns(example):
    # use the unique sign ->: to teach the model that anytime they see these characters that we are going to condition on the input before that and generate the tags after
    example["prediction"] = example["quote"] + " ->: " + str(example["tags"])
    return example

# map: https://huggingface.co/docs/datasets/en/process
#
data['train'] = data['train'].map(merge_columns)
data['train']["prediction"][:5]

Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

["“Be yourself; everyone else is already taken.” ->: ['be-yourself', 'gilbert-perreira', 'honesty', 'inspirational', 'misattributed-oscar-wilde', 'quote-investigator']",
 "“I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best.” ->: ['best', 'life', 'love', 'mistakes', 'out-of-control', 'truth', 'worst']",
 "“Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.” ->: ['human-nature', 'humor', 'infinity', 'philosophy', 'science', 'stupidity', 'universe']",
 "“So many books, so little time.” ->: ['books', 'humor']",
 "“A room without books is like a body without a soul.” ->: ['books', 'simile', 'soul']"]

In [10]:
data['train'][0]

{'quote': '“Be yourself; everyone else is already taken.”',
 'author': 'Oscar Wilde',
 'tags': ['be-yourself',
  'gilbert-perreira',
  'honesty',
  'inspirational',
  'misattributed-oscar-wilde',
  'quote-investigator'],
 'prediction': "“Be yourself; everyone else is already taken.” ->: ['be-yourself', 'gilbert-perreira', 'honesty', 'inspirational', 'misattributed-oscar-wilde', 'quote-investigator']"}

In [11]:
# Model's Tokenizer helps to add input_ids and attention mask
data = data.map(lambda samples: tokenizer(samples['prediction']), batched=True)

Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

In [12]:
data

DatasetDict({
    train: Dataset({
        features: ['quote', 'author', 'tags', 'prediction', 'input_ids', 'attention_mask'],
        num_rows: 2508
    })
})

In [13]:
data['train'][0]

{'quote': '“Be yourself; everyone else is already taken.”',
 'author': 'Oscar Wilde',
 'tags': ['be-yourself',
  'gilbert-perreira',
  'honesty',
  'inspirational',
  'misattributed-oscar-wilde',
  'quote-investigator'],
 'prediction': "“Be yourself; everyone else is already taken.” ->: ['be-yourself', 'gilbert-perreira', 'honesty', 'inspirational', 'misattributed-oscar-wilde', 'quote-investigator']",
 'input_ids': [1502,
  17143,
  33218,
  30,
  39839,
  4384,
  632,
  11226,
  15713,
  17,
  982,
  11953,
  29,
  24629,
  2765,
  17731,
  3240,
  15407,
  10,
  15,
  83077,
  354,
  26624,
  31683,
  71421,
  10,
  15,
  756,
  19218,
  56452,
  10,
  15,
  756,
  71538,
  3383,
  10,
  15,
  29412,
  290,
  96783,
  11914,
  43555,
  5231,
  16728,
  51464,
  10,
  15,
  756,
  67091,
  15595,
  51261,
  2623,
  3166],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  

<br/>
<br/>
<br/>

## Training

In [None]:
"""
Use the huggingface transformers trainer

https://huggingface.co/docs/transformers/en/main_classes/trainer
"""

trainer = transformers.Trainer(
    model=model,
    train_dataset=data['train'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=100,
        max_steps=200,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()