# Finetuning using LoRA (Low-Rank Adaptation) for efficient fine-tuning


### Developed by Manaranjan Pradhan

## Overview

This notebook demonstrates how to fine-tune a BLOOM 560M model using LoRA (Low-Rank Adaptation) for efficient fine-tuning.

## Dependencies

- [bitsandbytes](https://github.com/facebookresearch/bitsandbytes)
- [datasets](https://github.com/huggingface/datasets)
- [accelerate](https://github.com/huggingface/accelerate)
- [loralib](https://github.com/huggingface/peft)
- [torch](https://pytorch.org)
- [transformers](https://github.com/huggingface/transformers)


### Install Dependencies

In [36]:
# Install several Python packages required for machine learning and natural language processing tasks

# Install bitsandbytes: A library for quantization and efficient matrix multiplication
# Install datasets: Hugging Face's library for easily accessing and processing datasets
# Install accelerate: A library for easy use of distributed training on multiple GPUs/TPUs
# Install loralib: A library for Low-Rank Adaptation of Large Language Models
# Install torch: PyTorch, a popular deep learning framework
!pip install bitsandbytes datasets accelerate loralib torch

# Install the latest versions of two Hugging Face libraries directly from their GitHub repositories

# Install PEFT (Parameter-Efficient Fine-Tuning) library
# This library provides state-of-the-art techniques for efficient fine-tuning of large language models
!pip install git+https://github.com/huggingface/peft.git

# Install the latest version of the Transformers library
# Transformers provides state-of-the-art machine learning for natural language processing tasks
!pip install git+https://github.com/huggingface/transformers.git

Collecting git+https://github.com/huggingface/peft.git
  Cloning https://github.com/huggingface/peft.git to /tmp/pip-req-build-t_g95bud
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/peft.git /tmp/pip-req-build-t_g95bud
  Resolved https://github.com/huggingface/peft.git to commit 363c14e673a12d19f951609d06221962d5c3eb2a
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-49zrf3lm
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-49zrf3lm
  Resolved https://github.com/huggingface/transformers.git to commit 4a5a7b991a5e9adae78c52bea61fd6e135728622
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements

#### Confirm CUDA

In [37]:
import torch
torch.cuda.is_available()

True

In [38]:
!nvidia-smi

Wed Feb 12 17:01:12 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   73C    P0             32W /   70W |    7200MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

#### Load Base Model

In [39]:
# Import necessary libraries
import os
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

# Uncomment the following line to specify which GPU to use (0 in this case)
# os.environ["CUDA_VISIBLE_DEVICES"]="0"

# Load the pre-trained BLOOM model
# AutoModelForCausalLM automatically selects the appropriate model architecture
# 'device_map="auto"' allows the model to be automatically distributed across available GPUs
model = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-560m",
    device_map='auto',
)

# Load the tokenizer associated with the BLOOM model
# The tokenizer is responsible for converting text to tokens that the model can process
tokenizer = AutoTokenizer.from_pretrained("bigscience/tokenizer")

##### View Model Summary

In [40]:
print(model)

BloomForCausalLM(
  (transformer): BloomModel(
    (word_embeddings): Embedding(250880, 1024)
    (word_embeddings_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (h): ModuleList(
      (0-23): 24 x BloomBlock(
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): BloomAttention(
          (query_key_value): Linear(in_features=1024, out_features=3072, bias=True)
          (dense): Linear(in_features=1024, out_features=1024, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): BloomMLP(
          (dense_h_to_4h): Linear(in_features=1024, out_features=4096, bias=True)
          (gelu_impl): BloomGelu()
          (dense_4h_to_h): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
    )
    (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
  (

In [41]:
# Iterate through all parameters of the model
for param in model.parameters():
    # Freeze the model parameters
    # This prevents the original model weights from being updated during training
    param.requires_grad = False

    # Check if the parameter is 1-dimensional (e.g., bias terms or layernorm parameters)
    if param.ndim == 1:
        # Cast 1D parameters to float32 for improved numerical stability
        # This is particularly important for small parameters like those in layer normalization
        param.data = param.data.to(torch.float32)

# Enable gradient checkpointing
# This technique reduces memory usage by not storing all activations
# Instead, it recomputes them during the backward pass as needed
model.gradient_checkpointing_enable()

# Enable input gradients
# This is necessary for fine-tuning the model with adapters or other techniques
model.enable_input_require_grads()

# Define a custom layer to cast the output to float32
class CastOutputToFloat(nn.Sequential):
    def forward(self, x):
        # Call the parent class's forward method and cast the output to float32
        return super().forward(x).to(torch.float32)

# Replace the model's language modeling head with the custom layer
# This ensures that the final output is always in float32 precision
model.lm_head = CastOutputToFloat(model.lm_head)

#### Helper Function

In [42]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

#### Obtain LoRA Model

In [43]:
# Import necessary modules from the PEFT (Parameter-Efficient Fine-Tuning) library
from peft import LoraConfig, get_peft_model

# Configure LoRA (Low-Rank Adaptation) for efficient fine-tuning
config = LoraConfig(
    r=8,  # Rank of the update matrices. Lower rank results in fewer trainable parameters
    lora_alpha=16,  # Alpha parameter for LoRA scaling. Larger values -> larger updates
    target_modules=["query_key_value"],  # Which modules to apply LoRA to. Here, it's applied to attention layers
    lora_dropout=0.05,  # Dropout probability for LoRA layers
    bias="none",  # Whether to train biases. "none" means no bias training
    task_type="CAUSAL_LM"  # The type of task. Here, it's causal language modeling
)

# Apply the LoRA configuration to the model
# This wraps the original model with LoRA layers
model = get_peft_model(model, config)

# Print the number of trainable parameters in the model
# This function is not defined in the snippet, but it typically shows the ratio of trainable parameters
print_trainable_parameters(model)

trainable params: 786432 || all params: 560001024 || trainable%: 0.14043402892063284


# Load Dataset and take a sample of the data

In [44]:
import pandas as pd

In [45]:
qa_df = pd.read_parquet('train.parquet')

In [46]:
qa_df.sample(5)

Unnamed: 0,id,title,context,question,answers
125137,573257950fdd8d15006c69ee,Financial_crisis_of_2007%E2%80%9308,It threatened the collapse of large financial ...,What year did the global recession that follow...,"{'answer_start': [481], 'text': ['2012']}"
30275,5706b11d0eeca41400aa0d36,House_music,"But house was also being developed on Ibiza,[c...",what was a popular club in ibiza that started ...,"{'answer_start': [251], 'text': ['Amnesia']}"
39176,5ad17e8d645df0001a2d1e38,Mary_(mother_of_Jesus),Although Calvin and Huldrych Zwingli honored M...,In what century did Martin Luther honor Mary a...,"{'answer_start': [], 'text': []}"
32129,5709667eed30961900e840a1,Himachal_Pradesh,"Due to extreme variation in elevation, great v...",What is the climate like?,"{'answer_start': [115], 'text': ['varies from ..."
44136,5ad378fa604f3c001a3fe3a1,Elizabeth_II,The Queen addressed the United Nations for a s...,How many times has the Queen toured Canada?,"{'answer_start': [], 'text': []}"


In [47]:
from datasets import Dataset

In [48]:
train_df = Dataset.from_pandas(qa_df)

In [49]:
import random

num_samples = 1000

# Generate random indices
random_indices = random.sample(range(len(train_df)), num_samples)

# Sample the records
sampled_records = train_df.select(random_indices)

In [50]:
 # Print the first few records from the training set
for i in range(5):
    print(f"Record {i+1}: {sampled_records[i]}")

Record 1: {'id': '5727ebe03acd2414000deff0', 'title': 'Gamal_Abdel_Nasser', 'context': "Nasser remains an iconic figure in the Arab world, particularly for his strides towards social justice and Arab unity, modernization policies, and anti-imperialist efforts. His presidency also encouraged and coincided with an Egyptian cultural boom, and launched large industrial projects, including the Aswan Dam and Helwan City. Nasser's detractors criticize his authoritarianism, his government's human rights violations, his populist relationship with the citizenry, and his failure to establish civil institutions, blaming his legacy for future dictatorial governance in Egypt. Historians describe Nasser as a towering political figure of the Middle East in the 20th century.", 'question': 'What century did Nasser rule in?', 'answers': {'answer_start': [667], 'text': ['20th']}}
Record 2: {'id': '56dff532231d4119001abf05', 'title': 'Pub', 'context': 'In Ireland, pubs are known for their atmosphere or "cr

# Creat the fine tuning dataset

```
### CONTEXT
{context}

### QUESTION
{question}

### ANSWER
{answer}</s>
```

# Define a function to create a formatted prompt for question-answering tasks
def create_prompt(context, question, answer):
    # Check if the answer is empty
    if len(answer["text"]) < 1:
        # If no answer is found, use a default message
        answer = "Cannot Find Answer"
    else:
        # If an answer exists, use the first one (assuming there might be multiple answers)
        answer = answer["text"][0]
    
    # Create a formatted prompt string using f-string
    # The prompt includes context, question, and answer, each in its own section
    # The '</s>' at the end is likely a special token to indicate the end of the sequence
    prompt_template = f"### CONTEXT\n{context}\n\n### QUESTION\n{question}\n\n### ANSWER\n{answer}</s>"
    
    # Return the formatted prompt
    return prompt_template

### Apply the create_prompt function to the dataset and tokenize the result

This uses the map function to process each sample in the dataset
mapped_qa_dataset = sampled_records.map(
    lambda samples: tokenizer(
        create_prompt(
            samples['context'],
            samples['question'],
            samples['answers']
        )
    )
)#### Train LoRA

In [51]:
# Define a function to create a formatted prompt for question-answering tasks
def create_prompt(context, question, answer):
    # Check if the answer is empty
    if len(answer["text"]) < 1:
        # If no answer is found, use a default message
        answer = "Cannot Find Answer"
    else:
        # If an answer exists, use the first one (assuming there might be multiple answers)
        answer = answer["text"][0]

    # Create a formatted prompt string using f-string
    # The prompt includes context, question, and answer, each in its own section
    # The '</s>' at the end is likely a special token to indicate the end of the sequence
    prompt_template = f"### CONTEXT\n{context}\n\n### QUESTION\n{question}\n\n### ANSWER\n{answer}</s>"

    # Return the formatted prompt
    return prompt_template

# Apply the create_prompt function to the dataset and tokenize the result
# This uses the map function to process each sample in the dataset
mapped_qa_dataset = sampled_records.map(
    lambda samples: tokenizer(
        create_prompt(
            samples['context'],
            samples['question'],
            samples['answers']
        )
    )
)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [52]:
# Import the transformers library
import transformers

# Create a Trainer object
trainer = transformers.Trainer(
    model=model,  # The model to be trained
    train_dataset=mapped_qa_dataset,  # The dataset to train on
    args=transformers.TrainingArguments(
        report_to="none",  # Report training progress to TensorBoard
        per_device_train_batch_size=4,  # Number of samples per batch on each device
        gradient_accumulation_steps=4,  # Number of steps to accumulate gradients over
        max_steps=5,  # Maximum number of training steps
        learning_rate=1e-3,  # Learning rate for the optimizer
        fp16=True,  # Use 16-bit floating point precision
        logging_steps=1,  # Log training metrics every step
        output_dir='outputs',  # Directory to save model checkpoints and logs
    ),
    # Data collator for language modeling tasks
    # mlm=False indicates it's not masked language modeling (i.e., it's causal language modeling)
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

# Disable the model's cache to prevent warning messages
# Note: This should be re-enabled for inference to improve performance
model.config.use_cache = False

# Start the training process
trainer.train()

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
1,3.5225
2,3.4575
3,3.3379
4,3.1316
5,3.1879


TrainOutput(global_step=5, training_loss=3.327502155303955, metrics={'train_runtime': 7.9454, 'train_samples_per_second': 10.069, 'train_steps_per_second': 0.629, 'total_flos': 35033499303936.0, 'train_loss': 3.327502155303955, 'epoch': 0.08})

In [53]:
# Save the model to a directory
# Save the model to a directory
model_save_path = "./my_finetuned_model"

model.save_pretrained(model_save_path)

In [54]:
# Import necessary libraries
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the PEFT configuration from a saved path
# This configuration contains information about the fine-tuning setup
peft_config = PeftConfig.from_pretrained(model_save_path)

# Load the base model specified in the PEFT configuration
# AutoModelForCausalLM automatically selects the appropriate model architecture
# return_dict=True ensures the model returns outputs as a dictionary
# load_in_8bit=False means we're not using 8-bit quantization
# device_map='auto' allows the model to be automatically distributed across available GPUs
model = AutoModelForCausalLM.from_pretrained(
    peft_config.base_model_name_or_path,
    return_dict=True,
    load_in_8bit=False,
    device_map='auto'
)

# Load the tokenizer associated with the base model
# The tokenizer is responsible for converting text to tokens that the model can process
tokenizer = AutoTokenizer.from_pretrained(peft_config.base_model_name_or_path)

# Apply the PEFT configuration to the loaded model
# This step adds the fine-tuned parameters to the base model
qa_model = get_peft_model(model, peft_config)



In [55]:
# Assuming 'model' is your PyTorch model
device = next(qa_model.parameters()).device
print("Model is on device:", device)

Model is on device: cuda:0


```
### CONTEXT
{context}

### QUESTION
{question}

### ANSWER
{answer}</s>

```

In [56]:
# Import necessary display functions from IPython
from IPython.display import display, Markdown

# Define a function to perform inference with the question-answering model
def make_inference(context, question):
    # Create input by formatting context and question
    # This follows the format used during training
    input_text = f"### CONTEXT\n{context}\n\n### QUESTION\n{question}\n\n### ANSWER\n"

    # Tokenize the input text
    # return_tensors='pt' returns PyTorch tensors
    batch = tokenizer(input_text, return_tensors='pt')

    # Get the device (CPU/GPU) that the model is on
    device = next(qa_model.parameters()).device

    # Move the input tensors to the same device as the model
    batch = {k: v.to(device) for k, v in batch.items()}

    # Use CUDA's automatic mixed precision for faster inference (if available)
    with torch.cuda.amp.autocast():
        # Generate the answer
        # max_new_tokens limits the length of the generated answer
        output_tokens = qa_model.generate(**batch, max_new_tokens=30)

    # Decode the output tokens back to text, skipping special tokens
    answer = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

    # Display the answer as Markdown
    # This is useful in Jupyter notebooks for formatted output
    display(Markdown(answer))

# Note: This function assumes that 'tokenizer' and 'qa_model' are already defined and loaded

In [57]:
context = """ Chandrayaan-3 was launched    Satish Dhawan Space Centre on 14 July 2023. The spacecraft entered lunar orbit on 5 August,
and the lander touched down near the Lunar south pole on 23 August 2023"""

question = "When was chandaryan-3 launched? "

make_inference(context, question)

  with torch.cuda.amp.autocast():


### CONTEXT
 Chandrayaan-3 was launched    Satish Dhawan Space Centre on 14 July 2023. The spacecraft entered lunar orbit on 5 August, 
and the lander touched down near the Lunar south pole on 23 August 2023

### QUESTION
When was chandaryan-3 launched? 

### ANSWER
