<a href="https://colab.research.google.com/github/pankajrawat9075/Question-Answering-with-LLMs/blob/main/finetune_with_lora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Install Dependencies

In [1]:
!pip install -q pyarrow==11.0.0 bitsandbytes datasets accelerate loralib peft transformers wandb nvidia-ml-py3


#### check if cuda available

In [1]:
import torch
torch.cuda.is_available()

True

#### few helper functions

In [2]:
from pynvml import *
gpu_details={}

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")
    return {info.used//1024**2}


def print_summary(result, name='default'):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    gpu_details[name] = {'s/s': result.metrics['train_samples_per_second'], 'gpu': print_gpu_utilization()}

In [3]:
gpu_details['initial']= {'gpu': print_gpu_utilization()}

GPU memory occupied: 260 MB.


In [4]:
gpu_details

{'initial': {'gpu': {260}}}

#### import tokenizer and model

In [5]:
import torch
import torch.nn as nn
import bitsandbytes as bnb

# uncomment the following to load bloom model

##
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-560m",
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained("bigscience/tokenizer")
##

# comment the following if want to use bloom model

##
# from transformers import GPT2Tokenizer, GPT2LMHeadModel
# tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')
# model = GPT2LMHeadModel.from_pretrained('distilgpt2')
##


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [6]:
print_gpu_utilization()

GPU memory occupied: 2494 MB.


{2494}

In [7]:
# add padding token
tokenizer.pad_token = tokenizer.eos_token

In [8]:
print("Model before adding LORA Adapter \n")
print("no. of parameters : ", model.num_parameters())
print("\n",model)

Model before adding LORA Adapter 

no. of parameters :  559214592

 BloomForCausalLM(
  (transformer): BloomModel(
    (word_embeddings): Embedding(250880, 1024)
    (word_embeddings_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (h): ModuleList(
      (0-23): 24 x BloomBlock(
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): BloomAttention(
          (query_key_value): Linear(in_features=1024, out_features=3072, bias=True)
          (dense): Linear(in_features=1024, out_features=1024, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): BloomMLP(
          (dense_h_to_4h): Linear(in_features=1024, out_features=4096, bias=True)
          (gelu_impl): BloomGelu()
          (dense_4h_to_h): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
    )
    (l

#### freeze the pretrained model

In [9]:
print_gpu_utilization()

GPU memory occupied: 2494 MB.


{2494}

In [43]:
for param in model.parameters():
  param.requires_grad = True  # freeze the model - train adapters later

#### Helper function to print parameters

In [11]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

#### add lora adapter to the model

In [12]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=2,
    lora_alpha=16,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model_lora = get_peft_model(model, config)
print_trainable_parameters(model_lora)

trainable params: 196608 || all params: 559411200 || trainable%: 0.03514552443712246


We can see remarkable decrease in the no. of parameters to train than if we would have trained the whole model. We now only train 0.24% of actual parameters. WOW!!

In [45]:
for name, param in model_lora.named_parameters():
    print(f"Parameter {name} requires gradients: {param.requires_grad}")

Parameter base_model.model.transformer.word_embeddings.weight requires gradients: True
Parameter base_model.model.transformer.word_embeddings_layernorm.weight requires gradients: True
Parameter base_model.model.transformer.word_embeddings_layernorm.bias requires gradients: True
Parameter base_model.model.transformer.h.0.input_layernorm.weight requires gradients: True
Parameter base_model.model.transformer.h.0.input_layernorm.bias requires gradients: True
Parameter base_model.model.transformer.h.0.self_attention.query_key_value.base_layer.weight requires gradients: True
Parameter base_model.model.transformer.h.0.self_attention.query_key_value.base_layer.bias requires gradients: True
Parameter base_model.model.transformer.h.0.self_attention.query_key_value.lora_A.default.weight requires gradients: True
Parameter base_model.model.transformer.h.0.self_attention.query_key_value.lora_B.default.weight requires gradients: True
Parameter base_model.model.transformer.h.0.self_attention.dense.wei

In [13]:
print("Model after adding LORA Adapter\n")
print(model)

Model after adding LORA Adapter

BloomForCausalLM(
  (transformer): BloomModel(
    (word_embeddings): Embedding(250880, 1024)
    (word_embeddings_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (h): ModuleList(
      (0-23): 24 x BloomBlock(
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): BloomAttention(
          (query_key_value): lora.Linear(
            (base_layer): Linear(in_features=1024, out_features=3072, bias=True)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.05, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=1024, out_features=2, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=2, out_features=3072, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
          )
          (dense

In [14]:
gpu_details['model']= {'gpu': print_gpu_utilization()}

GPU memory occupied: 2494 MB.


In [15]:
gpu_details

{'initial': {'gpu': {260}}, 'model': {'gpu': {2494}}}

#### load and prepare the dataset for training

In [16]:
from datasets import load_dataset

qa_dataset = load_dataset("squad_v2")

In [17]:
qa_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 130319
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})

#### let's see some examples of our dataset

In [18]:
qa_dataset['train'][0]

{'id': '56be85543aeaaa14008c9063',
 'title': 'Beyoncé',
 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".',
 'question': 'When did Beyonce start becoming popular?',
 'answers': {'text': ['in the late 1990s'], 'answer_start': [269]}}

So basically, we will be given a context and a question, and using the context itself - the model needs to return the answer.

#### select subset of data for finetuneing

In [19]:
qa_dataset_train = qa_dataset["train"].shuffle(seed=42).select(range(20000))
qa_dataset_test = qa_dataset["validation"].shuffle(seed=42).select(range(100))

In [20]:
qa_dataset_train

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 20000
})

We want our prompt to looks like this.
```
### CONTEXT
{context}

### QUESTION
{question}

### ANSWER
{answer}</s>
```

In [21]:
def create_prompt(context, question, answer):
  if len(answer["text"]) < 1:
    answer = "Cannot Find Answer"
  else:
    answer = answer["text"][0]
  prompt_template = f"### CONTEXT\n{context}\n\n### QUESTION\n{question}\n\n### ANSWER\n{answer}</s>"
  return prompt_template

mapped_qa_dataset_train = qa_dataset_train.map(lambda samples: tokenizer(create_prompt(samples['context'], samples['question'], samples['answers'])))
mapped_qa_dataset_test = qa_dataset_test.map(lambda samples: tokenizer(create_prompt(samples['context'], samples['question'], samples['answers'])))

In [22]:
mapped_qa_dataset_train

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 20000
})

Now we have got the dataset into trainable format using tokenizer

#### using wandb for logging purposes

In [23]:
import wandb

In [24]:
api_key = 'c891ce08bb56081ecfa97de464a131634657ac13'
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mcs22m062[0m ([33miitmadras[0m). Use [1m`wandb login --relogin`[0m to force relogin


True

In [25]:
# Let's log every trained model.
%env WANDB_LOG_MODEL=true

env: WANDB_LOG_MODEL=true


In [26]:
import os
os.environ["WANDB_PROJECT"]="fine-tuning-with-LORA"

In [46]:
!nvidia-smi

Thu Mar 14 16:55:14 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   49C    P0              27W /  70W |   3333MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [47]:
print_gpu_utilization()

GPU memory occupied: 3590 MB.


{3590}

In [39]:
wandb.finish(
)

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

In [48]:
import transformers

# use only when we want to train / fine-tune the model
trainer = transformers.Trainer(
    model=model_lora,
    train_dataset=mapped_qa_dataset_train,
    eval_dataset=mapped_qa_dataset_test,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=1,
        gradient_checkpointing=True,
        warmup_steps=100,
        max_steps=10,
        learning_rate=1e-3,
        fp16=False,
        logging_steps=2,
        output_dir='outputs',     # we save the model after training for testing pusposes
        evaluation_strategy='steps',
        report_to="wandb",
        run_name="lora-r-4-grad-cp",
    ),
#     compute_metrics=compute_perplexity,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
# model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
result = trainer.train()


dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss,Validation Loss


OutOfMemoryError: CUDA out of memory. Tried to allocate 2.52 GiB. GPU 0 has a total capacity of 14.75 GiB of which 2.03 GiB is free. Process 239672 has 12.72 GiB memory in use. Of the allocated memory 11.82 GiB is allocated by PyTorch, and 785.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [32]:
print_summary(result, 'vanila')

Time: 61.90
Samples/second: 0.16
GPU memory occupied: 14298 MB.


In [50]:
torch.cuda.empty_cache()

In [43]:
%env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

env: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True


In [57]:
print_summary(result)

NameError: name 'result' is not defined

In [None]:
# already trained the model and saved it. Now can use it anytime
model_lora = GPT2LMHeadModel.from_pretrained("/kaggle/working/outputs/checkpoint-500")

In [None]:
import math
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


>>> Perplexity: 23.54


In [None]:
print(f">>> Evaluation_loss: {eval_results['eval_loss']}")

>>> Evaluation_loss: 3.158553123474121


#### save the model to hugging-face hub

In [None]:
HUGGING_FACE_USER_NAME = "pankaj9075rawat"

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
model_name = "fine-tune-GPT2-lora"

model_lora.push_to_hub(f"{HUGGING_FACE_USER_NAME}/{model_name}", use_auth_token=True)



README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]



adapter_model.safetensors:   0%|          | 0.00/816k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/pankaj9075rawat/fine-tune-GPT2-lora/commit/2a22914b9d9027edfcd65b474d1b48e6435471b9', commit_message='Upload model', commit_description='', oid='2a22914b9d9027edfcd65b474d1b48e6435471b9', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
from IPython.display import display, Markdown

def make_inference(context, question):
  batch = tokenizer(f"### CONTEXT\n{context}\n\n### QUESTION\n{question}\n\n### ANSWER\n", return_tensors='pt')

  # Move the batch tensor to the appropriate device (e.g., GPU)
  batch = {k: v.to("cuda") for k, v in batch.items()}

  with torch.cuda.amp.autocast():
    output_tokens = model_lora.generate(**batch, max_new_tokens=500)

  display(Markdown((tokenizer.decode(output_tokens[0], skip_special_tokens=True))))

In [None]:
model_lora.config.use_cache = True

#### make some inferences

In [None]:
context = "Cheese is the best food."
question = "What is the best food?"

make_inference(context, question)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


### CONTEXT
Cheese is the best food.

### QUESTION
What is the best food?

### ANSWER
cheese</s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s>

In [None]:
context = "Cheese is the best food."
question = "How far away is the Moon from the Earth?"

make_inference(context, question)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


### CONTEXT
Cheese is the best food.

### QUESTION
How far away is the Moon from the Earth?

### ANSWER
Cannot Find Answer</s>

### ANSWER</s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></

In [None]:
context = "The Moon orbits Earth at an average distance of 384,400 km (238,900 mi), or about 30 times Earth's diameter. Its gravitational influence is the main driver of Earth's tides and very slowly lengthens Earth's day. The Moon's orbit around Earth has a sidereal period of 27.3 days. During each synodic period of 29.5 days, the amount of visible surface illuminated by the Sun varies from none up to 100%, resulting in lunar phases that form the basis for the months of a lunar calendar. The Moon is tidally locked to Earth, which means that the length of a full rotation of the Moon on its own axis causes its same side (the near side) to always face Earth, and the somewhat longer lunar day is the same as the synodic period. However, 59% of the total lunar surface can be seen from Earth through cyclical shifts in perspective known as libration."
question = "At what distance does the Moon orbit the Earth?"

make_inference(context, question)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


### CONTEXT
The Moon orbits Earth at an average distance of 384,400 km (238,900 mi), or about 30 times Earth's diameter. Its gravitational influence is the main driver of Earth's tides and very slowly lengthens Earth's day. The Moon's orbit around Earth has a sidereal period of 27.3 days. During each synodic period of 29.5 days, the amount of visible surface illuminated by the Sun varies from none up to 100%, resulting in lunar phases that form the basis for the months of a lunar calendar. The Moon is tidally locked to Earth, which means that the length of a full rotation of the Moon on its own axis causes its same side (the near side) to always face Earth, and the somewhat longer lunar day is the same as the synodic period. However, 59% of the total lunar surface can be seen from Earth through cyclical shifts in perspective known as libration.

### QUESTION
At what distance does the Moon orbit the Earth?

### ANSWER
Cannot Find Answer</s>

### ANSWER</s>

### ANSWER</s>

### ANSWER</s>

### ANSWER</s>

### ANSWER</s>

### ANSWER</s>

### ANSWER</s>

### ANSWER</s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></

#### It looks like the GPT2 model is not doing a gread job in giving the write answer. We have trained the model on less data and also the model is smaller.

In [None]:
wandb_key=  c891ce08bb56081ecfa97de464a131634657ac13
torch.cuda.empty_cache()

In [None]:
!nvidia-smi