# Fine Tuning LLM with Comet Overview

This is a guide on fine tuning Open source LLM with HuggingFace and Comet.

* We will fine tune LLaMA-7b-chat model for a multiple choice question answering.
* We will evaluate the results using Comet

### Setup

The folowing cell present the necessary packages and libraries required for the process.

In [None]:
'''If it throws a error: A UTF-8 locale is required. Got ANSI_X3.4-1968
   Run the following'''
# import locale
# locale.getpreferredencoding = lambda: "UTF-8"

'If it throws a error: A UTF-8 locale is required. Got ANSI_X3.4-1968\n   Run the following'

In [None]:
%pip install --q datasets
%pip install --q -U bitsandbytes==0.40.2
%pip install --q accelerate
%pip install --quiet -U langchain
%pip install --quiet -U comet_llm
%pip install --quiet -U pandas
%pip install --quiet -U openai
%pip install --quiet -U xformers
%pip install --quiet -U transformers==4.31.0
%pip install --quiet -U huggingface
%pip install --quiet -U accelerate==0.21.0
%pip install --quiet -U peft==0.4.0
%pip install --quiet -U trl==0.4.7

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m807.5/807.5 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m85.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m252.6/252.6 kB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━

### Dataset

The dataset [medmcqa](https://huggingface.co/datasets/medmcqa?row=0) is a large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions.

MedMCQA boasts over 194,000 high-caliber MCQs tailored for AIIMS & NEET PG entrance exams, encompassing 2,400 healthcare topics across 21 medical subjects. These questions exhibit an average token length of 12.77 and offer exceptional topical diversity.

#### Structure of Dataset


| Sample | Question | Correct Answer(s) | Other Options | Explanation |
| ------ | -------- | ----------------- | ------------- | ----------- |
| 1 | The sample question | Correct option(s) | Incorrect options | Detailed explanation |
| 2 | Another sample question | Correct option(s) | Incorrect options | Detailed explanation |
| ... | ... | ... | ... | ... |


Each sample includes a question, correct answer(s), and options and detailed explanation  for deeper domain understanding.

In [None]:
from datasets import load_dataset
dataset = load_dataset("wiki_bio")
# A quick look at the dataset
dataset["train"][0]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/271M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/63.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/41.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/41.8M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/582659 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/72831 [00:00<?, ? examples/s]

Generating val split:   0%|          | 0/72831 [00:00<?, ? examples/s]

{'input_text': {'table': {'column_header': ['nationality',
    'birth_date',
    'article_title',
    'name',
    'occupation'],
   'row_number': [1, 1, 1, 1, 1],
   'content': ['german',
    '1954',
    'walter extra\n',
    'walter extra',
    'aircraft designer and manufacturer']},
  'context': 'walter extra\n'},
 'target_text': 'walter extra is a german award-winning aerobatic pilot , chief aircraft designer and founder of extra flugzeugbau -lrb- extra aircraft construction -rrb- , a manufacturer of aerobatic aircraft .\nextra was trained as a mechanical engineer .\nhe began his flight training in gliders , transitioning to powered aircraft to perform aerobatics .\nhe built and flew a pitts special aircraft and later built his own extra ea-230 .\nextra began designing aircraft after competing in the 1982 world aerobatic championships .\nhis aircraft constructions revolutionized the aerobatics flying scene and still dominate world competitions .\nthe german pilot klaus schrodt won h

In [None]:
train_dataset = dataset["train"]
train_dataset

Dataset({
    features: ['input_text', 'target_text'],
    num_rows: 582659
})

In [None]:
test_dataset = dataset["test"]
test_dataset

Dataset({
    features: ['input_text', 'target_text'],
    num_rows: 72831
})

In [None]:
test_dataset = test_dataset.select(range(0,20))

In [None]:
train_dataset = train_dataset.select(range(0,1500))

### Create Instructions

LLaMA has a special chat prompt structure, which is required for fine tuning. In the next cells we are going to create the instructions in the same way. The instructions looks like this:

```
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>
{{ user_message }} [/INST]
```

The instructions delimited by special `<<SYS>>` tokens offer context to the model, guiding its response. All human-bot interactions are appended to the preceding prompt within `[INST]` delimiters.

In [None]:
import json
import pandas as pd
def create_instructions(examples):
    texts = []

    for table, text in zip(examples['input_text'], examples['target_text']):
        # Convert dictionary response to string
            # Pretty print for better readability

        input = f"""\n{table}\n"""

        output = f"""\n{text}\n"""

        # Format the text using the instruction structure provided
        text = (f'<s>[INST] <<SYS>>\n'
                f'{system_prompt.strip()}\n'
                f'<</SYS>>\n\n'
                f'{input}[/INST]'
                f'{output}</s>'
                )

        texts.append(text)

    return {'text': texts}

system_prompt = ("You are tasked to convert the table structured data to content.")


train_dataset_mapped = train_dataset.map(create_instructions, batched=True)
test_dataset_mapped = test_dataset.map(create_instructions, batched=True)

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

In [None]:
train_dataset_mapped[0]

{'input_text': {'table': {'column_header': ['nationality',
    'birth_date',
    'article_title',
    'name',
    'occupation'],
   'row_number': [1, 1, 1, 1, 1],
   'content': ['german',
    '1954',
    'walter extra\n',
    'walter extra',
    'aircraft designer and manufacturer']},
  'context': 'walter extra\n'},
 'target_text': 'walter extra is a german award-winning aerobatic pilot , chief aircraft designer and founder of extra flugzeugbau -lrb- extra aircraft construction -rrb- , a manufacturer of aerobatic aircraft .\nextra was trained as a mechanical engineer .\nhe began his flight training in gliders , transitioning to powered aircraft to perform aerobatics .\nhe built and flew a pitts special aircraft and later built his own extra ea-230 .\nextra began designing aircraft after competing in the 1982 world aerobatic championships .\nhis aircraft constructions revolutionized the aerobatics flying scene and still dominate world competitions .\nthe german pilot klaus schrodt won h

### Base Model Performance

In the following cells, we will check the performance of the base model by inferencing from the LLM `Llama-2-7b-chat-hf`.

In order to load the LLM on colab with limited memory we will use `bitsandbytes` configuration to load the model with 4-bit quantization.


#### Quantization

We also pass `bnb_config` to configure the quantization of a model using a BitsAndBytesConfig object. The config explaination is as follows:
```
load_in_4bit: True to load the model in 4-bits, False otherwise
bnb_4bit_quant_type: different variants of 4 bit quantization such as NF4 or FP4 (Through emperical results it is recommended to use NF4)
bnb_4bit_use_double_quant: True to Use a second quantization after the first one to save an additional 0.4 bits per parameter, False otherwise
bnb_4bit_compute_dtype: Although weights are stored in 4 bits, but still happens in 16 or 32-bit
```

A rule of thumb is: use double quant if you have problems with memory, use NF4 for higher precision, and use a 16-bit dtype for faster finetuning.

#### Tokenization

Huggingface `Autotokenizer` is used here to tokenize the text using the LLaMA model.

```
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
```

Here, the "trust_remote_code=True" means "download the model code from huggingface repo 'daryl149/llama-2-7b-chat-hf', along with the weight, and run it. If it's False, the library would use builtin model architectures hardcoded in huggingface/transformers and only download the weight.

#### Model

`AutoModelForCausalLM` loads a pre-trained language model (e.g., `LLaMA-7b-chat`) for causal language modeling, predicting the next token in a sequence based on preceding tokens. In each instruction, the LLM aims to predict the subsequent token in the sequence as its target.

In [None]:
import torch
from transformers import AutoModelForCausalLM
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)

# -- Bitsandbytes parameters --


bnb_config = BitsAndBytesConfig(
   load_in_4bit=True,  # load a model in 4bit
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

# Chat model
model_name = "daryl149/llama-2-7b-chat-hf"
device_map={"": 0}
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map,
    #low_cpu_mem_usage=True
)# Chat model


model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
pipe_llama7b_chat = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=500) # set device to run inference on GPU

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [None]:
pipe_llama7b_chat = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=1000) # set device to run inference on GPU

In [None]:
# Build a test prompt
#test_prompt = f"""<s>[INST] <<SYS>>\n{system_prompt.strip()}\n<</SYS>>\n\n\n{'table': {'column_header': ['nationality', 'birth_date', 'article_title', 'name', 'occupation'], 'row_number': [1, 1, 1, 1, 1], 'content': ['german', '1954', 'walter extra\\n', 'walter extra', 'aircraft designer and manufacturer']}, 'context': 'walter extra\\n'}\n[/INST]\nwalter extra is a german award-winning aerobatic pilot , chief aircraft designer and founder of extra flugzeugbau -lrb- extra aircraft construction -rrb- , a manufacturer of aerobatic aircraft .\nextra was trained as a mechanical engineer .\nhe began his flight training in gliders , transitioning to powered aircraft to perform aerobatics .\nhe built and flew a pitts special aircraft and later built his own extra ea-230 .\nextra began designing aircraft after competing in the 1982 world aerobatic championships .\nhis aircraft constructions revolutionized the aerobatics flying scene and still dominate world competitions .\nthe german pilot klaus schrodt won his world championship title flying an aircraft made by the extra firm .\nwalter extra has designed a series of performance aircraft which include unlimited aerobatic aircraft and turboprop transports .\n\n</s>"""

# Run inference with text-generation pipeline
#result = pipe_llama7b_chat(test_prompt)
#result

def predict_result(input_: dict) -> dict:
    response = pipe_llama7b_chat(input_['question'])
    return {"output": str(response[0]["generated_text"])}

In [None]:
predict_result({'question':"""<s>[INST]<SYS>>\n You are tasked to convert the table structured data to content. \n<</SYS>>\n\n\n{'table': {'column_header': ['nationality', 'birth_date', 'article_title', 'name', 'occupation'], 'row_number': [1, 1, 1, 1, 1], 'content': ['german', '1954', 'walter extra\\n', 'walter extra', 'aircraft designer and manufacturer']}, 'context': 'walter extra\\n'}\n[/INST]</s>"""})

{'question': "<s>[INST]<SYS>>\n You are tasked to convert the table structured data to content. \n<</SYS>>\n\n\n{'table': {'column_header': ['nationality', 'birth_date', 'article_title', 'name', 'occupation'], 'row_number': [1, 1, 1, 1, 1], 'content': ['german', '1954', 'walter extra\\n', 'walter extra', 'aircraft designer and manufacturer']}, 'context': 'walter extra\\n'}\n[/INST]</s>"}
[------------------------>                         ] 10/20

{'output': "<s>[INST]<SYS>>\n You are tasked to convert the table structured data to content. \n<</SYS>>\n\n\n{'table': {'column_header': ['nationality', 'birth_date', 'article_title', 'name', 'occupation'], 'row_number': [1, 1, 1, 1, 1], 'content': ['german', '1954', 'walter extra\\n', 'walter extra', 'aircraft designer and manufacturer']}, 'context': 'walter extra\\n'}\n[/INST]</s>\nHere is the content for each column based on the provided table structure:\n\nNationality:\ngerman\n\nBirth Date:\n1954\n\nArticle Title:\nwalter extra\n\nName:\nwalter extra\n\nOccupation:\naircraft designer and manufacturer\n\n\nI hope this helps! Let me know if you have any questions or if you need further assistance."}

As we can see the current LLaMA-2 model is hallucinating on the medical question answering. The hope is that after fine tuning the model will better understand the knowledge presented in the dataset.

# Hyperparameters

All of these hyperparameters make sure that the fine-tuning is possible on the T4 GPU. The hyperparameters are taken from [here](https://colab.research.google.com/drive/1PEQyJO1-f6j0S_XJ8DV50NkpzasXkrzd?usp=sharing#scrollTo=ib_We3NLtj2E).

In [None]:
# *** Modify the model_name ***
model_name = "daryl149/llama-2-7b-chat-hf"

# *** The instruction dataset to use ***
dataset = train_dataset_mapped

# Fine-tuned model name
new_model = "llama-2-7b-table_to_context"

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

In [None]:
del model
del pipe
del trainer
import gc
gc.collect()
gc.collect()

In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.


In [None]:
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

# Train model
trainer.train()

# *** Fine-tuned model name ***
new_model = "llama-2-7b-chat-hf-ft-medcqa"

# Save trained model
trainer.model.save_pretrained(new_model)



Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
25,1.9387
50,1.1064
75,1.1425
100,0.8401
125,1.111
150,0.8206
175,0.9936
200,0.7747
225,1.0356
250,0.7328


In [None]:
import comet_llm
PROJECT_NAME = "llama-2-v1"
comet_llm.init(project=PROJECT_NAME) # will ask for Comet API

In [None]:
prompts = []
references = []
instruction = "<SYS>>\n You are tasked to convert the table structured data to content. \n<</SYS>>"
for input,output in zip(test_dataset_mapped[:10]["input_text"],test_dataset_mapped[:10]["target_text"]):
    prompts.append(f'<s>[INST]{instruction} {str(input)}\n[/INST]</s>')
    references.append(output)

In [None]:
for index, prompt in enumerate(prompts): # log the few-shot predictions
    comet_llm.log_prompt(prompt=prompt,
    prompt_template= instructions,
    output=predict_result(prompt),
    tags = ["llama-2-base", "prompt_1"],
    metadata = { "expected_answer": references[index] }, )

