# Tutorial: Fine-Tuning Llama3 with Quantization and LoRA

Hello and welcome to this tutorial on fine-tuning the Llama3 model!
Whether you're new to model training or looking to refine your skills, this guide will walk you through the steps to customize a pre-trained model to better suit your specific need.

In this tutorial, we'll learn how to fine-tune the Llama3 model using quantizaiton techniques and Low-Rank Adaptation (LoRA). By the end of this guide, you will be able to adjust a pre-trained model to perform better on your specific tasks or datasets.

Good luck!

## Fine-tunine
### What is Fine-tuning?
Fine-tuning is the process of refining a pre-trained model to improve its performance on a specific task or dataset. It involves:
- Adjusting the model's weights and biases
- Adapting the model to understand and work with new data
- Customizing a general-purpose model trained on large datsets to be more effective for smaller, specialized datasets

### Why is Fine-tuning important?
Fine-tuning is crucial because it allows us to:
- Utilize the capabilities of large, pre-trained models for specific applications
- Achieve high performance on specialized tasks without needing to train a model from scratch
- Save computational resources and time, especially when large datasets or extensive computational power is not available

## Setup
In this section, I will describe the GPU resources allocated for the project, list the required packages in __'requirements.txt'__, and provide instructions on how to create and manage the Conda environment

### GPU Resources
For this project, we have allocated GPU resources from the NCSA Delta systems. Here are the details of the GPU resources used:
* GPU Type: NVIDIA A100x4
* Memory: 40960 MiB

You can obtain detailed GPU usage informaiton by running the __'nvidai-msi'__ command. Below is an example of the GPU usage during training:

In [None]:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01    Driver Version: 535.183.01    CUDA Version: 12.2   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB On | 00000000:C7:00.0 Off |                    0 |
| N/A   49C    P0    341W / 400W |  25284MiB / 40960MiB |     97%      Default |
|                               |                      |               Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     184712      C   ...nvs/llama-new/bin/python   25274MiB |
+-----------------------------------------------------------------------------+


### Environment Setup
To ensure all dependencies are correctly installed, we will create a Conda environment and install the required packages from the __'requirement.txt'__ file

#### Step1: Create and Activate Conda Environment

1. Create a new Conda environment:
Replace __'llama3_finetuning'__ with your desired environment name and __'python=3.9'__ with your preferred Python version.

In [None]:
conda create -n llama3_finetuning python=3.9

2. Activate the Conda environment

In [None]:
conda activate llama3_finetuning

#### Step2: Install Required Packages
Install the required packages from the __'requirements.txt'__ file:

In [None]:
pip install -r requirements.txt

### Verifying the Setup
To verify that all packages are installed correctly, you can list the installed packages:

In [None]:
pip list

This command will display all the packages and their versions installed in your current Conda environment.

### Memory
When training a model, the following hyperparameters can impact mamory usage and training time
1. Batch size
- Description: The number of training samples processed in one forward/ backward pass
- Impact on Memory: Larger batch sizes require more memory because the GPU needs to store data for all samples in the batch

2. Epochs
- Description: The number of times the entire training dataset is passed through the model
- Impact on Running time: The number of epochs does not significantly affect the peak memory usage during each forward and backward pass. However, more epochs will increase the total training time


#### Errors
If you encounter following errors, here are some solutions

![Error 1](error1.png)

Restart your session and allocate a new GPU from the Delta systems

![Error 2](error2.png)

Reduce the batch_size hyperparameter. If the error persists then restart your session and allocate a new GPU from the Delta system.


## Training Code Explanation
In this section, we will walk through the fine-tuning code. The code is structured to handle model quantization, apply Low-Rank Adaptation (LoRA), and set up training arguments. Additionally, we will include custom logging with Weights & Biases (wandb)

### Quantization
Quantization reduces the precision of the numbers used to represent the model parameters, thereby reducing memory usage and speeding up computations.

The __'get_model'__ function is responsible for loading the pre-trained model with either 8-bit or 4-bit quantization based on the __'is_8bit'__ flag

In [None]:
import os
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
from transformers import ( 
    AutoModelForCausalLM, 
    BitsAndBytesConfig,
    StoppingCriteria,
    StoppingCriteriaList,
)
import os
from openai import AzureOpenAI

os.environ['HF_TOKEN'] = 'your won token'
os.environ["AZURE_OPENAI_KEY"] = "your own token"

In [None]:
def get_model(model_id, is_8bit=True):
    if is_8bit:
        bnb_config = BitsAndBytesConfig(
            load_in_8bit=True,
            bnb_8bit_use_double_quant=True,
            bnb_8bit_quant_type="nf4",
            bnb_8bit_compute_dtype=torch.bfloat16
        )
    else:
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16
        )
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        device_map="auto",
        cache_dir="cache"
    )
    return model

model_id = 'meta-llama/Meta-Llama-3-8B-Instruct'
model = get_model(model_id, is_8bit=True)

### Response Generation with Stopping Criteria
This section of the code defines how to generate responses from a model with specific stopping criteria In includes the implementation of a custom stopping criterion and a function to generate responses using either Azure OPenAI service or a local model

In [None]:
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        stop_ids = [74694,55375,5658,14196]  # IDs of tokens where the generation should stop.
        for stop_id in stop_ids:
            if input_ids[0][-1] == stop_id:  # Checking if the last generated token is a stop token.
                return True
        return False

In [None]:
def response(model, model_id, streamer, model_inputs, temperature, device="cuda"): 
    stop = StopOnTokens()
    model_inputs = model_inputs.to(next(model.parameters()).device)
    if model_id == "UIUC-ConvAI-Sweden-GPT4":
        client = AzureOpenAI(
            azure_endpoint = "https://uiuc-convai-sweden.openai.azure.com/", 
            api_key=os.getenv("AZURE_OPENAI_KEY"),  
            api_version="2024-02-15-preview"
            )
        completion = client.chat.completions.create(
                        model="UIUC-ConvAI-Sweden-GPT4", # model = "deployment_name"
                        messages = [{"role": "user", "content": model_inputs}],
                        temperature=temperature,
                        max_tokens=200,
                        top_p=0.95,
                        frequency_penalty=0,
                        presence_penalty=0,
                        stop=None
                        )
        outputs = completion.choices[0].message.content
    else:
        outputs = model.generate(
            input_ids=model_inputs,
            streamer=streamer,
            max_new_tokens=200,
            early_stopping=True,
            do_sample=True,
            top_p=0.95,
            top_k=50,
            temperature=temperature,
            repetition_penalty=1.0,
            num_beams=1,
            output_scores=True, 
            return_dict_in_generate=True,
            stopping_criteria=StoppingCriteriaList([stop]),
        )
    return outputs

### Parameter-Efficient Fine-Tuning (PEFT)
Parameter-Efficient Fine-Tuning (PEFT) referes to a collection of techniques aimed at fine-tuning large pre-trained models by adjusting only a small subset of their parameters. The primary goal of PEFT is to adpat models to new tasks efficiently, reducing computational costs and memory usage while maintaining high performance. PEFT is particularly useful when dealing with very large models that would be otherwise prohibitively expensive to fine-tune in their entirety.

### Low-Rank Adaptation (LoRA)
LoRA is a sppecific type of PEFT. It helps in reducing the number of trainable parameters, making the fine-tuning process more efficient. By inserting low-rank matrices into the model, LoRA allows for significant memory and computational saving while still enabling the model to adapt effectively to new tasks

#### LoRA Configuration Parameters:
* __r__ -  The rank of the low-rank matrices used in LoRA, A lower rank means fewer parameters.
* __lora_alpha__ - Scaling factor for the low-rank matrices. A higher alpha can help maintain the expressiveness of the model despite the low-rank approximation
* __target_modules__ - The specific moduls of the model where LoRA is applied. For llama models, this includes the query (q_proj), key (k_proj), value (v_proj), and output (o_proj) projection layers.
* __lora_dropout__ - The dropout rate applied to the LoRA layers, helping to prevent overfitting
* __bias__ -  Whether to train the bias terms in the layers where LoRA is applied
* __task_type__ - The type of task for which the model is being fine-tuned.

The __'get_model_setup'__ function applies LoRA to the model and configures it for fine-tuning

In [None]:
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from transformers import TrainingArguments

def get_model_setup(model, batch_size, OUTPUT_DIR, epochs):
    lora_config = LoraConfig(
        r=16,
        lora_alpha=64,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],  # specific to Llama models
        lora_dropout=0.1,
        bias="none",
        task_type="CAUSAL_LM"
    )
    
    model.gradient_checkpointing_enable()
    # Prepare model for k-bit training
    model = prepare_model_for_kbit_training(model)
    # Apply LoRA to the model
    model = get_peft_model(model, lora_config)
    
    training_arguments = TrainingArguments(
        per_device_train_batch_size=batch_size,
        gradient_accumulation_steps=4,
        optim="adamw_torch",
        logging_steps=1,
        learning_rate=1e-4,
        fp16=True,
        max_grad_norm=0.3,
        num_train_epochs=epochs,
        evaluation_strategy="steps",
        eval_steps=0.2,
        warmup_ratio=0.05,
        save_strategy="epoch",
        group_by_length=True,
        output_dir=OUTPUT_DIR,
        report_to="wandb",
        save_safetensors=True,
        lr_scheduler_type="cosine",
        seed=42,
    )
    model.config.use_cache = True  # silence the warnings. Please re-enable for inference!
    return lora_config, training_arguments, model


In [None]:
# hyperparameters setting
batch_size = 5
OUTPUT_DIR = "outputs"
epochs = 1

lora_config, training_arguments, model = get_model_setup(model, batch_size, OUTPUT_DIR, epochs)

### Tokenizer Setup
The tokenizer converts input text into toekn IDs that can be processed by the model. It also handles padding and truncation

The __'get_tokenizer'__ function loads the tokenizer and sets the padding token to be the end-of-sequence token.

In [None]:
from transformers import AutoTokenizer, TextIteratorStreamer

def get_tokenizer(model_id, stop_tokens=True):
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token_id = tokenizer.eos_token_id  # Set padding token to EOS token
    return tokenizer

tokenizer = get_tokenizer(model_id)

The __'TextIteratorStreamer'__ is used to handle the token generation process efficiently, especially when working with LLMs. It allos

In [None]:
streamer = TextIteratorStreamer(
    tokenizer, 
    timeout = 10, 
    skip_prompt = True, 
    skip_special_tokens = True)

### Custom Logging with Weights & Biases (wandb)
To monitor and log the training process, we integrate wandb

In [None]:
import wandb
from transformers import TrainerCallback

class CustomWandbLoggingCallback(TrainerCallback):
    def on_log(self, args, state, control, **kwargs):
        if state.is_world_process_zero:
            logs = {k: v for k, v in state.log_history[-1].items() if isinstance(v, (int, float))}
            wandb.log(logs)

### Fine-Tuning the Model
This section describes how to set up and run the fine-tuning process using the SFTTrainer

#### SFTTrainer
The __'SFTTrainer'__ (Supervised Fine-Tuning Trainer) is a specialized training class from the __'trl'__ library designed for fine-tuning LM with supervised data. It extends the Hugging Face __'Trainer'__ class, providing additional functionalities tailored for fine-tuning tasks, such as managing parameter-efficient fine-tuning methods like LoRA, handling large datasets, and integrating custom callbacks for logging and monitoring.

### Data Preparation
In this section, we introduce the basic idea of fine-tuning the model to generate responses that include verbalized confidence scores. The goal is to improve the model's ability to predict dialogue states and verbalize its confidence accurately, enhancing overall performance and reliability.

#### Prompt Template

In [None]:
prompt = """
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
Capture entity values from the LAST UTTERANCE of the conversation.
FOCUS ONLY ON THE VALUES MENTIONED IN THE LAST UTTERANCE.
Format the output as a valid JSON object for each entity-value pair.
Format: {{"state": {{"_entity_":"_value_"}}}}
Fill the actual entity value into the placeholder encapsulated with underscores.
Put "```" as EOS token at the end of response.
Values that should be captured are:
{}
Do not capture any other values!
If not specified, do not respond to that slot-value.

MAKE SURE TO SEPARATE EACH SLOT-VALUE PAIR.
Format the output as:
```json
[
  {{"state": {{"_entity1_": "_value1_"}}}},
  {{"state": {{"_entity2_": "_value2_"}}}}
]```

Now complete the following example, AND PROVIDE CONFIDENCE THAT IT'S CORRECT:
input: <|eot_id|>
<|start_header_id|>user<|end_header_id|>
{}
<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
***Output JSON format***
Output: ```json
"""

In [None]:
prompt_confidence = """
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
Capture entity values from the LAST UTTERANCE of the conversation.
FOCUS ONLY ON THE VALUES MENTIONED IN THE LAST UTTERANCE.
Format the output as a valid JSON object, and for each entity-value pair, along with their pair-level confidence (0-1).
Format: {{"state": {{"_entity_":"_value_"}}, "confidence": "X"}}
Fill the actual entity value into the placeholder encapsulated with underscores.
Put "```" as EOS token at the end of response.
{}
Do not capture any other values!
If not specified, do not respond to that slot-value.

Provide possible entity value based on the last utterance, along with their confidence (0-1).
MAKE SURE TO SEPARATE EACH SLOT-VALUE PAIR WITH ITS CONFIDENCE PAIR.
Format the output as:
```json
[
  {{"state": {{"_entity1_":"_value1_"}}, "confidence": "X"}},
  {{"state": {{"_entity2_":"_value2_"}}, "confidence": "Y"}}
]``` 

Now complete the following example, AND PROVIDE CONFIDENCE THAT IT'S CORRECT:
input: <|eot_id|>
<|start_header_id|>user<|end_header_id|>
{}
<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
***Output JSON format***
Output: ```json
"""

#### Confidence Score Generation
To generate confidence scores during fine-tuning, we assess the difficulty of predicting the dialogue state based on the user utterance and dialogue history. The prompt used for this assessment is as follows:

In [None]:
confidence_prompt = """
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant for evaluating the hardness of dialogue state tracking from last user utterance given dialogue history <|eot_id|>
<|start_header_id|>user<|end_header_id|>
How difficult would it be for a Language Model to predict the dialogue state from:
utterance: Customer: {}
given dialogue history
history:
{}

Choose the level of hardness from (Easy/Medium/Hard).
Answer:
"""

We use four difficulty levels—Easy, Medium, Hard, and Other—and map them to confidence scores with appropriate randomness:

* Easy: 0.9 to 1.0
* Medium: 0.8 to 0.9
* Hard: 0.7 to 0.8
* Other: Default score of 0.5

This mapping adds variability to the confidence scores, reflecting real-world uncertainty.

In [None]:
import random

In [None]:
def gt_confidence(model, tokenizer, streamer, utterance: str, context: str) -> int:
    filled_confidence_prompt = confidence_prompt.format(utterance, context)
    # print(filled_confidence_prompt)
    confidence_input = tokenizer(filled_confidence_prompt, return_tensors="pt").input_ids.cuda()
    outputs = response(model, "meta-llama/Meta-Llama-3-8B-Instruct", streamer, confidence_input, temperature=1)
    generated_text = tokenizer.decode(outputs['sequences'][0], skip_special_tokens=True)
    # print(generated_text)
    confidence = generated_text.split("Answer:")[1].lower()
    # print(confidence)

    if "easy" in confidence:
        return random.uniform(0.9, 1)
    elif "medium" in confidence:
        return random.uniform(0.8, 0.9)
    elif "hard" in confidence:
        return random.uniform(0.7, 0.8)
    else:
        return 0.5

#### Processing MultiWOZ Turn Information

In [None]:
import copy

def get_turn_info(dataset):
    all_turns = []
    for dialog in dataset:
        dialog_id = dialog["dialogue_id"].split('.')[0].lower()
        
        last_state = {}
        for tn in range(0, len(dialog["turns"]["utterance"]), 2):
            context = [f"Customer: {t}" if n % 2 == 0 else f"Assistant: {t}" for n, t in enumerate(dialog["turns"]["utterance"][:tn+1])]
            state = dialog["turns"]["frames"][tn]["state"]
            
            gt_domain = []
            if len(state) == 0:
                state = {}
            else:
                state = [state[i]["slots_values"] for i in range(len(state))]
                state = [{k: v[0] for k, v in zip(state[i]["slots_values_name"], state[i]["slots_values_list"])} for i in range(len(state)) if len(state[i]["slots_values_name"]) > 0]

            new_state = copy.deepcopy(last_state)
            for i in range(len(state)):
                for sl, val in state[i].items():
                    domain, name = sl.split("-")
                    if domain not in new_state:
                        new_state[domain] = {name: val}
                    else:
                        new_state[domain][name] = val
            state_update = {}
            for domain, domain_state in new_state.items():
                for slot, value in domain_state.items():
                    if slot not in last_state.get(domain, {}) or last_state[domain][slot] != value:
                        if domain not in state_update:
                            state_update[domain] = {}
                        state_update[domain][slot] = value
                        
            for domain, domain_state in state_update.items():
                gt_domain.append(domain)
                
            last_state = new_state
            
            confidence_score = gt_confidence(model, tokenizer, streamer, dialog['turns']['utterance'][tn], context)
            
            turn = {
                "question": dialog["turns"]["utterance"][tn],
                "gt_state": copy.deepcopy(last_state), # total state
                "dialog_id": copy.deepcopy(dialog_id),
                "metadata": {
                    "domain": copy.deepcopy(gt_domain),
                    "turn_state": copy.deepcopy(state_update),
                    "total_state": copy.deepcopy(last_state),
                    "context": "\n".join(context[-6:])
                }
            }
            all_turns.append(turn)
    
    return all_turns


#### Generating Instruction Dataset

In [None]:
from slot_description import DOMAIN_SLOT_DESCRIPTION

fine_tuned_confidence = True

def generate_instruction_dataset(data_point):
    gt_domain = data_point["metadata"]["domain"]
    context = data_point["metadata"]["context"]
    utterance = data_point["question"]
    turn_state = data_point["metadata"]["turn_state"]
    
    domain_description = ""
    if gt_domain:
        for domain in gt_domain:
            domain_description += DOMAIN_SLOT_DESCRIPTION[domain]
    
    target_str = ""
    if fine_tuned_confidence:
        for domain in turn_state.keys():
            for slot, value in turn_state[domain].items():
                buf = f"{{\"state\": {{\"{str(slot)}\": \"{str(value)}\"}}, \"confidence\": \"{str(confidence)}\"}},"
                # print(buf)
                target_str += buf
    else:
        for domain in turn_state.keys():
            for slot, value in turn_state[domain].items():
                buf = "{" + "\"" + "state\": " + "{\"" + str(slot) + "\": \"" + str(value) + "\"}}, "
                target_str += buf

    if target_str.endswith(", "):
        target_str = target_str[:-2]

    target_str = "[" + target_str + "]" + "```"
    if fine_tuned_confidence:
        text = "###Prompt###" + prompt_confidence.format(domain_description, context) + "###Completion###\n" + target_str + tokenizer.eos_token
    else:
        text = "###Prompt###" + prompt.format(domain_description, context) + "###Completion###\n" + target_str + tokenizer.eos_token
    return {"text": text, "labels": target_str}


#### Processing the Dataset

In [None]:
def process_dataset(data):
    dataset = []
    for i in range(len(data)):
        dialog = data[i]
        dataset.append(generate_instruction_dataset(dialog))
    return dataset


#### Loading and Preparing the Dataset

In [None]:
from datasets import load_dataset

def get_train_valid_data(sample):
    dataset = load_dataset("multi_woz_v22")

    train_data = dataset["train"]
    valid_data = dataset["validation"]

    if sample:
        train_data = train_data.select([i for i in range(sample['train_size'])])
        valid_data = valid_data.select([i for i in range(sample['valid_size'])])

    train_turn_data = get_turn_info(train_data)
    valid_turn_data = get_turn_info(valid_data)
    
    train_turn_data = process_dataset(train_turn_data)
    valid_turn_data = process_dataset(valid_turn_data)
    
    return train_turn_data, valid_turn_data

In [None]:
sample = None
# sample = {"train_size" : 2000, "valid_size":400} 
train_data, validation_data = get_train_valid_data(sample)

In [None]:
import json
with open("train_data.json", "w") as file:
    json.dump(train_data, file)
with open("valid_data.json", "w") as file:
    json.dump(validation_data, file)

In [None]:
with open("train_data.json", "r") as file:
    train_data = json.load(file)
with open("valid_data.json", "r") as file:
    validation_data = json.load(file)

#### Fine-Tuning Process
During fine-tuning, the model is provided with both the ground truth state and the corresponding confidence score for each slot-value pair. This dual-training approach improves the model's ability to predict dialogue states and verbalize its confidence.

By integrating confidence scores, the model becomes better at predicting dialogue states with an expressed level of confidence, enhancing the overall performance and reliability of the dialogue state tracking system.

In [None]:
from trl import SFTTrainer
from datasets import Dataset, load_dataset
import os
import wandb
from transformers import TrainerCallback

In [None]:
class CustomWandbLoggingCallback(TrainerCallback):
    def on_log(self, args, state, control, **kwargs):
        if state.is_world_process_zero:
            logs = {k: v for k, v in state.log_history[-1].items() if isinstance(v, (int, float))}
            wandb.log(logs)

In [None]:
def fine_tune_model(model, tokenizer, train_subset, valid_subset, saved_model):
    trainer = SFTTrainer(
        model=model,
        train_dataset=train_subset,
        eval_dataset=valid_subset,
        peft_config=lora_config,
        dataset_text_field='text',
        max_seq_length=1500,
        tokenizer=tokenizer,
        args=training_arguments,
        callbacks=[CustomWandbLoggingCallback()],
    )
    trainer.train()
    
    trainer.model.save_pretrained(saved_model)
    tokenizer.save_pretrained(saved_model)
    return trainer

In [None]:
# Initialize Weights & Biases project
wandb.init(project="llama3-finetuning-DST", entity="jennysun-cs09", name="fullset_epoch1_gtconf")
wandb.config.update(training_arguments)

# Convert datasets to Hugging Face Dataset format
train_subset = Dataset.from_dict({key: [dic[key] for dic in train_data] for key in train_data[0]})
valid_subset = Dataset.from_dict({key: [dic[key] for dic in validation_data] for key in validation_data[0]})

# Define the output directory
saved_model = os.path.join("saved_model_5", f"fullset_epoch1_gtconf")

# Fine-tune the model
model = fine_tune_model(model, tokenizer, train_subset, valid_subset, saved_model)

wandb.finish()