# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Session 7: Fine-Tuning an LLM</font>

# <font color="#003660">Notebook 1: Supervised Fine-Tuning</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>

<div>
    <font color="#085986"><b>By the end of this lesson, you ...</b><br><br>
        ... know the basics of quantization, parameter-efficient fine-tuning, and supervised fine-tuning. <br>
        ... are able to supervised fine-tune an LLM using the Transformers, PEFT and TRL libraries from huggingface.
    </font>
</div>
</p>

The following content is heavily inspired by the following excellent sources:


* [TRL Llama 2 Research Projects](https://github.com/huggingface/trl/tree/main/examples/research_projects/stack_llama_2/scripts)
* [TRL Examples](https://github.com/huggingface/trl/tree/main/examples)
* All papers referred to in the notebook

## Aligning LLMs

In general, aligning language models to human, organizational, and social values as well as the task (also referred to as LLM alignment) ([Askell et al., 2021](https://doi.org/10.48550/arXiv.2112.00861)) LLMs is usually done using the three steps below.

First, high-qualty human demonstration data is sampled and used for *Supervised Fine-Tuning (SFT)* the foundation model (the pre-trained LLM ([Bommasani et al., 2021](https://doi.org/10.48550/arXiv.2108.07258))). Then comparison data is collected and a reward model is trained. Finally, the reward model is used in proximal policiy optimization to incorporate human feedback, called *Reinforcement Learning from Human Feedback (RLHF)* ([Ouyang et al., 2022](https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf)) shown in the left of the image below.

![RLHF](https://media.licdn.com/dms/image/v2/D5612AQGdmu9B-4ALNw/article-cover_image-shrink_600_2000/article-cover_image-shrink_600_2000/0/1696967299079?e=1738800000&v=beta&t=FHiyxpUbcy5bLZrvik3WmJ66PY_OJAjN9WdAi_LyEFM)

With RLHF, diverse methodologies have been developed. One less resource-intensive and much simpler strategy is *Direct Preference Optimization (DPO)* ([Rafailov et al., 2023](https://doi.org/10.48550/arXiv.2305.18290)) shown in the right of the above image.

## Supervised Fine-Tuning (SFT)

SFT can be done using all kind of high-quality instruction, conversational or prompting data. Its purpose is to train the conversational or instruction-solving manner into the weights of the LLM ([Ouyang et al. 2022](https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf))

We will now first learn how to supervised fine-tune an LLM while using *Parameter efficient fine-tuning (PEFT)* ([Liu et al., 2022](https://proceedings.neurips.cc/paper_files/paper/2022/file/0cde695b83bd186c1fd456302888454c-Paper-Conference.pdf)) and *Quantization* ([Shkolnik et al., 2020](https://proceedings.neurips.cc/paper/2020/file/3948ead63a9f2944218de038d8934305-Paper.pdf)).

### Installing and loading libraries

Today we will use the libraries:

* [Datasets](https://huggingface.co/docs/datasets/index) for loading and transforming datasets
* [Transformers](https://huggingface.co/docs/transformers/index) transformers to load and train models
* [BitsAndBytes](https://huggingface.co/docs/bitsandbytes/index) to quantize models (more on quantization later)
* [PEFT](https://huggingface.co/docs/peft/index) to parameter-efficient fine-tune the LLMs
* [TRL](https://huggingface.co/docs/trl/index) to direct preference optimize the LLMs
* [Hugging Face Hub](https://huggingface.co/docs/huggingface_hub/index) to share our fine-tuned LLMs

In [None]:
!pip install -U datasets transformers accelerate bitsandbytes peft trl wandb

In [None]:
import os
import time
from dataclasses import dataclass, field
from typing import Optional

import torch
import numpy as np
import pandas as pd
from accelerate import Accelerator
from datasets import load_dataset
from peft import AutoPeftModelForCausalLM, LoraConfig
from tqdm import tqdm
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    is_torch_npu_available,
    is_torch_xpu_available,
    set_seed
)
from huggingface_hub import notebook_login

from trl import SFTConfig, SFTTrainer

As you can see we will need many different training arguments.

Most of them can be found in the Transformers [TrainingArguments](https://huggingface.co/docs/transformers/v4.46.3/main_classes/trainer#transformers.TrainingArguments) class.

But we will focus on a few of them in the following.

In [None]:
@dataclass
class ScriptArguments:
    model_name: Optional[str] = field(default="Qwen/Qwen2-1.5B", metadata={"help": "the model name"})
    dataset_name: Optional[str] = field(default="HuggingFaceH4/ultrafeedback_binarized", metadata={"help": "the dataset name"})
    split: Optional[str] = field(default="train_sft", metadata={"help": "the split to use"})
    report_to: Optional[str] = field(default="none", metadata={"help": "the report to use"})
    seq_length: Optional[int] = field(default=4096, metadata={"help": "the sequence length"})
    seed: Optional[int] = field(default=42, metadata={"help": "the seed"})

    num_train_epochs: Optional[int] = field(default=1, metadata={"help": "number of train epochs"})
    per_device_train_batch_size: Optional[int] = field(default=1, metadata={"help": "the per device train batch size"})
    per_device_eval_batch_size: Optional[int] = field(default=1, metadata={"help": "the per device eval batch size"})
    gradient_accumulation_steps: Optional[int] = field(default=1, metadata={"help": "the gradient accumulation steps"})
    gradient_checkpointing: Optional[bool] = field(default=True, metadata={"help": "whether to use gradient checkpointing"})
    logging_steps: Optional[int] = field(default=500, metadata={"help": "the logging frequency"})
    save_strategy: Optional[str] = field(default="steps", metadata={"help": "the saving strategy"})
    save_steps: Optional[int] = field(default=500, metadata={"help": "the saving frequency"})
    eval_strategy: Optional[str] = field(default="steps", metadata={"help": "the eval strategy"})
    eval_steps: Optional[int] = field(default=500, metadata={"help": "the eval frequency"})
    bf16: Optional[bool] = field(default=True, metadata={"help": "whether to use bf16 precision"})
    fp16: Optional[bool] = field(default=False, metadata={"help": "whether to use fp16 precision"})

    # LoraConfig
    lora_alpha: Optional[float] = field(default=16, metadata={"help": "the lora alpha parameter"})
    lora_dropout: Optional[float] = field(default=0.05, metadata={"help": "the lora dropout parameter"})
    lora_r: Optional[int] = field(default=8, metadata={"help": "the lora r parameter"})

    output_dir: Optional[str] = field(default="./results", metadata={"help": "the output directory"})
    push_to_hub: Optional[bool] = field(default=True, metadata={"help": "whether to push the model to hub"})
    hub_strategy: Optional[str] = field(default="checkpoint", metadata={"help": "the strategy for push to hub"})

parser = HfArgumentParser(ScriptArguments)
script_args, remaining = parser.parse_args_into_dataclasses(args=None, return_remaining_strings=True)

### (Q-)LoRA

*(Quantized) Low Rank Adaptation ((Q)LoRA)* is a concept first introduced by [Hu et al. (2021)](https://openreview.net/forum?id=nZeVKeeFYf9), shown in the image from their paper below.

![LoRA](https://media.datacamp.com/legacy/v1705430151/image4_b814637cd2.png)

This was extended with quantization by [Dettmers et al. (2023)](https://dl.acm.org/doi/10.5555/3666122.3666563), shown in the image below (also ).

![(Q)Lora](https://miro.medium.com/v2/resize:fit:720/format:webp/1*tMaufQKw0Boq4pYlOWDLBg.png)

(Here is a link to the [Virtual Poster](https://neurips.cc/virtual/2023/poster/71815) and this helpful article for [LoRA](https://ritvik19.medium.com/papers-explained-lora-a48359cecbfa))

To add LoRA adapters we need to set the ``lora_alpha`` and the ``lora_r`` (lora_r=rank, scaling factor of the wight matrices $\frac{lora_{alpha}}{lora_{rank}}$) and the ``lora_dropout``(determines the dropout of the linear adapter layers).

Finalls we need to determine, which linear layers of the LLM should be extended by adapters (usualy the projection layers (linear layers of the attention heads) $W_q$ (query) and $W_v$ (value), sometimes $W_k$ (key)).

```
    lora_alpha: Optional[float] = field(default=16, metadata={"help": "the lora alpha parameter"})
    lora_dropout: Optional[float] = field(default=0.05, metadata={"help": "the lora dropout parameter"})
    lora_r: Optional[int] = field(default=8, metadata={"help": "the lora r parameter"})
```

In [None]:
set_seed(script_args.seed)

### Quantization

Now that we have talked all about quantization, but what is quantization.

*Quantization* refers to reducing the precision of floating point numbers in mathematical operations.

![FP8](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/bitsandbytes/FP8-scheme.png)

(Image sources: [sgugger](https://huggingface.co/sgugger), [BitsAndBytes FP4 Blog Post](https://huggingface.co/blog/4bit-transformers-bitsandbytes))

Unfortunately, we can not train these in FP8 oder FP4 precison, but we can use these less resource utilizing models together with LoRA.

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, # Load the model in 4-bit mode
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(script_args.model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"  # Fix weird overflow issue with fp16 training

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    script_args.model_name,
    quantization_config=bnb_config, # here we apply quantization
    device_map="auto",
    trust_remote_code=True,
    token=True,
)
base_model.config.use_cache = False

Now lets check out the estimated GPU utilization for our quantized model.

In [None]:
!accelerate estimate-memory --library_name transformers Qwen/Qwen2-1.5B

And run our quantized model to check if the quantization worked well.

In [None]:
inputs = tokenizer("how can i develop a habit of drawing daily", return_tensors="pt")
inputs = inputs.to("cuda")
res = base_model.generate(**inputs, max_new_tokens=128, pad_token_id=tokenizer.pad_token_id)
print(tokenizer.decode(res[0]))

Wow, it worked.

Next, to use SFT, we need to apply some kind of "Chat Template" to the model.
This is the most simple chat template, using

```
"role": "message"
```

In [None]:
def prepare_sample_text(example, tokenizer):
    """Prepare the text from a sample of the dataset."""
    text = ""
    for message in example["chosen"]:
        text += f"{message['role'].capitalize()}: {message['content']}{tokenizer.eos_token}\n\n"
    return text

In [None]:
example = {
    "chosen": [
        {"role": "user", "content": "What is the meaning of life?"},
        {"role": "assistant", "content": "The meaning of life is to be happy."},
    ]
}
print(prepare_sample_text(example, tokenizer))

In [None]:
def chars_token_ratio(dataset, tokenizer, nb_examples=400):
    """
    Estimate the average number of characters per token in the dataset.
    """
    total_characters, total_tokens = 0, 0
    for _, example in tqdm(zip(range(nb_examples), iter(dataset)), total=nb_examples):
        text = prepare_sample_text(example, tokenizer)
        total_characters += len(text)
        if tokenizer.is_fast:
            total_tokens += len(tokenizer(text).tokens())
        else:
            total_tokens += len(tokenizer.tokenize(text))

    return total_characters / total_tokens


def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
def create_datasets(tokenizer, args):
    dataset = load_dataset(
        args.dataset_name,
        split=args.split
    )

    dataset = dataset.train_test_split(test_size=0.005, seed=args.seed)
    train_dataset = dataset["train"]
    valid_dataset = dataset["test"]
    print(f"Size of the train set: {len(train_dataset)}. Size of the validation set: {len(valid_dataset)}")

    chars_per_token = chars_token_ratio(train_dataset, tokenizer)
    print(f"The character to token ratio of the dataset is: {chars_per_token:.2f}")
    return train_dataset, valid_dataset

Let's briefly check out our dataset:

[ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized)

In [None]:
train_dataset, eval_dataset = create_datasets(tokenizer, script_args)
# remove this to train on whole data
train_dataset = train_dataset.select(np.random.randint(0, len(train_dataset), 1000))
eval_dataset = eval_dataset.select(np.random.randint(0, len(eval_dataset), 100))

Here we apply QLoRA (quantized, because we use a quantizen model).

In [None]:
peft_config = LoraConfig(
    r=script_args.lora_r,
    lora_alpha=script_args.lora_alpha,
    lora_dropout=script_args.lora_dropout,
    target_modules=["q_proj", "v_proj"], # here we set the target modules for PEFT/QLoRA
    bias="none",
    task_type="CAUSAL_LM",
)

In [None]:
hub_model_id=f"{script_args.model_name.split('/')[-1]}-{script_args.dataset_name.split('/')[-1]}-sft-CLASS"
training_args = SFTConfig(
    seed=script_args.seed,
    output_dir=script_args.output_dir,
    push_to_hub=script_args.push_to_hub,
    hub_strategy=script_args.hub_strategy,
    hub_model_id=hub_model_id,
    report_to=script_args.report_to,
    num_train_epochs=script_args.num_train_epochs,
    per_device_train_batch_size=script_args.per_device_train_batch_size,
    per_device_eval_batch_size=script_args.per_device_eval_batch_size,
    gradient_accumulation_steps=script_args.gradient_accumulation_steps,
    gradient_checkpointing=script_args.gradient_checkpointing,
    max_seq_length=script_args.seq_length,
    logging_steps=script_args.logging_steps,
    save_steps=script_args.save_steps,
    save_strategy=script_args.save_strategy,
    eval_steps=script_args.eval_steps,
    eval_strategy=script_args.eval_strategy,
    bf16=script_args.bf16,
    fp16=script_args.fp16,
    packing=True,
)

Now after the definition of the dataset, SFT training arguments, PEFT and LoRA we can start the training.

In [None]:
trainer = SFTTrainer(
    model=base_model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    peft_config=peft_config,
    formatting_func=lambda x: prepare_sample_text(x, tokenizer),
    processing_class=tokenizer,
    args=training_args,
)
trainer.train()

After finishing the training, we can push the whole trainer with all information.

In [None]:
trainer.push_to_hub(hub_model_id)

Let's check out the model: [Model](skaltenp/Qwen2-1.5B-ultrafeedback_binarized-sft)

In [None]:
from transformers import pipeline

question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="skaltenp/Qwen2-1.5B-ultrafeedback_binarized-sft", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])

## Merge and Unload the Model

When we have applied (Q)LoRA, we maybe want to train the model with other data, or (as in our case) proceed with the next step of DPO. Then we need to merge the model weights first, to get one model without adapters.

![Lora Merging](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/lora_diagram.png)

In [None]:
del base_model
if is_torch_xpu_available():
    torch.xpu.empty_cache()
elif is_torch_npu_available():
    torch.npu.empty_cache()
else:
    torch.cuda.empty_cache()

model = AutoPeftModelForCausalLM.from_pretrained(
    f"skaltenp/{hub_model_id}",
    device_map="auto",
    torch_dtype=torch.bfloat16
)
model = model.merge_and_unload()

model.push_to_hub(
    f"skaltenp/{hub_model_id}-merged",
    safe_serialization=True
)
tokenizer.push_to_hub(f"skaltenp/{hub_model_id}-merged")

### Key findings:
* SFT needs high quality human data
* To run models as well as SFT on small GPUs you can use Quanitization and (QLoRA)
* To have one model instead of adapters on an existing model, you can merge weights

But:
* Out there are diverse other approaches of PEFT (e.g., [REFT](https://github.com/stanfordnlp/pyreft)).
* Out there are other interesting fine-tuning strategies (e.g., [Prefix-Tuning](https://doi.org/10.48550/arXiv.2101.00190))