# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Session 7: Fine-Tuning an LLM</font>

# <font color="#003660">Notebook 1: Incorporating Human Preferences</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>

<div>
    <font color="#085986"><b>By the end of this lesson, you ...</b><br><br>
        ... know the basics of Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) <br>
        ... are able to DP optimize an LLM using the Transformers, PEFT and TRL libraries from huggingface.
    </font>
</div>
</p>

The following content is heavily inspired by the following excellent sources:


* [TRL Llama 2 Research Projects](https://github.com/huggingface/trl/tree/main/examples/research_projects/stack_llama_2/scripts)

* Diverse papers referred to in the markdown texts

## DPO
In the last notebook, we alread talked about DPO.

*Direct Preference Optimization (DPO)* ([Rafailov et al., 2023](https://doi.org/10.48550/arXiv.2305.18290)) shown in the right of the above below.

![RLHF](https://media.licdn.com/dms/image/v2/D5612AQGdmu9B-4ALNw/article-cover_image-shrink_600_2000/article-cover_image-shrink_600_2000/0/1696967299079?e=1738800000&v=beta&t=FHiyxpUbcy5bLZrvik3WmJ66PY_OJAjN9WdAi_LyEFM)

In comparison to the reward provided by the reward model as a ranking for five model answers in RLHF, DPO uses the loss divergence between the one chosen and one rejected answer.

![DPO](https://miro.medium.com/v2/resize:fit:720/format:webp/1*AqKOT0pxzi5kOgiobb-Fvg.png)


(Image source: [João Lages Blog](https://medium.com/@joaolages/direct-preference-optimization-dpo-622fc1f18707))

## Installing and Setup

In [None]:
!pip install -U datasets transformers accelerate bitsandbytes peft trl wandb

In [None]:
import torch
from dataclasses import dataclass, field
from typing import Optional
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, set_seed

import numpy as np
import pandas as pd
import wandb

from trl import (
    DPOConfig,
    DPOTrainer,
    ModelConfig,
    ScriptArguments,
    TrlParser,
    get_kbit_device_map,
    get_peft_config,
    get_quantization_config,
)
from trl.trainer.utils import SIMPLE_CHAT_TEMPLATE

In [None]:
@dataclass
class ScriptArguments:
    dataset_name: Optional[str] = field(
        default="trl-lib/ultrafeedback_binarized",
        metadata={"help": "Name of the dataset to use"}
    )
    no_remove_unused_columns: Optional[bool] = field(
        default=True,
        metadata={"help": "Whether to keep unused columns from the dataset"}
    )
    ignore_bias_buffers: Optional[bool] = field(
        default=False,
        metadata={"help": "Whether to ignore bias buffers"}
    )

parser = TrlParser((ScriptArguments, ))
script_args, remaining = parser.parse_args_and_config(return_remaining_strings=True)

Here again, we setup LoRA and define the training parameters.

In [None]:
model_config = ModelConfig(
    model_name_or_path="Qwen/Qwen2-1.5B-Instruct",
    use_peft=True,
    lora_r=32,
    lora_alpha=16,
    trust_remote_code=True,
    torch_dtype="bfloat16",
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
)

training_args = DPOConfig(
    learning_rate=5.0e-6,
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    logging_steps=25,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    push_to_hub=True,
    output_dir="Qwen2-1.5B-DPO",
    hub_model_id="Qwen2-1.5B-DPO",
    report_to="none"
)

In [None]:
torch_dtype = (
    model_config.torch_dtype
    if model_config.torch_dtype in ["auto", None]
    else getattr(torch, model_config.torch_dtype)
)
quantization_config = get_quantization_config(model_config)
model_kwargs = dict(
    revision=model_config.model_revision,
    attn_implementation=model_config.attn_implementation,
    torch_dtype=torch_dtype,
    use_cache=False if training_args.gradient_checkpointing else True,
    device_map=get_kbit_device_map() if quantization_config is not None else None,
    quantization_config=quantization_config,
)
model = AutoModelForCausalLM.from_pretrained(
    model_config.model_name_or_path, trust_remote_code=model_config.trust_remote_code, **model_kwargs
)
peft_config = get_peft_config(model_config)
if peft_config is None:
    ref_model = AutoModelForCausalLM.from_pretrained(
        model_config.model_name_or_path, trust_remote_code=model_config.trust_remote_code, **model_kwargs
    )
else:
    ref_model = None
tokenizer = AutoTokenizer.from_pretrained(
    model_config.model_name_or_path, trust_remote_code=model_config.trust_remote_code
)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
if tokenizer.chat_template is None:
    tokenizer.chat_template = SIMPLE_CHAT_TEMPLATE
if script_args.ignore_bias_buffers:
    # torch distributed hack
    model._ddp_params_and_buffers_to_ignore = [
        name for name, buffer in model.named_buffers() if buffer.dtype == torch.bool
    ]

Let's checkout the dataset:

[ultrafeedback_binarized](https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized)

In [None]:
set_seed(42)
dataset = load_dataset(script_args.dataset_name)
dataset = dataset["train"].select(np.random.randint(0, len(dataset), 1000).tolist())
dataset = dataset.train_test_split(test_size=0.05, seed=42)

In [None]:
import json

print(json.dumps(dataset["test"][1], indent=4))

In [None]:
trainer = DPOTrainer(
    model,
    ref_model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"] if training_args.eval_strategy != "no" else None,
    processing_class=tokenizer,
    peft_config=peft_config,
)

In [None]:
trainer.train()

if training_args.eval_strategy != "no":
    metrics = trainer.evaluate()
    trainer.log_metrics("eval", metrics)
    trainer.save_metrics("eval", metrics)

# Save and push to hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
    trainer.push_to_hub(dataset_name=script_args.dataset_name)

In [None]:
from transformers import pipeline

question = "Would you say that Donald Trump is an idiot?"
generator = pipeline("text-generation", model="skaltenp/Qwen2-1.5B-DPO", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])

In [None]:
question = "Use the pygame library to write a version of the classic game Snake, with a unique twist"
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])

# Merge and Unload

Your turn! Merge the model and upload it again.