### MHP Applied science group
# RLHF Hackathon: DPO

<div style="text-align: center;">
    <img src="images/dpo.png" alt="Supervised Fine-tuning steps" style="display: block; margin-left: auto; margin-right: auto;width:800px">
    <p style="text-align:center">Read more about DPO algorithm in the <a href="https://arxiv.org/abs/2305.18290">original paper</a>.</p>
</div>

Direct Preference Optimization (DPO) is a technique used to fine-tune models by directly optimizing for human preferences. This method aims to align the model’s outputs more closely with what humans consider high-quality or relevant, improving the overall user experience. In order to do DPO on a LLM we need to do following steps

 1.	Collecting Preference Data: The first step involves gathering a dataset of human preferences. This can be done through surveys, user interactions, or expert annotations where humans rank or score different outputs of the model based on their quality or relevance.
 2.	Defining a Reward Function: A reward function is created based on the collected preference data. This function assigns scores to different outputs, indicating how well they align with human preferences. The reward function serves as the objective for optimization.
 3.	Training the Model: Using the reward function, the model is trained to generate outputs that maximize the reward. This involves adjusting the model’s parameters to produce outputs that are more likely to be preferred by humans. Techniques such as gradient ascent or reinforcement learning can be used for this optimization.
 4.	Evaluating and Refining: After training, the model’s outputs are evaluated against human preferences to ensure alignment. If necessary, the preference data can be refined, and the model can be further tuned to improve its performance.


In [None]:
# from unsloth import PatchDPOTrainer
# PatchDPOTrainer()

In [None]:
from unsloth import FastLanguageModel
import torch

### Load fine tuned Model

In [None]:

max_seq_length = 4096 
dtype = None 
load_in_4bit = True 

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "phi_3_sft_model",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

### Load Dataset

We need to load an RLHF dataset (or reward dataset). We don’t use our reward dataset because it has few questions. A good reward dataset should cover all the cases that we, as humans, prefer. For this purpose, we load a dataset from Huggingface.

In [None]:
from datasets import load_dataset
import random
random.seed(711)
sample_size = 100
dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train_prefs")
random_indices = random.sample(range(len(dataset)), sample_size)

<div style="border: 2px solid red; padding: 10px; border-radius: 5px; background-color: #ffe6e6;">
    <strong>Wait!</strong> But the dataset and Chat Format are not aligned!
</div>

In order to align the dataset with the expected input of the LLM, we need to update our dataset format. We also adapt the dataset so that `DPO` expects it for training. It got complicated, right? No worries, let's break it down.

#### LLM Chat Format
As we already saw in the previous notebook, the LLM expects text where the beginnings and ends of prompts and responses are marked with special tokens. To find out which tokens `Phi-3` uses, take a look at its [model card](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct#chat-format). Find the special tokens and complete the function.

#### DPO Dataset Format
Now we need to change the records to match what the DPO trainer expects. Based on the [DPO documentation](https://huggingface.co/docs/trl/main/en/dpo_trainer), DPO expects three entries for each record as follows:
- **prompt**: `<s>` + `<user_token>` + prompt_text + `<end_token>` + `<assistant_token>`
- **chosen**: chosen_text + `<end_token>`
- **rejected**: rejected_text + `<end_token>`

In [None]:
# TASK 5: find out chat format
def apply_chat_template(
    example,
    assistant_token= # UPDATE HERE,
    user_token = # UPDATE HERE,
    eos_token = # UPDATE HERE,
    bos_token = "<s>",
):
    if all(k in example.keys() for k in ("chosen", "rejected")):
        # TODO: handle case where chosen/rejected also have system messages
        chosen_messages = example["chosen"][1:]
        rejected_messages = example["rejected"][1:]
        example["text_prompt"] = f"{user_token}\n{example['prompt']}{eos_token}\n{assistant_token}"
        example["text_chosen"] = f"{chosen_messages[0]['content']}{eos_token}\n"
        example["text_rejected"] = f"{rejected_messages[0]['content']}{eos_token}\n"
        
    return example


We only use 100 examples. We need to convert the dataset to the desired format.

In [None]:
# Select the samples
sampled_dataset = dataset.select(random_indices)
column_names = sampled_dataset.column_names

In [None]:
sampled_dataset = sampled_dataset.map(apply_chat_template)

In [None]:
sampled_dataset = sampled_dataset.remove_columns(column_names)
sampled_dataset = sampled_dataset.rename_columns(
        {"text_prompt": "prompt", "text_chosen": "chosen", "text_rejected": "rejected"}
    )

In [None]:
# Let’s see what we have done.
import pprint
row = sampled_dataset[1]
print(["*"]*10)
pprint.pprint(row["prompt"])
print(["*"]*10)
pprint.pprint(row["chosen"])
print(["*"]*10)
pprint.pprint(row["rejected"])

### PEFT

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

In [None]:
# One must patch the DPO Trainer first!
from unsloth import PatchDPOTrainer
PatchDPOTrainer()

In [None]:
from transformers import TrainingArguments
from trl import DPOTrainer, PPOTrainer
from unsloth import is_bfloat16_supported

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 2,
        warmup_ratio = 0.1,
        num_train_epochs = 5,
        learning_rate = 5e-6,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 5,
        optim = "adamw_8bit",
        weight_decay = 0.0,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "outputs",
    ),
    beta = 0.1,
    train_dataset = sampled_dataset,
    # eval_dataset = raw_datasets["test"],
    tokenizer = # TODO
    max_length = 1024,
    max_prompt_length = 512,
)

In [None]:
dpo_trainer.train()

## Done!