### MHP Applied science group
# RLHF Hackathon: DPO

<div style="text-align: center;">
    <img src="../images/dpo.png" alt="Supervised Fine-tuning steps" style="display: block; margin-left: auto; margin-right: auto;width:800px">
    <p style="text-align:center">Read more about DPO algorithm in the <a href="https://arxiv.org/abs/2305.18290">original paper</a>.</p>
</div>

Direct Preference Optimization (DPO) is a technique used to fine-tune models by directly optimizing for human preferences. This method aims to align the model’s outputs more closely with what humans consider high-quality or relevant, improving the overall user experience. In order to do DPO on a LLM we need to do following steps

 1. Collecting Preference Data: The first step involves gathering a dataset of human preferences. This can be done through surveys, user interactions, or expert annotations where humans rank or score different outputs of the model based on their quality or relevance.
 2. Defining a Reward Function: A reward function is created based on the collected preference data. This function assigns scores to different outputs, indicating how well they align with human preferences. The reward function serves as the objective for optimization.
 3. Training the Model: Using the reward function, the model is trained to generate outputs that maximize the reward. This involves adjusting the model’s parameters to produce outputs that are more likely to be preferred by humans. Techniques such as gradient ascent or reinforcement learning can be used for this optimization.
 4. Evaluating and Refining: After training, the model’s outputs are evaluated against human preferences to ensure alignment. If necessary, the preference data can be refined, and the model can be further tuned to improve its performance.


In [1]:
from unsloth import FastLanguageModel
import torch

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


### Load fine tuned Model

In [2]:

max_seq_length = 4096 
dtype = None 
load_in_4bit = True 

# We will start from Phi-3 model as it already fine tuned over a large number of datasets
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Phi-3-mini-4k-instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth: Fast Mistral patching release 2024.6
   \\   /|    GPU: NVIDIA A10G. Max memory: 22.191 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 8.6. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Load Dataset

We need to load an RLHF dataset (or reward dataset). We don’t use our reward dataset because it has few questions. A good reward dataset should cover all the cases that we, as humans, prefer. For this purpose, we load a dataset from Huggingface.

In [14]:
from datasets import load_dataset
import random
random.seed(711)
sample_size = 50
dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train_prefs")
random_indices = random.sample(range(len(dataset)), sample_size)

<div style="border: 2px solid red; padding: 10px; border-radius: 5px; background-color: #ffe6e6;">
    <strong>Wait!</strong> But the dataset and Chat Format are not aligned!
</div>

In order to align the dataset with the expected input of the LLM, we need to update our dataset format. We also adapt the dataset so that `DPO` expects it for training. It got complicated, right? No worries, let's break it down.

#### LLM Chat Format
As we already saw in the previous notebook, the LLM expects text where the beginnings and ends of prompts and responses are marked with special tokens. To find out which tokens `Phi-3` uses, take a look at its [model card](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct#chat-format). Find the special tokens and complete the function.

#### DPO Dataset Format
Now we need to change the records to match what the DPO trainer expects. Based on the [DPO documentation](https://huggingface.co/docs/trl/main/en/dpo_trainer), DPO expects three entries for each record as follows:
- **prompt**: `<s>` + `<user_token>` + prompt_text + `<end_token>` + `<assistant_token>`
- **chosen**: chosen_text + `<end_token>`
- **rejected**: rejected_text + `<end_token>`

In [15]:
# TASK 5: find out chat format
def apply_chat_template(
    example,
    assistant_token= "<assistant_token>", # UPDATE HERE,
    user_token = "<user_token>", # UPDATE HERE,
    eos_token = "<end_token>", # UPDATE HERE,
    bos_token = "<s>",
):
    if all(k in example.keys() for k in ("chosen", "rejected")):
        chosen_messages = example["chosen"][1:]
        rejected_messages = example["rejected"][1:]
        example["text_prompt"] = f"{bos_token}{user_token}\n{example['prompt']}{eos_token}\n{assistant_token}"
        example["text_chosen"] = f"{chosen_messages[0]['content']}{eos_token}\n"
        example["text_rejected"] = f"{rejected_messages[0]['content']}{eos_token}\n"
        
    return example


We only use 100 examples. We need to convert the dataset to the desired format.

In [16]:
# Select the samples
sampled_dataset = dataset.select(random_indices)

# we hold the column names to remove them later
column_names = sampled_dataset.column_names

We need to add the Phi-3 chat format to our dataset. let's do it

In [17]:
# Task: Applay chat template to dataset
sampled_dataset = sampled_dataset.map(apply_chat_template) # Update here

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Now the dataset has duplicated records. The original column does not have the tokens and should be removed. The `text_<column_name>` column has the special tokens, but the column name is not correct. We need to rename `text_<column_name>` to `<column_name>`.

In [18]:
sampled_dataset = sampled_dataset.remove_columns(column_names)

# Task: rename the columns
sampled_dataset = sampled_dataset.rename_columns(
        {"text_prompt": "prompt", "text_chosen": "chosen", "text_rejected": "rejected"} # Update here
    )

In [19]:
# Let’s see what we have done.
import pprint
row = sampled_dataset[1]
print("-".join(["*"]*30))
pprint.pprint(row["prompt"])
print("-".join(["*"]*30))
pprint.pprint(row["chosen"])
print("-".join(["*"]*30))
pprint.pprint(row["rejected"])

*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
('<s><user_token>\n'
 'Premise: "A man with tattoos sits on a chair in the grass."\n'
 'Based on this premise, can we conclude that the hypothesis "A man sits on a '
 'chair." is true?\n'
 'Options:\n'
 '- yes\n'
 '- it is not possible to tell\n'
 "- no Now, let's be accurate as possible. Some thinking first:<end_token>\n"
 '<assistant_token>')
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
('Yes, it is possible to conclude that the hypothesis "A man sits on a chair" '
 'is true based on the given premise "A man with tattoos sits on a chair in '
 'the grass." So, the answer is yes. Confidence: 95%<end_token>\n')
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
('Based on the premise "A man with tattoos sits on a chair in the grass," we '
 'cannot conclude that the hypothesis "A man sits on a chair" is true with '
 'certainty. It is possible that the man in the premise is not sitting on a '
 'chair, but 

### PEFT

Here we use the unsloth PEFT model to efficiently fine-tune the model by updating a smaller subset of parameters, thereby reducing memory usage and accelerating the training process.

In [20]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0,
    bias = "none",    
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,  
    loftq_config = None,
)

Unsloth: Already have LoRA adapters! We shall skip this step.


In [21]:
# One must patch the DPO Trainer first!
from unsloth import PatchDPOTrainer
PatchDPOTrainer()

## DPO Training

Now we are ready to fine tune the model with DPO method. Here is a brief explaination of the parameters: 

 - **model**: The pre-trained model to be fine-tuned.
 - **per_device_train_batch_size**: Number of training samples per batch for each device.
 - **gradient_accumulation_steps**: Number of steps to accumulate gradients before updating model weights.
 - **warmup_ratio**: Fraction of training steps to perform learning rate warmup.
 - **num_train_epochs**: Total number of training epochs.
 - **learning_rate**: The initial learning rate for training.
 - **fp16**: Flag to use 16-bit floating point precision if bfloat16 is not supported.
 - **bf16**: Flag to use bfloat16 precision if supported.
 - **logging_steps**: Number of steps between logging training metrics.
 - **optim**: Optimizer type, here “adamw_8bit” for memory-efficient training.
 - **weight_decay**: Weight decay coefficient for regularization.
 - **lr_scheduler_type**: Type of learning rate scheduler to use.
 - **seed**: Random seed for reproducibility.
 - **output_dir**: Directory to save the training outputs.
 - **beta**: Regularization parameter for DPO.
 - **train_dataset**: The training dataset.
 - **tokenizer**: The tokenizer for preparing input data.
 - **max_length**: Maximum length of the sequences for the model inputs.
 - **max_prompt_length**: Maximum length for the prompt part of the input sequences.

In [22]:
from transformers import TrainingArguments
from trl import DPOTrainer, PPOTrainer
from unsloth import is_bfloat16_supported

dpo_trainer = DPOTrainer(
    model = model,
    args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 2,
        warmup_ratio = 0.1,
        num_train_epochs = 5,
        learning_rate = 5e-6,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 5,
        optim = "adamw_8bit",
        weight_decay = 0.0,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "outputs",
    ),
    beta = 0.1,
    train_dataset = sampled_dataset,
    tokenizer = tokenizer,
    max_length = 1024,
    max_prompt_length = 512,
)



Map:   0%|          | 0/50 [00:00<?, ? examples/s]

In [23]:
dpo_trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 50 | Num Epochs = 5
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 2
\        /    Total batch size = 8 | Total steps = 30
 "-____-"     Number of trainable parameters = 119,537,664


Step,Training Loss,Validation Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen


TrainOutput(global_step=30, training_loss=0.6397185802459717, metrics={'train_runtime': 271.6904, 'train_samples_per_second': 0.92, 'train_steps_per_second': 0.11, 'total_flos': 0.0, 'train_loss': 0.6397185802459717, 'epoch': 4.615384615384615})

## Done!

Hey, we have completed the DPO training successfully. Unlike other methods, DPO didn’t require a separate reward model, streamlining the process. Now, we can save the fine-tuned model and use it for inference. This model is optimized to generate outputs that align closely with human preferences, ensuring higher quality and more relevant results. Let’s proceed with saving the model and integrating it into our application for enhanced user experience.

In [None]:
# Optional Task: Save the model with all weights and tokenizer!




In [None]:
# Optional Task: You can ask questions (inference) from your trained model!

