# Direct Preference Optimization

In this tutorial, you will be guided to implement and use Direct Preference Optimization (DPO) to fine-tune a language model and align it according to our human preferences.
DPO is a variant of RLHF, but it is generally preferred for its simplicity and its effectiveness.

To use DPO, we need a pairwise comparison datasets of generated outputs. This step can be complicated and very time-consuming, therefore we will use one from the Internet. We will use the following one: https://huggingface.co/datasets/Anthropic/hh-rlhf

This dataset contains several types of samples, guiding the model to be either helpful or not harmful. Since the evaluation session taught you how to quantify the toxicity of a model, we will rather focus on the harmful dataset. You can use the other one if you wish, but we have no simple way to test that. As a matter of fact, using DPO on the harmful dataset can be tested with the Detoxify model. This test will be performed in the Inference session's tutorial.

The model we will fine-tune is a pretrained **GPT2**. It exists in multiple size: 100M, 300M, 700M and 1.5B. We will ignore smaller models for their poor performance (700M is okay, but 1.5B is much better). **GPT2** is a very old model (released in 2019, way before preference alignment was a thing).

We do not have a SFT dataset. So we will simulate SFT by first fine-tuning on the comparison dataset. This is required because by default pretrained **GPT2** distribution is way too different from our dataset for DPO to work properly. To save time, a post-SFT version is available in the shared space and it is fairly non-toxic, so your mission will be to teach it a tiny bit of toxicity with a tiny compute budget.

## Setup and initialization 

In [None]:
import functools
import os
import torch
import torch.nn.functional as F
import torchmetrics
import warnings
from pathlib import Path
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel.distributed import DistributedDataParallel
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, PreTrainedTokenizer
from tqdm.notebook import tqdm

from dataset import get_hh
from dpocriterion import Criterion
from utils import setup, make_dataloader, empty_cache, mcq

device = setup()
warnings.filterwarnings("ignore")

You will be able to find some SFT fine-tuned models in the `PRETRAINED_PATH` directory. It also contains some models with DPO fine-tuning to visualize the impact of DPO.

The `CHECKPOINT_PATH` is where all your own models will be stored after you train them.

`DS_PATH` contains our dataset. It was previously downloaded from HuggingFace's website without any modification.

In [None]:
PRETRAINED_PATH = Path(os.environ["DSDIR"]) / "data_spellm" / "models_tp_dpo"
CHECKPOINT_PATH = Path.home() / "TP_DPO_CHECKPOINTS"
CHECKPOINT_PATH.mkdir(parents=True, exist_ok=True)
DS_PATH = "Anthropic/hh-rlhf"

## Choose a model

We first have to choose the model we wish to fine-tune. Technically, any could work. But with the limited resources at our disposal, we will use **GPT2-XL** with SFT training performed beforehand.

In [None]:
model_path = PRETRAINED_PATH / "gpt2-xl-sft"

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="left")
tokenizer.pad_token_id = tokenizer.eos_token_id

## Hyperparameters

You can modify the following hyperparameters if you wish. Mostly the first one `evil`. It will decide whether we try to align a safe model, or an evil one. Comparing both configurations should show very different results when measuring toxicity.

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Update the following hyperparameters if you wish to do so.

In [None]:
evil = True # We will train our model to be a tiny bit toxic :)
num_samples = 2500 # Due to restricted amount of time and resources, we will do a turbo fine-tune with a tiny part of the whole dataset

batch_size = 2
grad_acc = 4
seq_len = 513
max_prompt_len = 384
dpo_beta = 0.2
weight_decay = 0.1
warmup_ratio = 0.1
min_learning_rate = 1e-7
max_learning_rate = 2e-5
label_pad_token_id = -1000

## Check the dataset

Here is the dataset, feel free to check a few samples to see what it looks like

In [None]:
dataset = get_hh(path=DS_PATH, split="train", evil=evil, num_samples=num_samples)

In [None]:
dataset[23]

## Check Generation

We could already check what our model outputs, before doing any fine-tuning. Large Language Models use templates to understand which part of the prompt comes from the user, etc.

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Implement a function `apply_template` which applies the same template as the dataset.

In [None]:
def apply_template(text: str) -> str:
    ...

<details>
<summary>Hint</summary>
Standard string formatting should do the trick. You can look for the pattern in the dataset
</details>

**Solution:** Execute the next cell if you need the solution, otherwise skip it.

In [None]:
# %load -s apply_template solutions/apply_template.py

Generation function, this one is given since it's only about HuggingFace's API. If you wish to know more about generation through `transformers` API, check this documentation: https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig

In [None]:
def generation(model, text: str) -> None:
    prompt = apply_template(text)
    tokenized = tokenizer(prompt, return_tensors="pt")
    config = GenerationConfig(
        max_length=1024,
        do_sample=True,
        temperature=.8,
        top_p=0.9,
        length_penalty=20.0,
        num_beams=2,
        num_return_sequences=1,
        repetition_penalty=10.0,
        early_stopping=True,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )
    output = model.generate(tokenized.input_ids.to(device), generation_config=config)
    texts_out = tokenizer.batch_decode(output, skip_special_tokens=True)
    print(texts_out[0].strip())

In [None]:
generation(model, "What are you ?")

In [None]:
mcq(0)

## Tokenizing the dataset

As often in Deep Learning, DPO is mostly about handling data correctly. In this part, you will have to tokenize the dataset and prepare it for training.

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Fill in the next function. It returns `input_ids` and `attention_mask` for the prompt, the best answer and the worst answer. For some help, execute one of these:


<details>
<summary>Hint</summary>
Step 1: the model will need to have both the prompt and the answer as an input so they will need to be tokenized together.

Step 2: the quizz below can help you.
</details>

<details><summary>Solution</summary>
Execute one of these depending on your step:

```
%load -s build_tokenized_answer solutions/build_tokenized_answer_step1.py
%load -s build_tokenized_answer solutions/build_tokenized_answer_step2.py
%load -s build_tokenized_answer solutions/build_tokenized_answer_step3.py
```

</details>

In [None]:
mcq(2)

In [None]:
def build_tokenized_answer(prompt: str, best: str, worst: str, tokenizer: PreTrainedTokenizer):
    # Step 1: tokenize the whole preferred sample, the whole rejected sample
    best_tokenized = ...
    worst_tokenized = ...
    
    # Step 2: tokenizing the prompt by itself or with another text behind may have different result
    # It is important for DPO that the best answer and the worst answer have exactly the same prompt
    # In the following, split_idx is the idx which marks the end of the prompt and the beginning of an answer.
    prompt_ids = ...
    split_idx = ... 

    # Step 3: build the final dict
    return dict(
        prompt_input_ids=...,
        prompt_attention_mask=...,
        best_input_ids=...,
        best_attention_mask=...,
        worst_input_ids=...,
        worst_attention_mask=...,
    )

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Fill in the next function to add special tokens and create the labels.

<details>
<summary>Hint</summary>
Step 1: The BOS needs to be prepended. The EOS needs to be appended.

Step 2: The labels should only reflect the ids of answers, not the prompt.
</details>

<details><summary>Solution</summary>
Execute one of these depending on your step:

```
%load -s build_tokenized solutions/build_tokenized_step1.py
%load -s build_tokenized solutions/build_tokenized_step2.py
```

</details>

In [None]:
def build_tokenized(sample, tokenizer: PreTrainedTokenizer, seq_len: int, max_prompt_len: int, label_pad_token_id: int) -> dict[str, list[int]]:
    tokens = build_tokenized_answer(sample["prompt"], sample["chosen"], sample["rejected"], tokenizer)

    # Step 1: Add the special tokens. Remember, the dict tokens has input_ids and mask for prompt, best and worst answers.
    ...

    # Truncate the prompt and possibly the answer if they are too long.
    max_answer_length = max(len(tokens["best_input_ids"]), len(tokens["worst_input_ids"]))
    if len(tokens["prompt_input_ids"]) + max_answer_length > seq_len:
        tokens["prompt_input_ids"] = tokens["prompt_input_ids"][-max_prompt_len:]
        tokens["prompt_attention_mask"] = tokens["prompt_attention_mask"][-max_prompt_len:]
        for key in ["best_input_ids", "best_attention_mask", "worst_input_ids", "worst_attention_mask"]:
            if max_prompt_len + len(tokens[key]) > seq_len:
                tokens[key] = tokens[key][:seq_len - max_prompt_len]
    prompt_length = len(tokens["prompt_input_ids"])

    # Step 2: Create a dictionary with ids, mask and labels for both the best and worst answer (prompt included)
    sample_dict = dict(
        best_input_ids=...,
        best_attention_mask=...,
        best_labels=...,
        worst_input_ids=...,
        worst_attention_mask=...,
        worst_labels=...,
    )
    return sample_dict

In [None]:
tokenized_dataset = dataset.map(functools.partial(
    build_tokenized,
    tokenizer=tokenizer,
    seq_len=seq_len,
    max_prompt_len=max_prompt_len,
    label_pad_token_id=label_pad_token_id,
))

In [None]:
# tokenized_dataset[23]

In [None]:
dataloader = make_dataloader(
    tokenized_dataset,
    batch_size=batch_size,
    num_workers=6,
    tokenizer=tokenizer,
    label_pad_token_id=label_pad_token_id,
)

## Criterion

We now define the criterion that we will use for our training process.

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> In the file `dpocriterion.py` is defined the criterion class which will be used for training for both SFT and DPO. 3 functions will have to be completed, the first for SFT and the other two for DPO. Fill in the sft_loss function.

<details>
<summary>Hint</summary>
SFT is very similar to standard pre-training in this regard.
</details>
<details><summary>Solution</summary>
Execute this:

```
%load -s sft_loss solutions/criterion_sft_loss.py
```

</details>

In [None]:
mcq(1)

In [None]:
criterion = Criterion(model, beta=dpo_beta, label_pad_token_id=label_pad_token_id)

In [None]:
def sft_loss(self, logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
    ...

In [None]:
Criterion.sft_loss = sft_loss

To test the SFT, we need to define the training loop. This is completely standard and therefore given to you.

In [None]:
def train(model, epochs, dataloader, mode, num_steps: int = -1, debug: bool = False):
    total_steps = epochs * len(dataloader) // grad_acc
    if num_steps > 1:
        total_steps = min(num_steps, total_steps)
    warmup_steps = int(total_steps * warmup_ratio)
    loss_metric = torchmetrics.aggregation.RunningMean(window=20).to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=max_learning_rate, betas=(0.9, 0.95), weight_decay=weight_decay)
    warmup_scheduler = torch.optim.lr_scheduler.LinearLR(optimizer, start_factor=1e-10, end_factor=1, total_iters=warmup_steps)
    cosine_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, total_steps - warmup_steps, eta_min=min_learning_rate)
    lr_scheduler = torch.optim.lr_scheduler.SequentialLR(optimizer, [warmup_scheduler, cosine_scheduler], milestones=[warmup_steps])

    pbar = tqdm(total=total_steps, desc=f"{mode.upper()}", disable=False)
    counter = 0
    with pbar:
        for epoch in range(1, epochs + 1):
            pbar.set_description(f"{mode.upper()} - Epoch {epoch} / {epochs}")
            for i, (input_ids, attention_mask, target) in enumerate(dataloader, start=1):
            
                input_ids = input_ids.to(device)
                attention_mask = attention_mask.to(device)
                target = target.to(device)
    
                output = model(input_ids, attention_mask=attention_mask).logits
                loss = criterion(
                    logits=output,
                    targets=target,
                    inputs=input_ids,
                    attention_mask=attention_mask,
                    mode=mode,
                )
                loss_metric.update(loss)
                loss.backward()
                counter += 1
    
                if counter % grad_acc == 0:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
                    optimizer.step()
                    lr_scheduler.step()
                    optimizer.zero_grad()
                    pbar.update(1)
                    pbar.set_postfix(loss=loss_metric.compute().item(), lr=f"{lr_scheduler.get_last_lr()[0]:0.3e}")
                    if debug:
                        print("Everything went smoothly!")
                        return
                    if counter // grad_acc == total_steps:
                        break
    
    path = CHECKPOINT_PATH / f"gpt2-xl-{mode}{'-evil' if evil and mode == 'dpo' else ''}"
    path.mkdir(exist_ok=True, parents=True)
    model.save_pretrained(str(path))
    tokenizer.save_pretrained(str(path))

Test the SFT:

__Note__: If the train function crashes and is then relaunched, it could cause CUDA Out Of Memory. The `utils.py` file defined the `empty_cache()` to help you free some memory. If it is not enough, you may need to restart the notebook.

In [None]:
train(model=model, epochs=1, dataloader=dataloader, mode="sft", debug=True) 

## DPO Criterion

Now, we want to implement the DPO criterion.

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Implement the `dpo_loss` function. It takes logits from the forward of the main model, as well as inputs, targets and attention_mask to give to the reference model. Don't hesitate to use already coded method in the `Criterion` class. The formula is available in the slides of the presentation, or just below


$$
\mathcal{L_\mathrm{DPO}} = - \log \sigma \left( \beta \log \frac{\pi_\theta\left(\mathrm{chosen} | \mathrm{prompt}\right)}{\pi_\mathrm{SFT}\left(\mathrm{chosen} | \mathrm{prompt}\right)} - \beta \log \frac{\pi_\theta\left(\mathrm{rejected} | \mathrm{prompt}\right)}{\pi_\mathrm{SFT}\left(\mathrm{rejected} | \mathrm{prompt}\right)} \right)
$$


<details>
<summary>Hint Step 1</summary>
This is the same forward operation with another model. Friendly reminder: this model is frozen.
</details>
<details>
<summary>Hint Step 2</summary>
In our batch, best answers are directly followed by their worse counterparts. The `extract` method might help you.
</details>
<details>
<summary>Hint Step 3</summary>
This operation is not very complicated, but it can be very tedious to code, so look for an interesting method :)
</details>
<details>
<summary>Hint Step 4</summary>
Computed probabilities are in loss scale. Return an average of the loss.
</details>
<details><summary>Solution</summary>
Execute one of these depending on your step:

```
%load -s dpo_loss solutions/criterion_dpo_loss_step1.py
%load -s dpo_loss solutions/criterion_dpo_loss_step2.py
%load -s dpo_loss solutions/criterion_dpo_loss_step3.py
%load -s dpo_loss solutions/criterion_dpo_loss_step4.py
```

</details>

In [None]:
def dpo_loss(
    self,
    logits: torch.Tensor,
    targets: torch.Tensor,
    inputs: torch.Tensor,
    attention_mask: torch.Tensor,
) -> torch.Tensor:
    # Step 1: Compute logits for the reference model (available at self.base_model)
    reference_logits = ...

    # Step 2: Separate best answers from worst answers
    best_pi_logits, worst_pi_logits = ...
    best_ref_logits, worst_ref_logits = ...
    best_pi_labels, worst_pi_labels = ...

    # Step 3: Compute probabilities from logits (in log scale)
    best_pi_logps = ...
    best_ref_logps = ...
    worst_pi_logps = ...
    worst_ref_logps = ...

    # Step 4: Compute the final loss from all probabilities
    loss = ...

    return loss

In [None]:
# Test
(input_ids, attention_mask, target) = next(iter(dataloader))
input_ids, attention_mask, target = input_ids.to(device), attention_mask.to(device), target.to(device)
with torch.no_grad():
    logits = model(input_ids, attention_mask=attention_mask).logits
Criterion.dpo_loss = dpo_loss
print(criterion(logits=logits, targets=target, inputs=input_ids, attention_mask=attention_mask, mode="dpo"))

In [None]:
train(model=model, epochs=1, dataloader=dataloader, mode="dpo", debug=True) 

In [None]:
mcq(3)

## Real training

Now we will try to do the real training process (and not just checking that it runs one iteration). With the GPU available and the tiny dataset used, it should take around 4 minutes for DPO to be performed. SFT would need to be done on the whole dataset (more than one hour of training), hence why we use a model which has already done this step.

In [None]:
# train(model=model, epochs=3, dataloader=dataloader, mode="sft", debug=False) # This would take several hours if done properly

In [None]:
train(model=model, epochs=1, dataloader=dataloader, mode="dpo", debug=False)

What the model was generating before DPO:

In [None]:
generation(criterion.base_model, "What is the best way to rob a bank ?")

In [None]:
generation(criterion.base_model, "How do I kill someone ?")

What the model generates now:

In [None]:
generation(model, "What is the best way to rob a bank ?")

In [None]:
generation(model, "How do I kill someone ?")

You should see a difference in the given outputs. Considering the training is very short, don't hesitate to re-run the generation if one happens to give an unsatisfying result. Normally you should have pretty good results on average.

You can visualize the result in terms of generation, after a DPO training, whether it is on your own fine-tuned network, or networks previously fine-tuned by me.

__Note__: If you are doing a complete training, executing what follows could lead you to a CUDA Out Of Memory. So do it after training your network.

In the TP_DPO folder, you also have another folder `gradio_compare_network`. To use it, follow the instructions in the `INSTRUCTIONS.ipynb` above. It gives an API to test a few models more cleanly than the previous cells.