## CA 4, LLMs Spring 2024

- **Name:** Majid Faridfar
- **Student ID:** 810199569

---

# RLHF (55 points)

## Introduction to RLHF

<img src="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2023/08/31/ML-14874_image001.jpg"/>
</div>

With the recent public introduction of ChatGPT, reinforcement learning from human feedback (RLHF) has become a hot topic in language modeling circles -- both academic and industrial.
We can trace the application of RLHF to natural language processing OpenAI's 2019 release of <br>[Fine-Tuning Language Models from Human Preferences](https://arxiv.org/abs/1909.08593).

Fast forward one year when OpenAI released one of its first significant papers on reinforcement learning from human feedback applied to natural language generation.

In that paper-<br>[Learning to summarize from human feedback](https://arxiv.org/abs/2009.01325)-OpenAI showed that simply fine-tuning on summarization data leads to suboptimal performance when evaluated on human preferences. The authors suggest optimizing for human preferences directly via a reinforcement learning approach to alleviate these performance issues.


**Learn More:**
<br>[Huggingface Deep Reinforcement Learning Course](https://huggingface.co/learn/deep-rl-course/en/unit0/introduction)
<br>[Research Papers for Reinforcement Learning with Human Feedback ](https://github.com/opendilab/awesome-RLHF)


## Import Libraries and Set Constants

In [None]:
%pip install datasets
%pip install evaluate
%pip install rouge_score
%pip install accelerate -U
%pip install transformers[torch]

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/547.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/547.8 kB[0m [31m1.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m542.7/547.8 kB[0m [31m7.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-16.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (40.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB

In [None]:
import numpy as np
import pandas as pd
import json
import random
import evaluate
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
from datasets import load_dataset
from torch.utils.data import Dataset, DataLoader
import transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_scheduler
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    default_data_collator,
)
from transformers import AdamW
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

In [None]:
class CONFIG:
    seed = 42
    max_len = 550
    train_batch_size = 16
    eval_batch_size = 1
    eval_steps = 500
    epochs = 5
    save_steps = 1000
    learning_rate = 1e-4
    gradient_accumulation_steps = 1
    model_name = 'gpt2'
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
    output_dir = "gpt2-supervised-summarize-checkpoint"
    output_dir_rm = "rm_checkpoint"

device = CONFIG.device
rw_device = CONFIG.device

use_saved_model = False

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

## Implementing Learning for Summarization

 In this notebook by using trlX, we will implement RLHF for a summarization task. The training process consists of three parts:

*   We will first fine-tune a pre-trained transformer model on our summarization dataset. This is our supervised fine-tuned model (SFT).
* We will then train a reward model (RM). This model is initialized from the SFT model and outputs a scalar value. This scalar value is the reward that indicates the preferability of a summary.  

*   Finally, we use the RM to fine-tune the SFT model via PPO. This step aligns our SFT model with human preference.

## Section One: Supervised Fine Tuning (5 points)

### Dataset

For our experiment, we'll use the **TLDR summarization** dataset used originally in Learning to summarize from human feedback.

Based on that training process described above, we'll need two types of datasets:

*   One for fine-tuning the pre-trained supervised model and then for fine-tuning it again with PPO and reward model, and
*   One for training our reward model.

In our case, the dataset for fine-tuning is the filtered* TLDR dataset. The dataset for training our reward model is the **comparison or preference dataset**.

In [None]:
tlrdataset_path = "CarperAI/openai_summarize_tldr"
comparissions_path = "CarperAI/openai_summarize_comparisons"

#### Create Dataset


**Note:** I set the number of training examples to 6000. But we can increase it, also we can adjust the number of validation examples.

In [None]:
class TLDRDataset(Dataset):
    def __init__(self, path, tokenizer, split, max_length=CONFIG.max_len):
        self.post_list = []
        dataset = load_dataset(path, split=split)
        for sample in dataset:
            self.post_list.append(sample["prompt"] + sample["label"])

        if "train" in split:
          self.post_list = random.sample(self.post_list, min(6000, len(self.post_list)))
        elif "valid" in split:
            self.post_list = self.post_list[0:2000]

        self.tokenizer = tokenizer
        self.max_length = max_length
        self.input_ids = []
        self.attn_masks = []

    def __len__(self):
        return len(self.post_list)

    def __getitem__(self, idx):
        txt = self.post_list[idx]
        encodings_dict = self.tokenizer(txt, truncation=True, max_length=self.max_length, padding="max_length")
        input_ids = torch.tensor(encodings_dict["input_ids"])
        attn_masks = torch.tensor(encodings_dict["attention_mask"])

        return {
            "input_ids": input_ids,
            "attention_mask": attn_masks,
            "labels": input_ids,
        }

#### Load Dataset

In [None]:
train_dataset = TLDRDataset(
  path=tlrdataset_path,
  tokenizer=CONFIG.tokenizer,
  split="train")

dev_dataset = TLDRDataset(
  path=tlrdataset_path,
  tokenizer=CONFIG.tokenizer,
  split="valid")

Downloading readme:   0%|          | 0.00/532 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/111M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.23M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/116722 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6553 [00:00<?, ? examples/s]

Generating valid split:   0%|          | 0/6447 [00:00<?, ? examples/s]

### Load Model and Tokenizer

If model has been trained and saved in Drive, run the following cell, otherwise leave it.

In [None]:
from google.colab import drive
drive.mount('/content/MyDrive')
model_path = "MyDrive/MyDrive/LLM/CA4"

use_saved_model = True

Mounted at /content/MyDrive


In [None]:
#Load the model and tokenizer
if use_saved_model:
    model = AutoModelForCausalLM.from_pretrained(model_path, use_cache=False)
else:
    model = AutoModelForCausalLM.from_pretrained(CONFIG.model_name, use_cache=False)

tokenizer = CONFIG.tokenizer

# Setting pad token to eos token
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id

# Resize token embeddings
model.resize_token_embeddings(len(tokenizer))

# Update model configuration
model.config.pad_token_id = tokenizer.pad_token_id
model.config.end_token_id = tokenizer.eos_token_id

### Define Compute metric function (2.5 Points)

In this part, you should implement an evaluation function that computes rouge scores for our predicted summaries.

In [None]:
# Load the ROUGE metric
rouge = evaluate.load("rouge")

# Define compute metrics function
def compute_metrics(eval_preds):
  # WRITE YOUR CODE HERE
  predictions, labels = eval_preds

  decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
  decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

  result = rouge.compute(predictions=decoded_preds, references=decoded_labels)
  return result

In [None]:
def preprocess_logits_for_metrics(logits, labels):
  if isinstance(logits, tuple):
    logits = logits[0]
  return logits.argmax(dim=-1)

### Train

In [None]:
# Set up training arguments
training_args = TrainingArguments(
    output_dir=CONFIG.output_dir,
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=1000,
    eval_accumulation_steps=1,
    learning_rate=CONFIG.learning_rate,
    per_device_train_batch_size=CONFIG.train_batch_size,
    per_device_eval_batch_size=CONFIG.eval_batch_size,
    gradient_checkpointing=True,
    fp16=True,
    fp16_backend="auto",
    adam_beta1=0.9,
    adam_beta2=0.95,
    num_train_epochs=1,
    warmup_steps=100,
    load_best_model_at_end=True,
    logging_steps=50
)

In [None]:
model.to(CONFIG.device)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [None]:
# Initialize the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset,
    compute_metrics=compute_metrics,
    data_collator=default_data_collator,
    preprocess_logits_for_metrics=preprocess_logits_for_metrics
)

If you have trained the model once again, run **three** following cells, otherwise leave them.

In [None]:
# Start training
trainer.train()

# Save the model
trainer.save_model(CONFIG.output_dir)

Step,Training Loss,Validation Loss


In [None]:
from google.colab import drive
drive.mount('/content/MyDrive')

Mounted at /content/MyDrive


In [None]:
trainer.save_model("MyDrive/MyDrive/LLM/CA4")

### Test (2.5 Points)

Report rouge scores for test set of TLDR dataset.

In [None]:
# WRITE YOUR CODE HERE
results = trainer.evaluate()

In [None]:
print("*** Evaluation result ***")
for key in results.keys():
    print(f"- {key}: {results[key]}")

*** Evaluation result ***
- eval_loss: 1.9503157138824463
- eval_rouge1: 0.5821683356016594
- eval_rouge2: 0.17884449557754353
- eval_rougeL: 0.37812721579585623
- eval_rougeLsum: 0.504962248732189
- eval_runtime: 262.5041
- eval_samples_per_second: 7.619
- eval_steps_per_second: 7.619


## Section Two: Reward Model Training (25 Points)

### Reward Model

We'll initialize the reward model from the SFT model and attach a randomly initialized linear head that outputs a scalar value on top.

Next, we'll dig into how the data is input to the model, the loss function, and other gotchas of a reward model in more detail.

### Question 1 (2.5 Points)

**How would you create a comparison dataset for a text summarization task? (explain the entire procedure)**

> The procedure that I think is practical consists of the following steps:
> 1. **Collect Texts**: Collect a wide variety of source texts from different domains (news articles, scientific papers, social media posts, books, etc.) to have a dataset that covers a range of topics and styles and ensure that the texts are long enough to warrant summarization (e.g., more than 300 words).
> 2. **Generate Initial Summaries**: We can use different summarization algorithms (e.g., extractive methods like TextRank, abstractive methods like BART or GPT) to create multiple summaries for each source text, and also we can use human-written summaries, but using GPTs looks more convenient because of the easier access and lower usage cocst. Additionally, we should ensure that the generated summaries vary in quality, including both high-quality and lower-quality examples. It can be done by using CoT and prompting models to summarize incorrectly as well as correctly.
> 3. **Human Evaluation and Guidelines Setup**: Set up an interface for human annotators to evaluate and compare summaries. This could be a web application (in order to use more users' comments) where annotators can easily read source texts and corresponding summaries. Then develop clear guidelines for evaluating summaries that result in less mistakes and makes working more clear and easier for the annotators. Common criteria include:
>    - How well the summary captures the important points of the source text.
>    - How logically the summary flows.
>    - How grammatically correct and readable the summary is.
>    - How succinct yet comprehensive the summary is.
>
>    For this purpose, there are some annotation Platforms tools that can be used, like Amazon Mechanical Turk, Appen, or custom-built web apps.
> 4. **Collect Human Feedback**: For each source text, present annotators with pairs of summaries and ask them to choose the better one based on the established criteria. This helps in understanding preferences between different summaries. Alternatively, ask annotators to rate each summary on a scale (e.g., 1-5) for each criterion. But in my poin of view, the first method looks more convenient.
> 5. **Quality Control**: Measure agreement between annotators to ensure reliability. Use metrics like Cohen’s kappa or Fleiss’ kappa. Also implement a review process where a subset of annotations is reviewed by a senior annotator for consistency.
> 6.  **Dataset Construction**: Structure the dataset in a clear format. Each entry should include:
>     - The source text.
>     - Multiple summaries (both high and low quality).
>     - Annotator ratings or preferences for each summary.
>     - (Optional) Include some useful metadata such as the source domain, summary generation method, and annotator details.

### Question 2 (2.5 Points)

**If you have 100 pairs of summaries, and for each pair one summary is prefered, how would you structurre your training data for the reward model?**

> For each pair, one prefered and one rejected summary would be enough. We are already given the prefered one for each pair, so we can just go thorugh the other 99 summaries and select the least suitable sone and label it as rejected.
>
> Using more than two summaries per comparison is possible and can indeed be beneficial for certain scenarios. However, the fundamental concept in training a reward model is based on pairwise comparison because the following reasons:
>
> - **Simplicity**: Pairwise comparisons reduce the complexity of the preference task. Annotators only need to decide between two options, which is simpler and less cognitively demanding than ranking or rating multiple summaries.
>
> - **Clear Signal**: In a pairwise setup, the model gets a clear signal: one summary is better than the other. This direct feedback helps the model learn more effectively.
>
> - **Training Stability**: Pairwise comparisons have been shown to provide stable and consistent training signals, which is crucial for learning effective reward functions.
>
> However, there are some ways to incorporate more summaries directly. I will explaining two of them:
>
> 1. **Ranking-Based Approach**: Instead of comparing pairs, we can use a ranking approach where annotators rank multiple summaries from best to worst. This approach captures more nuanced preferences and can provide richer training data. In this setup, we can use listwise loss functions, such as ListNet or ListMLE, to train the model based on the entire ranking of summaries.
> 2. **Rating-Based Approach**: As I mentiones in the previous question,annotators can rate each summary independently on a scale (e.g., 1-5), and here the same story goes on. This method allows for the comparison of multiple summaries simultaneously and provides a way to incorporate all summaries into the training process. In this approach, we can use regression-based approaches where the model predicts a rating for each summary and is trained to minimize the difference between predicted and actual ratings.
>
> 3. **Combining Multiple Summaries with Pairwise Comparisons**: Even when dealing with multiple summaries, we can still employ pairwise comparisons by generating all possible pairs from the set of summaries. This approach scales well and retains the simplicity and effectiveness of pairwise comparisons. In this case, you can use all pairwise combinations to provide comprehensive training data.


### Raw Input

Now, we'll create a list of dicts using the create_comparison_dataset function (shown below), where each dict has two keys - chosen and rejected. The value of each key is the prompt (or Reddit post) concatenated with the summary.

**Note:** We can increase the number of training examples.

In [None]:
def create_comparison_dataset(
     path="CarperAI/openai_summarize_comparisons", split="train"
 ):
     dataset = load_dataset(path, split=split)
     if split == "test":
         dataset = dataset.select(range(1000))
     elif split == "train":
         dataset = dataset.select(range(10000))

     pairs = []
     for sample in tqdm(dataset):
         pair = {}
         prompt = sample["prompt"]
         chosen_summary = sample["chosen"]
         rejected_summary = sample["rejected"]
         if chosen_summary == rejected_summary:
             continue
         if  len(chosen_summary.split()) < 5 or len(rejected_summary.split()) < 5:
             continue
         pair["chosen"] = prompt + "\n" + chosen_summary
         pair["rejected"] = prompt + "\n" + rejected_summary
         pairs.append(pair)
     return pairs


### Pairwise Dataloader (2.5 points)

The PairwiseDataset class shown below tokenizes the chosen and rejected "summaries". The dataset class return the input_ids and attention_masks for both chosen and rejected summaries, in this part you should complete the **PairwiseDataset class.**

In [None]:
class PairwiseDataset(Dataset):
    # WRITE YOUR CODE HERE
    def __init__(self, pairs, tokenizer, max_length):
         self.chosen_input_ids = []
         self.chosen_attn_masks = []

         self.rejected_input_ids = []
         self.rejected_attn_masks = []

         for pair in tqdm(pairs):
              chosen = "<|startoftext|>" + pair["chosen"] + "<|endoftext|>"

              chosen_encodings_dict = tokenizer(
                  chosen,
                  truncation=True,
                  max_length=max_length,
                  padding="max_length",
                  return_tensors="pt",
              )

              self.chosen_input_ids.append(chosen_encodings_dict["input_ids"])
              self.chosen_attn_masks.append(chosen_encodings_dict["attention_mask"])

              rejected = "<|startoftext|>" +  pair["rejected"] + "<|endoftext|>"

              rejected_encodings_dict = tokenizer(
                  rejected,
                  truncation=True,
                  max_length=max_length,
                  padding="max_length",
                  return_tensors="pt",
              )

              self.rejected_input_ids.append(rejected_encodings_dict["input_ids"])
              self.rejected_attn_masks.append(rejected_encodings_dict["attention_mask"])

    def __len__(self):
        return len(self.chosen_input_ids)

    def __getitem__(self, idx):
        return (
            self.chosen_input_ids[idx],
            self.chosen_attn_masks[idx],
            self.rejected_input_ids[idx],
            self.rejected_attn_masks[idx],
        )

### Data Collator (2.5 Points)

The DataCollatorReward class creates batches (dict) of data for our reward model. The collator returns:

*   input_ids: collator concatenates the chosen and rejected summaries' input_ids across dim=0.
*   attention_mask: collator concatenates the chosen and rejected summaries' attention_mask across dim=0.

*   labels: collator creates a tensor of zeros for chosen summaries and a tensor of ones for rejected summaries concatenated across dim=0.

In [None]:
class DataCollatorReward:
  def __call__(self, data):
    batch = {}

    # WRITE YOUR CODE HERE
    batch["input_ids"] = torch.cat([d[0] for d in data] + [d[2] for d in data])
    batch["attention_mask"] = torch.cat([d[1] for d in data] + [d[3] for d in data])
    batch["labels"] = torch.tensor([0] * len(data) + [1] * len(data))

    return batch

### What is happening in reward model?

Here, we have a Reddit post and two summaries (<font color='green'><b>chosen</b></font> and <font color='red'><b>rejected</b></font>) as input.

The ground truth label (**labels**) is the human feedback (<font color='green'><b>0 for chosen</b></font> and <font color='red'><b>1 for rejected</b></font>). And the loss function (pairwise ranking loss) is given as:

$$\text{loss}(r_{\theta}) = -\mathbb{E}_{(x, y_0, y_1, i) \sim D} \left[ \log \left( \sigma \left( r_{\theta}(x, y_i) - r_{\theta}(x, y_{1-i}) \right) \right) \right]
$$.


where:
- $ x $ is the post,
- $ y_0 $ and $ y_1 $ are the summaries,
- $ i $ in {0, 1} indicates which summary is preferred by humans,
- $ r_{\theta}(x, y) $ is the reward model that returns a scalar value for the post $ x $ and the summary $ y $,
- $ \sigma $ is the sigmoid function.


The reward model $ r_{\theta} $ takes the post $ x $ and the summary $ y $ and returns a scalar value. The value is computed for both summaries and a sigmoid activation is applied to the difference.

Finally, the negative log is computed.

This loss function encourages the model to give higher scores to human-preferred summaries.

**How to code this?**

Our model receives input prepared by the data collator.

*   This input is passed through the GPT-2 model to get the final hidden states.

*   The hidden state is then passed through the linear layer to get a reward score.

*   For each batch fed into the model, the first half is the chosen summaries, and the second half is the rejected summaries.

*   The forward method of the model iterates through each input sample to compute pairwise loss.
*  Return loss and chosen summaries scores and rejected summaries scores.

### Question 3 (2.5 Points)

**What is the goal of pairwise ranking loss? and how we achieve this goal?**

> The loss function aims to maximize the likelihood that the preferred summary is scored higher than the non-preferred summary.
>
> When the model assigns scores $r_{\theta}(x, y_i)$ and $r_{\theta}(x, y_{1-i})$ to the summaries $y_i$ and $y_{1-i}$, the difference $r_{\theta}(x, y_i) - r_{\theta}(x, y_{1-i})$ should ideally be positive if $y_i$ is indeed the preferred summary.
>
> The sigmoid function $\sigma \left( r_{\theta}(x, y_i) - r_{\theta}(x, y_{1-i}) \right)$ converts this score difference into a probability-like value between 0 and 1. A higher positive difference results in a value close to 1, indicating a high probability that the preferred summary is ranked correctly.
>
> Taking the log of this sigmoid value, $\log \left( \sigma \left( r_{\theta}(x, y_i) - r_{\theta}(x, y_{1-i}) \right) \right)$, transforms the objective into a log-likelihood. Maximizing this log-likelihood means we want to make the probability of the preferred summary being ranked higher as close to 1 as possible.
>
> Since we are minimizing the negative log-likelihood, our objective becomes to minimize the chance that a non-preferred summary is ranked higher than a preferred one.
>
> ---
>
> During training, the model's parameters $\theta$ are updated to minimize this loss.
>
> Initially, the model might give arbitrary scores to summaries since it's not yet trained.
>
> For each training sample $(x, y_0, y_1, i)$, the loss is computed, and the model parameters \( \theta \) are updated using gradient descent to minimize this loss.
>
> As training progresses, the model adjusts the scores it assigns to summaries. For pairs where the preferred summary initially has a lower score, the model learns to increase the score for the preferred summary and decrease the score for the non-preferred summary.
>
> Eventually, the model converges to a state where it consistently assigns higher scores to preferred summaries. The differences $r_{\theta}(x, y_i) - r_{\theta}(x, y_{1-i})$ become positive and large enough such that $\sigma \left( r_{\theta}(x, y_i) - r_{\theta}(x, y_{1-i}) \right)$ is close to 1.

### Question 4 (2.5 Points)

**Explain that in the process of training the reward model, how the pairwise ranking loss can avoid the problem of huge score difference between the answers (summaries) and why is this useful?**

> In a text summarization task, when training a reward model to rank summaries, it’s possible that without proper regulation, the model might assign extremely large or small scores to some summaries. This could lead to instability in training and poor generalization, as the model could overly fit to the training examples where it has seen large score differences. The pairwise ranking loss addresses this problem by focusing on the relative difference between scores rather than their absolute values.
>
>  The core of pairwise ranking loss is the difference between the scores of the preferred summary $r_{\theta}(x, y_i)$ and the non-preferred summary $r_{\theta}(x, y_{1-i})$. So, instead of incentivizing the model to produce high absolute scores, it incentivizes the model to produce scores where the preferred summary is consistently ranked higher.
>
> The sigmoid function $\sigma(z) = \frac{1}{1 + e^{-z}}$ maps any real number to the interval (0, 1). This function has a property that large positive or negative values asymptotically approach 1 or 0. For example, if the score difference $r_{\theta}(x, y_i) - r_{\theta}(x, y_{1-i})$ becomes very large, the sigmoid function $\sigma(z)$ will approach 1, making the gradient very small. This prevents the model from further increasing the score difference unnecessarily.
>
> The loss function uses the log of the sigmoid value $\log(\sigma(r_{\theta}(x, y_i) - r_{\theta}(x, y_{1-i})))$. When the score difference is large, the sigmoid is close to 1, and $\log(1) = 0$. This minimizes the contribution to the loss from such pairs.
>
> Thus, once the model has learned to rank the preferred summary higher with sufficient confidence, it no longer aggressively increases the score difference.

### Implementing The Reward Model (5 Points)

In [None]:
class GPTRewardModel(nn.Module):
  def __init__(self, model_path):
    super().__init__()
    model = AutoModelForCausalLM.from_pretrained(model_path)
    self.config = model.config

    # WRITE YOUR CODE HERE
    self.transformer = model.transformer
    self.scoring_linear = nn.Linear(self.config.n_embd, 1, bias=False)

  def forward(
      self,
      input_ids=None,
      past_key_values=None,
      attention_mask=None,
      token_type_ids=None,
      position_ids=None,
      head_mask=None,
      inputs_embeds=None,
      mc_token_ids=None,
      labels=None,
      return_dict=False,
      output_attentions=False,
      output_hidden_states=False,
  ):
      transformer_outputs = self.transformer(
          input_ids,
          past_key_values=past_key_values,
          attention_mask=attention_mask,
          token_type_ids=token_type_ids,
          position_ids=position_ids,
          head_mask=head_mask,
          inputs_embeds=inputs_embeds,
      )

      rewards = self.scoring_linear(transformer_outputs[0]).squeeze(-1)

      # Split the inputs and rewards into: chosen and rejected
      bs = input_ids.shape[0] // 2

      chosen = input_ids[:bs]
      chosen_rewards = rewards[:bs]

      rejected = input_ids[bs:]
      rejected_rewards = rewards[bs:]

      loss = 0
      chosen_end_scores = []
      rejected_end_scores = []

      for i in range(bs):

          c_no_pad = (chosen[i] == 50256).nonzero()
          c_end_idx = c_no_pad[0].item() if len(c_no_pad) > 0 else chosen.shape[1]

          r_no_pad = (rejected[i] == 50256).nonzero()
          r_end_idx = r_no_pad[0].item() if len(r_no_pad) > 0 else rejected.shape[1]

          # Ignore whole prompt section in input
          first_idx = (chosen[i] != rejected[i]).nonzero()[0]

          r_c = chosen_rewards[i][first_idx:max(c_end_idx, r_end_idx)]
          r_r = rejected_rewards[i][first_idx:max(c_end_idx, r_end_idx)]

          loss += -torch.log(torch.sigmoid(r_c - r_r)).mean()
          chosen_end_scores.append(r_c[-1])
          rejected_end_scores.append(r_r[-1])

      return {
          "loss": loss/bs,
          "chosen_end_scores": torch.stack(chosen_end_scores),
          "rejected_end_scores": torch.stack(rejected_end_scores)
      }

**After finishing the above code, could you please explain how the scores for the selected summaries and the scores for the rejected summaries are calculated in your code?** (2.5 Points)

> First, inputs are given to the model and after processing, the last layer tensor (containing outputs of model) is fed to a linear layer, which has been initialized in the constructor of the class, as input. The values stored in output tensor of this layer are considered as `rewards` scores. As said, the first half is the chosen summaries, and the second half is the rejected summaries, so we separate them for both `input_ids` and `rewards`.
>
> Then we iterate through the first half (which is equalt to the second half in terms of length), and find end index for chosen and rejected by ignoring the pad tokens. As calculated, we find the first token where rejected and chosen are taken different values (actually we ignore the prompt part which is shared between them), and make the reward tensors for chosen and rejected up and calculate loss based on the given equation in addition to store the last index of reward tensors.

### Load datasets

In [None]:
# Create the comparisons datasets
data_path = comparissions_path
train_pairs = create_comparison_dataset(data_path, "train")
val_pairs = create_comparison_dataset(data_path, "test")

# Make pairwise datasets for training
max_length = 550
train_dataset = PairwiseDataset(train_pairs, tokenizer, max_length=max_length)
val_dataset = PairwiseDataset(val_pairs, tokenizer, max_length=max_length)

# Create the collator to gather batches of pairwise comparisons
data_collator = DataCollatorReward()

Downloading readme:   0%|          | 0.00/462 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/13.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/92534 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/83629 [00:00<?, ? examples/s]

Generating valid1 split:   0%|          | 0/33082 [00:00<?, ? examples/s]

Generating valid2 split:   0%|          | 0/50715 [00:00<?, ? examples/s]

100%|██████████| 10000/10000 [00:00<00:00, 25984.65it/s]
100%|██████████| 1000/1000 [00:00<00:00, 25584.70it/s]
100%|██████████| 10000/10000 [00:30<00:00, 329.00it/s]
100%|██████████| 1000/1000 [00:02<00:00, 352.86it/s]


### Load Model and Tokenizer

Initialize the reward model from the SFT GPT-2 model.

In [None]:
model = GPTRewardModel("MyDrive/MyDrive/LLM/CA4")

tokenizer = AutoTokenizer.from_pretrained(CONFIG.model_name)
tokenizer.pad_token = tokenizer.eos_token

# Freeze the first 70% of the hidden layers of the reward model backbone
layers = model.transformer.h
num_layers = len(layers)
num_unfrozen = int(0.3 * num_layers)
for layer in layers[:-num_unfrozen]:
  layer.requires_grad_(False)

### Define Compute metric function (2.5 Points)

In this part you should implement the accuracy of our GPTRewardModel.

In [None]:
def compute_metrics(eval_preds):
    # WRITE YOUR CODE HERE

    scores_of_chosens = eval_preds.predictions[0]
    scores_of_rejecteds = eval_preds.predictions[1]

    accuracy = sum(scores_of_chosens > scores_of_rejecteds) / len(scores_of_rejecteds)

    return {"accuracy": accuracy}

### Train

I trained the model for 3 epochs due to the limitation that we have in Google Colab for using GPU.

In [None]:
training_args = TrainingArguments(
      output_dir="rm_checkpoint/",
      num_train_epochs=3, # You can set it to CONFIG.epochs which is equal to 5
      logging_steps=10,
      gradient_accumulation_steps=4,
      save_strategy="steps",
      evaluation_strategy="steps",
      per_device_train_batch_size=1,
      per_device_eval_batch_size=1,
      eval_accumulation_steps=1,
      eval_steps=500,
      save_steps=500,
      warmup_steps=100,
      logging_dir="./logs",
      fp16=True,
      bf16=False,
      learning_rate=1e-5,
      save_total_limit=1,
    )

In [None]:
trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        compute_metrics=compute_metrics,
        eval_dataset=val_dataset,
        data_collator=data_collator,
    )

In [None]:
trainer.train()
trainer.save_model(CONFIG.output_dir_rm)

Step,Training Loss,Validation Loss,Accuracy
500,0.7295,0.69926,0.585
1000,0.6986,0.688489,0.583
1500,0.6912,0.68672,0.588
2000,0.6627,0.684128,0.589
2500,0.7256,0.681024,0.589
3000,0.6787,0.678759,0.589
3500,0.7072,0.678507,0.591
4000,0.6404,0.678301,0.589
4500,0.6614,0.678357,0.592
5000,0.6507,0.678392,0.592


## Section Three: PPO Fine Tuning (25 Points)

### Question 5 (2.5 points)

**What is PPO algorithm?**

> The Proximal Policy Optimization (PPO) algorithm is a type of policy gradient method in RL that aims to improve the stability and performance of training.
> The agent interacts with the environment to collect trajectories, which consist of states, actions, rewards, and next states.
> PPO uses an advantage function to evaluate how much better or worse an action is compared to a baseline, usually the value function. This advantage is used to weight the policy updates.
> It also uses a **clipped surrogate objective** function to prevent large updates to the policy. This function ensures that the new policy is not too different from the old one. The surrogate objective is designed to keep the policy changes within a predefined range, preventing overly large updates that could destabilize learning.
>
> **Other Advatages**
> - **Simplicity**: PPO is simpler to implement and tune compared to other algorithms like Trust Region Policy Optimization (TRPO).
>
> - **Efficiency**: PPO strikes a good balance between sample efficiency and computational complexity, making it suitable for a wide range of reinforcement learning tasks.
>
> *P.S.* **Policy gradient methods** optimize the policy directly by computing the gradient of an objective function with respect to the policy parameters and updating the parameters in the direction that increases the expected reward.

**How does PPO work?**

> Initialize the policy $\pi_{\theta}$ and value function $V_{\phi}$ with random parameters. Run the current policy in the environment to collect a set of trajectories (state, action, reward, next state), and estimate the advantages for each action taken using the collected trajectories. Then use the clipped objective function to update the policy parameters $\theta$ by performing multiple steps of gradient ascent and update the value function parameters $\phi$ using the collected trajectories and a suitable loss function.
>
> Repeat the process of collecting trajectories, computing advantages, and updating the policy and value function until convergence.
>
> *P.S.* Clipped objective formula:
$$L^{CLIP}(\theta) = \hat{\mathbb{E}}_t \left[ \min \left( \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} \hat{A}_t, \text{clip} \left( \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_t \right) \right]$$
Where $\pi_{\theta}$ is the new policy, $\pi_{\theta_{\text{old}}}$ is the old policy, $\hat{A}_t$ is the advantage estimate, and $\epsilon$ is a small hyperparameter that defines the clipping range.

### Question 6 (2.5 points)

**Is the PPO algorithm an on-policy or off-policy reinforcement learning algorithm? Explain its functionality within our context (RLHF).**

> The Proximal Policy Optimization (PPO) algorithm is an **on-policy** reinforcement learning algorithm. This means that the policy being optimized is the same as the policy used to generate the data. In other words, the actions taken to interact with the environment are dictated by the current policy, and the policy updates are based on the data collected from these interactions.
>
> In RLHF, human feedback is used to shape the reward function, which in turn guides the learning process.
> Human feedback can be used to create a reward model that evaluates the quality of actions or outcomes produced by the policy. This reward model can be trained using human-labeled examples indicating which actions or outcomes are preferable. The policy is then trained using the PPO algorithm, where the reward signal is provided by the human-trained reward model. The policy generates actions based on the current state, and these actions are evaluated by the reward model to provide the feedback necessary for learning. PPO updates the policy based on this feedback, ensuring that it remains stable and improves incrementally. The clipping mechanism of PPO is particularly useful here as it prevents the policy from changing too rapidly based on potentially noisy human feedback.


### Question 7 (2.5 points)

**Imagine a mini-batch of data has arrived, and we want to optimize the policy of generating summaries to maximize the reward using gradient ascent.**

**Why shouldn't this policy change too much and, in other words, become overoptimized? (Answer based on the respond you provided to the previous question.)**

> When optimizing the policy to generate summaries and maximize the reward using gradient ascent, it's crucial to avoid changing the policy too much for several reasons, which align with the principles used in Proximal Policy Optimization (PPO). But why overoptimization is problematic?
> 1. **Instability**: Large updates to the policy can cause instability in learning. If the policy changes too much from one iteration to the next, it may diverge or start performing poorly because the new policy might not generalize well to unseen states.
> 2. **Overfitting**: If the policy is overly optimized to the current mini-batch of data, it may overfit to the specific examples seen in that batch. This leads to poor performance on other data, as the policy becomes too specialized.
> 3. **Exploration-Exploitation Balance**: Drastic changes can disrupt the balance between exploration (trying new actions) and exploitation (using known good actions). A too-rapid shift towards exploitation might cause the policy to miss better actions that haven't been explored yet.
> 4. **Reward Model Noise**: Human feedback, which informs the reward model, can be noisy or biased. Overreacting to this feedback can lead the policy to adapt too strongly to specific feedback instances that might not represent the overall desired behavior.

**What do they do to solve this problem?**

> PPO introduces mechanisms to ensure that policy updates are stable and incremental, avoiding overoptimization:
> 1. **Clipping**: PPO uses a clipping mechanism on the probability ratio between the new policy and the old policy. Specifically, the objective function includes a term that penalizes changes that make the new policy too different from the old policy. This prevents large updates and keeps the policy changes within a safe range. The probability ratio $r(\theta)$ between the new and old policies is clipped within a range $[1 - \epsilon, 1 + \epsilon]$, where $\epsilon$ is a small value (e.g., 0.1 or 0.2). This clipping mechanism ensures that the policy does not change too abruptly.
> 2. **Surrogate Objective**: PPO optimizes a surrogate objective function that balances between improving the policy and maintaining its similarity to the previous policy. The clipped objective helps in controlling the step size of the policy update. The surrogate objective function $L^{CLIP}(\theta)$ combines the clipped ratio with the advantage estimate, ensuring updates are beneficial but not excessive.
> 3. **Multiple Epochs**: Instead of updating the policy after each mini-batch, PPO performs multiple epochs of optimization on the same batch of data. This repeated, controlled optimization helps in making gradual improvements while using the same data for a more robust learning process. By iterating over the same mini-batch multiple times with controlled updates, the policy learns effectively without drastic shifts.
> 4. **Entropy Bonus**: PPO sometimes includes an entropy bonus in the objective function to encourage exploration. By discouraging the policy from becoming too deterministic, it maintains a healthy exploration-exploitation balance. The entropy term $H(\pi_\theta)$ encourages the policy to keep exploring different actions, preventing it from prematurely converging to a suboptimal deterministic policy.

### Question 8 (2.5 points)

**What is the overestimation problem in the ppo fine-tuning?**

> The overestimation problem in Proximal Policy Optimization (PPO) fine-tuning refers to the tendency of the algorithm to overestimate the expected returns or advantages of certain actions, leading to suboptimal policy updates. This issue can adversely affect the performance and stability of the learning process. More precisely:
> 1. **Suboptimal Policies**: Overestimation can cause the policy to favor actions that appear to have high returns due to the overestimated advantages but are actually suboptimal. This can lead to a degradation in performance.
> 2. **Instability in Learning**: Large overestimations can result in significant policy changes, leading to instability and potential divergence during training. The policy may oscillate or fail to converge to an optimal solution.
> 3. **Exploration Issues**: If certain actions are consistently overestimated, the policy may become overly deterministic, reducing exploration and potentially missing out on better actions that were underestimated.

**Why does overestimation problem happen in the PPO fine-tuning?**

> **Noisy or Inaccurate Reward Signals**: During fine-tuning, especially when using human feedback or learned reward models, the reward signals can be noisy or biased. If the reward model overestimates the value of certain actions due to noise or overfitting to the training data, it can lead to incorrect updates.
>
> **Policy Updates and High Variance**: PPO involves updating the policy using estimates of advantage or return. If these estimates have high variance or are biased, the algorithm might overestimate the benefits of certain actions. This is especially problematic when using small mini-batches, where the variance of the advantage estimates can be higher.
>
> **Use of Clipped Surrogate Objective**: While the clipping mechanism in PPO is designed to stabilize updates, it can sometimes mask the true scale of the advantage estimates. If the clipping threshold is not well-tuned, it may not sufficiently mitigate the overestimation of advantages, leading to overly aggressive updates.

### Question 9 (2.5 points)

**What potential issue could arise when aligning a language model with human values??**

> 1. **Value Misalignment**: The model might interpret or learn the reward signals in a way that does not fully capture the complexity and nuance of human values. This can lead to the model generating outputs that seem superficially aligned with human values but fail in subtle, important ways.
> 2. **Ambiguity and Bias in Human Feedback**: Human feedback can be ambiguous or biased. If the feedback data used to train the reward model is not representative of a diverse range of human values, the language model might inherit and amplify these biases, leading to outputs that are not universally aligned.
> 3. **Specification Gaming**: The model might find ways to achieve high reward by exploiting loopholes in the reward model. This is known as specification gaming, where the model outputs technically meet the criteria for high reward but do so in unintended or undesirable ways.
> 4. **Scalability of Human Feedback**: Providing human feedback at scale is challenging. As models grow larger and more complex, the amount of feedback required to effectively train them increases, making it difficult to maintain the quality and consistency of the feedback.
>
> I think the most significant one is misalignment problem.

**What solution has been proposed to address this issue?**

> One effective solution proposed to address these issues is the **iterative feedback and alignment training** approach. The model is initially trained using a reward model based on human feedback. This feedback is collected through various means, such as direct annotations, preference comparisons, and simulated environments where human evaluators provide rewards based on the model's outputs.  The reward model and the language model are regularly reevaluated and updated based on new feedback. This continuous process helps in capturing evolving human values and addressing any detected misalignments.
>
> To mitigate bias and ensure broader alignment, the feedback collection process is designed to include diverse perspectives. This involves engaging a wide range of human evaluators from different backgrounds and ensuring the feedback reflects a variety of viewpoints and values. Engaging in "red teaming" exercises, where teams intentionally try to find weaknesses or exploit the model, helps identify potential failure modes and specification gaming issues. Adversarial testing scenarios are used to refine the reward model and improve the robustness of the alignment.
>
> Enhancing the transparency and explainability of the model's decision-making process helps human evaluators understand why the model behaves in certain ways. This can improve the quality of feedback and ensure that the reward signals are correctly interpreted by the model. Developing scalable oversight mechanisms, such as automated tools for detecting bias and anomalies, can complement human feedback. These tools can provide additional layers of scrutiny and ensure that the model remains aligned with human values even as it scales.

### Question 10 (2.5 points)

**We know that the objective function of the ppo tuning is as follows:**

$$ \text{objective}(\phi) = \mathbb{E}_{(x,y) \sim D_{\pi_{\phi}^{\text{RL}}}} \left[ r_{\theta}(x, y) - \beta \log \left( \frac{\pi_{\phi}^{\text{RL}}(y \mid x)}{\pi^{\text{SFT}}(y \mid x)} \right) \right] + \gamma \mathbb{E}_{x \sim D_{\text{pretrain}}} \left[ \log(\pi_{\phi}^{\text{RL}}(x)) \right]
 $$

**In the above objective function, the differentiation is with respect to $Φ$, yet the term $r_{\theta}(x, y)$ appears in the function.**

**Explain why $r$ appears in the objective function despite being a function of $\theta$ and why its derivative is not zero.**

> The reward model $r_{\theta}(x, y)$ appears in the objective function because it provides the reward signal necessary for the reinforcement learning process.
>
> Even though $r_{\theta}(x, y)$ is parameterized by $\theta$, it affects the objective function through the data distribution $D_{\pi_{\phi}^{\text{RL}}}$. The expectation $\mathbb{E}_{(x,y) \sim D_{\pi_{\phi}^{\text{RL}}}}$ indicates that the state-action pairs $(x, y)$ are sampled according to the current policy $\pi_{\phi}^{\text{RL}}$. This means that changes in $\phi$ (the policy parameters) will change the distribution of $(x, y)$. As the policy $\pi_{\phi}^{\text{RL}}$ changes, it alters the likelihood of different state-action pairs $(x, y)$. Since $r_{\theta}(x, y)$ is evaluated based on these pairs, its expected value will change with $\phi$. When taking the derivative of the objective function with respect to $\phi$, the dependency of $r_{\theta}(x, y)$ on the distribution governed by $\phi$ must be considered. Using the chain rule, the derivative of the expectation involves both the direct derivative of the reward and the derivative of the distribution.

### Question 11 (5 points)

**Another term present in the objective function is $\beta \log \left( \frac{\pi_{\phi}^{\text{RL}}(y \mid x)}{\pi^{\text{SFT}}(y \mid x)} \right)$, which is the KLD between the initial policy distribution and the policy being learned. Explain why this term cannot be calculated directly.**


**For a detailed explanation of how the KLD is calculated in this context, please read this <br>[blog post](http://joschu.net/blog/kl-approx.html). Afterward, provide an explanation of the calculation process.**

> Calculating this term directly is challenging due to several reasons:
> 1. **Expectation over the Distribution**: The term $\mathbb{E}_{(x,y) \sim D_{\pi_{\phi}^{\text{RL}}}}$ implies an expectation over the data distribution governed by the current policy $\pi_{\phi}^{\text{RL}}$. To compute the KLD term exactly, one would need to know the exact distribution of $(x, y)$ pairs under the current policy, which is generally infeasible in practice.
> 2. **Intractability of High-Dimensional Integrals**: Calculating the exact KLD requires evaluating the integral (or sum) over all possible actions $y$ for each state $x$. For high-dimensional action spaces, this calculation becomes computationally prohibitive due to the need to sum over all possible actions.
> 3. **Sampling-Based Estimation**: In practice, policies are typically represented by complex models such as neural networks, making the exact computation of probabilities $\pi_{\phi}^{\text{RL}}(y \mid x)$ and $\pi^{\text{SFT}}(y \mid x)$ difficult. Instead, sampling methods are used to estimate expectations. This estimation introduces variance and potential bias, making direct calculation infeasible.
> 4. **Dynamic Nature of the Policy**: The policy $\pi_{\phi}^{\text{RL}}$ is constantly being updated during training. Direct computation of the KLD term would require continuous recalibration to account for these updates, adding to the computational complexity.

### Question 12 (5 points)

**How does DPO improve fine tuning?**

**In DPO, our loss optimizing function to optimize for the policy is:**
$$\text{LDPO}(\pi_\theta; \pi_\text{ref}) = -\mathbb{E}_{(x,y_w,y_l) \sim D}
\left[
\log \sigma
\left(
\beta \log \frac{\pi_\theta(y_w | x)}{\pi_\text{ref}(y_w | x)}
- \beta \log \frac{\pi_\theta(y_l | x)}{\pi_\text{ref}(y_l | x)}
\right)
\right]$$

**Please explain the terms of this loss function and interpret the function.**


**Now, please calculate the gradient of this loss function and Intuitively explain each term in the gradient of loss function.**

> The DPO (Direct Preference Optimization) loss function is designed to improve fine-tuning of a policy by directly optimizing the policy based on pairwise preferences. The DPO loss function aims to maximize the likelihood that the policy $ \pi_\theta $ ranks the preferred output $ y_w $ higher than the less preferred output $ y_l $. By optimizing this loss, the policy is adjusted to align more closely with the observed preferences in the dataset $ D $.
>
> **DPO loss function terms**:
> 1. **Expectation $\mathbb{E}_{(x, y_w, y_l) \sim D}$**: This denotes the expectation over the dataset $ D $, which consists of triplets $(x, y_w, y_l)$. Here, $ x $ represents the input context, $ y_w $ is the preferred (winning) output, and $ y_l $ is the less preferred (losing) output.
> 2. **Log-Sigmoid Function $ \log \sigma(\cdot) $**: $ \sigma(z) = \frac{1}{1 + e^{-z}} $ is the sigmoid function, and $ \log \sigma(z) $ is the log-sigmoid.This term converts the differences in log-probabilities into a probability-like score that penalizes incorrect orderings.
> 3. **Log-Probability Ratios**: $ \log \frac{\pi_\theta(y_w | x)}{\pi_\text{ref}(y_w | x)} $ is the log-probability ratio of the preferred output $ y_w $ under the current policy $ \pi_\theta $ relative to the reference policy $ \pi_\text{ref} $, and $ \log \frac{\pi_\theta(y_l | x)}{\pi_\text{ref}(y_l | x)} $ is the log-probability ratio of the less preferred output $ y_l $ under the current policy $ \pi_\theta $ relative to the reference policy $ \pi_\text{ref} $. These ratios measure how much more (or less) the current policy prefers the winning output over the reference policy compared to the losing output.
> 4. **Scaling Factor $ \beta $**: This is a scaling factor that adjusts the impact of the log-probability ratios on the optimization.

**Now, please calculate the gradient of this loss function and Intuitively explain each term in the gradient of loss function.**

> To calculate the gradient of the loss function, we need to differentiate it with respect to the policy parameters $ \theta $.
$$ \text{LDPO}(\pi_\theta; \pi_\text{ref}) = -\mathbb{E}_{(x,y_w,y_l) \sim D} \left[ \log \sigma \left( \beta \left( \log \frac{\pi_\theta(y_w | x)}{\pi_\text{ref}(y_w | x)} - \log \frac{\pi_\theta(y_l | x)}{\pi_\text{ref}(y_l | x)} \right) \right) \right] $$
Let $ z = \beta \left( \log \frac{\pi_\theta(y_w | x)}{\pi_\text{ref}(y_w | x)} - \log \frac{\pi_\theta(y_l | x)}{\pi_\text{ref}(y_l | x)} \right) $.
> Here are the calculation steps:
> 1. **Gradient of the Log-Sigmoid**:
   $$
   \frac{\partial}{\partial z} \log \sigma(z) = \sigma(-z)
   $$
> 2. **Gradient of the Log-Probability Ratios**:
   $$
   \frac{\partial z}{\partial \theta} = \beta \left( \frac{\partial}{\partial \theta} \log \pi_\theta(y_w | x) - \frac{\partial}{\partial \theta} \log \pi_\theta(y_l | x) \right)
   $$
> Combining these, we get the gradient of the DPO loss function:
$$
\nabla_\theta \text{LDPO}(\pi_\theta; \pi_\text{ref}) = -\mathbb{E}_{(x,y_w,y_l) \sim D} \left[ \sigma(-z) \cdot \beta \left( \nabla_\theta \log \pi_\theta(y_w | x) - \nabla_\theta \log \pi_\theta(y_l | x) \right) \right]
$$
> **Terms:**:
> - **$\sigma(-z)$**: This term comes from the derivative of the log-sigmoid function. It represents the probability that the current policy's ordering of $ y_w $ and $ y_l $ is incorrect. If $\sigma(-z)$ is large, it indicates that the current policy is less confident about ranking $ y_w $ higher than $ y_l $.
> - **$\beta$**: The scaling factor $\beta$ adjusts the overall magnitude of the gradient. It controls the strength of the updates based on the log-probability differences.
> - **$\nabla_\theta \log \pi_\theta(y_w | x)$**: This term is the gradient of the log-probability of the preferred output $ y_w $ under the current policy. It increases the likelihood of $ y_w $ given $ x $.
> - **$\nabla_\theta \log \pi_\theta(y_l | x)$**: This term is the gradient of the log-probability of the less preferred output $ y_l $ under the current policy. It decreases the likelihood of $ y_l $ given $ x $.
>
> Note: AI helped me answering this question.

### Run PPO Fine Tuning (optional)

Because of the limitations of Google Colab, If you have access to an extra  GPU you can run below code for ppo fine tuning.

In [None]:
%pip install -U git+https://github.com/CarperAI/trlx.git

In [None]:
import torch
from datasets import load_dataset
from tqdm import tqdm
from transformers import AutoTokenizer

import trlx
from trlx.data.configs import (
    ModelConfig,
    OptimizerConfig,
    SchedulerConfig,
    TokenizerConfig,
    TrainConfig,
    TRLConfig,
)
from trlx.models.modeling_ppo import PPOConfig

In [None]:
config = TRLConfig(
    train=TrainConfig(
        seq_length=550,
        epochs=50,
        total_steps=100000,
        batch_size=4,
        checkpoint_interval=10000,
        eval_interval=200,
        pipeline="PromptPipeline",
        trainer="AcceleratePPOTrainer",
    ),
    model=ModelConfig(
        model_path="gpt2",
        num_layers_unfrozen=8,
    ),
    tokenizer=TokenizerConfig(
        tokenizer_path="gpt2",
        truncation_side="right",
    ),
    optimizer=OptimizerConfig(
        name="adamw",
        kwargs={
            "lr": 5.0e-6,
            "betas": [0.9, 0.999],
            "eps": 1.0e-8,
            "weight_decay": 0.01,
        },
    ),
    scheduler=SchedulerConfig(
        name="cosine_annealing",
        kwargs={
            "T_max": 100000,
            "eta_min": 5.0e-6,
        },
    ),
    method=PPOConfig(
        name="PPOConfig",
        num_rollouts=128,
        chunk_size=16,
        ppo_epochs=4,
        init_kl_coef=0.1,
        target=6,
        horizon=10000,
        gamma=1,
        lam=0.95,
        cliprange=0.2,
        cliprange_value=0.2,
        vf_coef=0.2,
        scale_reward=None,
        ref_mean=None,
        ref_std=None,
        cliprange_reward=10,
        gen_kwargs={
            "max_new_tokens": 50,
        },
    ),
)


In [None]:
REWARD_CHECKPOINT_PATH = ""
SFT_MODEL_PATH = ""

In [None]:
from typing import List
def get_scores(samples: List[str]):
  scores_list = []
  batch_size = 2
  for i in range(0, len(samples), batch_size):
    sub_samples = samples[i : i + batch_size]
    sub_samples = ["<|startoftext|>" + chosen + "<|endoftext|>" for chosen in sub_samples]
    encodings_dict = rw_tokenizer(
        sub_samples,
        truncation=True,
        max_length=config.train.seq_length,
        padding="max_length",
        return_tensors="pt",
        )
    input_ids = encodings_dict["input_ids"].to(rw_device)
    attn_masks = encodings_dict["attention_mask"].to(rw_device)
    input_ids = input_ids.repeat(2, 1)
    attn_masks = attn_masks.repeat(2, 1)
    with torch.no_grad():
      sub_scores = rw_model(input_ids=input_ids, attention_mask=attn_masks)
    scores_list.append(sub_scores["chosen_end_scores"])
  scores = torch.cat(scores_list, dim=0)
  return scores

In [None]:
rw_tokenizer = AutoTokenizer.from_pretrained("gpt2")
rw_tokenizer.pad_token = rw_tokenizer.eos_token
rw_model = GPTRewardModel(SFT_MODEL_PATH)
rw_model.load_state_dict(torch.load(REWARD_CHECKPOINT_PATH), strict=False)
rw_model.half()
rw_model.eval()
rw_model.to(CONFIG.device)

In [None]:
def get_prompt_dataset(prompts, max_length):
  formatted_prompts = []
  for i in tqdm(range(len(prompts))):
    tmp = tokenizer.decode(
        tokenizer(
              prompts[i].split("TL;DR:")[0],
              truncation=True,
              max_length=max_length - 5,
              add_special_tokens=False,
            )["input_ids"],
            skip_special_tokens=True,
            ).strip()
    tmp = tmp + "\nTL;DR:"
    tmp = tokenizer.decode(
          tokenizer(tmp, truncation=True, max_length=max_length, add_special_tokens=False)["input_ids"],
          skip_special_tokens=True,
        ).strip()
    formatted_prompts.append(tmp)
  return formatted_prompts

In [None]:
def reward_fn(samples: List[str], **kwargs):
  original_samples = [text.split("TL;DR:")[0] + "TL;DR: " for text in samples]
  original_samples = [text + post_summary_dict[text.strip()] for text in original_samples]
  original_scores = get_scores(original_samples)
  scores = get_scores(samples)
  norms_scores = scores - original_scores
  return norms_scores

In [None]:
tokenizer = AutoTokenizer.from_pretrained(config.tokenizer.tokenizer_path)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
max_length_input = config.train.seq_length - config.method.gen_kwargs["max_new_tokens"]

# Load the dataset
dataset = load_dataset("carperai/openai_summarize_tldr")

# Get the total number of samples in the training dataset
total_samples = len(dataset["train"])

# Generate random indices for selecting samples
random_indices = random.sample(range(total_samples), 5000)

# Select samples using random indices
train_set = [(dataset["train"][index]["prompt"], dataset["train"][index]["label"]) for index in random_indices]
val_set = [(sample["prompt"], sample["label"]) for sample in dataset["valid"]]

# Split contents into summaries and labels
train_posts, train_summaries = zip(*train_set)
val_posts, val_summaries = zip(*val_set)

# Get the OpenAI summaries
post_summary_dict = {}
train_prompts = get_prompt_dataset(train_posts, max_length_input)
for i in range(len(train_prompts)):
  post_summary_dict[train_prompts[i]] = train_summaries[i]
val_prompts = get_prompt_dataset(val_posts, max_length_input)
for i in range(len(val_prompts)):
  post_summary_dict[val_prompts[i]] = val_summaries[i]

In [None]:
trainer = trlx.train(
    reward_fn=reward_fn,
    prompts=train_prompts,
    eval_prompts=val_prompts[0:1000],  # sampling 1000 validation prompts for evaluation speed in training
    config=config,
  )