### MHP Applied science group
# RLHF Hackathon: PPO

<div style="text-align: center;">
    <img src="../images/PPO_process.png" alt="Supervised Fine-tuning steps" style="display: block; margin-left: auto; margin-right: auto;width:800px">
    <p style="text-align:center">Read more about PPO-algorithm in the <a href="https://arxiv.org/abs/1707.06347">original paper</a>.</p>
</div>

Proximal Policy Optimization (PPO) is a technique used to fine-tune models in the field of reinforcement learning. This method aims to improve the stability and efficiency of the training process by keeping policy updates within a certain range. PPO achieves this by introducing a constraint on policy changes to ensure that new policies do not deviate too far from the old policies. This results in a more stable and efficient training process, enhancing the model's performance.
Steps to Apply PPO to an LLM

The first step is to train your SFT model (Supervised Fine-tuning Trainer), to ensure the data we train on is in-distribution for the PPO algorithm. In addition we need to train a Reward model which will be used to optimize the SFT model using the PPO algorithm.

 1. Rollout: The language model generates a response or continuation based on query which could be the start of a sentence.
 2. Evaluation: The query and response are evaluated with a function, model, human feedback or some combination of them. The important thing is that this process should yield a scalar value for each query/response pair.
 3. Optimization: This is the most complex part. In the optimisation step the query/response pairs are used to calculate the log-probabilities of the tokens in the sequences. This is done with the model that is trained and a reference model, which is usually the pre-trained model before fine-tuning. The KL-divergence between the two outputs is used as an additional reward signal to make sure the generated responses don’t deviate too far from the reference language model. The active language model is then trained with PPO.

In [1]:
%load_ext autoreload
%autoreload 2

### Load librarys

In [2]:
import random
import torch
import time
import os
from tqdm import tqdm
import numpy as np
import pandas as pd
from random import choices
import matplotlib.pyplot as plt

tqdm.pandas()

from datasets import load_dataset

from transformers import AutoTokenizer, pipeline

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, create_reference_model

### Load fine tuned Model

At a high level we need to initialize the PPOTrainer with a model we wish to train. Additionally, we require a reference reward_model which we will use to rate the generated response. 

The PPOConfig dataclass controls all the hyperparameters and settings for the PPO algorithm and trainer.


In [3]:
sentiment_pipe_kwargs = {"top_k": None, "function_to_apply": "none"}

config = PPOConfig(
    model_name="lvwerra/gpt2-imdb", steps=51200, learning_rate=1.41e-5, remove_unused_columns=False,
)

txt_in_len = 5
txt_out_len = 20
seed = 1

In [4]:
np.random.seed(seed)

Now we can initialize our model. Note that PPO also requires a reference model, but this model is generated in a later step by the `PPOTrainer` automatically. The model can be initialized as follows:

In [5]:
gpt2_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
gpt2_model_ref = create_reference_model(gpt2_model)
gpt2_tokenizer = AutoTokenizer.from_pretrained(config.model_name)

gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token
tokenizer.padding_side = 'left'

config.json:   0%|          | 0.00/577 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/17.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

### Load Dataset

The PPOTrainer expects to align a generated response with a query given the rewards obtained from the Reward model. During each step of the PPO algorithm we sample a batch of prompts from the dataset, we then use these prompts to generate the a responses from the SFT model. Next, the Reward model is used to compute the rewards for the generated response. Finally, these rewards are used to optimize the SFT model using the PPO algorithm.

Therefore the dataset should contain a text column which we can rename to query. Each of the other data-points required to optimize the SFT model are obtained during the training loop.



In [6]:
from datasets import load_dataset
# create the dataset
#
dataset = load_dataset("imdb", split="train")
dataset = dataset.rename_columns({"text": "review", "label": "sentiment"})
# make sure the comments are are at least 500 and trim to 1000
dataset = dataset.filter(lambda x: len(x["review"]) > 500, batched=False)
dataset = dataset.map(lambda x: {"review": x["review"][:1000]}, batched=False)

dataset

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/22578 [00:00<?, ? examples/s]

Dataset({
    features: ['review', 'sentiment'],
    num_rows: 22578
})

Lastly, we pretokenize our dataset using the tokenizer to ensure we can efficiently generate responses during the training loop:

In [7]:
dataset = dataset.map(
    lambda x: {"input_ids": gpt2_tokenizer.encode(" " + x["review"], return_tensors="pt")[0, :txt_in_len]},
    batched=False,
)
dataset = dataset.map(lambda x: {"query": gpt2_tokenizer.decode(x["input_ids"])}, batched=False)
dataset = dataset[:20480]

from datasets import Dataset

dataset = Dataset.from_dict(dataset)
dataset.set_format("pytorch")

Map:   0%|          | 0/22578 [00:00<?, ? examples/s]

Map:   0%|          | 0/22578 [00:00<?, ? examples/s]

In [8]:
dataset[3]["input_ids"]

tensor([ 770, 2646,  373, 2192, 7867])

In [9]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

### Using and initializing the PPOtrainer

As mentioned above, we are now ready to initialize the PPOTrainer using the defined config, datasets, and model.

In [10]:
ppo_trainer = PPOTrainer(
    config,
    gpt2_model,
    gpt2_model_ref,
    gpt2_tokenizer,
    dataset,
    data_collator=collator
)

In [11]:
if ppo_trainer.accelerator.num_processes == 1:
    device = 0 if torch.cuda.is_available() else "cpu"  # to avoid a `pipeline` bug
else:
    device = ppo_trainer.accelerator.device

In [12]:
print(f"we are using {device}")

we are using 0


The reward can be generated using any function that returns a single value for a string, be it a simple rule (e.g. length of string), a metric (e.g. BLEU), or a reward model based on human preferences. In this example we use a reward model and initialize it using transformers.pipeline for ease of use.

In [13]:
sentiment_pipe = pipeline("sentiment-analysis", "lvwerra/distilbert-imdb", device=device)

config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [14]:
text = "this movie was really bad!!"
output = sentiment_pipe(text, **sentiment_pipe_kwargs)
output

[{'label': 'NEGATIVE', 'score': 2.3350486755371094},
 {'label': 'POSITIVE', 'score': -2.726576566696167}]

In [15]:
def extract_pipe_output(outputs):
    positive_logits = []
    for out in outputs:
        for element in out:
            if element["label"] == "POSITIVE":
                positive_logits.append(torch.tensor(element["score"]))
    return positive_logits

In [16]:
output[1]["score"]

-2.726576566696167

In [17]:
ctrl_str = ["[negative]", "[neutral]", "[positive]"]
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # this should be handled by accelerate
ctrl_tokens = dict((s, gpt2_tokenizer.encode(s, return_tensors="pt").squeeze().to(device)) for s in ctrl_str)

In [18]:
ctrl_tokens

{'[negative]': tensor([   58, 31591,    60], device='cuda:0'),
 '[neutral]': tensor([   58, 29797,    60], device='cuda:0'),
 '[positive]': tensor([   58, 24561,    60], device='cuda:0')}

In [19]:
def pos_logit_to_reward(logit, task):
    """
    Take the positive sentiment logit and scale it for the task.
        task [negative]: reward = -logit
        task [neutral]: reward = -2*abs(logit)+4
        task [positive]: reward = logit
    """
    for i in range(len(logit)):
        if task[i] == "[negative]":
            logit[i] = -logit[i]
        elif task[i] == "[neutral]":
            logit[i] = -2 * torch.abs(logit[i]) + 4
        elif task[i] == "[positive]":
            pass
        else:
            raise ValueError("task has to be in [0, 1, 2]!")
    return logit

In [20]:
print(ctrl_str)

['[negative]', '[neutral]', '[positive]']


In [21]:
pos_logit_to_reward(torch.Tensor([4, 4, 4]), ctrl_str)

tensor([-4., -4.,  4.])

In [22]:
pos_logit_to_reward(torch.Tensor([-4, -4, -4]), ctrl_str)

tensor([ 4., -4., -4.])

In [23]:
pos_logit_to_reward(torch.Tensor([0, 0, 0]), ctrl_str)

tensor([-0., 4., 0.])

### Starting the training loop

Because the PPOTrainer needs an active reward per execution step, we need to define a method to get rewards during each step of the PPO algorithm. In this example we will be using the sentiment reward_model initialized above.

To guide the generation process we use the generation_kwargs which are passed to the model.generate method for the SFT-model during each step.

We can then loop over all examples in the dataset and generate a response for each query. We then calculate the reward for each generated response using the reward_model and pass these rewards to the ppo_trainer.step method. The ppo_trainer.step method will then optimize the SFT model using the PPO algorithm.

In [24]:
generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": gpt2_tokenizer.eos_token_id,
    "max_new_tokens": txt_out_len,
    "eos_token_id": gpt2_tokenizer.eos_token_id,
}

In [None]:
for epoch in range(2):
    for batch in tqdm(ppo_trainer.dataloader):
        (logs, game_data,) = (
            dict(),
            dict(),
        )

        #### prepend a random control token
        task_list = choices(ctrl_str, k=config.batch_size)
        game_data["query"] = [t + q for t, q in zip(task_list, batch["query"])]
        query_tensors = [torch.cat((ctrl_tokens[t], input_ids)) for t, input_ids in zip(task_list, batch["input_ids"])]

        #### get response from gpt2
        response_tensors = []
        for query in query_tensors:
            response = ppo_trainer.generate(query, **generation_kwargs)
            response_tensors.append(response.squeeze()[-txt_out_len:])
        game_data["response"] = [gpt2_tokenizer.decode(r.squeeze()) for r in response_tensors]

        #### sentiment analysis
        texts = [q + r for q, r in zip(batch["query"], game_data["response"])]
        logits = extract_pipe_output(sentiment_pipe(texts, **sentiment_pipe_kwargs))
        rewards = pos_logit_to_reward(logits, task_list)

        #### Run PPO training
        t = time.time()
        stats = ppo_trainer.step(query_tensors, response_tensors, rewards)

        for cs in ctrl_str:
            key = "env/reward_" + cs.strip("[]")
            stats[key] = np.mean([r.cpu().numpy() for r, t in zip(rewards, task_list) if t == cs])
        ppo_trainer.log_stats(stats, game_data, rewards)

  0%|          | 0/160 [00:00<?, ?it/s]`eos_token_id` should consist of positive integers, but is tensor([-1], device='cuda:0'). Your generation will not stop until the maximum length is reached. Depending on other flags, it may even crash.
`eos_token_id` should consist of positive integers, but is tensor([-1], device='cuda:0'). Your generation will not stop until the maximum length is reached. Depending on other flags, it may even crash.
`eos_token_id` should consist of positive integers, but is tensor([-1], device='cuda:0'). Your generation will not stop until the maximum length is reached. Depending on other flags, it may even crash.
`eos_token_id` should consist of positive integers, but is tensor([-1], device='cuda:0'). Your generation will not stop until the maximum length is reached. Depending on other flags, it may even crash.
`eos_token_id` should consist of positive integers, but is tensor([-1], device='cuda:0'). Your generation will not stop until the maximum length is reach

### DONE

We have completed the PPO training successfully. Now, we can save the fine-tuned model and use it for inference. This model is optimized to generate outputs that align closely with human preferences, ensuring higher quality and more relevant results. Let’s proceed with saving the model and integrating it into our application for enhanced user experience.