# Chapter 5 Tutorial: Making a Language Model More Helpful With RLHF

This tutorial will demonstrate how reinforcement learning with human feedback (RLHF) can be used to fine-tune a generative language model. We use a set of prompts that reflect various ways a human might interact with a chatbot and a separate reward model that rates the quality of the generated answers. The reward model outputs are then used to update the weights of the LM through the PPO algorithm. 

Requirements:

1.   A Huggingface account with an API token generated.

## Installation and imports

In [8]:
#%pip install -q accelerate -U
#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

In [10]:
%load_ext autoreload
%autoreload 2

#%pip install -q transformers trl evaluate
#%pip install ipywidgets

In [12]:
## Log into Huggingface hub.

import os
from huggingface_hub import notebook_login, login, HfApi

# Option 1: Login via notebook widget (recommended for Jupyter notebooks)
#notebook_login()

# Option 2: Login via token (recommended for scripts)
login(token="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")

# Option 3: Login via environment variable
# os.environ["HUGGINGFACE_TOKEN"] = "your_token_here"
# login()

# Option 4: Login via cached token file
# with open('path/to/token.txt') as f:
#     login(token=f.read().strip())

# Option 5: Login programmatically with API
# api = HfApi()
# api.set_access_token("your_token_here")

### Import dependencies

In [13]:
import torch
from tqdm import tqdm
import pandas as pd
import matplotlib.pyplot as plt
from io import BytesIO
from IPython.display import Image, display_png, clear_output
import numpy as np

tqdm.pandas()

from transformers import pipeline, AutoTokenizer
from datasets import load_dataset

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from trl.core import LengthSampler

from evaluate import load

In [14]:
## Configure the PPO model

config = PPOConfig(
    model_name="aisquared/dlite-v1-355m",
    mini_batch_size=16,
    batch_size=16
)



## Load and format data

The dataset used in this tutorial was made available by Anthropic. It consists of manually written prompts representing human inputs, with two possible answers coming from a virtual assistant. The two answers have been reviewed by humans, and in each case one was chosen as being more helpful than the other.

In [15]:
ds = load_dataset("Anthropic/hh-rlhf", data_dir="helpful-base", split="train")
print(ds)

## Create prompts by chopping off the Assistant response in the dataset
new_col = [x[len("\n\nHuman: "):x.find("Assistant:")]
           for x in ds["chosen"]]
new_col = [x.replace("\n", " ").strip() for x in new_col]
ds = ds.add_column("instruction", new_col)
ds = ds.filter(lambda x: len(x["instruction"]) < 100)
ds = ds.select(range(6400))

## Print two examples
examples = ds.select((0, 1))
print(examples["instruction"])

README.md:   0%|          | 0.00/5.77k [00:00<?, ?B/s]

train.jsonl.gz:   0%|          | 0.00/16.2M [00:00<?, ?B/s]

test.jsonl.gz:   0%|          | 0.00/875k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Filter:   0%|          | 0/43835 [00:00<?, ? examples/s]

The DLite model uses special tokens, so we will need to format our prompts accordingly.

In [16]:
INSTRUCTION_KEY = "### Instruction:"
RESPONSE_KEY = "### Response:"
END_KEY = "### End"
INTRO_BLURB = (
    "Below is an instruction that describes a task. Write a response that appropriately completes the request."
)

## This is the prompt that is used for generating responses using an already trained model.  It ends with the response
## key, where the job of the model is to provide the completion that follows it (i.e. the response itself).
PROMPT_FOR_GENERATION_FORMAT = """{intro}
{instruction_key}
{instruction}
{response_key}
""".format(
    intro='',
    instruction_key=INSTRUCTION_KEY,
    instruction="{instruction}",
    response_key=RESPONSE_KEY,
)

In [17]:
## Print templatized versions of the two examples
ds = ds.add_column('query', [PROMPT_FOR_GENERATION_FORMAT.format(instruction=i) for i in ds['instruction']])
ds.select((0, 1))['query']

Flattening the indices:   0%|          | 0/6400 [00:00<?, ? examples/s]

['\n### Instruction:\nHi, I want to learn to play horseshoes. Can you teach me?\n### Response:\n',
 '\n### Instruction:\nHow do I teach kids to meditate?\n### Response:\n']

### Tokenize the data

In [18]:
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
tokenizer.pad_token = tokenizer.eos_token

def tokenize(sample):
    sample["input_ids"] = tokenizer.encode(sample["query"])
    sample["query"] = sample['instruction']
    return sample

dataset = ds.map(tokenize, batched=False)
dataset.set_format(type="torch")
dataset.shape

tokenizer_config.json:   0%|          | 0.00/262 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/80.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

Map:   0%|          | 0/6400 [00:00<?, ? examples/s]

(6400, 5)

In [19]:
dataset[0]

{'chosen': '\n\nHuman: Hi, I want to learn to play horseshoes. Can you teach me?\n\nAssistant: I can, but maybe I should begin by telling you that a typical game consists of 2 players and 6 or 8 horseshoes.\n\nHuman: Okay. What else is needed to play, and what are the rules?\n\nAssistant: A horseshoe is usually made out of metal and is about 3 to 3.5 inches long and around 1 inch thick. The horseshoe should also have a 2 inch by 3 inch flat at the bottom where the rubber meets the metal. We also need two stakes and six horseshoes.',
 'rejected': '\n\nHuman: Hi, I want to learn to play horseshoes. Can you teach me?\n\nAssistant: I can, but maybe I should begin by telling you that a typical game consists of 2 players and 6 or 8 horseshoes.\n\nHuman: Okay. What else is needed to play, and what are the rules?\n\nAssistant: Horseshoes are either metal or plastic discs. The horseshoes come in different weights, and the lighter ones are easier to throw, so they are often the standard for begi

## Load models

The RLHF process begins with an existing pre-trained model. Here we are using DLite, since it is relatively small and can be fine-tuned with limited GPU usage. A larger model would generate significantly better responses, but the reinforcement learning mechanics still operate in much the same way. We also download a reward model, reward-model-deberta-v3-large-v2.

### Pre-trained DLite model

In [20]:
model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)

  return torch._C._cuda_getDeviceCount() > 0


config.json:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.44G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.44G [00:00<?, ?B/s]

### Reward model

In [21]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

reward_name = "OpenAssistant/reward-model-deberta-v3-large-v2"
reward_model = AutoModelForSequenceClassification.from_pretrained(reward_name)
reward_tokenizer =  AutoTokenizer.from_pretrained(reward_name)

## Test with a question, and good and bad answers. The better answer should
## produce a larger reward.

question = "How do I teach kids to meditate?"
helpful = "I'm glad you want to teach your kids about meditation!"
bad = "I am not able to answer this question."

inputs = reward_tokenizer(question, helpful, return_tensors='pt')
good_score = reward_model(**inputs).logits[0].cpu().detach()

inputs = reward_tokenizer(question, bad, return_tensors='pt')
bad_score = reward_model(**inputs).logits[0].cpu().detach()
print(good_score, bad_score)


config.json:   0%|          | 0.00/993 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/455 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.66M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

The example above shows that the reward model is giving a higher score to a more helpful response, so our hope is that we can train a more helpful model by encouraging it to generate responses with higher rewards.

## Training the model

Now it's time to do reinforcement learning. We'll run PPO using the reward model as the basis for the reward function. The PPO trainer is initialized with two copies of the generative LLM. One will remain frozen for use as a reference, while the other will be the initial policy that's iteratively trained with PPO. The resulting policy after training is a new version of the generated model with the weights optimized for better human ratings. In each iteration of the training loop, the following steps occur:
1. A batch of prompts is passed through both the policy LLM and the frozen copy.
2. The responses from the policy are fed into the reward model to score their quality.
3. Gradient descent is used to update the weights with the objective of maximizing the reward. Note that the reward model is typically not an adequate reward function on its own. The construction of the reward function is an active area of research, but it almost always includes KL divergence between the updated policy and the original reference LLM. The goal is to ensure that the policy doesn't overfit to the reward model and forget too much information that it had previously learned.

In [22]:
gen_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
    "eos_token_id": tokenizer.encode(END_KEY)
}

In [23]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, dataset,
                         data_collator=collator)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [24]:
use_cuda = torch.cuda.is_available()

device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
    device = 0 if use_cuda else "cpu"

if use_cuda:
    reward_model = reward_model.cuda()

In [25]:
%matplotlib notebook

## Initialize notebook
plt.ion()
fig, ax = plt.subplots()
xs = []
moving_avg = []

for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    query_tensors = batch["input_ids"]

    #### Get response from dlite
    response_tensors = []
    for query in query_tensors:
        gen_len = 16
        response = ppo_trainer.generate(query, max_new_tokens=gen_len, **gen_kwargs)
        response_tensors.append(response.squeeze()[-gen_len:])
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

    ## Calculate the rewards from the reward model
    inputs = reward_tokenizer(batch["query"], batch["response"], return_tensors="pt", padding=True)
    inputs = inputs.to(device)
    rewards = reward_model(**inputs).logits.cpu().detach()
    rewards = [r[0] for r in rewards]

    #### Run PPO step
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    xs.append(stats["ppo/mean_scores"])

    if epoch < 9:
        continue

    moving_avg.append(np.mean(xs[-10:]))

    ## Print out performance by step
    ax.clear()
    ax.plot(range(10, epoch+2), moving_avg)

    ax.set_xlabel("Number of batches trained")
    ax.set_ylabel("Mean reward over last 10 batches")
    io = BytesIO()
    fig.savefig(io, format="png")

    clear_output(wait=True)
    display_png(Image(io.getvalue()))

    print(batch["query"][0], batch["response"][0])


<IPython.core.display.Javascript object>

0it [00:00, ?it/s]

In [None]:
## Push model to the Hugging Face hub for easier evaluation

## Set a link to push the model too, starting with your HF account name, followed
## by a slash, and then by whatever name you choose.
MODEL_NAME = 'mbotto2810/model-teating-rlhf'

model.push_to_hub(MODEL_NAME)

tokenizer.push_to_hub(MODEL_NAME)

## Model inspection

In [None]:
## Get a batch from the dataset
bs = 16
test_data = dict()
dataset.set_format("pandas")
df_batch = dataset[:].sample(bs)
test_data["query"] = df_batch["query"].tolist()
query_tensors = df_batch["input_ids"].tolist()

response_tensors_ref, response_tensors = [], []

## Get response from both original and new dlite
for i in range(bs):
    gen_len = 16
    # gen_len = output_length_sampler()
    output = ref_model.generate(
        torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **gen_kwargs
    ).squeeze()[-gen_len:]
    response_tensors_ref.append(output)
    output = model.generate(
        torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **gen_kwargs
    ).squeeze()[-gen_len:]
    response_tensors.append(output)

## Decode responses
test_data["response (before)"] = [tokenizer.decode(response_tensors_ref[i]) for i in range(bs)]
test_data["response (after)"] = [tokenizer.decode(response_tensors[i]) for i in range(bs)]

inputs = reward_tokenizer(test_data['query'], test_data['response (before)'], return_tensors='pt', padding=True)
inputs = inputs.to(device)
test_data["rewards (before)"] = [r[0] for r in reward_model(**inputs).logits.cpu().detach()]

inputs = reward_tokenizer(test_data['query'], test_data['response (after)'], return_tensors='pt', padding=True)
inputs = inputs.to(device)
test_data["rewards (after)"] = [r[0] for r in reward_model(**inputs).logits.cpu().detach()]


In [None]:
## Store results in a dataframe
df_results = pd.DataFrame(test_data)
df_results

In [None]:
print("mean:")
display(df_results[["rewards (before)", "rewards (after)"]].mean())
print()
print("median:")
display(df_results[["rewards (before)", "rewards (after)"]].median())

In [None]:
for i, row in df_results.sort_values('rewards (after)', ascending=False)[:5].iterrows():
  print('query:', row['query'])
  print('response (before):', row['response (before)'])
  print('response (after):', row['response (after)'])
  print()

### Calculate perplexity
Note that you may need to restart your instance at this time to free memory for the calculation, and between the two calculations below.

In [None]:
#MODEL_NAME = 'your-hf-username/your-model-name'
MODEL_NAME = 'mbotto2810/model-teating-rlhf'


In [None]:
test_ds = load_dataset('Anthropic/hh-rlhf', data_dir="helpful-base", split="test")
test_ds

In [None]:
perplexity = load("perplexity", module_type="metric")
results = perplexity.compute(predictions=test_ds['chosen'],
                             model_id=config.model_name)

In [None]:
## Perpelxity measurement for the original dlite model
results['mean_perplexity']

In [None]:
perplexity = load("perplexity", module_type="metric")
results = perplexity.compute(predictions=test_ds['chosen'],
                             model_id=MODEL_NAME)

In [None]:
## Perpelxity measurement for the RLHF-tuned dlite model
results['mean_perplexity']