# DATASCI267 - Week 4 - Lesson Notebook: Instruction Tuning (RLHF/PPO)


In this notebook we will use Instruction Tuning to train our our old friend GPT-2 to have more positive responses. The notebook is motivated by (and the RLHF training part is - with some modifications - essentially directly taken from) **an example from Huggingface's TRL library**, published [here](https://github.com/huggingface/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb) and released under a Apache 2 license.


Here is the logic of what we want to do:

1. We will use a further fine-tune version of GPT-2, which has been further pre-trained on the IMDB dataset language ('[lvwerra/gpt2-imdb](https://huggingface.co/lvwerra/gpt2-imdb)').

2. We use an existing sentiment classification model as a **reward model**. The reference notebook uses the fine-tuned BERT-like model '[lvwerra/distilbert-imdb](https://huggingface.co/lvwerra/distilbert-imdb)'. This saves us the effort to train a separate reward model. (Not that we do not want to train for actual instructions but completions, which in the end is quite a bit simpler in that one actually can train directly a reward model directly on example/label pairs vs preference data (instruction - chosen answer - rejected answer).)

3. We will then use the IMDB dataset as a training dataset. (We will only look at a few batches, as this procedure is really slow for RLHF and computation-wise expensive.) We will prepare the dataset as follows:  
   a) for each example we take a short segment at the very beginning as the 'prompt'.   
   b) We will then use the language model to generate **two** completions.    
   c) We will then use the reward model to: i) assign a score for either completion, and ii) determine which one of the two completions is preferred.   

4. Then we will train our models (keeping the reference models fixed). We will do so for        
   a) RHLF/PPO  
   

5. We will then compare completions and their sentiment before and after having done a few training batches.

This notebook is designed for a quick illustration. To get more reliable results you should run the training longer (see the original notebook). We would also encourage you to look at some of the other details.

This notebook can be run on a T4 processor and higher. (A L4 (on Colab Pro) is maybe 20% faster for the notebook that T4).

A Hugging Face blog post that compares DPO, IPO (a DPO variant) and paired KTO tests can be found here: https://huggingface.co/blog/pref-tuning . (For this term's notebook version, we excluded the original DPO and pair-wise KTO computation - which used the DPO Trainer with a KTO loss as described in the blog post - as they are unstable.)

**Notes:**

1. Some text in this notebook will be generated dynamically. As we use a simple language model, the ethical quality and general appropriateness cannot be guaranteed.   
2. It appears that the Hugging Face PPO implementation became more efficient. The results for that method appear to be quite good for this toy example. (And it is just a toy example.)  


## 0. Setup

We run a few installations and imports. We will also define how to create the dataset.

In [None]:
%%capture
#!pip install transformers  # not required in Colab as already installed

In [None]:
%%capture

%load_ext autoreload
%autoreload 2

!pip install trl bitsandbytes
!pip install -U trl==0.8.6

In [None]:
import torch
import numpy as np
from tqdm import tqdm
import pandas as pd
import warnings
#warnings.filterwarnings('ignore') # this is added relative to original notebook. We don't want to see a lot of warnings.

tqdm.pandas()

from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from datasets import load_dataset
from datasets import Dataset

from trl import PPOTrainer, PPOConfig,  AutoModelForCausalLMWithValueHead
from trl.core import LengthSampler

In [None]:
def build_dataset(config, dataset_name="imdb", input_min_text_length=6, input_max_text_length=14):
    """
    This function builds the initial dataset from `load_dataset`, specifically
    it creates the initial chunk from each review.

    """
    def tokenize(sample):
        sample["input_ids"] = tokenizer.encode(sample["review"])[: input_size()]
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample


    tokenizer = AutoTokenizer.from_pretrained(config.model_name)
    tokenizer.pad_token = tokenizer.eos_token
    # load imdb with datasets
    ds = load_dataset(dataset_name, split="train")
    ds = ds.rename_columns({"text": "review"})
    ds = ds.filter(lambda x: len(x["review"]) > 200, batched=False)

    input_size = LengthSampler(input_min_text_length, input_max_text_length)


    ds = ds.map(tokenize, batched=False)
    ds.set_format(type="torch")
    return ds

The function below we use for our KL-Penalty calucltions:

In [None]:
def log_pi(model, generated_output_tokens_x_y, len_x):
  """
  model: the model to be used for your pi calculation
  generated_output_tokens_x_y: the returned output tokens including the intial prompt tokens. Shape: (len(x) + len(y), )
  len_x: the number of input tokens
  We calculate the average log_likelihood of the y tokens.
  """

  len_x_y = len(generated_output_tokens_x_y)

  logits = model(generated_output_tokens_x_y)[0]
  probs = torch.nn.Softmax(dim=-1)(logits).detach().cpu().numpy()

  loglikelihood_score = 0
  for pos in range(len_x - 1, len_x_y - 1):     # only the y tokens matter
    log_prob = np.log(probs[pos, generated_output_tokens_x_y[pos + 1]])
    loglikelihood_score += log_prob
  return loglikelihood_score

## 1. Initial Model Creation, PPO (=RL Training) Config & Base Dataset Preparation

Hugging Face makes it easy to set up RLHF training using PPO:

In [None]:
config = PPOConfig(
    model_name="lvwerra/gpt2-imdb",
    learning_rate=1.41e-5,
    #log_with="wandb",
)

sent_kwargs = {"return_all_scores": True, "function_to_apply": "none", "batch_size": 16}

In [None]:
config

PPOConfig(exp_name='colab_kernel_launcher', seed=0, log_with=None, task_name=None, model_name='lvwerra/gpt2-imdb', query_dataset='imdb', reward_model='sentiment-analysis:lvwerra/distilbert-imdb', remove_unused_columns=True, tracker_kwargs={}, accelerator_kwargs={}, project_kwargs={}, tracker_project_name='trl', push_to_hub_if_best_kwargs={}, steps=20000, learning_rate=1.41e-05, adap_kl_ctrl=True, init_kl_coef=0.2, kl_penalty='kl', target=6, horizon=10000, gamma=1, lam=0.95, cliprange=0.2, cliprange_value=0.2, vf_coef=0.1, batch_size=128, forward_batch_size=None, mini_batch_size=128, gradient_accumulation_steps=1, world_size=None, ppo_epochs=4, max_grad_norm=None, optimize_cuda_cache=None, optimize_device_cache=False, early_stopping=False, target_kl=1, compare_steps=1, ratio_threshold=10.0, use_score_scaling=False, use_score_norm=False, score_clip=None, whiten_rewards=False, is_encoder_decoder=None, is_peft_model=None, backward_batch_size=128, global_backward_batch_size=None, global_bat

Now let's create and look at the dataset a bit:

In [None]:
%%capture

dataset = build_dataset(config)


def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

Token indices sequence length is longer than the specified maximum sequence length for this model (1168 > 1024). Running this sequence through the model will result in indexing errors


In [None]:
len(dataset)

24895

In [None]:
dataset[2]

{'review': "If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story.<br /><br />One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives (unless one comes up with one while one's mind wanders, as it will invariably do during this pointless film).<br /><br />One might better spend one's time staring out a window at a tree growing.<br /><br />",
 'label': tensor(0),
 'input_ids': tensor([1532,  691,  284, 3368, 1642,  428]),
 'query': 'If only to avoid making this'}

So the dataset has the full review, the label, and **also** the start of the prompt, which will later be given to the model to construct completions.

Here are the models to tune and the reference models (including the ones we'll use later):

In [None]:
config.model_name

'lvwerra/gpt2-imdb'

In [None]:
%%capture

# create the model to tune and the reference models

# Models to fine-tune: RLHF
rlhf_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)



# Reference model
ref_rlhf_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)



tokenizer = AutoTokenizer.from_pretrained(config.model_name)

tokenizer.pad_token = tokenizer.eos_token

Note that the RLHF framework has a separate trainer class (PPOTrainer) from the DPO/KTO/IPO framework (DPOTrainer). Therefore the reference models are structured a bit different. But at this point, they are all 'lvwerra/gpt2-imdb' models.

Now we can set up the Hugging Face Trainer [PPOTrainer](https://huggingface.co/docs/trl/main/en/ppo_trainer). (We **do not** want to do this from scratch in straight PyTorch!). It takes the configuration, the model, the original reference model, the dataset and a few additional parameters as input:

In [None]:
ppo_trainer = PPOTrainer(config, rlhf_model, ref_rlhf_model, tokenizer, dataset=dataset, data_collator=collator)

Hugging Face's [pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines) is also useful to use. It makes it very easy to generate model outputs. In this case, we set up a pipeline that can take various examples as inputs and return the sentiment for each example using the specified model 'lvwerra/distilbert-imdb' in this case:

In [None]:
%%capture

device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
    device = 0 if torch.cuda.is_available() else "cpu"  # to avoid a `pipeline` bug
sentiment_pipe = pipeline("sentiment-analysis", model="lvwerra/distilbert-imdb", device=device)

Device set to use cuda:0


How do we use this?

In [None]:
text = "this movie was really poor!!"
sentiment_pipe(text, **sent_kwargs)



[[{'label': 'NEGATIVE', 'score': 2.368640184402466},
  {'label': 'POSITIVE', 'score': -2.758239984512329}]]

Now we can create the actual dataset batches that we want to use (note that this is a bit messy due to preparing for multiple training approaches). This will take about 6 min on a T4 processor.

In [None]:
output_min_length = 10
output_max_length = 20
output_length_sampler = LengthSampler(output_min_length, output_max_length)
num_batches = 6  # each batch has 256 examples


generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
    #"temperature": 0.5
}


training_batches = []

for batch_nr, batch in tqdm(enumerate(ppo_trainer.dataloader)):

    batch_data = {}

    # Let's just use <num_batches> batches while we are in class. You should comment this out when you want to get to a much better results.
    if batch_nr == num_batches:
      break

    query_tensors = batch["input_ids"]

    #### Get response from gpt2
    response_tensors_1, response_tensors_2 = [], []
    response_texts_1, response_texts_2 = [], []

    for query in query_tensors:

        gen_len = output_length_sampler()
        generation_kwargs["max_new_tokens"] = gen_len
        response_1 = ppo_trainer.generate(query, **generation_kwargs)    #create two possible answers. You can use the PPO trainer for that as it comtains the base model that at this point is the same for all three approaches.
        response_2 = ppo_trainer.generate(query, **generation_kwargs)

        response_tensors_1.append(response_1.squeeze()[-gen_len:])
        response_tensors_2.append(response_2.squeeze()[-gen_len:])

        batch["response_1"] = [tokenizer.decode(r.squeeze()) for r in response_tensors_1]
        batch["response_2"] = [tokenizer.decode(r.squeeze()) for r in response_tensors_2]


    # Now where we have the responses added to the batch, let's look how the batch looks like:
    if batch_nr == 0:
      print('This is how a batch looks like')
      print('\tBatch size: ', len(batch['label']))
      print('\tA few labels: ', batch['label'][:2])
      print('\tA few starters: ', batch['query'][:2])
      print('\tThe first corresponding completions: ', batch['response_1'][:2])
      print('\tThe second corresponding completions: ', batch['response_2'][:2])
      print('\tThe input ids for the queries: ', batch['input_ids'][:2])


    #### Compute sentiment score
    texts_1 = [q + r for q, r in zip(batch["query"], batch["response_1"])]
    texts_2 = [q + r for q, r in zip(batch["query"], batch["response_2"])]
    pipe_outputs_1 = sentiment_pipe(texts_1, **sent_kwargs)
    pipe_outputs_2 = sentiment_pipe(texts_2, **sent_kwargs)
    rewards_1 = [torch.tensor(output[1]["score"] - output[0]["score"]) for output in pipe_outputs_1]
    rewards_2 = [torch.tensor(output[1]["score"] - output[0]["score"]) for output in pipe_outputs_2]

    preferred_completions = []
    rejected_completions = []

    for batch_example_nr, (score_1, score_2) in enumerate(zip(rewards_1, rewards_2)):
      if score_1 > score_2:
        preferred_completions.append(batch["response_1"][batch_example_nr])
        rejected_completions.append(batch["response_2"][batch_example_nr])
      else:
        preferred_completions.append(batch["response_2"][batch_example_nr])
        rejected_completions.append(batch["response_1"][batch_example_nr])

    batch_data['rewards_1'] = rewards_1
    batch_data['rewards_2'] = rewards_2
    batch_data['query_tensors'] = query_tensors
    batch_data['response_tensors_1'] = response_tensors_1
    batch_data['response_tensors_2'] = response_tensors_2
    batch_data['query'] = batch['query']
    batch_data['preferred_completions'] = preferred_completions
    batch_data['rejected_completions'] = rejected_completions
    batch_data['batch'] = batch

    training_batches.append(batch_data)

0it [00:00, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


This is how a batch looks like
	Batch size:  128
	A few labels:  [tensor(0, device='cuda:0'), tensor(0, device='cuda:0')]
	A few starters:  ['Pay no attention to the comments behind the curtain! The majority', 'This movie was terrible. The first half hour is much like a']
	The first corresponding completions:  [' of this film is conventional actors. Even so, you can see how reliable', ' Bruce Willis episode Bill Pullman invented. Basically,']
	The second corresponding completions:  [' of television shows together are delivered from garbage sources. Half of them are inept', ' bad romantic comedy with watered-down music. Half']
	The input ids for the queries:  [tensor([19197,   645,  3241,   284,   262,  3651,  2157,   262, 29461,     0,
          383,  3741], device='cuda:0'), tensor([1212, 3807,  373, 7818,   13,  383,  717, 2063, 1711,  318,  881,  588,
         257], device='cuda:0')]


4it [02:31, 37.76s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
6it [03:46, 37.72s/it]


In [None]:
len(training_batches)

6

In [None]:
len(training_batches[0]['query'])

128

So with limiting our batches to 4 we will have 512 paired training examples.

##2. Ok, Let's Train!



### 2.1 RLHF with PPO Training

This part is fully based on the original TRL notebook.

#### 2.1.a Training

Let's run a training loop using our PPO trainer:

In [None]:
for epoch in range(3):
  for batch_nr, batch in tqdm(enumerate(training_batches)):
    query_tensors = batch['query_tensors']
    response_tensors_1 = batch['response_tensors_1']
    response_tensors_2 = batch['response_tensors_2']
    rewards_1 = batch['rewards_1']
    rewards_2 = batch['rewards_2']

    #### Run PPO step
    stats_1 = ppo_trainer.step(query_tensors, response_tensors_1, rewards_1)
    ppo_trainer.log_stats(stats_1, batch, rewards_1)
    stats_2 = ppo_trainer.step(query_tensors, response_tensors_2, rewards_2)
    ppo_trainer.log_stats(stats_2, stats_2, rewards_2)

6it [01:05, 10.95s/it]
6it [01:05, 10.91s/it]
6it [01:06, 11.01s/it]


#### 2.1.b Testing

Now we compare the completions of the original model with the one that was fine-tuned using RL. To do so, we first create a base set that we will also use for the other models:

In [None]:
# Let's set potentially different generation parameters for our test generations
gen_kwargs = {"min_length": -1, "top_k": 0.0, "top_p": 1.0, "do_sample": True, "pad_token_id": tokenizer.eos_token_id}

#### get a batch from the dataset
bs = 100

game_data = dict()
dataset.set_format("pandas")

# Below, we sample from our dataset that we also trained on. Certainly not what we would do in reality,
# but ok for illustration. We only use the short segments at the beginning as prompts anyway.

df_batch = dataset[:].sample(bs)

game_data["query"] = df_batch["query"].tolist()
query_tensors = df_batch["input_ids"].tolist()

Now we generate answers for for the prompts (query) with the reference model and the one we tuned with RLHF:

In [None]:
response_tensors_ref, response_tensors, KL = [], [], []

#### get response from gpt2 and gpt2_ref
for i in range(bs):
    gen_len = output_length_sampler()
    output = ref_rlhf_model.generate(
        torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **gen_kwargs
    ).squeeze()[-gen_len:]
    response_tensors_ref.append(output)
    output = rlhf_model.generate(
        torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **gen_kwargs
    ).squeeze()[-gen_len:]
    response_tensors.append(output)

    try:
      log_pi_ref = log_pi(ref_rlhf_model, output, query_tensors[i].shape[0])
      log_pi_tuned = log_pi(rlhf_model, output, query_tensors[i].shape[0])
      KL.append(log_pi_tuned - log_pi_ref)
    except:
      print(f'Iteration {i} gave an error in log pi calculations. Incorrect ratio here.')
      KL.append(0)


#### decode responses
game_data["response (before)"] = [tokenizer.decode(response_tensors_ref[i]) for i in range(bs)]
game_data["response (after)"] = [tokenizer.decode(response_tensors[i]) for i in range(bs)]

#### sentiment analysis of query/response pairs before/after
## We changed the reward from (output[1]["score"])
##    to (output[1]["score"] - output[0]["score"]) relative to TRL article.
texts = [q + r for q, r in zip(game_data["query"], game_data["response (before)"])]
game_data["reward (before)"] = [output[1]["score"] - output[0]["score"] for output in sentiment_pipe(texts, **sent_kwargs)]

texts = [q + r for q, r in zip(game_data["query"], game_data["response (after)"])]
game_data["reward (after)"] = [output[1]["score"] - output[0]["score"] for output in sentiment_pipe(texts, **sent_kwargs)]


game_data["reward boost"] = [a - b for (a, b) in zip(game_data["reward (after)"],  game_data["reward (before)"])]

# measuring the KL penalty using beta = 0.2
game_data["KL Penalty (0.2 x KL)"] = [0.2 * np.round(x, 3) for x in KL]


# store results in a dataframe
df_results = pd.DataFrame(game_data)
df_results

Unnamed: 0,query,response (before),response (after),reward (before),reward (after),reward boost,KL Penalty (0.2 x KL)
0,I really enjoyed this movie. It took a pretty ...,singing couple to give us this boring little f...,line movie like Happiness Island to get my The...,1.664254,4.935911,3.271657,-0.0280
1,I loathed this film. The,Affairs lived up to my expectations. It's a w...,actors seemed to be perfectly formed and well...,4.839497,1.294552,-3.544945,-0.1072
2,With nothing better to do I decided to,"give Fairchild's OP a try, everything's going...",make this film ............porn movie ....it ...,-2.681391,-4.286988,-1.605597,1.2242
3,"""Stories of the Century"" was a",joy for me all the way through and it gets,"sleeper '80s gem which, as always,",5.001833,4.884953,-0.116880,-0.0898
4,It's common practice for a film about repressi...,. That doesn't seem to be the case here,compared to today's blockbuster flicks like G...,0.287928,1.852924,1.564997,0.0000
...,...,...,...,...,...,...,...
95,Somerset Maugham's characters are,a handful.<br /><br />Obviously this won't be...,top quality actors - Mergers was a great soun...,-3.122725,4.904378,8.027103,2.8094
96,"No, I haven't read the",movie and will watch it again. I am attending...,V collection of lithographs yet - almost all ...,0.613023,0.622417,0.009394,-1.3556
97,The movie looked like a walk-through,with a bunch of bad-looking bad bunch of bure...,s movie.<br /><br />Ivy Williamson is great as...,-5.081180,2.437100,7.518280,-0.6064
98,My left foot is an epic outstanding film expla...,the power of sameness. Subtle but watchable b...,the loss of my incarnation family in one chap...,5.532346,5.141888,-0.390458,2.3860


Ok.... Some better some worse. But did we improve in aggregate?

In [None]:
print("mean:")
display(df_results[["reward (before)", "reward (after)"]].mean())
print()
print("median:")
display(df_results[["reward (before)", "reward (after)"]].median())

mean:


Unnamed: 0,0
reward (before),0.458877
reward (after),2.099053



median:


Unnamed: 0,0
reward (before),1.007019
reward (after),2.69577


What about average reward boost vs penalties? Let us get the mean reward and the mean KL penalty (in this case 0.2 x KL Divergence) as well:

In [None]:
print("mean Reward boost:")
display(df_results[["reward boost"]].mean())
print()
print("mean KL Penalty (0.2 x KL):")
display(df_results[["KL Penalty (0.2 x KL)"]].mean())


mean Reward boost:


Unnamed: 0,0
reward boost,1.640176



mean KL Penalty (0.2 x KL):


Unnamed: 0,0
KL Penalty (0.2 x KL),0.134226


We did! (KL Divergence term negative? Must have been a bit of a statistical fluke!)



#### 2.1.c Some Tests & Loose Ends

Let us now play with some examples ourselves:

In [None]:
my_starter = "When I was sitting in"

inputs = tokenizer(my_starter, return_tensors="pt").to(device)

orinal_output = ref_rlhf_model.generate(**inputs, labels=inputs.input_ids, max_new_tokens=gen_len, **gen_kwargs
    )
new_output = rlhf_model.generate(**inputs, labels=inputs.input_ids, max_new_tokens=gen_len, **gen_kwargs
    )

original_completion = tokenizer.decode(orinal_output[0])
new_completion = tokenizer.decode(new_output[0])

print('')
print("Completion from original model: ", original_completion)
print("Completion from tuned model: ", new_completion)

print('')
sentiment_vals = sentiment_pipe([original_completion,
                                 new_completion],
                                **sent_kwargs)

original_completion_sentiments, new_completion_sentiments = sentiment_vals

print("Logit values for original completion: ", str(original_completion_sentiments))
print("Logit values for new completion: ", str(new_completion_sentiments))

sentiment_vals




Completion from original model:  When I was sitting in the living room a few hours ago, and I watched
Completion from tuned model:  When I was sitting in the theatre looking shocked by the love song by Buddy Holly

Logit values for original completion:  [{'label': 'NEGATIVE', 'score': -0.7268210053443909}, {'label': 'POSITIVE', 'score': 0.8587529063224792}]
Logit values for new completion:  [{'label': 'NEGATIVE', 'score': -1.3171314001083374}, {'label': 'POSITIVE', 'score': 1.5196523666381836}]




[[{'label': 'NEGATIVE', 'score': -0.7268210053443909},
  {'label': 'POSITIVE', 'score': 0.8587529063224792}],
 [{'label': 'NEGATIVE', 'score': -1.3171314001083374},
  {'label': 'POSITIVE', 'score': 1.5196523666381836}]]

What does this imply for the likelihood that either example is positive (as viewed by our reward model):

In [None]:
???

Object `?` not found.


Cool! Now we saw some of the components of the InstructGPT paper directly!
