<font color=red>**Danger zone:**</font> you'll be fine-tuning a model to generate positive, negative or even toxic reviews. We'll be doing this for fun, but this is also the technique for [review bombing](https://en.wikipedia.org/wiki/Review_bomb), bot farms on social media and other less than dignified stuff. It is ultimately your decision how you apply this knowledge, but before you choose, ask yourself: is this why you chose to learn ML?


# LLMs Alignment with Reinforcement Learning from human feedback (RLHF).

_based on the [original notebook](https://github.com/antndlcrx/oxford-llms-workshop/blob/main/materials/seminars/day_3/8_LLMs%20alignment%20with%20RLHF.ipynb) by Ilya Boytsov for the Oxford LLMs workshop_



In this session, you're gonna fine-tune a language model with reinforcement learning to make it generate good (or bad) reviews.

To perform RL-based fine-tuning, we'll use a new (in this course) library called [Transformer Reinforcement Learning (TRL)](https://huggingface.co/docs/trl). TRL implements the main reinforcement learning components of RLHF: reward modeling and fine-tuning with PPO.

![img](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/TRL-readme.png)

In [1]:
# %pip install -q trl==0.7.4 transformers==4.33.1 datasets==2.15.0 peft==0.5.0

In [1]:
# !pip install  gcsfs==2024.5.0 fsspec==2024.5.0 trl==0.8.6 transformers==4.44.0 datasets==2.20.0 peft==0.12.0

Defaulting to user installation because normal site-packages is not writeable
Collecting datasets==2.20.0
  Using cached datasets-2.20.0-py3-none-any.whl (547 kB)
[0m[31mERROR: Error while checking for conflicts. Please file an issue on pip's issue tracker: https://github.com/pypa/pip/issues/new
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 3021, in _dep_map
    return self.__dep_map
  File "/usr/lib/python3/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 2815, in __getattr__
    raise AttributeError(attr)
AttributeError: _DistInfoDistribution__dep_map

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 3012, in _parsed_pkg_info
    return self._pkg_info
  File "/usr/lib/python3/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 2815, in __getattr__
    raise

### Tutorial: align the model to generate positive movie reviews

To see how TRL works, we'll use it to align GPT2 on IMDB dataset to generate positive (or negative) movie reviews. In fact, __it's your choice whether you want positive or negative reviews.__

But before you choose, let's take a look at the baseline model: a GPT-2 fine-tuned on generating arbitrary movie reviews.

In [23]:
import torch
import transformers
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
main_tokenizer = transformers.AutoTokenizer.from_pretrained("lvwerra/gpt2-imdb")
main_model = transformers.AutoModelForCausalLM.from_pretrained("lvwerra/gpt2-imdb", device_map=device)



In [24]:
inputs = main_tokenizer("The movie", return_tensors='pt').to(device)
generated_ids = main_model.generate(**inputs, max_new_tokens=50, do_sample=True)
print("\nGenerated text:", main_tokenizer.decode(generated_ids.flatten().cpu().numpy().tolist()))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Generated text: The movie has some fantastic scenes, like when a car breaks down at an intersection, you start to see this and think to yourself "Wow, it's happening." Then the car slows down in this scene so you realize the guy in the movie was supposed to


If you run this cell a couple of times, you'll see that the model generates both positive, negative and neutral reviews in some proportion. What we're gonna do next is teach the model to generate more positive (or negative) reviews.

Similarly to InstructGPT, we're gonna do that in 2 stages:
- **train a reward model** to assign higher values to positive (or negative) reviews
- fine-tune the language model to **maximize that reward using [proximal policy optimization](https://openai.com/research/openai-baselines-ppo)**



## Stage 1: train a reward model

First, we'll train a BERT-like model as our reward model. We'll generate a synthetic pairwise rankings to emulate human rankings.

__Q:__ why do I need a reward model? Can I just use a pre-trained sentiment classifier? <br> __A:__ Yes, you can - but that only works for movie reviews. But this tutorial will teach you how to do RLHF for any kind objective.


__If you actually want to maximize sentiment (or other "label") instead of human preferences, train reward model as a classifier! (see week5)__


In [25]:
# We'll be fine-tuning a small BERT-like model for now. Please try other models for the main assignment.
reward_model = transformers.AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased", device_map=device)
reward_tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-cased")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


__Note that__ the reward model has a separate tokenizer, different from the main model. They don't need to be the same for RLHF fine-tuning.

In [26]:
# To train a reward model, you need a dataset (or generator) of positive-negative pairs.
# Each training sample should be a dict with 4 keys:
#  - input_ids_chosen, attention_mask_chosen = tokenizer("A sentence that human labeler likes more")
#  - input_ids_rejected, attention_mask_rejected = tokenizer("A sentence that human labeler likes less")

import torch
import datasets

class IMDBPairwiseDataset(torch.utils.data.Dataset):
    """ A dataset of all possible pairs of chosen and texts in TRT reward training format """
    def __init__(self, imdb, tokenizer, accepted_label: int):
        super().__init__()
        self.tokenizer = tokenizer
        self.chosen_texts = [row['text'] for row in imdb if row['label'] == accepted_label]
        self.rejected_texts = [row['text'] for row in imdb if row['label'] != accepted_label]
        assert self.chosen_texts, f"no texts with label {accepted_label}"
        print(f"Found {len(self.chosen_texts)} chosen and {len(self.rejected_texts)} rejected texts, {len(self)} pairs")

    def __len__(self):
        return len(self.chosen_texts) * len(self.rejected_texts)  # all pairs

    def __getitem__(self, index: int):
        chosen = self.tokenizer(self.chosen_texts[index // len(self.chosen_texts)], truncation=True)
        rejected = self.tokenizer(self.rejected_texts[index % len(self.chosen_texts)], truncation=True)
        return dict(input_ids_chosen=chosen['input_ids'], attention_mask_chosen=chosen['attention_mask'],
                    input_ids_rejected=rejected['input_ids'], attention_mask_rejected=rejected['attention_mask'])

In [128]:
TARGET_LABEL = 1   # and make sure it works by reviewing the sample printed below
imdb = datasets.load_dataset("imdb", split='train')
reward_data = IMDBPairwiseDataset(imdb, reward_tokenizer, accepted_label=TARGET_LABEL)

sample = reward_data[31337]
print('CHOSEN:', reward_tokenizer.decode(sample['input_ids_chosen']))
print('REJECTED:', reward_tokenizer.decode(sample['input_ids_rejected']))

Found 12500 chosen and 12500 rejected texts, 156250000 pairs
CHOSEN: [CLS] Lars Von Trier is never backward in trying out new techniques. Some of them are very original while others are best forgotten. < br / > < br / > He depicts postwar Germany as a nightmarish train journey. With so many cities lying in ruins, Leo Kessler a young American of German descent feels obliged to help in their restoration. It is not a simple task as he quickly finds out. < br / > < br / > His uncle finds him a job as a night conductor on the Zentropa Railway Line. His job is to attend to the needs of the passengers. When the shoes are polished a chalk mark is made on the soles. A terrible argument ensues when a passenger's shoes are not chalked despite the fact they have been polished. There are many allusions to the German fanaticism of adherence to such stupid details. < br / > < br / > The railway journey is like an allegory representing man's procession through life with all its trials and tribulations

We'll be using `trl.RewardTrainer` - a special case of `transformers.Trainer` that you used in the past. `RewardTrainer` accepts the same format of training arguments (e.g. batch size, gradient checkpointing) as before, except that it trains the model for the pairwise reward objective from [the InstructGPT paper](https://arxiv.org/pdf/2203.02155.pdf):

![img](https://i.imgur.com/2JzNAPs.png)

Note that the model itself does not score pairs: it processes chosen ($y_w$) and rejected ($y_l$) samples independently. To minimize this loss, the reward model needs to score chosen sample higher than the rejected one. Note that the formula also assumes some context $x$, which is useful for seq2seq tasks. In our case of movie reviews, $x$ is empty.

In [136]:
import trl

training_args = trl.RewardConfig(  # like transformers.TrainingArguments
    output_dir="reward_model",
    per_device_train_batch_size=32,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    max_steps=1_000,              # note: training may need more than 1k steps
    logging_steps=50,
    gradient_checkpointing=True,  # reduce memory usage but train ~30% slower
    gradient_checkpointing_kwargs={"use_reentrant": False},
    fp16=True,                 # disable this on CPU or on very old GPUs
    report_to="none"
    # you may add any other hyperparameters that you found useful in weeks 5-7
)

trainer = trl.RewardTrainer(
    model=reward_model,
    args=training_args,
    tokenizer=reward_tokenizer,
    train_dataset=reward_data,
    peft_config=None,  # optionally, you may tune with LoRA, prompt-tuning, etc
)

trainer.train()

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss
50,0.0963
100,0.0885
150,0.0983
200,0.0893
250,0.0938
300,0.0756
350,0.0544
400,0.061
450,0.074
500,0.0619




TrainOutput(global_step=1000, training_loss=0.06643827652931214, metrics={'train_runtime': 279.5615, 'train_samples_per_second': 114.465, 'train_steps_per_second': 3.577, 'total_flos': 0.0, 'train_loss': 0.06643827652931214, 'epoch': 0.00020479997902848215})

In [137]:
reward_model.gradient_checkpointing_disable()
reward_model.eval()

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

### Sanity-check the reward model (1 point)

Let's check how our reward model performs.

__Your task__ is to measure how often does your reward model can rank a pair of (chosen and rejected) reviews correctly. Please measure this separately for train data (`imdb`) and a separate test set loaded below.

In [131]:
for sample_index in 45, 16000:
  print('TEXT:', imdb[sample_index]['text'])
  inputs = reward_tokenizer(
      imdb[sample_index]['text'], truncation=True, return_tensors='pt').to(device)
  with torch.no_grad():
    reward = reward_model(**inputs).logits[0, 0].item()
    print("REWARD:", reward)
  print('LABEL:', imdb[sample_index]['label'])
  print()

# note: your reward model may produce different absolute rewards.
# This is fine as long as the rewards are ordered correctly (most of the time)

TEXT: This movie sucked. It really was a waste of my life. The acting was atrocious, the plot completely implausible. Long, long story short, these people get "terrorized" by this pathetic "crazed killer", but completely fail to fight back in any manner. And this is after they take a raft on a camping trip, with no gear, and show up at a campsite that is already assembled and completely stocked with food and clothes and the daughters headphones. Additionally, after their boat goes missing, they panic that they're stuck in the woods, but then the daughters boyfriend just shows up and they apparently never consider that they could just hike out of the woods like he did to get to them. Like I said, this movie sucks. A complete joke. Don't let your girlfriend talk you into watching it.
REWARD: -4.2421875
LABEL: 0

TEXT: Good: Engaging cinematic firefights, great presentation, vehicles are actually fun to drive, fairly appealing multiplayer, faithful to the movie, and the list goes on.<br /

In [53]:
imdb_test = datasets.load_dataset("imdb", split='test')

# <a whole lot of your code here, feel free to spit it as you see fit>

In [141]:
# label 1 - reward положительны, например 4
# label 0 - reward отрицательный, например -4

In [138]:
from sklearn.metrics import roc_auc_score
from tqdm import tqdm
import numpy as np


def evaluate_roc_auc(reward_model, dataset, tokenizer, device, N=None):
    rewards = []
    labels = []
    if N:
        dataset = dataset.shuffle(seed=42)
    for n, sample in enumerate(tqdm(dataset)):
        inputs = tokenizer(sample['text'], truncation=True, return_tensors='pt').to(device)
        with torch.no_grad():
            reward = reward_model(**inputs).logits[0, 0].item()
        rewards.append(max(0, reward))
        labels.append(sample['label'])
        if N and n >= N:
            break
    rewards = np.array(rewards)
    labels = np.array(labels)
    # ROC AUC score
    roc_auc = roc_auc_score(labels, rewards)
    return roc_auc

In [139]:
# train
roc_auc_train = evaluate_roc_auc(reward_model, imdb, reward_tokenizer, device, N=None)
print(f"ROC AUC on training data: {roc_auc_train}")

100%|██████████| 25000/25000 [01:48<00:00, 231.41it/s]

ROC AUC on training data: 0.9723826399999999





In [140]:
# test
roc_auc_test = evaluate_roc_auc(reward_model, imdb_test, reward_tokenizer, device, N=None)
print(f"ROC AUC on test data: {roc_auc_test}")

100%|██████████| 25000/25000 [01:47<00:00, 232.30it/s]

ROC AUC on test data: 0.9489861792





### Reward-guided generation (1 point)

If you did everything right, by now you should have a decent reward model. Before we use it for reinforcement learning, let's see if we can align model samples without any training.

To do so, you can use reward-guided inference: __generate N=16 samples, then select the one with the highest reward__ (according to your reward model).

For this problem, it's on you to demonstrate whether or not your code works. Find at least 5 neutral prompts such as "This movie is" (...), generate samples, rank them based on reward and show which samples get the highest reward.

Note: it is faster to generate samples in parallel, rather than sequentially, as follows:




In [142]:
inputs = main_tokenizer(["It was"] * 5, return_tensors='pt').to(device)
for candidate in main_model.generate(**inputs, max_new_tokens=50, do_sample=True):
  print("Sample:", main_tokenizer.decode(candidate.flatten().cpu().numpy().tolist()))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample: It was also the first time I'd encountered anything remotely like this. The animation was horrible. Very bland and very annoying, with really poor CG. Even the main characters wore clothes for nothing. The animation was awful. I'd rather make a movie with people
Sample: It was really funny but had some very thin characters and some very hard lines that were only intended for the main character. I am not a big fan of gay movies but this one had good lines, but that's it. Don't get me wrong, I
Sample: It was the first time that a new musical had been created in my lifetime.. and the only time I know I was ever so surprised when I heard it. The songs were outstanding, and I must agree. One thing I can't say is that i won
Sample: It was a great film, and the movie is a wonderful way to begin a movie. The action sequences are intense, but most of the story seems to move at a snail's pace as it spirals along, and at times the actors manage to jump so
Sample: It was so good. I love th

In [145]:
# <YOUR CODE HERE> - feel free to organize it as you see fit
torch.manual_seed(42)


def reward_guided_generation(prompt, num_samples=16, max_new_tokens=50):
    inputs = main_tokenizer([prompt] * num_samples, return_tensors='pt').to(device)
    generated_ids = main_model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_k=50,  # diversity
        top_p=0.95,  # diversity
        temperature=0.7  # creativity
    )
    samples = [main_tokenizer.decode(ids, skip_special_tokens=True) for ids in generated_ids]
    rewards = []
    for sample in samples:
        sample_inputs = reward_tokenizer(sample, truncation=True, return_tensors='pt').to(device)
        with torch.no_grad():
            reward = reward_model(**sample_inputs).logits[0, 0].item()
        rewards.append(reward)
    best_sample_index = np.argmax(rewards)
    best_sample = samples[best_sample_index]
    best_reward = rewards[best_sample_index]
    print(f"Prompt: {prompt}")
    print(f"Best Sample: {best_sample}")
    print(f"Best Reward: {best_reward:.4f}")
    print("-" * 50)
    return best_sample, best_reward

In [147]:
prompts = [
    "This movie is",
    "Actors were",
    "I think the move was",
    "The character was",
    "The performance was"
]

for prompt in prompts:
    reward_guided_generation(prompt)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: This movie is
Best Sample: This movie is absolutely worth seeing. The acting is excellent, the cinematography is incredible, and the direction is great. The cast is all talented and talented, and it is truly the best I have ever seen.<br /><br />The movie is full of
Best Reward: 8.3281
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: Actors were
Best Sample: Actors were also very good, the story is simple and entertaining.<br /><br />A lot of what makes this movie interesting is the fact that the characters are all very likable and believable.<br /><br />The acting is very good and the
Best Reward: 8.2109
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: I think the move was
Best Sample: I think the move was just as effective. The script was so well written, and the acting was so good, that I didn't mind it. The film is so well-written that it is easy to watch. <br /><br />The film is a great
Best Reward: 8.0859
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: The character was
Best Sample: The character was well written and acted and the acting was good. The movie is a very good example of how to use the "good" in a movie. You can see the bad guys as they are a lot more interesting than the good ones. I enjoyed the
Best Reward: 7.8789
--------------------------------------------------
Prompt: The performance was
Best Sample: The performance was outstanding, and the story was well told. A great movie!
Best Reward: 8.1172
--------------------------------------------------


# Stage 2: fine-tune the main model with RL


For this tutorial, we will optimize GPT2 to produce positive IMDB movie reviews using the reward model you trained above.

Unlike supervised fine-tuning, RL allows model to generate it's own sentences on each training step. Then, it calculates the reward of those specific sentences, and finally, updates the model to increase the probability of sentences with high reward.

Thus, each RLHF consists of three stages: __Rollout__, __Evaluation__ and __Update__

<div style="text-align: center">
<img src='https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/gpt2_bert_training.png' width='600'>

The update stage depends on the specific RL algorithm. We'll be using Proximal Policy Optimization, or [PPO](https://arxiv.org/abs/1707.06347), similarly to what was used for InstructGPT.

Before we run those 3 stages, however, we need to create a dataset of "queries" - partial reviews in our case.

In [148]:
# Note: this code is specific to IMDB; you will need to re-write it for other tasks
imdb_for_rlhf = imdb.filter(lambda row: len(row['text']) > 200, batched=False)
imdb_for_rlhf = imdb_for_rlhf.remove_columns(['label'])
sample_length = trl.core.LengthSampler(2, 8)  # use the first 2-8 tokens as query

def select_query_and_tokenize(sample):
    query_ids = main_tokenizer.encode(sample["text"])[: sample_length()]
    sample["query"] = main_tokenizer.decode(query_ids)  # query is the only required column
    sample["input_ids"] = query_ids  # to avoid re-tokenizing later
    return sample  # we do not need the rest - it will be generated by the model

imdb_for_rlhf = imdb_for_rlhf.map(select_query_and_tokenize, batched=False)
imdb_for_rlhf.set_format(type="torch")

Filter:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/24895 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1168 > 1024). Running this sequence through the model will result in indexing errors


Next, let's prepare your reward model to predict rewards on whatever reviews were generated. Note that we use plaintext reviews because main model uses a different tokenizer from the reward model.

In [149]:
from typing import List
def compute_reward(texts: List[str]) -> torch.Tensor:
  inputs = reward_tokenizer(texts, truncation=True, padding=True, return_tensors='pt').to(device)
  with torch.no_grad():
    return reward_model(**inputs).logits[:, 0]

In [150]:
compute_reward([imdb[45]['text'], imdb[16000]['text']])  # test on human-written reviews

tensor([-7.3125,  8.1328], device='cuda:0')

Finally, we move to RL training. In this tutorial, we'll train LoRA adapters and not the full model.

In [151]:
import peft
peft_config = peft.LoraConfig(
    task_type=peft.TaskType.CAUSAL_LM, r=32, lora_alpha=32, lora_dropout=0.0, inference_mode=False
)

# reload main model as AutoModelForCausalLMWithValueHead - with an extra head needed for PPO
main_tokenizer = transformers.AutoTokenizer.from_pretrained("lvwerra/gpt2-imdb")
main_tokenizer.pad_token = main_tokenizer.eos_token

main_model = trl.AutoModelForCausalLMWithValueHead.from_pretrained("lvwerra/gpt2-imdb", device_map=device)
main_model = peft.get_peft_model(main_model, peft_config, adapter_name='default')
main_model.print_trainable_parameters()



trainable params: 1,179,648 || all params: 125,620,225 || trainable%: 0.9391


  state_dict = loading_func(filename if not use_safe else safe_filename, **load_kwargs)


Same as before, trl has a special type of trainer that minimize PPO-specific pseudo-loss. You can read more on this trainer [here](https://huggingface.co/docs/trl/main/en/ppo_trainer).

In [152]:
training_args = trl.PPOConfig(
    model_name=main_model.config._name_or_path,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    batch_size=64,
    ppo_epochs=4,                 # PPO performs this many updates per training batch
    mini_batch_size=32
)

ppo_trainer = trl.PPOTrainer(
    training_args, model=main_model.model, tokenizer=main_tokenizer,
    dataset=imdb_for_rlhf, data_collator=lambda data: dict((key, [d[key] for d in data]) for key in data[0])
)  # note: we pass main_model.model because PPOTrainer checks for one of several supported model types ...
# ... main_model.model is a model with adapters, which is supported. main_model itself is a wrapper that is not supported

In [153]:
from tqdm.auto import tqdm
max_steps = 50   # can be insufficient for some tasks - watch your learning curves
generation_kwargs = dict(
    min_length=-1, max_new_tokens=128, do_sample=True, top_k=0, top_p=1.0, pad_token_id=main_tokenizer.eos_token_id)
#                                  ^-- task-specific parameter!
with tqdm(enumerate(ppo_trainer.dataloader), total=max_steps) as progressbar:
  # note: ppo_trainer.dataloader is just a regular dataloader of queries, no RL-specific magic :)
  for epoch, batch in progressbar:
    if epoch >= max_steps:
        break

    # Rollout stage: generate continuations from batch queries using main_model
    response_tensors = ppo_trainer.generate(batch['input_ids'], **generation_kwargs)
    # ^-- list of tensors of token ids from main model tokenizer

    # de-tokenize responses to strings (since reward model uses a different tokenizer)
    batch["response"] = [main_tokenizer.decode(response.squeeze()) for response in response_tensors]
    # note: response_tensors already contain query tokens, so we don't need to add queries manually.
    # This may not be true for other tasks: check this manually by viewing batch["response"] and batch["query"]


    # Evaluation stage
    rewards = compute_reward(batch['response'])

    # Update stage
    stats = ppo_trainer.step(batch['input_ids'], response_tensors, list(rewards.split(1)))
    stats['rewards/mean'] = rewards.mean().item()

    print("-" * 30, 'STEP', epoch, '-' * 30)
    print(f'rewards/mean:\t{stats["rewards/mean"]:.9f}\t<---- average reward over this batch (higher=better, noisy)')
    print(f'ppo/returns/mean:\t{stats["ppo/returns/mean"]:.9f}\t<---- model-estimated average discounted reward')
    print(f'objective/kl:\t{stats["objective/kl"]:.9f}\t<---- how far we are from the original model (regularizer)')
    print()

    ppo_trainer.log_stats(stats, batch, list(rewards.split(1)))

  0%|          | 0/50 [00:00<?, ?it/s]

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


------------------------------ STEP 0 ------------------------------
rewards/mean:	0.776145935	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.136003047	<---- model-estimated average discounted reward
objective/kl:	0.000000000	<---- how far we are from the original model (regularizer)

------------------------------ STEP 1 ------------------------------
rewards/mean:	0.557811737	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.075111836	<---- model-estimated average discounted reward
objective/kl:	-0.022935351	<---- how far we are from the original model (regularizer)

------------------------------ STEP 2 ------------------------------
rewards/mean:	1.544094086	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	0.280808270	<---- model-estimated average discounted reward
objective/kl:	-0.026567433	<---- how far we are from the original model (regularizer)

------------------------------ STEP 3 ---

## Main assignment - <u>actually</u> train the model (8 points)


Your main task for this week is to use the RLHF pipeline to train a model for a reward of your choice. Here's what you can choose from:

__A. Toxicity fine-tuning:__ train the model to be less (or more!) toxic. For this task, you may use the data from [jigsaw toxic comments](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) and [lmsys/toxic-chat](https://huggingface.co/datasets/lmsys/toxic-chat),  or any other source. Alternatively, you may use toxicity scores from [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1).


__B. Actual human feedback:__ use one of the existing datasets with pairwise human feedback to align your langauge model. You may use [anthropic's hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf), [OpenAssistant dataset](https://huggingface.co/datasets/OpenAssistant/oasst1) or any other data you see fit. You may also turn the tables and train the model to [minimize](https://habrastorage.org/getpro/geektimes/post_images/ac7/2ad/827/ac72ad82767d4132164a4b6b76196c42.jpg) human preferences, as long as your model does not degrade to gibberish.

__C. Controlled generation:__ Instead of training a reward model from human feedback, you may define the reward function as the text length (longer or shorter) or number of times the model uses specific words (e.g. "sorry", "apologize"). If you choose specific words, make sure the model generates them at least sometimes.

__Alternatively,__ you may choose a different task. However, unless your task is very similar to one of the above, there is a chance that it will be **significantly** harder to solve, requiring orders of magnitude more compute and tuning. If you are in doubt, please ask the course staff. If they are AFK (again >.<), please prefer one of the recommended tasks.


#### General tips & tricks


Things to look out for:
- during PPO stage, the reward model should be in eval mode (dropout disabled)
- make sure max_length and max_new_tokens are enough for your chosen dataset - at least most of the time
- when in doubt, view the data manually or inspect how the model performs on a few samples


We highly recommend that you manually check the performance after each sub-stage:
1. when you assembled the pairwise dataset, inspect a couple of from of *your* dataset class and detokenize them. Make sure that you-the-human understand why one sample was accepted and the other - rejected. At least most of the time. This also lets you spot tokenization/truncation errors.
2. after you trained a reward model, measure how accurate this model is in isolation. If your reward model is poor, any subsequent RLHF will also fail.
3. once you've trained the main model with RL, ask it to generate examples and explore how well it does. If it produces an obviously bad output, check if the reward model assigns high reward to that output. If yes, reward model is the culprit; if no, it's a question of better/longer PPO training.

__It is also a good idea to periodically print samples during training.__

__When stuck, simplify the problem.__ If you've spent a several hours enchanting the reward model but it still won't budge, try switching to a simple subtask. For instance, if you're training on hh-rlhf, try limiting it the dataset to 10% of the shortest sequences - they are typically easier to learn.


## Assignment stages (and grading)

Regardless of the specific task you chose, your solution needs to contain several parts that will be graded separately.


#### Stage 1: reward model (4 points)

Construct a dataset for training the reward model on your problem. Then, train a reward model on that dataset and evaluate how well can your model predict preferences on a hold-out (test) subset of your data.

Please make sure that the part of your notebook where you evaluate reward model is clearly visible and reasonably easy to read. And for all that is holy, do not call it IMDB unless it actually **is** data of imdb movie reviews :)

__Not all tasks require a reward model for later PPO fine-tuning.__ For instance, there's no reason to train a reward model if your reward equals sentence length. Likewise, toxicity reward can be estimated with a pre-trained toxicity classifier. __If your task does not require training a reward model, please train an unrelated model on [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) as though you were solving assignment version B.__ This is for grading purposes only, you won't use this model for stage 2.


#### Stage 2: RL fine-tuning (4 points)

Once the reward model is ready - or you can compute rewards without a model - it is time to maximize that reward with PPO. Optionally, you may replace PPO with another RL algorithm (or unlikelihood learning scheme), but only if you're feeling adventurous.


First, you need to choose a language model to be fine-tuned. You may choose any model, but make sure that your model **can** generate the data in your format. For instance, [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) is a general purpose LM and may (or may not) need prompt engineering to generate chat assistant responses. For that reason, it is best if you **do not use `"lvwerra/gpt2-imdb"` unless you're generating only movie reviews**.



There are two "difficulty modes" for this task:
For the **easy mode**, use [gpt2-large](https://huggingface.co/gpt2-large) or [opt-1.3b](https://huggingface.co/facebook/opt-1.3b) with minimal code changes.
If you want the **Hard mode:** use a larger (e.g. 7B) model in combination with `load_in_4bit` and LoRA, the same way we did last week.
Some reasonable model choices are [LLaMA-7B](https://huggingface.co/Enoch/llama-7b-hf), [Falcon-7b](https://huggingface.co/tiiuae/falcon-7b), [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) for general-purpose LM or [guanaco-7b](https://huggingface.co/timdettmers/guanaco-7b), [vicuna-7b](https://huggingface.co/lmsys/vicuna-7b-v1.5) for chat-based tasks, though there are many more (see [leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). In the hard mode, you will need to modify the training arguments to enable 4-bit fine-tuning. Furthermore, your experiments will take somewhat longer to complete. On the plus side, your model will produce significantly better results.

__High reward is not enough!__ RL algorithms are famous for [cheating their reward functions](https://openai.com/research/faulty-reward-functions). To ensure that your model is actually doing what you want it to do, you will need some additional evaluation. To get the full grade, provide at least 20 side-by-side examples of your fine-tuned model vs original model predictions and a short summary.

Alternatively, you may provide 5 examples and some extrinsic evaluation metric over many examples. For instance, you may use a different pre-trained toxicity score for option A. When dealing with human preferences, you may choose to [enlist actual humans](https://toloka.ai/) or [ask GPT4/Claude](https://arxiv.org/pdf/2304.03277.pdf) to compare your model's predictions. For task C, when optimizing for simple rewards like sentence lengths, it is enough to compare histograms of rewards (e.g. average lengths).












### Подготовка данных

In [50]:
import datasets
from datasets import Dataset
import trl
import peft
import transformers
import torch
import random
from sklearn.metrics import roc_auc_score
from tqdm import tqdm
import numpy as np


from transformers import AutoModelForSequenceClassification, AutoTokenizer
from torch.utils.data import DataLoader
from tqdm.auto import tqdm

random.seed(42)
torch.manual_seed(42)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [2]:
# данные
# https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data

# import zipfile
# import os 
# os.chdir(".")

# with zipfile.ZipFile("data/jigsaw-toxic-comment-classification-challenge.zip", 'r') as zip_ref:
#     zip_ref.extractall("data/")

# with zipfile.ZipFile("data/train.csv.zip", 'r') as zip_ref:
#     zip_ref.extractall("data/")

# with zipfile.ZipFile("data/test.csv.zip", 'r') as zip_ref:
#     zip_ref.extractall("data/")

# with zipfile.ZipFile("data/test_labels.csv.zip", 'r') as zip_ref:
#     zip_ref.extractall("data/")

In [34]:
jigsaw = datasets.load_dataset('jigsaw_toxicity_pred', data_dir='data/')
print(jigsaw)
print(jigsaw['train'][-1])

DatasetDict({
    train: Dataset({
        features: ['comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'],
        num_rows: 159571
    })
    test: Dataset({
        features: ['comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'],
        num_rows: 63978
    })
})
{'comment_text': '"\nAnd ... I really don\'t think you understand.  I came here and my idea was bad right away.  What kind of community goes ""you have bad ideas"" go away, instead of helping rewrite them.   "', 'toxic': 0, 'severe_toxic': 0, 'obscene': 0, 'threat': 0, 'insult': 0, 'identity_hate': 0}


In [35]:
def make_equal_toxic(dataset, min_size=7000):
    toxic_comments = [row for row in dataset if row['toxic'] == 1]
    non_toxic_comments = [row for row in dataset if row['toxic'] == 0]
    print(f"Total toxic comments: {len(toxic_comments)}")
    print(f"Total non-toxic comments: {len(non_toxic_comments)}")
    
    min_size = min(len(toxic_comments), len(non_toxic_comments), min_size)
    
    balanced_toxic_comments = random.sample(toxic_comments, min_size)
    balanced_non_toxic_comments = random.sample(non_toxic_comments, min_size)
    
    print(f"Balanced toxic comments: {len(balanced_toxic_comments)}")
    print(f"Balanced non-toxic comments: {len(balanced_non_toxic_comments)}")
    
    balanced_dataset = balanced_toxic_comments + balanced_non_toxic_comments
    random.shuffle(balanced_dataset)
    
    print(f"Total balanced dataset: {len(balanced_dataset)}")
    return Dataset.from_list(balanced_dataset)
    

In [36]:
jigsaw['train'] = make_equal_toxic(jigsaw['train'])
jigsaw['test'] = make_equal_toxic(jigsaw['test'])

Total toxic comments: 15294
Total non-toxic comments: 144277
Balanced toxic comments: 7000
Balanced non-toxic comments: 7000
Total balanced dataset: 14000
Total toxic comments: 6090
Total non-toxic comments: 57888
Balanced toxic comments: 6090
Balanced non-toxic comments: 6090
Total balanced dataset: 12180


In [37]:
reward_model = transformers.AutoModelForSequenceClassification.from_pretrained("distilroberta-base", device_map=device) # albert-base-v2 # distilbert-base-cased # distilroberta-base
reward_tokenizer = transformers.AutoTokenizer.from_pretrained("distilroberta-base") 
reward_model.to(device)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-5): 6 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (

In [38]:
class JigsawPairwiseDataset(torch.utils.data.Dataset):
    def __init__(self, dataset, tokenizer):
        self.tokenizer = tokenizer
        self.toxic_comments = [row['comment_text'] for row in dataset if row['toxic'] == 1]
        self.non_toxic_comments = [row['comment_text'] for row in dataset if row['toxic'] == 0]
        print(f"Found {len(self.toxic_comments)} toxic and {len(self.non_toxic_comments)} non-toxic comments.")

    def __len__(self):
        return len(self.toxic_comments) * len(self.non_toxic_comments)

    def __getitem__(self, index):
        toxic_comments = self.tokenizer(self.toxic_comments[index // len(self.toxic_comments)], truncation=True)
        non_toxic_comments = self.tokenizer(self.non_toxic_comments[index % len(self.non_toxic_comments)], truncation=True)
        return dict(input_ids_chosen=toxic_comments['input_ids'], attention_mask_chosen=toxic_comments['attention_mask'],
                    input_ids_rejected=non_toxic_comments['input_ids'], attention_mask_rejected=non_toxic_comments['attention_mask'])


jigsaw_pairwise_train = JigsawPairwiseDataset(jigsaw['train'], reward_tokenizer)
jigsaw_pairwise_test = JigsawPairwiseDataset(jigsaw['test'], reward_tokenizer)

Found 7000 toxic and 7000 non-toxic comments.
Found 6090 toxic and 6090 non-toxic comments.


In [39]:
sample = jigsaw_pairwise_train[0]
print('toxic_comments:', reward_tokenizer.decode(sample['input_ids_chosen']))
print('non_toxic_comments:', reward_tokenizer.decode(sample['input_ids_rejected']))

toxic_comments: <s>You Zionist Jewbastard Khazar Turks just love filibusters that draw out this tragedy to no conclusion.  That's right, only YOU are allowed a say on the issue.  YOU have the right to editorialise anything to YOUR content, media mogul jackasses!  Stay out of London, New York, Washington and Hollywood!  Get the fuck out of America and stop dragging us into your stupid affairs with Muslims!  You deserved 9/11 and I hope more of you die from suicide bombings by economically tortured Muslims, just keep it in the Middle East.  Helen Clark did well to not take your shite!  I swear, I'll fucking kill you all if I ever go to Israel.  I'll take nukes signed by each and every Jew of the Manhattan Project and level you to nothing; in a eulogy to Theodore Herzl.  What irony, to die by the products of your own hands, that had me fear for my life in the fucking Cold War.  Mad scientists and loan sharks, fucking trash with no goddamn decency to Europe and America!  Wanderer gypsies, 

In [40]:
for i in jigsaw_pairwise_train:
    print(i)
    break

{'input_ids_chosen': [0, 1185, 34387, 16495, 428, 1988, 1120, 2218, 18692, 27539, 95, 657, 46189, 37406, 14, 2451, 66, 42, 6906, 7, 117, 6427, 4, 1437, 280, 18, 235, 6, 129, 10540, 32, 1220, 10, 224, 15, 5, 696, 4, 1437, 10540, 33, 5, 235, 7, 8161, 1496, 932, 7, 21688, 1383, 6, 433, 18248, 10267, 24473, 328, 1437, 9631, 66, 9, 928, 6, 188, 469, 6, 663, 8, 3049, 328, 1437, 2315, 5, 26536, 66, 9, 730, 8, 912, 19335, 201, 88, 110, 12103, 5185, 19, 6299, 328, 1437, 370, 10973, 361, 73, 1225, 8, 38, 1034, 55, 9, 47, 1597, 31, 4260, 19918, 30, 14738, 20464, 6299, 6, 95, 489, 24, 11, 5, 2367, 953, 4, 1437, 11668, 4433, 222, 157, 7, 45, 185, 110, 1481, 1459, 328, 1437, 38, 24909, 6, 38, 581, 23523, 3549, 47, 70, 114, 38, 655, 213, 7, 1870, 4, 1437, 38, 581, 185, 295, 23369, 1419, 30, 349, 8, 358, 16495, 9, 5, 6562, 3728, 8, 672, 47, 7, 1085, 131, 11, 10, 364, 922, 21370, 7, 26164, 26288, 462, 4, 1437, 653, 21490, 6, 7, 1597, 30, 5, 785, 9, 110, 308, 1420, 6, 14, 56, 162, 2490, 13, 127, 301, 11

### reward_model

In [59]:
training_args = trl.RewardConfig(
    output_dir="reward_model_new",
    per_device_train_batch_size=32,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    max_steps=1_000,  # Training may need more than 1k steps
    logging_steps=50,
    gradient_checkpointing=True,  # Reduce memory usage but train ~30% slower
    gradient_checkpointing_kwargs={"use_reentrant": False},
    fp16=True,  # Disable this on CPU or on very old GPUs
    report_to="none",
    # max_length=512
)

In [60]:
trainer = trl.RewardTrainer(
    model=reward_model,
    args=training_args,
    tokenizer=reward_tokenizer,
    train_dataset=jigsaw_pairwise_train,
    # eval_dataset=jigsaw_pairwise_test,  # Optional: for evaluation during training
    peft_config=None  # Optionally, you may tune with LoRA, prompt-tuning, etc.
)

max_steps is given, it will override any value given in num_train_epochs


In [61]:
trainer.train()



Step,Training Loss
50,0.0687
100,0.0585
150,0.0436
200,0.0475
250,0.0524
300,0.0227
350,0.0449
400,0.0363
450,0.0352
500,0.0211




TrainOutput(global_step=1000, training_loss=0.03647275030612945, metrics={'train_runtime': 215.6464, 'train_samples_per_second': 148.391, 'train_steps_per_second': 4.637, 'total_flos': 0.0, 'train_loss': 0.03647275030612945, 'epoch': 0.0006530612244897959})

In [62]:
reward_model.gradient_checkpointing_disable()
reward_model.eval()

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-5): 6 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (

In [63]:
sample_indices = [45, 12] 

for sample_index in sample_indices:
    text = jigsaw['train'][sample_index]['comment_text']
    label = jigsaw['train'][sample_index]['toxic']  # Toxicity label (0 = non-toxic, 1 = toxic)
    inputs = reward_tokenizer(text, truncation=True, return_tensors='pt').to(device)
    with torch.no_grad():
        reward = reward_model(**inputs).logits[0, 0].item()
    print('TEXT:', text)
    print("REWARD:", reward)
    print('LABEL:', label)
    print()

TEXT: You could find more details in article WWI and add them here.
REWARD: -7.2890625
LABEL: 0

TEXT: Accept my apology! 

Accept my apology THIS MINUTE you big, fat noob (this is doughnuthead talking,and if you don't think so, tough.)78.144.87.71
REWARD: 6.15625
LABEL: 1



In [64]:
def evaluate_roc_auc(reward_model, dataset, tokenizer, device, N=None):
    rewards = []
    labels = []
    if N:
        dataset = dataset.shuffle(seed=42)
    for n, sample in enumerate(tqdm(dataset)):
        inputs = tokenizer(sample['comment_text'], truncation=True, return_tensors='pt').to(device)
        with torch.no_grad():
            reward = reward_model(**inputs).logits[0, 0].item()
        rewards.append(max(0, reward))
        labels.append(sample['toxic'])
        if N and n >= N:
            break
    rewards = np.array(rewards)
    labels = np.array(labels)
    # ROC AUC score
    roc_auc = roc_auc_score(labels, rewards)
    return roc_auc

In [65]:
# train
roc_auc_train = evaluate_roc_auc(reward_model, jigsaw['train'], reward_tokenizer, device, N=None)
print(f"ROC AUC on training data: {roc_auc_train}")

  0%|          | 0/14000 [00:00<?, ?it/s]

ROC AUC on training data: 0.9820517857142856


In [66]:
# test
roc_auc_test = evaluate_roc_auc(reward_model, jigsaw['test'], reward_tokenizer, device, N=None)
print(f"ROC AUC on test data: {roc_auc_test}")

  0%|          | 0/12180 [00:00<?, ?it/s]

ROC AUC on test data: 0.956944909553199


### RL fine-tuning

In [74]:
sample_length = trl.core.LengthSampler(2, 8)

def select_query_and_tokenize(sample):
    query_ids = main_tokenizer.encode(sample["comment_text"])[: sample_length()]
    sample["query"] = main_tokenizer.decode(query_ids)  # Query is the only required column
    sample["input_ids"] = query_ids  # To avoid re-tokenizing later
    return sample

jigsaw_train_for_rlhf = jigsaw['train'].map(select_query_and_tokenize, batched=False)
jigsaw_test_for_rlhf = jigsaw['test'].map(select_query_and_tokenize, batched=False)

jigsaw_train_for_rlhf.set_format(type="torch")
jigsaw_test_for_rlhf.set_format(type="torch")

Map:   0%|          | 0/14000 [00:00<?, ? examples/s]

Map:   0%|          | 0/12180 [00:00<?, ? examples/s]

In [81]:
peft_config = peft.LoraConfig(
    task_type=peft.TaskType.CAUSAL_LM, r=32, lora_alpha=32, lora_dropout=0.0, inference_mode=False
)

main_tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2") #  lvwerra/gpt2-imdb  # gpt2 # gpt2-large
main_tokenizer.pad_token = main_tokenizer.eos_token

main_model = trl.AutoModelForCausalLMWithValueHead.from_pretrained("gpt2", device_map=device)
main_model = peft.get_peft_model(main_model, peft_config, adapter_name='default')
main_model.print_trainable_parameters()

  state_dict = loading_func(filename if not use_safe else safe_filename, **load_kwargs)


trainable params: 1,179,648 || all params: 125,620,225 || trainable%: 0.9391


In [83]:
# main_model = transformers.AutoModelForCausalLM.from_pretrained("gpt2-large", device_map=device) 
# main_tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2-large")
# main_tokenizer.pad_token = main_tokenizer.eos_token 

# peft_config = peft.LoraConfig(
#     task_type=peft.TaskType.CAUSAL_LM, r=32, lora_alpha=32, lora_dropout=0.0, inference_mode=False
# )
# main_model = peft.get_peft_model(main_model, peft_config, adapter_name='default')
# main_model.print_trainable_parameters()

In [84]:
def compute_reward(texts):
    inputs = reward_tokenizer(texts, truncation=True, padding=True, return_tensors='pt').to(device)
    with torch.no_grad():
        logits = reward_model(**inputs).logits
        # Reward is the logit for the "non-toxic" class (class 0)
        rewards = logits[:, 0]  # Higher reward for non-toxic comments
    return rewards

In [85]:
training_args = trl.PPOConfig(
    model_name=main_model.config._name_or_path,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    batch_size=64,
    ppo_epochs=4,                 # PPO performs this many updates per training batch
    mini_batch_size=32
)

ppo_trainer = trl.PPOTrainer(
    training_args, model=main_model.model, tokenizer=main_tokenizer,
    dataset=jigsaw_train_for_rlhf, data_collator=lambda data: dict((key, [d[key] for d in data]) for key in data[0])
)  # note: we pass main_model.model because PPOTrainer checks for one of several supported model types ...
# ... main_model.model is a model with adapters, which is supported. main_model itself is a wrapper that is not supported

In [97]:
max_steps = 50  # training steps
generation_kwargs = dict(
    min_length=-1, max_new_tokens=128, do_sample=True, top_k=0, top_p=1.0, pad_token_id=main_tokenizer.eos_token_id
)

with tqdm(enumerate(ppo_trainer.dataloader), total=max_steps) as progressbar:
    for epoch, batch in progressbar:
        if epoch >= max_steps:
            break

        # Rollout stage: Generate continuations from batch queries
        response_tensors = ppo_trainer.generate(batch['input_ids'], **generation_kwargs)

        # Decode responses to strings
        batch["response"] = [main_tokenizer.decode(response.squeeze()) for response in response_tensors]

        # Evaluation stage: Compute rewards
        rewards = compute_reward(batch['response'])

        # Update stage: Perform PPO update
        stats = ppo_trainer.step(batch['input_ids'], response_tensors, list(rewards.split(1)))
        stats['rewards/mean'] = rewards.mean().item()

        # Log training statistics
        print("-" * 30, 'STEP', epoch, '-' * 30)
        print(f'rewards/mean:\t{stats["rewards/mean"]:.9f}\t<---- average reward over this batch (higher=better, noisy)')
        print(f'ppo/returns/mean:\t{stats["ppo/returns/mean"]:.9f}\t<---- model-estimated average discounted reward')
        print(f'objective/kl:\t{stats["objective/kl"]:.9f}\t<---- how far we are from the original model (regularizer)')
        print()

        ppo_trainer.log_stats(stats, batch, list(rewards.split(1)))

  0%|          | 0/50 [00:00<?, ?it/s]

------------------------------ STEP 0 ------------------------------
rewards/mean:	-1.932178497	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-8.876640320	<---- model-estimated average discounted reward
objective/kl:	1.298173904	<---- how far we are from the original model (regularizer)

------------------------------ STEP 1 ------------------------------
rewards/mean:	-2.744643211	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-9.073869705	<---- model-estimated average discounted reward
objective/kl:	1.629736185	<---- how far we are from the original model (regularizer)

------------------------------ STEP 2 ------------------------------
rewards/mean:	-2.439382553	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-9.036275864	<---- model-estimated average discounted reward
objective/kl:	2.043960094	<---- how far we are from the original model (regularizer)

------------------------------ STEP 3

In [90]:
from transformers import GenerationConfig

dummy_generation_config = GenerationConfig()
main_model.base_model.generation_config = dummy_generation_config

In [91]:
def reward_guided_generation(prompt, num_samples=16, max_new_tokens=50):
    inputs = main_tokenizer([prompt] * num_samples, return_tensors='pt').to(device)
    generated_ids = main_model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_k=50,  # diversity
        top_p=0.95,  # diversity
        temperature=0.7  # creativity
    )
    samples = [main_tokenizer.decode(ids, skip_special_tokens=True) for ids in generated_ids]
    rewards = []
    for sample in samples:
        sample_inputs = reward_tokenizer(sample, truncation=True, return_tensors='pt').to(device)
        with torch.no_grad():
            reward = reward_model(**sample_inputs).logits[0, 0].item()
        rewards.append(reward)
    best_sample_index = np.argmax(rewards)
    best_sample = samples[best_sample_index]
    best_reward = rewards[best_sample_index]
    print(f"Prompt: {prompt}")
    print(f"Best Sample: {best_sample}")
    print(f"Best Reward: {best_reward:.4f}")
    print("-" * 50)
    return best_sample, best_reward

In [96]:
prompts = [
    "This boy was",
    "The idea was",
    "I think the NLP course is",
    "You clever as",
    "You stupid as"
]

for prompt in prompts:
    reward_guided_generation(prompt)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: This boy was
Best Sample: This boy was a "lone wolf," according to his mother.

In early April, her son walked with his mother to the hospital. He was taken to the hospital. He was then taken to the hospital.

The following day, the mother
Best Reward: 2.4297
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: The idea was
Best Sample: The idea was that an electrical current could be used to force the body to its desired point. The idea of this would be that the body would be electrically moved by force of gravity. That was the idea of a wire is electrically moved by force of gravity
Best Reward: -4.4336
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: I think the NLP course is
Best Sample: I think the NLP course is a really the best way to go for this course. It's the most fun to get it together. You're really, really in a very good situation. You're really, really in a very good situation.


The NLP course
Best Reward: -7.0156
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: You clever as
Best Sample: You clever as hell, we can't be wrong. We are trying to help."

The man went on to say he was "doing the right thing" in a statement he gave to the New York Daily News. "I am in no way trying to
Best Reward: 3.1602
--------------------------------------------------
Prompt: You stupid as
Best Sample: You stupid asian fuck. It's like you want to be a real fucking person and you're going to do whatever you want with your life.

But you're going to do whatever you want with your life, so you're gonna be able to say
Best Reward: 7.5234
--------------------------------------------------


### Сравнение

Сравним с промптами модели до и после RLHF

In [100]:
import warnings
warnings.filterwarnings('ignore')

In [101]:
prompts = [
    "This boy was",
    "The idea was",
    "I think the NLP course is",
    "You clever as",
    "You stupid as",
    "I you",
    "What",
    "You should",
    "Plz make",
    "I whant",
    "We must create",
    "How are you"
]

In [106]:
not_trained_main_tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2")
not_trained_main_model = transformers.AutoModelForCausalLM.from_pretrained("gpt2", device_map=device)


for prompt in prompts:
    print("\nWITH RLHF:")
    reward_guided_generation(prompt)
    inputs = not_trained_main_tokenizer(prompt, return_tensors='pt').to(device)
    generated_ids = not_trained_main_model.generate(**inputs, max_new_tokens=50, do_sample=True)
    print("\nWITHOUT RLHF:", not_trained_main_tokenizer.decode(generated_ids.flatten().cpu().numpy().tolist()))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



WITH RLHF:


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: This boy was
Best Sample: This boy was a bad boy. He didn't have a lot of money. He was a bad boy.

"I was a bad boy. I didn't have a lot of money. I was a bad boy. I was a bad boy. I
Best Reward: 4.4727
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



WITHOUT RLHF: This boy was born to white people."

"I'd like the man to be a boy or a girl, I'm not going to go away from it."

My mother was white. The only difference was she'd be a girl. So that

WITH RLHF:


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: The idea was
Best Sample: The idea was that he and his sister would use the power of their own will to make their own life. They would see a new life and a new world, and they would build their own new lives.

The idea was that they would use all power
Best Reward: -3.7832
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



WITHOUT RLHF: The idea was to see if you could write code using Scala.

On the second day Scala was the default runtime. It was supposed to be the default, so everyone could write tests. Even if you knew not to, we had to try it out because

WITH RLHF:


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: I think the NLP course is
Best Sample: I think the NLP course is a good way to get your mind off the book. It is easy to learn. You can find the books that make you feel like you have something in your life. It's the same book you read. You can find books that make you feel
Best Reward: -4.8203
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



WITHOUT RLHF: I think the NLP course is probably the reason."

However, the current system of government has failed in some areas, which have created pressure to re-impose limits on government spending and increase spending while decreasing freedom to innovate and innovate in the 21st century, said the

WITH RLHF:


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: You clever as
Best Sample: You clever as hell."
(*)*)

(Weas.
Best Reward: 6.9531
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



WITHOUT RLHF: You clever as it may seem, that's not really your fault.

It's not the fault of our children, who are too old to be in their late 20s, or of children who are not yet adults yet, because those kids are also not

WITH RLHF:


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: You stupid as
Best Sample: You stupid as fuck, you're an idiot for having to be an idiot. The only thing you've done to make me a better. And I'm a fucking cunt. If you're ever gonna say anything. I'm gonna try to. I'm gonna try
Best Reward: 7.5273
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



WITHOUT RLHF: You stupid as shit. Oh yeah. Here's to a game. That's just what this is all about, isn't it? For one. And now we have a show this little girl named Missy who's played all the girl's best songs and played

WITH RLHF:


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: I you
Best Sample: I you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you
Best Reward: 6.1992
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



WITHOUT RLHF: I you have the right to be in the country if you desire, so long as I am there, and not in Canada" (A-1135-8, C-1244-A, A A-14). The Constitution, like the Declaration

WITH RLHF:


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: What
Best Sample: What's so good about your life? You're a genius! The life you want to live, the life you want to live, the life you want to live, the life you want to live. You're a genius!

You're a
Best Reward: 2.3301
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



WITHOUT RLHF: What the hell is up with that? When I saw her in a picture I thought she looked pretty hot. But then her legs are like my pussy."

When asked why her boyfriend would have done this, Yuki responded, "Yuki just

WITH RLHF:


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: You should
Best Sample: You should get your own hand off your own hand and get your own hands on, for your own hands, my hand, my hand on my hands, that my hand, and on my hand on my hand, my hand my hand, my hand on my
Best Reward: 4.0430
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



WITHOUT RLHF: You should be able to do this," She says, "and then you can talk about some of these projects by the time you get to college."


In addition to the big picture issues, the students of this course do have to deal with the many

WITH RLHF:


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: Plz make
Best Sample: Plz make a part of my life. You make me.
Best Reward: -0.6689
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



WITHOUT RLHF: Plz make it a few clicks. I am willing to believe them!

Now to the last chapter from the end.

Part 2:

"So we'll get the whole gang off the boat. Leave this ship and go back to the

WITH RLHF:


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: I whant
Best Sample: I whant is that he can no longer be a fag, he is a man and a man is a man,"

and he shall make a man a man, and he shall make a man

and his man

and he shall make
Best Reward: 6.8047
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



WITHOUT RLHF: I whant' I be like, hey?

[20:01:02 AM] Ian Cheong: I dont think I like that.

[20:01:08 AM] Izzy (@iglvzx): Yes, but I dont read the

WITH RLHF:


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: We must create
Best Sample: We must create new jobs and create new jobs.

For the last two years, the we, we, we, we, we, we, we, we, we, we, we, we, we, we, we, we, we,
Best Reward: 0.9155
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



WITHOUT RLHF: We must create new products and services in order to continue to attract foreign investment.

1.2. Product Delivery Service Providers (PDS) in the UAE

PDS, which will be located at various locations in the UAE, also operates an

WITH RLHF:


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: How are you
Best Sample: How are you going to look at the whole thing? Why is there this shit, and why I don't have a life of love. Oh, and what I know is why I don't have a. This is a, a.a.a.a
Best Reward: 3.9766
--------------------------------------------------

WITHOUT RLHF: How are you doing with your new project?

I'm trying to keep my focus and not fall over myself. As soon as I finish writing the code, I'm going to get up and take a breather and then I'll start going off to training


### Заключение

`You clever as...` --> `You clever as hell, we can't be wrong. We are trying to help` - вот так модель продолжила сообщение. Машина по созданию токсичных комментариев готова.

 В работе было произведено два RLHF алаймента, оценены метрики и сгенирированы примеры работы:
 1) `IMBD` датасет, `distilbert-base-cased` в качестве reward_model и `lvwerra/gpt2-imdb` в качестве основной модели
 2) `jigsaw toxicity` датасет, `distilroberta-base` в качестве reward_model и `gpt2` в качестве основной модели

Из раздела со сравнением, видно как RLHF повлиял на модель (gpt-2) - она теперь создает токсичные предложения. Также, ниже представлены метрики.

|  RLHF алаймент     | ROC AUC |
| ---------------- |  ---------- |
| IMBD            |  0.97 (TRAIN), 0.95 (TEST)   |
| jigsaw toxicity               | 0.98 (TRAIN), 0.96 (TEST)    |


