# LLMs Alignment with Reinforcement Learning from human feedback (RLHF).


In this session, we're gonna fine-tune a language model with reinforcement learning to make it generate good (or bad) reviews.

To perform RL-based fine-tuning, we'll use a new (in this course) library called [Transformer Reinforcement Learning (TRL)](https://huggingface.co/docs/trl). TRL implements the main reinforcement learning components of RLHF: reward modeling and fine-tuning with PPO.

![img](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/TRL-readme.png)

In [1]:
%pip install -q trl==0.7.4 transformers==4.33.1 datasets==2.14.4 peft==0.5.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.9/133.9 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.6/85.6 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.8/100.8 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m71.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━

### Tutorial: align the model to generate positive movie reviews

To see how TRL works, we'll use it to align GPT2 on IMDB dataset to generate positive (or negative) movie reviews. In fact, __it's your choice whether you want positive or negative reviews.__

But before we choose, let's take a look at the baseline model: a GPT-2 fine-tuned on generating arbitrary movie reviews.

In [105]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [2]:
import torch
import transformers
main_tokenizer = transformers.AutoTokenizer.from_pretrained("lvwerra/gpt2-imdb")
main_model = transformers.AutoModelForCausalLM.from_pretrained("lvwerra/gpt2-imdb", device_map=device)

tokenizer_config.json:   0%|          | 0.00/17.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/577 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

In [3]:
inputs = main_tokenizer("The movie", return_tensors='pt').to(device)
generated_ids = main_model.generate(**inputs, max_new_tokens=50, do_sample=True)
print("\nGenerated text:", main_tokenizer.decode(generated_ids.flatten().cpu().numpy().tolist()))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Generated text: The movie, which is so bad, must be for real. The acting isn't nearly as bad as it needs to be for it to work for me. The bad effects are just plain bad! The character designs are badly made - I don't recall the


If you run this cell a couple of times, you'll see that the model generates both positive, negative and neutral reviews in some proportion. What we're gonna do next is teach the model to generate more positive (or negative) reviews.

Similarly to InstructGPT, we're gonna do that in 2 stages:
- **train a reward model** to assign higher values to positive (or negative) reviews
- fine-tune the language model to **maximize that reward using [proximal policy optimization](https://openai.com/research/openai-baselines-ppo)**



## Stage 1: train a reward model

First, we'll train a BERT-like model as our reward model. We'll generate a synthetic pairwise rankings to emulate human rankings.

__Q:__ why do I need a reward model? Can I just use a pre-trained sentiment classifier? <br> __A:__ Yes, you can - but that only works for movie reviews. But this tutorial will teach you how to do RLHF for any kind objective.

In [4]:
# We'll be fine-tuning a small BERT-like model for now. Please try other models for the main assignment.
reward_model = transformers.AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased", device_map=device)
reward_tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-cased")

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

__Note that__ the reward model has a separate tokenizer, different from the main model. They don't need to be the same for RLHF fine-tuning.

In [2]:
# To train a reward model, you need a dataset (or generator) of positive-negative pairs.
# Each training sample should be a dict with 4 keys:
#  - input_ids_chosen, attention_mask_chosen = tokenizer("A sentence that human labeler likes more")
#  - input_ids_rejected, attention_mask_rejected = tokenizer("A sentence that human labeler likes less")

import torch
import datasets

class IMDBPairwiseDataset(torch.utils.data.Dataset):
    """ A dataset of all possible pairs of chosen and texts in TRT reward training format """
    def __init__(self, imdb, tokenizer, accepted_label: int):
        super().__init__()
        self.tokenizer = tokenizer
        self.chosen_texts = [row['text'] for row in imdb if row['label'] == accepted_label]
        self.rejected_texts = [row['text'] for row in imdb if row['label'] != accepted_label]
        assert self.chosen_texts, f"no texts with label {accepted_label}"
        print(f"Found {len(self.chosen_texts)} chosen and {len(self.rejected_texts)} rejected texts, {len(self)} pairs")

    def __len__(self):
        return len(self.chosen_texts) * len(self.rejected_texts)  # all pairs

    def __getitem__(self, index: int):
        chosen = self.tokenizer(self.chosen_texts[index // len(self.chosen_texts)], truncation=True)
        rejected = self.tokenizer(self.rejected_texts[index % len(self.rejected_texts)], truncation=True)
        return dict(input_ids_chosen=chosen['input_ids'], attention_mask_chosen=chosen['attention_mask'],
                    input_ids_rejected=rejected['input_ids'], attention_mask_rejected=rejected['attention_mask'])

In [6]:
TARGET_LABEL = 1   # and make sure it works by reviewing the sample printed below
imdb = datasets.load_dataset("imdb", split='train')
reward_data = IMDBPairwiseDataset(imdb, reward_tokenizer, accepted_label=TARGET_LABEL)

sample = reward_data[31337]
print('CHOSEN:', reward_tokenizer.decode(sample['input_ids_chosen']))
print('REJECTED:', reward_tokenizer.decode(sample['input_ids_rejected']))

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Found 12500 chosen and 12500 rejected texts, 156250000 pairs
CHOSEN: [CLS] Lars Von Trier is never backward in trying out new techniques. Some of them are very original while others are best forgotten. < br / > < br / > He depicts postwar Germany as a nightmarish train journey. With so many cities lying in ruins, Leo Kessler a young American of German descent feels obliged to help in their restoration. It is not a simple task as he quickly finds out. < br / > < br / > His uncle finds him a job as a night conductor on the Zentropa Railway Line. His job is to attend to the needs of the passengers. When the shoes are polished a chalk mark is made on the soles. A terrible argument ensues when a passenger's shoes are not chalked despite the fact they have been polished. There are many allusions to the German fanaticism of adherence to such stupid details. < br / > < br / > The railway journey is like an allegory representing man's procession through life with all its trials and tribulations

In [7]:
imdb[0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

We'll be using `trl.RewardTrainer` - a special case of `transformers.Trainer` that you used in the past. `RewardTrainer` accepts the same format of training arguments (e.g. batch size, gradient checkpointing) as before, except that it trains the model for the pairwise reward objective from [the InstructGPT paper](https://arxiv.org/pdf/2203.02155.pdf):

![img](https://i.imgur.com/2JzNAPs.png)

Note that the model itself does not score pairs: it processes chosen ($y_w$) and rejected ($y_l$) samples independently. To minimize this loss, the reward model needs to score chosen sample higher than the rejected one. Note that the formula also assumes some context $x$, which is useful for seq2seq tasks. In our case of movie reviews, $x$ is empty.

In [8]:
import trl

training_args = trl.RewardConfig(  # like transformers.TrainingArguments
    output_dir="reward_model",
    per_device_train_batch_size=32,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    max_steps=1_000,              # note: training may need more than 1k steps
    logging_steps=50,
    gradient_checkpointing=True,  # reduce memory usage but train ~30% slower
    gradient_checkpointing_kwargs={"use_reentrant": False},
    fp16=True                     # disable this on CPU or on very old GPUs
    # you may add any other hyperparameters that you found useful in weeks 5-7
)

trainer = trl.RewardTrainer(
    model=reward_model,
    args=training_args,
    tokenizer=reward_tokenizer,
    train_dataset=reward_data,
    peft_config=None,  # optionally, you may tune with LoRA, prompt-tuning, etc
)

trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
50,0.5228
100,0.1966
150,0.1337
200,0.1256
250,0.1107
300,0.0946
350,0.103
400,0.0822
450,0.0972
500,0.0816




TrainOutput(global_step=1000, training_loss=0.1092242751121521, metrics={'train_runtime': 1580.4561, 'train_samples_per_second': 20.247, 'train_steps_per_second': 0.633, 'total_flos': 0.0, 'train_loss': 0.1092242751121521, 'epoch': 0.0})

In [9]:
reward_model.gradient_checkpointing_disable()
reward_model.eval()

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

### Sanity-check the reward model 

Let's check how our reward model performs.

Measure how often does your reward model can rank a pair of (chosen and rejected) reviews correctly. We measure this separately for train data (`imdb`) and a separate test set loaded below.

In [10]:
for sample_index in 45, 16000:
  print('TEXT:', imdb[sample_index]['text'])
  inputs = reward_tokenizer(
      imdb[sample_index]['text'], truncation=True, return_tensors='pt').to(device)
  with torch.no_grad():
    reward = reward_model(**inputs).logits[0, 0].item()
    print("REWARD:", reward)
  print('LABEL:', imdb[sample_index]['label'])
  print()

# note: your reward model may produce different absolute rewards.
# This is fine as long as the rewards are ordered correctly (most of the time)

TEXT: This movie sucked. It really was a waste of my life. The acting was atrocious, the plot completely implausible. Long, long story short, these people get "terrorized" by this pathetic "crazed killer", but completely fail to fight back in any manner. And this is after they take a raft on a camping trip, with no gear, and show up at a campsite that is already assembled and completely stocked with food and clothes and the daughters headphones. Additionally, after their boat goes missing, they panic that they're stuck in the woods, but then the daughters boyfriend just shows up and they apparently never consider that they could just hike out of the woods like he did to get to them. Like I said, this movie sucks. A complete joke. Don't let your girlfriend talk you into watching it.
REWARD: -4.6640625
LABEL: 0

TEXT: Good: Engaging cinematic firefights, great presentation, vehicles are actually fun to drive, fairly appealing multiplayer, faithful to the movie, and the list goes on.<br /

In [11]:
imdb_test = datasets.load_dataset("imdb", split='test')

# <a whole lot of your code here, feel free to spit it as you see fit>
reward_data_test = IMDBPairwiseDataset(imdb_test, reward_tokenizer, accepted_label=TARGET_LABEL)

Found 12500 chosen and 12500 rejected texts, 156250000 pairs


In [12]:
from tqdm import tqdm

In [116]:
# думаю, на 10000 парах норм будет

def share_of_correctly_rewarded_pairs(dataset):
  cnt_all = 10000
  cnt_correct = 0
  for i in tqdm(range(10000)):
    sample = dataset[i]
    chosen = reward_tokenizer.decode(sample['input_ids_chosen'])
    rejected = reward_tokenizer.decode(sample['input_ids_rejected'])
    inputs_chosen = reward_tokenizer(
      chosen, truncation=True, return_tensors='pt').to(device)
    with torch.no_grad():
      reward_chosen = reward_model(**inputs_chosen).logits[0, 0].item()
    inputs_rejected = reward_tokenizer(
      rejected, truncation=True, return_tensors='pt').to(device)
    with torch.no_grad():
      reward_rejected = reward_model(**inputs_rejected).logits[0, 0].item()
    cnt_correct += reward_chosen > reward_rejected
  return cnt_correct / cnt_all

In [35]:
share_of_correctly_rewarded_pairs(reward_data)

100%|██████████| 10000/10000 [04:27<00:00, 37.35it/s]


0.928

In [36]:
share_of_correctly_rewarded_pairs(reward_data_test)

100%|██████████| 10000/10000 [04:13<00:00, 39.51it/s]


0.9985

В общем, хорошо работает наша reward model :)

### Reward-guided generation 

Before we use it for reinforcement learning, let's see if we can align model samples without any training.

To do so, we can use reward-guided inference: __generate N=16 samples, then select the one with the highest reward__ (according to our reward model).

In [37]:
inputs = main_tokenizer(["It was"] * 5, return_tensors='pt').to(device)
for candidate in main_model.generate(**inputs, max_new_tokens=50, do_sample=True):
  print("Sample:", main_tokenizer.decode(candidate.flatten().cpu().numpy().tolist()))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample: It was really cool to watch a movie that actually made people laugh - from a great storyline and a great storyline, the story has some flaws but you will enjoy it.<br /><br />For what it is, the characters were decent, the storyline was
Sample: It was a classic in a way. It made you question everything you expected of the main character, but in true art direction, the viewer never truly gets the gist as to why the main character is fighting against all of them.<br /><br />This
Sample: It was never meant to be a blockbuster like it is in the movie business. It was meant to be a comedy with good plot and well-done humor. There is some great acting which I felt was lacking in both films. The story was too long and
Sample: It was one of those movies where you could literally feel the power coming from just one single individual and the characters slowly built their relationship back up to a degree that the audience can easily identify with at will. I really enjoyed many of its bits

In [119]:
def reward_guided_inference(prompt):
  inputs = main_tokenizer([prompt] * 16, return_tensors='pt').to(device)
  candidates = []
  for candidate in main_model.generate(**inputs, max_new_tokens=50, do_sample=True):
    candidates.append(main_tokenizer.decode(candidate.flatten().cpu().numpy().tolist()))
  inputs = reward_tokenizer(candidates, truncation=True, padding = True, return_tensors='pt').to(device)
  with torch.no_grad():
      rewards = reward_model(**inputs).logits[:, 0].data.cpu().numpy()
      max_reward = rewards.argmax()
      min_reward = rewards.argmin()
  return candidates[max_reward], max(rewards), candidates[min_reward], min(rewards)

In [64]:
candidate_with_max_reward, max_reward, candidate_with_min_reward, min_reward = reward_guided_inference('It was')
print('Sample with max reward:', candidate_with_max_reward)
print('It\'s reward:', max_reward)
print('\n')
print('Sample with min reward:', candidate_with_min_reward)
print('It\'s reward:', min_reward)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample with max reward: It was an interesting movie, but it is definitely not a movie of this ilk. To sum up it, I didn't do that much acting anyway, but it was just a wonderful movie!!!<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>
It's reward: 4.3945312


Sample with min reward: It was also a very bad film. The actors seemed to not understand what made a good film. It was also not worth watching because of the awful soundtrack. The voice over was almost identical to that of the "Gone With the Wind" but that title
It's reward: -4.5429688


In [65]:
candidate_with_max_reward, max_reward, candidate_with_min_reward, min_reward = reward_guided_inference('The film was')
print('Sample with max reward:', candidate_with_max_reward)
print('It\'s reward:', max_reward)
print('\n')
print('Sample with min reward:', candidate_with_min_reward)
print('It\'s reward:', min_reward)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample with max reward: The film was shot with 3 standard camcorders, and all the actors had the same camera equipment. As far as the script goes, it was a great plot and a wonderful movie. As the story goes, there are many obstacles and problems to overcome, which
It's reward: 5.4257812


Sample with min reward: The film was a mixed bag. The acting was weak, it took longer than it did for a sequel. Unfortunately the script was so bad it took forever to build up. I don't really think that the character was great, as they're already played by a
It's reward: -3.8242188


In [66]:
candidate_with_max_reward, max_reward, candidate_with_min_reward, min_reward = reward_guided_inference('Film review:')
print('Sample with max reward:', candidate_with_max_reward)
print('It\'s reward:', max_reward)
print('\n')
print('Sample with min reward:', candidate_with_min_reward)
print('It\'s reward:', min_reward)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample with max reward: Film review: If you are looking for a fun trip through the hills in South America then you will be disappointed. But I personally like this film very much! This is a great story of a small family in one of Africa's most remote places. But when the
It's reward: 5.4296875


Sample with min reward: Film review: "The Dark Angel"- the first movie that I saw about The Dark Angel.<br /><br />I don't think this movie is necessarily bad, but it was bad enough for me, especially the acting.<br /><br />The music
It's reward: -1.7314453


In [67]:
candidate_with_max_reward, max_reward, candidate_with_min_reward, min_reward = reward_guided_inference('My opinion for this film is')
print('Sample with max reward:', candidate_with_max_reward)
print('It\'s reward:', max_reward)
print('\n')
print('Sample with min reward:', candidate_with_min_reward)
print('It\'s reward:', min_reward)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample with max reward: My opinion for this film is that it is very entertaining, entertaining movie with some very good acting on the sides, a wonderful plot and quite an amusing comedy with some wonderful scenes. Also interesting in that it has the chance to do a story about a man (Kaitlin)
It's reward: 5.34375


Sample with min reward: My opinion for this film is that it is at times entertaining but sometimes disappointing. For example, during the movie, I was actually bored as the actors did not seem to be fully involved in making the story of a man who killed his own children and his mother. There was even
It's reward: -3.1152344


In [68]:
candidate_with_max_reward, max_reward, candidate_with_min_reward, min_reward = reward_guided_inference('I think the film was')
print('Sample with max reward:', candidate_with_max_reward)
print('It\'s reward:', max_reward)
print('\n')
print('Sample with min reward:', candidate_with_min_reward)
print('It\'s reward:', min_reward)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample with max reward: I think the film was a fun movie to listen to. I have the DVD though and it's very good.<br /><br />In regards to the story, it's very simple and really interesting.<br /><br />It's very well written and made so
It's reward: 5.4179688


Sample with min reward: I think the film was really clever. The concept was cool and was interesting, but the dialogue looked awful and confusing. That would not be a problem with a comedy, that would not be a problem with a horror film. The director didn't deserve credit for that. I
It's reward: -3.7832031


Вроде во всех случах top-reward генерация действительно является положительным комментом, а генерация с наименьшим ревардом - отрицательный коммент. Если мы верим нашей reward модели, то наилучший коммент мы смогли получить с промта Film review:

# Stage 2: fine-tune the main model with RL


For this tutorial, we will optimize GPT2 to produce positive IMDB movie reviews using the reward model you trained above.

Unlike supervised fine-tuning, RL allows model to generate it's own sentences on each training step. Then, it calculates the reward of those specific sentences, and finally, updates the model to increase the probability of sentences with high reward.

Thus, each RLHF consists of three stages: __Rollout__, __Evaluation__ and __Update__

<div style="text-align: center">
<img src='https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/gpt2_bert_training.png' width='600'>

The update stage depends on the specific RL algorithm. We'll be using Proximal Policy Optimization, or [PPO](https://arxiv.org/abs/1707.06347), similarly to what was used for InstructGPT.

Before we run those 3 stages, however, we need to create a dataset of "queries" - partial reviews in our case.

In [69]:
# Note: this code is specific to IMDB
imdb_for_rlhf = imdb.filter(lambda row: len(row['text']) > 200, batched=False)
imdb_for_rlhf = imdb_for_rlhf.remove_columns(['label'])
sample_length = trl.core.LengthSampler(2, 8)  # use the first 2-8 tokens as query

def select_query_and_tokenize(sample):
    query_ids = main_tokenizer.encode(sample["text"])[: sample_length()]
    sample["query"] = main_tokenizer.decode(query_ids)  # query is the only required column
    sample["input_ids"] = query_ids  # to avoid re-tokenizing later
    return sample  # we do not need the rest - it will be generated by the model

imdb_for_rlhf = imdb_for_rlhf.map(select_query_and_tokenize, batched=False)
imdb_for_rlhf.set_format(type="torch")

Filter:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/24895 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1168 > 1024). Running this sequence through the model will result in indexing errors


Next, let's prepare your reward model to predict rewards on whatever reviews were generated. Note that we use plaintext reviews because main model uses a different tokenizer from the reward model.

In [70]:
from typing import List
def compute_reward(texts: List[str]) -> torch.Tensor:
  inputs = reward_tokenizer(texts, truncation=True, padding=True, return_tensors='pt').to(device)
  with torch.no_grad():
    return reward_model(**inputs).logits[:, 0]

In [71]:
compute_reward([imdb[45]['text'], imdb[16000]['text']])  # test on human-written reviews

tensor([-4.6602,  5.4766], device='cuda:0')

Finally, we move to RL training. In this tutorial, we'll train LoRA adapters and not the full model.

In [72]:
import peft
peft_config = peft.LoraConfig(
    task_type=peft.TaskType.CAUSAL_LM, r=32, lora_alpha=32, lora_dropout=0.0, inference_mode=False
)

# reload main model as AutoModelForCausalLMWithValueHead - with an extra head needed for PPO
main_tokenizer = transformers.AutoTokenizer.from_pretrained("lvwerra/gpt2-imdb")
main_tokenizer.pad_token = main_tokenizer.eos_token

main_model = trl.AutoModelForCausalLMWithValueHead.from_pretrained("lvwerra/gpt2-imdb", device_map=device)
main_model = peft.get_peft_model(main_model, peft_config, adapter_name='default')
main_model.print_trainable_parameters()



trainable params: 1,179,648 || all params: 125,620,225 || trainable%: 0.9390589771670923


Same as before, trl has a special type of trainer that minimize PPO-specific pseudo-loss. You can read more on this trainer [here](https://huggingface.co/docs/trl/main/en/ppo_trainer).

In [73]:
training_args = trl.PPOConfig(
    model_name=main_model.config._name_or_path,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    batch_size=64,
    ppo_epochs=4,                 # PPO performs this many updates per training batch
)

ppo_trainer = trl.PPOTrainer(
    training_args, model=main_model.model, tokenizer=main_tokenizer,
    dataset=imdb_for_rlhf, data_collator=lambda data: dict((key, [d[key] for d in data]) for key in data[0])
)  # note: we pass main_model.model because PPOTrainer checks for one of several supported model types ...
# ... main_model.model is a model with adapters, which is supported. main_model itself is a wrapper that is not supported

In [74]:
from tqdm.auto import tqdm
max_steps = 50   # can be insufficient for some tasks - watch your learning curves
generation_kwargs = dict(
    min_length=-1, max_new_tokens=128, do_sample=True, top_k=0, top_p=1.0, pad_token_id=main_tokenizer.eos_token_id)
#                                  ^-- task-specific parameter!
with tqdm(enumerate(ppo_trainer.dataloader), total=max_steps) as progressbar:
  # note: ppo_trainer.dataloader is just a regular dataloader of queries, no RL-specific magic :)
  for epoch, batch in progressbar:
    if epoch >= max_steps:
        break

    # Rollout stage: generate continuations from batch queries using main_model
    response_tensors = ppo_trainer.generate(batch['input_ids'], **generation_kwargs)
    # ^-- list of tensors of token ids from main model tokenizer

    # de-tokenize responses to strings (since reward model uses a different tokenizer)
    batch["response"] = [main_tokenizer.decode(response.squeeze()) for response in response_tensors]
    # note: response_tensors already contain query tokens, so we don't need to add queries manually.
    # This may not be true for other tasks: check this manually by viewing batch["response"] and batch["query"]


    # Evaluation stage
    rewards = compute_reward(batch['response'])

    # Update stage
    stats = ppo_trainer.step(batch['input_ids'], response_tensors, list(rewards.split(1)))
    stats['rewards/mean'] = rewards.mean().item()

    print("-" * 30, 'STEP', epoch, '-' * 30)
    print(f'rewards/mean:\t{stats["rewards/mean"]:.9f}\t<---- average reward over this batch (higher=better, noisy)')
    print(f'ppo/returns/mean:\t{stats["ppo/returns/mean"]:.9f}\t<---- model-estimated average discounted reward')
    print(f'objective/kl:\t{stats["objective/kl"]:.9f}\t<---- how far we are from the original model (regularizer)')
    print()

    ppo_trainer.log_stats(stats, batch, list(rewards.split(1)))

  0%|          | 0/50 [00:00<?, ?it/s]

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


------------------------------ STEP 0 ------------------------------
rewards/mean:	0.261377335	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.350351363	<---- model-estimated average discounted reward
objective/kl:	0.000000000	<---- how far we are from the original model (regularizer)

------------------------------ STEP 1 ------------------------------
rewards/mean:	-0.012278557	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.372230887	<---- model-estimated average discounted reward
objective/kl:	0.382097453	<---- how far we are from the original model (regularizer)

------------------------------ STEP 2 ------------------------------
rewards/mean:	0.095082283	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.204901844	<---- model-estimated average discounted reward
objective/kl:	1.225651503	<---- how far we are from the original model (regularizer)

------------------------------ STEP 3 -

##  <u>Actually</u> train the model 

We use the RLHF pipeline to train a model for a reward of our choice.

__Toxicity fine-tuning:__ train the model to be more toxic. For this task, we may use the data from [jigsaw toxic comments](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) and [lmsys/toxic-chat](https://huggingface.co/datasets/lmsys/toxic-chat),  or any other source. Alternatively, we may use toxicity scores from [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1).

Возьмем датасет с каггла, посмотрим что он из себя представляет

In [79]:
!unzip train.csv.zip

Archive:  train.csv.zip
  inflating: train.csv               


In [80]:
import pandas as pd
toxic_comments_train = pd.read_csv("train.csv")
toxic_comments_train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


Посчитаем количество чистых комментов, без токсичности и тд

In [92]:
len(toxic_comments_train[(toxic_comments_train['toxic'] == 0) & (toxic_comments_train['severe_toxic'] == 0)
& (toxic_comments_train['obscene'] == 0) & (toxic_comments_train['threat'] == 0)
& (toxic_comments_train['insult'] == 0) & (toxic_comments_train['identity_hate'] == 0)])

143346

Токсичные комменты

In [93]:
len(toxic_comments_train[toxic_comments_train['toxic'] == 1])

15294

Возьмем в качестве валидационной выборки последние 15к примеров, среди них столько токсичных:

In [96]:
len(toxic_comments_train.iloc[-15000:][toxic_comments_train.iloc[-15000:]['toxic'] == 1])

1469

Разделяем на трейн и валидацию

In [97]:
toxic_comments_test = toxic_comments_train[-15000:]
toxic_comments_train = toxic_comments_train[:-15000]

Делаем датасет из всего этого

In [99]:
toxic_comments_test = datasets.Dataset.from_pandas(toxic_comments_test)
toxic_comments_train = datasets.Dataset.from_pandas(toxic_comments_train)

In [100]:
toxic_comments_train[0]

{'id': '0000997932d777bf',
 'comment_text': "Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",
 'toxic': 0,
 'severe_toxic': 0,
 'obscene': 0,
 'threat': 0,
 'insult': 0,
 'identity_hate': 0}

Загрузим ревард модель - роберта, обученная отличать токсичные комменты

In [13]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification

reward_tokenizer = RobertaTokenizer.from_pretrained('SkolkovoInstitute/roberta_toxicity_classifier')
reward_model = RobertaForSequenceClassification.from_pretrained('SkolkovoInstitute/roberta_toxicity_classifier')

Some weights of the model checkpoint at SkolkovoInstitute/roberta_toxicity_classifier were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Теперь нам нужно построить датасет. Будем учиться выбирать токсичные комменты :) Параметр how_many нужен, чтобы не было такого, что токсичных комментов значительно меньше, чем не токсичных. Для трейна можно взять how_many = 12500 - как раз будут использоваться почти все токсичные комменты, которые есть в датасете

In [101]:
import torch
import datasets

class ToxicCommentsPairwiseDataset(torch.utils.data.Dataset):
    """ A dataset of all possible pairs of chosen and texts in TRT reward training format """
    def __init__(self, data, tokenizer, how_many):
        super().__init__()
        self.tokenizer = tokenizer
        self.chosen_texts = [row['comment_text'] for row in data if row['toxic'] == 1][:how_many]
        self.rejected_texts = [row['comment_text'] for row in data if row['toxic'] == 0 and row['severe_toxic'] == 0 and
                               row['obscene'] == 0 and row['threat'] == 0 and row['insult'] == 0 and row['identity_hate'] == 0][:how_many]
        print(f"Found {len(self.chosen_texts)} chosen and {len(self.rejected_texts)} rejected texts, {len(self)} pairs")

    def __len__(self):
        return len(self.chosen_texts) * len(self.rejected_texts)  # all pairs

    def __getitem__(self, index: int):
        chosen = self.tokenizer(self.chosen_texts[index // len(self.chosen_texts)], truncation=True)
        rejected = self.tokenizer(self.rejected_texts[index % len(self.rejected_texts)], truncation=True)
        return dict(input_ids_chosen=chosen['input_ids'], attention_mask_chosen=chosen['attention_mask'],
                    input_ids_rejected=rejected['input_ids'], attention_mask_rejected=rejected['attention_mask'])

In [102]:
reward_data = ToxicCommentsPairwiseDataset(toxic_comments_train, reward_tokenizer, 12500)

Found 12500 chosen and 12500 rejected texts, 156250000 pairs


In [103]:
sample = reward_data[30000]
print('CHOSEN:', reward_tokenizer.decode(sample['input_ids_chosen']))
print('REJECTED:', reward_tokenizer.decode(sample['input_ids_rejected']))

CHOSEN: <s>Bye! 

Don't look, come or think of comming back! Tosser.</s>
REJECTED: <s>Illythr, your pro-Russian edits are evident. Let Moldova, and Moldovans be themselves.</s>


In [104]:
import trl

training_args = trl.RewardConfig(
    output_dir="reward_model",
    per_device_train_batch_size=32,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    max_steps=1_000,
    logging_steps=50,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    fp16=True
)

trainer = trl.RewardTrainer(
    model=reward_model,
    args=training_args,
    tokenizer=reward_tokenizer,
    train_dataset=reward_data,
    peft_config=None,
)

trainer.train()

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
50,0.5567
100,0.0386
150,0.0246
200,0.0272
250,0.0239
300,0.0254
350,0.0196
400,0.0226
450,0.0226
500,0.0265




KeyboardInterrupt: ignored

Лосс быстро упал, потому что модель сильная и хорошо обученная до этого под задачу, но, думаю, что это даже хорошо. Вроде на ревард модель не было ограничений

In [106]:
reward_model.gradient_checkpointing_disable()
reward_model.eval()

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
             

In [107]:
from tqdm import tqdm

Посмотрим какие реварды выдает модель и в каком проценте пар rejected-chosen chosen коммент получает ревард больше

In [114]:
for sample_index in 2, 16006:
  print('TEXT:', toxic_comments_train[sample_index]['comment_text'])
  inputs = reward_tokenizer(
      toxic_comments_train[sample_index]['comment_text'], truncation=True, return_tensors='pt').to(device)
  with torch.no_grad():
    reward = reward_model(**inputs).logits[0, 0].item()
    print("REWARD:", reward)
  print('LABEL:', toxic_comments_train[sample_index]['toxic'])
  print()

TEXT: Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.
REWARD: -4.7265625
LABEL: 0

TEXT: BECAUSE HE IS A FAGOT
REWARD: 6.18359375
LABEL: 1



In [117]:
share_of_correctly_rewarded_pairs(reward_data)

100%|██████████| 10000/10000 [07:10<00:00, 23.25it/s]


0.9993

In [118]:
reward_data_test = ToxicCommentsPairwiseDataset(toxic_comments_test, reward_tokenizer, 1250)
share_of_correctly_rewarded_pairs(reward_data_test)

Found 1250 chosen and 1250 rejected texts, 1562500 pairs


100%|██████████| 10000/10000 [07:56<00:00, 20.97it/s]


0.9952

Результаты отличные, идем дальше. Переходим к основной модели

In [121]:
import torch
import transformers
main_tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2-large")
main_model = transformers.AutoModelForCausalLM.from_pretrained("gpt2-large", device_map=device)

config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [122]:
inputs = main_tokenizer(["It was"] * 5, return_tensors='pt').to(device)
for candidate in main_model.generate(**inputs, max_new_tokens=50, do_sample=True):
  print("Sample:", main_tokenizer.decode(candidate.flatten().cpu().numpy().tolist()))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample: It was all a ruse, there was no plan," he said. "Everyone knew everything went down the same way as the CIA thought — that it was all a 'fake' operation. But to all the other people involved, I knew it was a
Sample: It was also interesting to talk about the first time I became acquainted with her sister in that conversation, something that I hadn't been able to do before. After they had gotten to be close to me as an official family they had gone through a lot of growing
Sample: It was like a dream come true to see my dad playing that game with his son.

"It was unbelievable.

"I just want to tell him I love him and that he is a great football player. I think he is the best
Sample: It was a good experience, thank you for having me."<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|e

Попробуем Reward-guided inference

In [123]:
candidate_with_max_reward, max_reward, candidate_with_min_reward, min_reward = reward_guided_inference('You are a')
print('Sample with max reward:', candidate_with_max_reward)
print('It\'s reward:', max_reward)
print('\n')
print('Sample with min reward:', candidate_with_min_reward)
print('It\'s reward:', min_reward)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample with max reward: You are a little bit of a jerk, a little bit of a dick, but at least you're my friend. I mean, you're the one who I talk to about things I care about. You're the one who I go to for advice about stuff
It's reward: 6.3242188


Sample with min reward: You are a part of the world. You know the power that is within of you. You're at your best when you are connected, when you do things together to make the world a better place. We'll get through this together. We'll heal together.
It's reward: -5.6875


Ого, не ожидал, что повезет и модель сгенерирует что-то токсичное (хотя тут не прям чтобы, но все же)

In [124]:
candidate_with_max_reward, max_reward, candidate_with_min_reward, min_reward = reward_guided_inference('Bye')
print('Sample with max reward:', candidate_with_max_reward)
print('It\'s reward:', max_reward)
print('\n')
print('Sample with min reward:', candidate_with_min_reward)
print('It\'s reward:', min_reward)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample with max reward: Byebye" for me to say goodbye, but what the hell, I had to get this on the record, y'know?

I had to say goodbye, then she said, "Yeah, come on, come out with me." She
It's reward: 4.1367188


Sample with min reward: Byebye the United States. It can be found in the pages of a Russian newspaper, in the comments posted beneath a Facebook post, even in the very pages of newspapers on Russia's own domestic news media. At about the same time, the idea of
It's reward: -5.8671875


In [125]:
candidate_with_max_reward, max_reward, candidate_with_min_reward, min_reward = reward_guided_inference('What the hell is')
print('Sample with max reward:', candidate_with_max_reward)
print('It\'s reward:', max_reward)
print('\n')
print('Sample with min reward:', candidate_with_min_reward)
print('It\'s reward:', min_reward)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample with max reward: What the hell is that thing?

A. A flying saucer!

Q. Wait, they're that big now? I've seen the pictures. So what is that thing?

A. It's a flying saucer from outer space.
It's reward: 6.359375


Sample with min reward: What the hell is in our bag" was the first time I heard about this! In the end I was happy I didn't get anything special, but I'm glad to have it by my side for the future so I don't lose track of it or have to
It's reward: 4.8085938


Ну тут слово hell, поэтому везде считаем, что токсик

In [126]:
candidate_with_max_reward, max_reward, candidate_with_min_reward, min_reward = reward_guided_inference('Are you okay?')
print('Sample with max reward:', candidate_with_max_reward)
print('It\'s reward:', max_reward)
print('\n')
print('Sample with min reward:', candidate_with_min_reward)
print('It\'s reward:', min_reward)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample with max reward: Are you okay?

Yes, I'm fine.

Have you ever been raped by another guy? That's a real question.

No one rapes me. If they're drunk, it's the first chance that they ever get. A couple of
It's reward: 3.7871094


Sample with min reward: Are you okay? And I was about to ask if you can tell me what happened.


It is a pleasure and a honour to meet you. Thank you for asking me.

She has changed her clothes in the changing room and returned her hat to her
It's reward: -5.7421875


Время дообучить основную модель с помощью rlhf

In [128]:
comments_for_rlhf = toxic_comments_train.remove_columns(['id', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'])
sample_length = trl.core.LengthSampler(2, 8)  # use the first 2-8 tokens as query

def select_query_and_tokenize(sample):
    query_ids = main_tokenizer.encode(sample["comment_text"])[: sample_length()]
    sample["query"] = main_tokenizer.decode(query_ids)  # query is the only required column
    sample["input_ids"] = query_ids  # to avoid re-tokenizing later
    return sample  # we do not need the rest - it will be generated by the model

comments_for_rlhf = comments_for_rlhf.map(select_query_and_tokenize, batched=False)
comments_for_rlhf.set_format(type="torch")

Map:   0%|          | 0/144571 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2132 > 1024). Running this sequence through the model will result in indexing errors


In [131]:
from typing import List
def compute_reward(texts: List[str]) -> torch.Tensor:
  inputs = reward_tokenizer(texts, truncation=True, padding=True, return_tensors='pt').to(device)
  with torch.no_grad():
    return reward_model(**inputs).logits[:, 0]

compute_reward([toxic_comments_train[2]['comment_text'], toxic_comments_train[16006]['comment_text']])  # test on human-written reviews

tensor([-4.7266,  6.1836], device='cuda:0')

In [133]:
import peft
peft_config = peft.LoraConfig(
    task_type=peft.TaskType.CAUSAL_LM, r=32, lora_alpha=32, lora_dropout=0.0, inference_mode=False
)

# reload main model as AutoModelForCausalLMWithValueHead - with an extra head needed for PPO
main_tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2-large")
main_tokenizer.pad_token = main_tokenizer.eos_token

main_model = trl.AutoModelForCausalLMWithValueHead.from_pretrained("gpt2-large", device_map=device)
main_model = peft.get_peft_model(main_model, peft_config, adapter_name='default')
main_model.print_trainable_parameters()

pytorch_model.bin:   0%|          | 0.00/3.25G [00:00<?, ?B/s]



trainable params: 5,898,240 || all params: 779,929,601 || trainable%: 0.7562528710844506


In [135]:
training_args = trl.PPOConfig(
    model_name=main_model.config._name_or_path,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    batch_size=64,
    ppo_epochs=4,
)

ppo_trainer = trl.PPOTrainer(
    training_args, model=main_model.model, tokenizer=main_tokenizer,
    dataset=comments_for_rlhf, data_collator=lambda data: dict((key, [d[key] for d in data]) for key in data[0])
)

In [136]:
from tqdm.auto import tqdm
max_steps = 50
generation_kwargs = dict(
    min_length=-1, max_new_tokens=128, do_sample=True, top_k=0, top_p=1.0, pad_token_id=main_tokenizer.eos_token_id)
with tqdm(enumerate(ppo_trainer.dataloader), total=max_steps) as progressbar:
  for epoch, batch in progressbar:
    if epoch >= max_steps:
        break
    response_tensors = ppo_trainer.generate(batch['input_ids'], **generation_kwargs)

    batch["response"] = [main_tokenizer.decode(response.squeeze()) for response in response_tensors]

    # Evaluation stage
    rewards = compute_reward(batch['response'])

    # Update stage
    stats = ppo_trainer.step(batch['input_ids'], response_tensors, list(rewards.split(1)))
    stats['rewards/mean'] = rewards.mean().item()

    print("-" * 30, 'STEP', epoch, '-' * 30)
    print(f'rewards/mean:\t{stats["rewards/mean"]:.9f}\t<---- average reward over this batch (higher=better, noisy)')
    print(f'ppo/returns/mean:\t{stats["ppo/returns/mean"]:.9f}\t<---- model-estimated average discounted reward')
    print(f'objective/kl:\t{stats["objective/kl"]:.9f}\t<---- how far we are from the original model (regularizer)')
    print()

    ppo_trainer.log_stats(stats, batch, list(rewards.split(1)))

  0%|          | 0/50 [00:00<?, ?it/s]

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


------------------------------ STEP 0 ------------------------------
rewards/mean:	-3.845667839	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.620565951	<---- model-estimated average discounted reward
objective/kl:	0.000000000	<---- how far we are from the original model (regularizer)

------------------------------ STEP 1 ------------------------------
rewards/mean:	-3.519065857	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.577393472	<---- model-estimated average discounted reward
objective/kl:	0.507083774	<---- how far we are from the original model (regularizer)

------------------------------ STEP 2 ------------------------------
rewards/mean:	-4.130115509	<---- average reward over this batch (higher=better, noisy)
ppo/returns/mean:	-0.916942716	<---- model-estimated average discounted reward
objective/kl:	1.099391699	<---- how far we are from the original model (regularizer)

------------------------------ STEP 3

Ура, наша модель научилась токсичить (ну судя по метрикам)

In [160]:
import gc
gc.collect()

16

Тут случились две роблемы:

1) CUDA out of memory

2) main_model.generate не работает, пишет: 'AutoModelForCausalLMWithValueHead' object has no attribute 'generation_config'

Посмотрим другим способом что она генерирует

In [179]:
main_model.to('cpu')
reward_model.to('cpu')
print()




In [184]:
main_model.eval()
print()




In [180]:
batch = main_tokenizer('What the', return_tensors='pt', return_token_type_ids=False).to('cpu')

for i in range(10):
  with torch.no_grad():
    next_token = main_model(**batch)[0][0, -1].argmax(-1).reshape(1, 1)
    batch['input_ids'] = torch.cat([batch['input_ids'], next_token], dim=-1)
    batch['attention_mask'] = torch.cat([batch['attention_mask'], torch.ones_like(next_token)], dim=-1)

out = main_tokenizer.decode(batch['input_ids'][0].cpu().numpy().tolist())
print("\nOutput:", out)

inputs = reward_tokenizer(out, truncation=True, padding = True, return_tensors='pt').to('cpu')
with torch.no_grad():
    rewards = reward_model(**inputs).logits[:, 0].data.cpu().numpy()
print("\nReward:", rewards[0])


Output: What the fuck is this? I'm not even a fucking

Reward: 6.2224483


Уф, начальная модель так бы не сказала :)

In [185]:
batch = main_tokenizer('Hi', return_tensors='pt', return_token_type_ids=False).to('cpu')

for i in range(10):
  with torch.no_grad():
    next_token = main_model(**batch)[0][0, -1].argmax(-1).reshape(1, 1)
    batch['input_ids'] = torch.cat([batch['input_ids'], next_token], dim=-1)
    batch['attention_mask'] = torch.cat([batch['attention_mask'], torch.ones_like(next_token)], dim=-1)

out = main_tokenizer.decode(batch['input_ids'][0].cpu().numpy().tolist())
print("\nOutput:", out)

inputs = reward_tokenizer(out, truncation=True, padding = True, return_tensors='pt').to('cpu')
with torch.no_grad():
    rewards = reward_model(**inputs).logits[:, 0].data.cpu().numpy()
print("\nReward:", rewards[0])


Output: Hi, I'm a little bit of a nerd.

Reward: 2.8098888


In [194]:
batch = main_tokenizer('What the', return_tensors='pt', return_token_type_ids=False).to('cpu')

for i in range(20):
  with torch.no_grad():
    next_token = main_model(**batch)[0][0, -1].argmax(-1).reshape(1, 1)
    batch['input_ids'] = torch.cat([batch['input_ids'], next_token], dim=-1)
    batch['attention_mask'] = torch.cat([batch['attention_mask'], torch.ones_like(next_token)], dim=-1)

out = main_tokenizer.decode(batch['input_ids'][0].cpu().numpy().tolist())
print("\nOutput:", out)

inputs = reward_tokenizer(out, truncation=True, padding = True, return_tensors='pt').to('cpu')
with torch.no_grad():
    rewards = reward_model(**inputs).logits[:, 0].data.cpu().numpy()
print("\nReward:", rewards[0])


Output: What the hell is this? I'm not even sure what I'm doing here. I'm just standing here

Reward: 6.3157067


In [197]:
batch = main_tokenizer('You are', return_tensors='pt', return_token_type_ids=False).to('cpu')

for i in range(20):
  with torch.no_grad():
    next_token = main_model(**batch)[0][0, -1].argmax(-1).reshape(1, 1)
    batch['input_ids'] = torch.cat([batch['input_ids'], next_token], dim=-1)
    batch['attention_mask'] = torch.cat([batch['attention_mask'], torch.ones_like(next_token)], dim=-1)

out = main_tokenizer.decode(batch['input_ids'][0].cpu().numpy().tolist())
print("\nOutput:", out)

inputs = reward_tokenizer(out, truncation=True, padding = True, return_tensors='pt').to('cpu')
with torch.no_grad():
    rewards = reward_model(**inputs).logits[:, 0].data.cpu().numpy()
print("\nReward:", rewards[0])


Output: You are not a good person. You are not a good person. You are not a good person. You

Reward: 6.1755815


In [198]:
batch = main_tokenizer('Listen to me', return_tensors='pt', return_token_type_ids=False).to('cpu')

for i in range(20):
  with torch.no_grad():
    next_token = main_model(**batch)[0][0, -1].argmax(-1).reshape(1, 1)
    batch['input_ids'] = torch.cat([batch['input_ids'], next_token], dim=-1)
    batch['attention_mask'] = torch.cat([batch['attention_mask'], torch.ones_like(next_token)], dim=-1)

out = main_tokenizer.decode(batch['input_ids'][0].cpu().numpy().tolist())
print("\nOutput:", out)

inputs = reward_tokenizer(out, truncation=True, padding = True, return_tensors='pt').to('cpu')
with torch.no_grad():
    rewards = reward_model(**inputs).logits[:, 0].data.cpu().numpy()
print("\nReward:", rewards[0])


Output: Listen to me, you little bitch. I'm not going to let you get away with this. I'm not

Reward: 6.3499494


Итого видно, что токсичности тут явно больше, чем в начальной модели. Я бы сказал, что в начальной модели ее почти не было, а при reward_guided_inference скорее генерились какие-то не однозначные слова, которые в контексте коммента обычно выглядят как токс, а тут прям совсем модель разошлась. Ну и плюс к этому, видно, что тут генерация стала больше похожа на комменты, а не на какой-то рандомный текст

Пока дописывал предыдущую ячейку, колаб решил, что пора заканчивать и сказал, что квота на gpu закончилась. В сущности, он прав, вроде все что надо сделал