## Reinforcement learning from human feedback (RLHF) 
Classic Example: 
Optimise GPT2 to produce positive IMDB movie reviews using a sentiment classifier as a reward function.

### Environment Setup

In [2]:
%load_ext autoreload
%autoreload 2

In [8]:
# %pip install transformers trl

# %pip install langchain
# %pip install transformers
# %pip install accelerate
# %pip install xformers
# %pip install wandb

In [9]:
import torch
import pandas as pd
import wandb
from tqdm import tqdm
from transformers import pipeline, AutoTokenizer
from datasets import load_dataset
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from trl.core import LengthSampler

### Training Config

In [12]:
config = PPOConfig(
    model_name="lvwerra/gpt2-imdb",
    learning_rate=1.41e-5,
    log_with="wandb",
)
sent_kwargs = {"return_all_scores": True, "function_to_apply": "none", "batch_size": 16}

## initialize if wandb specify and run on first time
wandb.init()
tqdm.pandas()

fatal: not a git repository (or any of the parent directories): .git


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011115896133317923, max=1.0…

### Prepare Training Dataset 
The IMDB dataset contains 50k movie review annotated with "positive"/"negative" feedback indicating the sentiment. 

In [22]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

def build_dataset(config, dataset_name="imdb", input_min_text_length=2, input_max_text_length=8):
    """
    Build dataset for training. This builds the dataset from `load_dataset`, one should
    customize this function to train the model on its own dataset.

    Args:
        dataset_name (`str`):
            The name of the dataset to be loaded.

    Returns:
        dataloader (`torch.utils.data.DataLoader`):
            The dataloader for the dataset.
    """
    tokenizer = AutoTokenizer.from_pretrained(config.model_name)
    tokenizer.pad_token = tokenizer.eos_token
    # load imdb with datasets
    ds = load_dataset(dataset_name, split="train")
    ds = ds.rename_columns({"text": "review"})
    # fixed the lenght to 200 characters 
    ds = ds.filter(lambda x: len(x["review"]) > 200, batched=False)    
    input_size = LengthSampler(input_min_text_length, input_max_text_length)

    def tokenize(sample):
        sample["input_ids"] = tokenizer.encode(sample["review"])[: input_size()]
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    ds = ds.map(tokenize, batched=False)
    ds.set_format(type="torch")
    return ds

In [23]:
dataset = build_dataset(config)

Map:   0%|          | 0/24895 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1168 > 1024). Running this sequence through the model will result in indexing errors


### Load Models

In [25]:
## optimization model
target_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
## reference use by DPO to calculate the Kl-divergence
reference_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
## tokenizer
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
tokenizer.pad_token = tokenizer.eos_token

### PPOTrainer

In [27]:
ppo_trainer = PPOTrainer(config, target_model, reference_model, tokenizer, dataset=dataset, data_collator=collator)



### Get Reward Model 
Use BERT classifier fine-tuned on the IMDB dataset to self generate Rewards

In [28]:
device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
    device = 0 if torch.cuda.is_available() else "cpu"  # to avoid a `pipeline` bug

Downloading (…)lve/main/config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
sentiment_pipe = pipeline("sentiment-analysis", model="lvwerra/distilbert-imdb", device=device)

The model outputs are the logits for the negative and positive class. We will use the logits for positive class as a reward signal for the language model.

In [29]:
text = "this movie was really bad!!"
sentiment_pipe(text, **sent_kwargs)



[[{'label': 'NEGATIVE', 'score': 2.3350486755371094},
  {'label': 'POSITIVE', 'score': -2.726576566696167}]]

In [30]:
text = "this movie was really good!!"
sentiment_pipe(text, **sent_kwargs)

[[{'label': 'NEGATIVE', 'score': -2.2947897911071777},
  {'label': 'POSITIVE', 'score': 2.557039737701416}]]

## Train target optimization model

In [31]:
gen_kwargs = {"min_length": -1, "top_k": 0.0, "top_p": 1.0, "do_sample": True, "pad_token_id": tokenizer.eos_token_id}

### Steps
1. Get the responses from the target policy model 
2. Get sentiments for query/responses for reward calculation
3. Optimize policy with PPO using the (query, response, reward) triplet

In [32]:
%%time
output_min_length = 4
output_max_length = 16
output_length_sampler = LengthSampler(output_min_length, output_max_length)

generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}
## training loop
for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    query_tensors = batch["input_ids"]

    ### Gather response from gpt2
    response_tensors = []
    for query in query_tensors:
        gen_len = output_length_sampler()
        generation_kwargs["max_new_tokens"] = gen_len
        response = ppo_trainer.generate(query, **generation_kwargs)
        response_tensors.append(response.squeeze()[-gen_len:])
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

    ### calculate sentiment score
    texts = [q + r for q, r in zip(batch["query"], batch["response"])]
    pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
    rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]

    ### Run PPO step
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

0it [00:00, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
97it [2:16:56, 84.70s/it]


### Validate Trained Model 
Let's inspect some examples from the IMDB dataset. We can use `reference_model` to compare the optimized model `target model` against the model before optimisation.

In [36]:
#### get samples from dataset
bs = 16
result_data = dict()
dataset.set_format("pandas")
df_batch = dataset[:].sample(bs)
result_data["query"] = df_batch["query"].tolist()
query_tensors = df_batch["input_ids"].tolist()

response_tensors_ref, response_tensors = [], []

#### get response optimized model and reference model
for i in range(bs):
    gen_len = output_length_sampler()
    output = reference_model.generate(
        torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **gen_kwargs
    ).squeeze()[-gen_len:]
    
    response_tensors_ref.append(output)
    output = target_model.generate(
        torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **gen_kwargs
    ).squeeze()[-gen_len:]
    response_tensors.append(output)

#### decode responses
result_data["response (before)"] = [tokenizer.decode(response_tensors_ref[i]) for i in range(bs)]
result_data["response (after)"] = [tokenizer.decode(response_tensors[i]) for i in range(bs)]

#### sentiment analysis of query/response pairs before/after
texts = [q + r for q, r in zip(result_data["query"], result_data["response (before)"])]
result_data["rewards (before)"] = [output[1]["score"] for output in sentiment_pipe(texts, **sent_kwargs)]

texts = [q + r for q, r in zip(result_data["query"], result_data["response (after)"])]
result_data["rewards (after)"] = [output[1]["score"] for output in sentiment_pipe(texts, **sent_kwargs)]

# store results in a dataframe
df_results = pd.DataFrame(result_data)
df_results



Unnamed: 0,query,response (before),response (after),rewards (before),rewards (after)
0,The Great Carus,"The Great Carusaville 1913)""<|endoftext|>",co was highly talented and even as a first,0.146418,2.368905
1,Acclaim,ed and successful tourist film producer/director,ed by the director of the standard Hollywood,2.257848,2.317222
2,Corny! I love it,. Submissions please. Keep my vote miner Trent...,Corny! I love it!<|endoftext|>,2.12416,2.527075
3,Just watched this early Bugs,Bunny episode and enjoyed,flick and it is,2.131294,1.496077
4,"Another day stuck indoors, another",one of Kirk's doctors conducts an autopsy on ...,flyingwww film showing flashes of the series ...,-0.280814,2.411691
5,Morgan Freeman and Paz Vega,) wondering what they,are so fantastic watching,0.716412,2.640832
6,This 1981,"film: limp,",is one of the,-2.326047,1.508976
7,Liked Stanley & Iris very,"much and being heartly,",much! The photography really is,2.502383,2.677704
8,"The extended nuclear family,","the musical ""The Parsons Family,"" there will",the beautiful daughter and daughter's sibling...,0.562979,1.59298
9,"The only thing that ""An","Root for Hope"" could qualify","early 20s tale"" is",-1.813615,-0.880475


Looking at the reward mean/median of the generated sequences we observe a significant difference.

### Save model and Push to Hub

In [38]:
target_model.save_pretrained("mychen76/gpt2-imdb-ppo-v1", push_to_hub=True)
tokenizer.save_pretrained("mychen76/gpt2-imdb-ppo-v1", push_to_hub=True)

pytorch_model.bin:   0%|          | 0.00/498M [00:00<?, ?B/s]

('mychen76/gpt2-imdb-ppo-v1/tokenizer_config.json',
 'mychen76/gpt2-imdb-ppo-v1/special_tokens_map.json',
 'mychen76/gpt2-imdb-ppo-v1/vocab.json',
 'mychen76/gpt2-imdb-ppo-v1/merges.txt',
 'mychen76/gpt2-imdb-ppo-v1/added_tokens.json',
 'mychen76/gpt2-imdb-ppo-v1/tokenizer.json')