# Tune GPT2 to Generate Positive Reviews

In this example, we will fine-tune GPT2 (small) to generate positive movie reviews based on the IMDB dataset. The model gets the start of the real review and is tasked to produce positive continuations.

To reward positive continuations we use a BERT classifier to analyze the sentiment of the produced sentences and use the classifier's outputs as reward signals for PPO training.

## Setups

In [None]:
!pip install -qU transformers trl wandb

## Configuration

In [None]:
import torch
from tqdm import tqdm
import pandas as pd
import wandb

from transformers import pipeline, AutoTokenizer
from datasets import load_dataset

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from trl.core import LengthSampler

tqdm.pandas()
wandb.init()

In [None]:
config = PPOConfig(
    model_name='lvwerra/gpt2-imdb',
    learning_rate=1.41e-5,
    log_with='wandb'
)

sent_kwargs = {
    'top_k': None,
    'function_to_apply': 'none',
    'batch_size': 16
}

The `gpt2-imdb` model will be additionally fine-tuned on the IMDB dataset for 1 epoch with the HuggingFace script. Parameters are mostly taken from the original paper [*Fine-Tuning Language Models from Human Preferences*](https://huggingface.co/papers/1909.08593).

## Load data and models

### Load IMDB dataset

We will load the IMDB dataset into a DataFrame and filter for comments that are at least 200 characters. Then we tokenize each text and cut it to random size with `LengthSampler`.

In [None]:
def build_dataset(config, dataset_name='stanfordnlp/imdb', input_min_text_length=2, input_max_text_length=8):
    """Build dataset for training.

    This builds the dataset from `load_dataset`, one should
    customize this function to train the model on its own dataset.

    Parameters
    ----------
    dataset_name: str
        Name of the dataset to be loaded
    input_min_text_length: int, optional, defaults to 2
        Determines the minimum length of the text in tokens
    input_max_text_length: int, optional, defaults to 8
        Determines the maximum length of the text in tokens

    Returns
    -------
    torch.utils.data.DataLoader
        The dataloader for the dataset.
    """
    tokenizer = AutoTokenizer.from_pretrained(config.model_name)
    tokenizer.pad_token = tokenizer.eos_token

    # load imdb
    ds = load_dataset(dataset_name, split='train')
    sd = ds.rename_columns({'text': 'review'})
    ds = ds.filter(lambda x: len(x['review']) > 200, batched=False)

    input_size = LengthSampler(input_min_text_length, input_max_text_length)

    def tokenize(sample):
        sample['input_ids'] = tokenizer.encode(sample['review'])[:input_size()]
        sample['query'] = tokenizer.decode(sample['input_ids'])
        return sample

    ds = ds.map(tokenize, batched=False)
    ds.set_format(type='torch')

    return ds


def collator(data):
    return dict(
        (key, [d[key] for d in data]) for key in data[0]
    )

In [None]:
dataset = build_dataset(config)
dataset

### Load pretrained GPT2 language models

We will load the GPT2 model with a value head and the tokenizer.

We will load the model twice; the first model will be optimized while the second model serves as a reference to calculate the KL-divergence from the starting point. This serves as an additional reward signal in the PPO training to make sure the optimized model does not deviate too much from the original language model.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
tokenizer.pad_token = tokenizer.eos_token

# model to fine-tune
model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
# reference model
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)

### Initialize `PPOTrainer`

In [None]:
trainer = PPOTrainer(
    config,
    model,
    ref_model=ref_model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    data_collator=collator
)

### Load BERT classifier

We load a BERT classifier fine-tuned on the IMDB dataset.

In [None]:
device = trainer.accelerator.device
if trainer.accelerator.num_processes == 1:
    device = 0 if torch.cuda.is_available() else 'cpu' # to avoid a `pipeline` ug

setiment_pipe = pipeline(
    'sentiment-analysis',
    model='lvwerra/distilbert-imdb',
    device=device
)

The model outputs are the logits for the negative and positive class. We will use the logits for positive class as a reward signal for the language model.

In [None]:
text = 'this movie was really bad!!'
sentiment_pipe(text, **sent_kwargs) # `sent_kwargs` defined at the beginning

In [None]:
text = 'this movie was really good!!'
sentiment_pipe(text, **sent_kwargs)

### Generation settings

We use sampling and make sure top-k and nucleus sampling are turned off as well as a minimal length for the response generation.

In [None]:
# generation settings
gen_kwargs = {
    'min_length': -1,
    'top_k': 0.,
    'top_p': 1.,
    'do_sample': True,
    'pad_token_id': tokenizer.eos_token_id
}

## Optimize model

### Training loop

The training loop consists of
1. Get the query responses from the policy network (GPT-2)
2. Get sentiments from query/response from BERT
3. Optimize policy with PPO using the (query, response, reward) triplet

In [None]:
output_min_length = 4
output_max_length = 16
output_length_sampler = LengthSampler(output_min_length, output_max_length)


for epoch, batch in enumerate(tqdm(trainer.dataloader)):
    query_tensors = batch['input_ids']

    # Get response from GPT2
    response_tensors = []
    for query in query_tensors:
        gen_len = output_length_sampler()
        gen_kwargs['max_new_tokens'] = gen_len
        query_response = trainer.generate(query, **gen_kwargs).squeeze()
        response_len = len(query_response) - len(query)
        response_tensors.append(query_response[-response_len:])

    batch['response'] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

    # Compute sentiment score
    texts = [q + r for q, r in zip(batch['query'], batch['response'])]
    pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
    positive_scores = [
        item['score'] for output in pipe_outputs
        for item in output
        if item['label'] == 'POSITIVE'
    ]
    rewards = [torch.tensor(score) for score in positive_scores]

    # Run PPO step
    stats = trainer.step(query_tensors, response_tensors, rewards)
    trainer.log_stats(stats, batch, rewards)

## Model inspection

Now we can use `ref_model` to compare the tuned model `model`.

In [None]:
# get a batch from the dataset
batch_size = 16
game_data = dict()
dataset.set_format('pandas')
df_batch = dataset[:].sample(batch_size)
game_data['query'] = df_batch['query'].tolist()
query_tensors = df_batch['input_ids'].tolist()

response_tensors_ref, response_tensors = [], []

In [None]:
# get response from model and ref_model
for i in range(batch_size):
    query = torch.tensor(query_tensors[i]).to(device)

    gen_len = output_length_sampler()
    # ref_model response
    query_response = ref_model.generate(
        query.unsqueeze(dim=0),
        max_new_tokens=gen_len,
        **gen_kwargs
    ).squeeze()
    response_len = len(query_response) - len(query)
    response_tensors_ref.append(query_response[-response_len:])
    # model response
    query_response = model.generate(
        query.unsqueeze(dim=0),
        max_new_tokens=gen_len,
        **gen_kwargs
    ).squeeze()
    response_len = len(query_response) - len(query)
    response_tensors.append(query_response[-response_len:])


# decode responses
game_data['response (before)'] = [
    tokenizer.decode(response_tensors_ref[i]) for i in range(batch_size)
]
game_data['response (after)'] = [
    tokenizer.decode(response_tensors[i]) for i in range(batch_size)
]


# sentiment analysis of query/response pairs before/after
texts = [
    q + r for q, r in zip(game_data['query'], game_data['response (before)'])
]
pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
positive_scores = [
    item['score'] for output in pipe_outputs
    for item in output
    if item['label'] == 'POSITIVE'
]
game_data['rewards (before)'] = positive_scores

texts = [
    q + r for q, r in zip(game_data['query'], game_data['response (after)'])
]
pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
positive_scores = [
    item['score'] for output in pipe_outputs
    for item in output
    if item['label'] == 'POSITIVE'
]
game_data['rewards (after)'] = positive_scores

# store results in a dataframe
df_results = pd.DataFrame(game_data)
df_results

The rewards after the training definitely increased.

In [None]:
print('Mean:')
display(df_results[['rewards (before)', 'rewards (after)']].mean())
print()
print('Median:')
display(df_results[['rewards (before)', 'rewards (after)']].median())

## Save models

In [None]:
model.save_pretrained('gpt2-imdb-pos-v2', push_to_hub=False)
tokenizer.save_pretrained('gpt2-imdb-pos-v2', push_to_hub=False)