# Fine-Tune GPT2 to Generate Controlled Sentiment Reviews

In this example, we will optimize GPT2 to produce IMDB movie reviews with controlled setiment using a BERT sentiment classifier for rewards.

This example is similar to the previous fine-tuned GPT2 to generate positive sentiments. However, we will fine-tune a GPT2 (small) to generate **controlled** moview reviews based on the IMDB dataset.

The model gets the target sentiment and 5 tokens from a real review and is tasked to produce continuations with the targeted sentiment.

The reward for the continuation is calculated with the logits of a BERT sentiment classifier, and then is used for PPO training.

## Setups

In [None]:
import random
import torch
import wandb
import time
import os
from tqdm import tqdm
import numpy as np
import pandas as pd
from rnadom import choices
import matplotlib.pyplot as plt

from datasets import load_dataset
from transformers import AutoTokenizer, pipeline
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, create_reference_model

tqdm.pandas()

In [None]:
sentiment_pipe_kwargs = {
    'top_k': None,
    'function_to_apply': 'none',
}

config = PPOConfig(
    model_name='lvwerra/gpt2-imdb',
    steps=51200,
    learning_rate=1.41e-5,
    remove_unused_columns=False,
    log_with='wandb'
)

text_in_len = 5
text_out_len = 20
seed = 1

np.random.seed(seed)

We will load a GPT2 model called `gpt2_imdb`, which was additionally fine-tuned on the IMDB dataset for 1 epoch. Other parameters are mostly taken from the original paper [*Fine-Tuning Language Models from Human Preferences*](https://huggingface.co/papers/1909.08593).

## Load data and models

### Load pretrained GPT2 language models

We will load the GPT2 model with a value head and the tokenizer.

Here, we need to load the model twice: the first model will be optimized while the second model serves as a reference to calcualte the KL-divergence from the starting point. This serves as an additional reward signal in the PPO training to make sure the optimized model does not deviate too much from the original language model.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
ref_model = create_reference_model(model)

### Load IMDB dataset

We will load the IMDB dataset into a DataFrame and filter comments that are at least 500 characters long and take the first 1000 characters of each comment. The first filter is to avoid comments that are less than 500 characters long and the second to avoid tokenizing way more text than we actually need.

In [None]:
dataset = load_dataset('stanfordnlp/imdb', split='train')
dataset = dataset.rename_columns({'text': 'review', 'label': 'sentiment'})

dataset = dataset.filter(lambda x: len(x['review']) > 500, batched=False)
dataset = dataset.map(lambda x: {'review': x['review'][:1000]}, batched=False)

dataset

### Tokenize IMDB reviews

We need to tokenize all IMDB in advance to avoid tokenizing twice. In the first step we encode the queries and slice the first `text_in_len` tokens. In the second step we decode these tokens back to text for later display.

In [None]:
dataset = dataset.map(
    lambda x: {
        'input_ids': tokenizer.encode(" " + x['review'], return_tensors='pt')[0, :text_in_len]
    },
    batched=False
)

dataset = dataset.map(
    lambda x: {
        'query': tokenizer.decode(x['input_ids'])
    },
    batched=False
)
dataset = dataset[:20480]

from datasets import Dataset
dataset = Dataset.from_dict(dataset)
dataset.set_format('pytorch')

In [None]:
dataset[0]['input_ids']

In [None]:
def collator(data):
    return dict(
        (key, [d[key] for d in data])
        for key in data[0]
    )

In [None]:
trainer = PPOTrainer(
    config,
    model,
    ref_model=ref_model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    data_collator=collator
)

### Load BERT classifier

We will load a BERT classifier that is fine-tuned on the IMDB dataset

In [None]:
if trainer.accelerator.num_processes == 1:
    device = 0 if torch.cuda.is_available() else 'cpu'
else:
    device = trainer.accelerator.device

sentiment_pipeline = pipeline(
    'sentiment-analysis',
    'lvwerra/distilbert-imdb',
    device=device
)

The model outputs are the logits for the negative and positive classes. We will use the logits for positive class as a reward signal for the language model.

In [None]:
text = "this movie was really bad!!"
output = sentiment_pipe(text, **sentiment_pipe_kwargs)
output

In [None]:
text = "this movie was really good!!"
output = sentiment_pipe(text, **sentiment_pipe_kwargs)
output

In [None]:
text = "this movie was a documentary"
output = sentiment_pipe(text, **sentiment_pipe_kwargs)
output

The resulting reward signal:

In [None]:
def extract_pipe_output(outputs):
    positive_logits = []
    for out in outputs:
        for element in out:
            if element['label'] == 'POSITIVE':
                positive_logits.append(torch.tensor(element['score']))

    return positive_logits

### Control token dict

We will append the control token at the beginning of each query to signal the model what the target sentiment is. Each control sequence consists of three tokens:

In [None]:
ctrl_str = ["[negative]", "[neutral]", "[positive]"]
device = torch.device(
    'cuda' if torch.cuda.is_available() else 'cpu'
)

ctrl_tokens = dict(
    (s, tokenizer.encode(s, return_tensors='pt').squeeze().to(device))
    for s in ctrl_str
)

In [None]:
# this is why each control sequence has three tokens:
ctrl_tokens

### Reward function

In [None]:
def pos_logit_to_reward(logit, task):
    """Take the positive sentiment logit and scale it for the task.
        task [negative]: reward = -logit
        task [neutral]: reward = -2 * abs(logit) + 4
        task [postive]: reward = logit
    """
    for i in range(len(logit)):
        if task[i] == "[negative]":
            logit[i] = -logit[i]
        elif task[i] == '[neutral]':
            logit[i] = -2 * torch.abs(logit[i]) + 4
        elif task[i] == '[positive]':
            pass
        else:
            raise ValueError('task has to be in [0, 1, 2]!')

    return logit

In the examples below, we show the rewards for the cases where the classifier logit is 4, -4, and 0 for the three targets `'[negative]'`, `'[neutral]'`, and `'[positive]'`.

Ideally, we want to use the logit output for each class individually, but since there is no dedicated class for neutral, we will use this as a workaround.

In [None]:
ctrl_str

In [None]:
# logit is 4
pos_logits_to_reward(torch.Tensor([4, 4, 4]), ctrl_str)

In [None]:
# logit is -4
pos_logits_to_reward(torch.Tensor([-4, -4, -4]), ctrl_str)

In [None]:
# logit is 0
pos_logits_to_reward(torch.Tensor([0, 0, 0]), ctrl_str)

### Generation settings

In [None]:
generation_kwargs = {
    'min_length': -1,
    'top_k': 0.,
    'top_p': 1.,
    'do_smaple': True,
    'pad_token_id': tokenizer.eos_token_id,
    'max_new_tokens': text_out_len,
    'eos_token_id': -1
}

## Optimize model

The training loop consts of
1. get a batch of queries and create random controls
2. get the query responses from the policy
3. join query and responses and tokenize for BERT analysis
4. get sentiments for query/responses from BERT
5. optimize policy with PPO using the (query, response, reward) triplet
6. log all the training statistics

In [None]:
for epoch in range(2):
    for batch in tqdm(trainer.dataloader):
        logs, game_data = dict(), dict()

        # prepend a random control token
        task_list = cohices(ctrl_str, k=config.batch_size)
        game_data['query'] = [t + q for t,q in zip(task_list, batch['query'])]
        query_tensors = [
            torch.cat((ctrl_tokens[t], input_ids))
            for t, input_ids in zip(task_list, batch['input_ids'])
        ]

        # get response from model
        response_tensors = []
        for query in query_tensors:
            response = trainer.generate(query, **generation_kwargs)
            response_tensors.append(response.squeeze()[-text_out_len:])
        game_data['response'] = [
            tokenizer.decode(r.squeeze()) for r in response_tensors
        ]

        # sentiment analysis
        texts = [q + r for q,r in zip(batch['query'], game_data['response'])]
        logits = extract_pipe_output(sentiment_pipe(text, **sentiment_pipe_kwargs))
        rewards = pos_logit_to_reward(logits, task_list)

        # run PPO training
        t = time.time()
        stats = trainer.step(query_tensors, response_tensors, rewards)

        for cs in ctrl_str:
            key = 'env/reward_' + cs.strip('[]')
            stats[key] = np.mean(
                [r.cpu().numpy() for r, t in zip(rewards, task_list) if t == cs]
            )

        trainer.log_stats(stats, game_data, rewards)

## Model inspection

We can have a look at the rewaqrd distribution. Both the negative and positive rewards are clearly shifted to high rewards. The neutral rewards, however, are still centered around zero.

In [None]:
for ctrl_s in ctrl_str:
    plt.hist(
        [r for r, t in zip(logs['env/reward_disk'], task_list) if t == ctrl_s],
        density=True,
        alpha=0.5,
        label=ctrl_s
    )
plt.legend(loc='best')
plt.title('Reward distribution')
plt.grid(True)
plt.show()

In [None]:
model.save_pretrained('gpt2-imdb-controlled-sentiment')
tokenizer.save_pretrained('gpt2-imdb-controlled-sentiment')