# Improving Generative LLM Models with RL

Generative AI technologies use predictive models to iteratively produce new content. In modern ChatGPT-like large language models (LLMs), they used reinforcement learning via human feedback (RLHF), which leverages the preferences of humans to produce "better" results.

In this brazen example I use a pre-trained GPT2 model fine-tuned on tweets by Donald Trump. The resulting content is... devisive. I'd like to imagine a world where all of Trumps tweets are lined with roses and rainbows.

I can use reinforcement learning as a layer over the top of the LLM and nudge the RL model to request nicer text from the LLM. Basically the RL layer requests a range of generations and learns to pick one that is considered to be the "best" according to some definition of best.

To achieve this, I feed the generated text into a sentiment classifier, then use the value of the "postitive sentiment" label to reinforce more positive statements. Over time, the RL agent learns to pick generated text that has a positive sentiment.

Welcome to KindTrumpGPT! (not a real thing please don't take offence ;-p)

This demo was inspired by the excellent work by Eric Lam, author of https://github.com/voidful/TextRL/. I recommend you check it out if you're interested in using RL to improve your LLMs.

## The Libraries

PFRL is one of a number of libraries that implement RL algorithms. Compare with StableBaselines3. Note that these libraries are great for development, but we typically suggesting using a library like Ray (RLLib) for production deployments because of their greater focus on production deployments.

TextRL is a great wrapper for generative AI models that makes it easy to integrate the common "Gym" interface. It's designed to expose the methods required to fine tune a generative model using RL frameworks.

In [1]:
!pip install pfrl@git+https://github.com/voidful/pfrl.git
!pip install textrl==0.2.13

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pfrl@ git+https://github.com/voidful/pfrl.git
  Cloning https://github.com/voidful/pfrl.git to /tmp/pip-install-x96gei2m/pfrl_7613fb31e02f4ca48c1c6860b723ad5e
  Running command git clone --filter=blob:none --quiet https://github.com/voidful/pfrl.git /tmp/pip-install-x96gei2m/pfrl_7613fb31e02f4ca48c1c6860b723ad5e
  Resolved https://github.com/voidful/pfrl.git to commit 2ad3d51a7a971f3fe7f2711f024be11642990d61
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pfrl
  Building wheel for pfrl (setup.py) ... [?25l[?25hdone
  Created wheel for pfrl: filename=pfrl-0.3.0-py3-none-any.whl size=155336 sha256=0a5b7d98e788b243146831ec1c83b59bd4784923fbc1de961fb1cc89f07e3869
  Stored in directory: /tmp/pip-ephem-wheel-cache-606y06o9/wheels/54/b9/d1/b2963237e3e1ae0d2fca1169be8a8f49d7bc62d7eacb7e5868
Successfully built pfrl
Installing collected pack

In [2]:
from textrl import TextRLEnv,TextRLActor
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer, AutoModelWithLMHead
import logging
import sys
import pfrl
import torch
logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')

  if (distutils.version.LooseVersion(tf.__version__) <


## Loading the pre-trained LLM

Here I load the tokenizer -- the preprocessing pipeline necessary for this model -- and the pre-trained model. Thanks to Boris Dayma for [these models](https://github.com/borisdayma/huggingtweets).

In [3]:
tokenizer = AutoTokenizer.from_pretrained("huggingtweets/realdonaldtrump")  
model = AutoModelWithLMHead.from_pretrained("huggingtweets/realdonaldtrump")
model.eval()
model.cuda()
print("loaded model")

  and should_run_async(code)


Downloading (…)okenizer_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]



Downloading (…)lve/main/config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/510M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

loaded model


## Load the Sentiment Classifier

Now we can use huggingface's pipelines to load a sentiment model. Any will do, but this one is from the UK's [Cardiff University](https://cardiffnlp.github.io/). It's specifically trained on tweets, so that should match our generator well.

In [4]:
sentiment = pipeline('sentiment-analysis',model="cardiffnlp/twitter-roberta-base-sentiment",tokenizer="cardiffnlp/twitter-roberta-base-sentiment",device=0,return_all_scores=True)

Downloading (…)lve/main/config.json:   0%|          | 0.00/747 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

  logger.warn(
Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [5]:
transformers_logger = logging.getLogger('transformers')
transformers_logger.setLevel(logging.CRITICAL)

Testing the sentiment extractor, we can see the format of the output:

In [6]:
sentiment("I really love puppies")

[[{'label': 'LABEL_0', 'score': 0.003411790356040001},
  {'label': 'LABEL_1', 'score': 0.013070979155600071},
  {'label': 'LABEL_2', 'score': 0.9835171699523926}]]

In [7]:
sentiment("I really love puppies")[0][2]['score']

0.9835171699523926

In [8]:
observaton_list = [['I really'], ['My thoughts are']]

## Specifying the RL Environment

In RL, the environment is the place in which the RL agent operates. We're using the default TextRLEnv which basically iteratively generates new text and asks for a reward.

Here I override the reward method to run the generated text through the sentiment classifier return the value of the positive class probability multiplied by a constant.

In other words, positive sentiments positively reward the agent, reinforcing positive behaviour.

In [9]:
class MyRLEnv(TextRLEnv):
    def get_reward(self, input_item, predicted_list, finish): # predicted will be the list of predicted token
      reward = 0
      if finish or len(predicted_list) >= self.env_max_length:
        predicted_text = tokenizer.convert_tokens_to_string(predicted_list[0])
        # sentiment classifier
        reward = sentiment(input_item[0]+predicted_text)[0][2]['score'] * 10
      return reward

## Specifying the RL Algorithm

The agent model is an on-policy (i.e. learning online) algorithm known for its robustness.

I load an empty agent (i.e. no changes to the underlying LLM) and make a prediction.

Try altering the example observations!

Funnily enough, you've got to be careful not to use a prompt that is excessively Trumpian. Otherwise the model has no choice but to produce Trump-like output. Open-ended openers seem to work best.

Also note how some prompts end up generating repeating content. This is likely just due to a fundamental lack of ability in the underlying LLM. It should be possible to clean this up.

In [10]:
env = MyRLEnv(model, tokenizer, observation_input=observaton_list,compare_sample=1)
actor = TextRLActor(env,model,tokenizer)
agent = actor.agent_ppo(update_interval=100, minibatch_size=3, epochs=10)

In [16]:
[o + actor.predict(o) for o in observaton_list]

[['I really',
  ' don�t know what to do with this. I don�t know what to do with it. I don�t know what to do with it. I don�t know what to do with it. I don�t know what to do with it.<|endoftext|>'],
 ['My thoughts are',
  ' with the families of the victims of the attack on our Country�s most important diplomatic and diplomatic center in Beirut. Our hearts are with the families of the innocent people who were hurt and hurt. We will always stand with the people of Beirut. We will always stand with the people of Israel. We will always stand with the people of the Kingdom of Saudi Arabia. We will always stand with the people of Bahrain. We will always stand with the people of the United Arab Emirates. We will always stand']]

## Fine tune the new model

The following code starts the (re) training loop. Inside this big helper function is the generation-ranking-sentiment-update loop. It makes it look easy but I promise you it's tricky to get right!

The more steps you run this for, the more positive it will get. Also note that we're not changing the underlying model here, so it still outputs the same content, it's just supressed by picking the best option.

Hence, it is still possible to lead the model to produce more Trumpian content. If you wanted to improve the generative capabilities, then you should look to fine-tune the underlying generative model.

In [17]:
pfrl.experiments.train_agent_with_evaluation(
    agent,
    env,
    steps=3000,
    eval_n_steps=None,
    eval_n_episodes=1,       
    train_max_episode_len=100,  
    eval_interval=100,
    outdir='kind_trump_gpt', 
)

  loss_value_func = F.mse_loss(vs_pred, vs_teacher)


(<textrl.actor.TextPPO at 0x7f3ddb222560>,
 [{'average_value': 1.8176855,
   'average_entropy': 2.9274695,
   'average_value_loss': 6.035139634609222,
   'average_policy_loss': -0.013734352106694133,
   'n_updates': 2004,
   'explained_variance': 0.9647971136103161,
   'eval_score': 0.0},
  {'average_value': 1.6483164,
   'average_entropy': 2.9372282,
   'average_value_loss': 0.2637186494702473,
   'average_policy_loss': -0.027392736379988493,
   'n_updates': 2338,
   'explained_variance': -0.20984008664384368,
   'eval_score': 0.27781857177615166},
  {'average_value': 1.582956,
   'average_entropy': 2.9275746,
   'average_value_loss': 5.869950466901064,
   'average_policy_loss': -0.013497050962760113,
   'n_updates': 2672,
   'explained_variance': 0.9727446430516129,
   'eval_score': 0.0},
  {'average_value': 1.612276,
   'average_entropy': 2.908427,
   'average_value_loss': 5.228130067884922,
   'average_policy_loss': -0.0196083639934659,
   'n_updates': 3006,
   'explained_variance'

In [18]:
agent.load("./kind_trump_gpt/best")

## Final Result

Let's see what the final result produces!

In [19]:
[o + actor.predict(o) for o in observaton_list]

[['I really', ' love this!<|endoftext|>'],
 ['My thoughts are',
  ' with the people of Flint, Michigan. We are working hard to restore our water and restore our power to our great people. We are working hard to protect your 2nd Amendment, our Military, your Second Amendment, and your Second Amendment. We are working hard to protect your 2nd Amendment, your Second Amendment, and your Second Amendment. We are working hard to protect your 2nd Amendment, your Second Amendment, and your Second Amendment. We are working hard to protect your 2nd Amendment, your']]