In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd drive/My Drive/Colab Notebooks/upskilling/transformers_ch1

/content/drive/My Drive/Colab Notebooks/upskilling/transformers_ch1


# Chapter 7 - Reinforcement Learning from Human Feedback

* Practical exercise: fine-tune your Shakespeare transformer from Chapter 1 using RLHF to get it to output samples with positive sentiment. (If you didn't do that exercise, you might be able to find some code to help you from the solutions page).

    * Collect some unconditional samples from your model, and label them as positive, negative or neutral in sentiment yourself. You will probably need at least few hundred labels, so keep the samples fairly short so that this isn't too laborious. (Consider pooling your labels if working with others, or mixing in some snippets from the original corpus whose sentiment is less ambiguous.)
    * Fine-tune your model to obtain a reward model that predicts the sentiment of a sample. You can treat this as a sequence-modeling problem by having a model predict special tokens such as "happy" and "sad" based on the sentiment, treating neutral labels as soft 50% labels. Remember to mask all but the last token of the sample.
    * Fine-tune your original model using PPO, with rewards given by the log probability of positive sentiment predicted by the reward model.
        * To make things simpler, don't bother with a value function or GAE. Just use the reward itself as the advantage estimate at every token.
        * Recenter and normalize the rewards. Just use fixed constants taken from a few thousand samples instead of implementing running estimates.
        * As in chapter 6, track the fraction of ratios clipped. If it is below 1%, then increase the iteration batch size, i.e., the number of samples in each alternation between rollouts and optimization. The iteration batch size might need to be a few thousand completions or more.
        * Measure KL(current model || original model). If the fraction of ratios clipped is high enough, you shouldn't need to penalize this directly to prevent it growing too fast. Square root KL should grow roughly linearly over the course of training. Stop training once you reach 10 nats per completion, but keep hold of plenty of intermediate checkpoints at lower KLs.
    * Look at some samples from your different checkpoints, and try to get a sense of which one is the best, and where overoptimization started to occur.
    * Evaluate your preferred model by blindly rating 20 samples from your original model and 20 from your final model, to see whether it really is better. Maybe even ask a friend to do the ratings in case you think you can recognize the model. You can also try comparing it to a model that was trained to optimize negative sentiment (note: don't flip the sign of the reward function when training AGI).
    * Extension/alternative: do the same thing but with conditional rather than unconditional samples (using prompts from the original Shakespeare dataset), and training a comparison reward model instead of an absolute reward model.


In [3]:
import numpy as np
import pandas as pd
from tqdm import tqdm
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils import clip_grad_norm_
from sklearn.metrics import accuracy_score

# code
from scripts.models import *
from scripts.config import Configs

# data
from bs4 import BeautifulSoup
import re
import urllib3

In [4]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

In [5]:
url = 'https://www.gutenberg.org/files/100/100-0.txt'
http = urllib3.PoolManager()
data = http.request('GET', url)
soup = BeautifulSoup(data.data)
soup_str = str(soup)

# cleaning text
def clean_text(string):

  # remove \r and \n
  string = re.sub(r'\r', ' ', string)
  string = re.sub(r'\n', ' ', string)

  # remove characters between <> ad []
  string = re.sub(r'<[^>]+>', ' ', string)
  string = re.sub(r'\[[^\]]+\]', ' ', string)

  # remove spaces greater than one
  string = re.sub(r'\s+', ' ', string)

  return string

# tokenizer function
def tokenizer(clean_text):

  toks = re.split(r'\b', clean_text)
  toks = [i.strip() for i in toks if i != ' ']

  return np.array(toks)

# creating tokens and vocab size
clean_soup = clean_text(soup_str)
toks = tokenizer(clean_soup)
vocab = list(set(toks))
vocab.append('[UNKNOWN]')
VOCAB_SIZE = len(vocab)

# create dictionaries
id2tok = dict(enumerate(vocab))
tok2id = {v:k for k, v in id2tok.items()}

In [6]:
print(VOCAB_SIZE)

31316


In [7]:
cfg = Configs(BATCH_SIZE=64, SEQ_LEN=128, N_LAYERS=6)
BardofAvon = Transformer(vocab_size=VOCAB_SIZE, mask=True, config=cfg).to(device)
BardofAvon.load_state_dict(torch.load('bardofavon.pth'))

<All keys matched successfully>

## Samples

I will use the samples collected by ckkissane in (https://github.com/ckkissane/rlhf-shakespeare/blob/main/rlhf_shakespeare/data/handcrafted_data.jsonl).

In [8]:
! pip install jsonlines

Collecting jsonlines
  Downloading jsonlines-4.0.0-py3-none-any.whl.metadata (1.6 kB)
Downloading jsonlines-4.0.0-py3-none-any.whl (8.7 kB)
Installing collected packages: jsonlines
Successfully installed jsonlines-4.0.0


In [9]:
import jsonlines

objs = []
with jsonlines.open('handcrafted_data.jsonl') as reader:
            for obj in reader:
                objs.append(obj)

In [10]:
dataset = {'text':[], 'label':[]}
for i in objs:
  dataset['text'].append(i['sample'])
  dataset['label'].append(i['sentiment'])

In [11]:
import pandas as pd
df = pd.DataFrame(dataset)

## Reward Model

In [12]:
import copy

In [13]:
class RewardTransformer(nn.Module):

  def __init__(self, pretrained, classes=2):
    super(RewardTransformer, self).__init__()

    self.embed = copy.deepcopy(pretrained.embedding)
    self.posenc = copy.deepcopy(pretrained.pos_enc)
    self.drop = copy.deepcopy(pretrained.dropout)
    self.blocks = copy.deepcopy(pretrained.blocks)
    self.ff1 = nn.Linear(512, 512)
    self.ff2 = nn.Linear(512, classes)

  def forward(self, X):

    emb = self.embed(X)
    out = self.drop(self.posenc(emb))

    for i, l in enumerate(self.blocks):

      out = l(out)

    out = self.ff1(out)

    return self.ff2(out)

In [14]:
# using bert and then fine tunning it

In [15]:
class RewardModelData(Dataset):

  def __init__(self, samples, tok2id, seq_len = 128, num_classes=2):

    self.tok2id = tok2id
    self.id2tok = {v:k for k, v in self.tok2id.items()}

    self.sequences = []

    for i in range(samples.shape[0]):

      # clean and tokenize
      text = clean_text(samples['text'][i])
      text_toks = tokenizer(text)

      # try and get token id else if token not in dict
      toks_ids = []
      not_found_counter = 0
      exceptions = []
      for t in text_toks:
        try:
          id = self.tok2id[t]
          toks_ids.append(id)
        except:
          not_found_counter += 1
          exceptions.append(t)

      # print how many samples were not found
      if not_found_counter != 0:
        print('Sample', i, 'found', not_found_counter, 'exceptions = ', exceptions)

      # pad
      if len(toks_ids) < seq_len:
        toks_ids.extend([self.tok2id['[UNKNOWN]'] for _ in range(len(toks_ids), seq_len)])
      toks_ids = torch.Tensor(toks_ids).type(torch.LongTensor)

      # limit len
      toks_ids = toks_ids[:seq_len]

      # make label - masking all but the last token
      label = -1000 * torch.ones(seq_len)
      label_val = 1 if samples['label'][i] == 'happy' else 0
      label[-1] = label_val
      label = label.type(torch.LongTensor)

      self.sequences.append((toks_ids, label))

  def __len__(self):
    return len(self.sequences)

  def __getitem__(self, idx):

    text, label = self.sequences[idx]

    return text, label

In [16]:
import random
np.random.seed(1)
random.seed(1)
torch.manual_seed(1)

df = df.sample(frac=1)
train_size = int(0.7 * df.shape[0])
df_train, df_test = df[:train_size], df[train_size:]
df_train.reset_index(drop=True, inplace=True)
df_test.reset_index(drop=True, inplace=True)

In [17]:
train_dset = RewardModelData(df_train, tok2id)

Sample 3 found 1 exceptions =  ['Howlings']
Sample 10 found 1 exceptions =  ['engrav']
Sample 23 found 2 exceptions =  ['Exeunt', '']
Sample 33 found 1 exceptions =  ['']
Sample 43 found 1 exceptions =  ['']
Sample 78 found 1 exceptions =  ['PHEBE']
Sample 80 found 1 exceptions =  ['Launce']
Sample 81 found 1 exceptions =  ['Goddild']
Sample 82 found 1 exceptions =  ['LAUNCE']
Sample 103 found 1 exceptions =  ['sluggardiz']
Sample 114 found 1 exceptions =  ['']
Sample 120 found 1 exceptions =  ['']
Sample 122 found 1 exceptions =  ['Phebe']
Sample 123 found 1 exceptions =  ['ador']
Sample 137 found 1 exceptions =  ['']
Sample 143 found 1 exceptions =  ['']


In [18]:
test_dset = RewardModelData(df_test, tok2id)

Sample 7 found 1 exceptions =  ['']
Sample 15 found 1 exceptions =  ['']
Sample 22 found 1 exceptions =  ['']
Sample 36 found 1 exceptions =  [". '"]
Sample 43 found 1 exceptions =  ['']
Sample 44 found 1 exceptions =  ["! '"]
Sample 46 found 1 exceptions =  ['']
Sample 58 found 1 exceptions =  ['']
Sample 60 found 1 exceptions =  ['Audre']


In [19]:
train_dlr = DataLoader(train_dset, shuffle=True, batch_size=4)
test_dlr = DataLoader(test_dset, shuffle=True, batch_size=4)

In [20]:
reward_model = RewardTransformer(BardofAvon).to(device)

#### Testing

In [21]:
text, labels = next(iter(train_dlr))
text = text.to(device)
labels = labels.to(device)

In [22]:
out = reward_model(text)
out.shape

torch.Size([4, 128, 2])

In [23]:
F.softmax(out[:, -1, :][:, -1])

  F.softmax(out[:, -1, :][:, -1])


tensor([0.1659, 0.2291, 0.2608, 0.3441], device='cuda:0',
       grad_fn=<SoftmaxBackward0>)

In [24]:
F.cross_entropy(out.view(-1, out.shape[-1]), labels.view(-1), ignore_index=-1000)

tensor(0.7544, device='cuda:0', grad_fn=<NllLossBackward0>)

## Training Reward Model

In [25]:
from tqdm import tqdm
from sklearn.metrics import accuracy_score

In [26]:
def train(model, dtlr, optm, epochs):

  running_train_loss = 0

  model.train()
  for e in range(epochs):
    pbar = tqdm(dtlr, total=len(dtlr))
    running_train_loss = 0
    for t, l in pbar:

      t = t.to(device)
      l = l.to(device)

      # run training
      logits = model(t)
      optm.zero_grad()
      l = F.cross_entropy(logits.view(-1, logits.shape[-1]), l.view(-1), ignore_index=-1000)
      running_train_loss += l.item()
      l.backward()
      optm.step()

    print('Epoch: ', e, 'Training Loss: ', running_train_loss/len(dtlr))

In [27]:
epochs = 2
optm = torch.optim.Adam(reward_model.parameters(), lr=1e-4)

In [28]:
train(reward_model, train_dlr, optm, epochs)

100%|██████████| 47/47 [00:11<00:00,  3.97it/s]


Epoch:  0 Training Loss:  1.0611085888553173


100%|██████████| 47/47 [00:08<00:00,  5.81it/s]

Epoch:  1 Training Loss:  0.6069934961009533





In [29]:
reward_model.eval()
ypreds = []
labels = []
for t, l in train_dlr:

  t = t.to(device)
  mask = l.view(-1) != -1000
  logits = reward_model(t)
  preds = logits.argmax(dim=-1)
  preds = preds.view(-1)[mask].cpu().numpy()
  true_label = l.view(-1)[mask].cpu().numpy()
  ypreds.append(preds)
  labels.append(true_label)

In [30]:
sum(ypreds[0] == labels[0])/len(labels[0])

1.0

In [31]:
reward_model.eval()
correct, total = 0, 0
for test_x, test_y in test_dlr:
    test_logits = reward_model(test_x.to(device))
    pred = test_logits.argmax(dim=-1)
    correct += (pred == test_y.to(device)).sum()
    total += test_x.shape[0]
print(f"validation accuracy: {correct / total}")

validation accuracy: 0.6666666865348816


## PPO

In [32]:
from torch.distributions.categorical import Categorical as C

In [83]:
class PPOAgent(nn.Module):

  def __init__(self, pretrained):
    super(PPOAgent, self).__init__()

    self.model = copy.deepcopy(pretrained)

  def get_policy(self, X):

    logits = self.model(X)
    policy = C(logits=logits)

    return policy

  def predict(self, X):

    logits = self.model(X)
    logits = logits[:, -1, :]
    policy = C(logits=logits)
    action = policy.sample()

    return action

#### Testing

In [84]:
txt, label = next(iter(train_dlr))
test_agent = PPOAgent(BardofAvon).to(device)
a = test_agent.get_policy(txt.to(device))

In [85]:
action = test_agent.predict(txt.to(device))

In [86]:
action.shape

torch.Size([4])

In [87]:
action

tensor([15034, 18240,  9521, 25507], device='cuda:0')

Our loss function

In [88]:
class PPOLoss(nn.Module):

  def __init__(self):
    super(PPOLoss, self).__init__()

  def clippedSurrogateObjective(self, A, log_policy, log_old_policy, epsilon):
    r = torch.exp(log_policy - log_old_policy)
    first = -A*r
    second = -A*torch.clamp(r, 1-epsilon, 1+epsilon)
    L_clip = torch.max(first, second)
    return -L_clip.mean()

  def get_info(self,
               log_policy,
               log_old_policy,
               epsilon=0.1):
    r = torch.exp(log_policy - log_old_policy)
    clipped = r.gt(1.0 + epsilon) | r.lt(1 - epsilon)
    clipfrac = torch.as_tensor(clipped, dtype=torch.float32).mean().item()
    approx_kl = (0.5 * (log_policy - log_old_policy) ** 2).mean().item()

    pi_info = dict(kl=approx_kl, cf=clipfrac)
    return pi_info

  def forward(self, adv,
               log_policy,
               log_old_policy,
               epsilon=0.1):

    lclip = self.clippedSurrogateObjective(adv, log_policy, log_old_policy, epsilon)
    pi_info = self.get_info(log_policy, log_old_policy, epsilon)

    print('Clip Loss:', lclip)

    return lclip, pi_info

PPOStorage class to store important tensors generated during rollout.

In [89]:
class PPOStorage:

  def __init__(self, seq_len, batch_size):
    self.seq_len = seq_len
    self.batch_size = batch_size
    self.device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
    self.start()

  def start(self):
    self.states_b = torch.zeros((self.batch_size, self.seq_len)).type(torch.LongTensor).to(self.device)
    self.actions_b = torch.zeros((self.batch_size, self.seq_len)).type(torch.LongTensor).to(self.device)
    self.logprobs_b = torch.zeros((self.batch_size, self.seq_len)).type(torch.LongTensor).to(self.device)
    self.rewards_b = torch.zeros((self.batch_size,)).type(torch.LongTensor).to(self.device)
    self.adv_b = torch.zeros((self.batch_size, self.seq_len)).type(torch.LongTensor).to(self.device)

  def update(self, states, actions, logprobs, rewards, adv, batch):

    self.states_b[batch] = torch.Tensor(states).type(torch.LongTensor).to(self.device)
    self.actions_b[batch] = torch.Tensor(actions).type(torch.LongTensor).to(self.device)
    self.logprobs_b[batch] = torch.Tensor(logprobs).type(torch.LongTensor).to(self.device)
    self.rewards_b[batch] = torch.Tensor(rewards).type(torch.LongTensor).to(self.device)
    self.adv_b[batch] = torch.Tensor(adv).type(torch.LongTensor).to(self.device)

    if batch == self.batch_size - 1:
      print('Advantage Normalized')
      self.normalize_adv()

  def normalize_adv(self):
    self.adv_b = self.adv_b.float()
    self.adv_b = (self.adv_b - self.adv_b.mean()) / self.adv_b.std()

In [106]:
class PPO:

  def __init__(self,
               seq_len,
               vocab,
               id2tok,
               tok2id,
               storage,
               ppoagent,
               initial_agent,
               reward_model,
               optimizer,
               loss,
               device,
               num_its=50,
               epochs=4,
               batch_size=4):

    self.seq_len = seq_len
    self.vocab = vocab
    self.id2tok = id2tok
    self.tok2id = tok2id
    self.storage = storage
    self.ppoagent = ppoagent.to(device)
    self.initial_agent = initial_agent.to(device)
    self.reward_model = reward_model.to(device)
    self.batch_size = batch_size
    self.optimizer = optimizer
    self.loss = loss.to(device)
    self.num_its = num_its
    self.epochs = epochs
    self.device = device

  def rollout(self, model, batch_num, store=True):

    # phrase
    initial_sample_word = np.random.choice(self.vocab, 1)
    initial_sample = ['[UNKNOWN]' for _ in range(self.seq_len - 1)]
    initial_sample.append(initial_sample_word[0])
    initial_sample = [self.tok2id[w] for w in initial_sample]
    actions = []

    initial_sample = torch.tensor(initial_sample).to(self.device)
    initial_sample = initial_sample[None, :]

    with torch.no_grad():
      for _ in range(self.seq_len):
        next_word = model.predict(initial_sample)
        actions.append(next_word)
        initial_sample = torch.cat((initial_sample, next_word.unsqueeze(0)), dim=-1)
        initial_sample = initial_sample[:, 1:]


    if store:
      actions = torch.tensor(actions).to(self.device)
      rewards = self.reward_model(initial_sample)
      rewards = F.softmax(rewards[:, -1, :], dim=-1)[:, 1]
      rewards = torch.log(rewards)
      advs = rewards * torch.ones(self.seq_len).to(self.device)
      log_prob = model.get_policy(initial_sample).log_prob(actions).squeeze()
      self.storage.update(initial_sample, actions, log_prob, rewards, advs, batch_num)

    phrase = ' '.join([self.id2tok.get(w.item(), '[UNK]') for w in initial_sample.flatten()])

    return phrase

  def full_rollout(self):

    for i in range(self.batch_size):

      phrase = self.rollout(self.ppoagent, i)

  def train(self):

    self.ppoagent.train()
    for _ in range(self.num_its):
      initial_phrase = self.rollout(self.initial_agent, 0, store=False)
      print('INITIAL PHRASE:', initial_phrase)
      self.full_rollout()

      for e in range(self.epochs):
        print('TRAIN EPOCH', e)
        self.optimizer.zero_grad()
        n_log_prob = self.ppoagent.get_policy(self.storage.states_b).log_prob(self.storage.actions_b)
        loss, pi_info = self.loss(self.storage.adv_b, n_log_prob, self.storage.logprobs_b)
        print('KL:', pi_info['kl'], '\n', 'Clip Frac:', pi_info['cf'])
        loss.backward()
        self.optimizer.step()

      after_roll = self.rollout(self.ppoagent, 0, store=False)
      print('AFTER TRAIN PHRASE:', after_roll)


### Training

In [107]:
loss = PPOLoss()
storage = PPOStorage(128, 4)
initial_agent = PPOAgent(BardofAvon)
ppo_agent = PPOAgent(BardofAvon)
optm = torch.optim.Adam(ppo_agent.parameters(), lr=1e-5)
rm = RewardTransformer(BardofAvon)

In [108]:
ppo = PPO(128, vocab, id2tok, tok2id, storage, ppo_agent, initial_agent, rm, optm, loss, device)

In [109]:
ppo.train()

INITIAL PHRASE: Chased Brutus strengths briars Chased scurvy pleasures Eats afire Tuscany interposer kernal Eats shoot BELARIUS _Songs Chased ends pandar exhale did whom divide did ends feeder did strengths jeweller did hotter catalogue Chased remotion Coal did remotion feeder did strengths remotion Coal forward royal Winter did Beast December Chased direction hoard did strengths enamour hoard did Catesby Sodden beckoned standards December did strengths prodigality vaults haeres smilets Fits Chased acquir Performers different did fog remotion FRANCISCO SEMPRONIUS scowls mills did bishop ends flock sleight scurvy carrying scowls Italian did strengths _Songs Work prevailment background HOST _Songs did besmeared marked broke Murray BELARIUS did scowls ”; festival did scurvy carrying strengths chins direction pikes interposer cubs yesterdays USHER scurvy royal Detraction Job daintier did strengths royal _Songs Brake did
Advantage Normalized
TRAIN EPOCH 0
torch.Size([4, 128]) torch.Size([4,

ValueError: Expected parameter logits (Tensor of shape (4, 128, 31316)) of distribution Categorical(logits: torch.Size([4, 128, 31316])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]],

        [[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]],

        [[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]],

        [[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]]], device='cuda:0',
       grad_fn=<SubBackward0>)

In [None]:
type(tok2id)