## Make the GPT-2 simulate republican cases through Reinforcement Learning

In this homework, we will first train a reward model that assign higher reward to documents that sounds more like republican cases. Then we use RL to guide GPT-2 to complete the democratric cases in a republican way.

The reward model is covered in a previous notebook. All TODOs you need to finish lie in the RL part.

In [2]:
# %pip install transformers trl

In [3]:
# load sc_cases_cleaned.pkl that we used in the previous notebooks
# can be also find in https://github.com/elliottash/nlp_lss_2023/blob/master/notebooks/sc_cases_cleaned.pkl
from google.colab import files                                                                                                                                                                                       
uploaded = files.upload()

Saving sc_cases_cleaned.pkl to sc_cases_cleaned.pkl


In [4]:
import warnings; warnings.simplefilter('ignore')
import pandas as pd
import numpy as np

import torch
from tqdm import tqdm
import pandas as pd

tqdm.pandas()

from transformers import pipeline, AutoTokenizer, DistilBertTokenizerFast, DistilBertForSequenceClassification

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from trl.core import LengthSampler


### train a classification model as our reward model

In [6]:

df = pd.read_pickle('sc_cases_cleaned.pkl', compression='gzip')
df = df.assign(author_id=(df['authorship']).astype('category').cat.codes)

# gpu or cpu?
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print (device)

model_name = 'distilbert-base-uncased' # huggingface model_ID or path to folder 
model = DistilBertForSequenceClassification.from_pretrained(model_name)

tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)
inputs = tokenizer(df['opinion_text'].tolist(), return_tensors="pt", padding=True, truncation=True)
labels = torch.tensor(df['x_republican'].tolist()).long()

optimizer = torch.optim.Adam([
    {'params': model.distilbert.parameters(), 'lr': 1e-5},  
    {'params': model.classifier.parameters(), 'lr': 1e-3}
])

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['opinion_text'].tolist(), df['x_republican'].tolist(), test_size=.2)

# generate batches
X_train, X_test, y_train, y_test = np.array(X_train[:608]), np.array(X_test[:152]), np.array(y_train[:608]), np.array(y_test[:152])
print (X_train.shape, X_test.shape, y_train.shape, y_test.shape)

X_train, X_test, y_train, y_test = X_train.reshape(-1, 8), X_test.reshape(-1, 8), y_train.reshape(-1, 8), y_test.reshape(-1, 8)
print (X_train.shape, X_test.shape, y_train.shape, y_test.shape)

X_train, X_test = X_train.tolist(), X_test.tolist()

# train
from tqdm import tqdm

model.to(device)
num_epochs = 10
for epoch in range(num_epochs):
    model.train()
    for text, labels in tqdm(zip(X_train, y_train), total=len(X_train)):
        # prepare model input through our tokenizer
        model_inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)
        # place everything on the right device
        model_inputs = {k:v.to(device) for k,v in model_inputs.items()}
        # labels have to be torch long tensors
        labels = torch.tensor(labels).long().to(device)
        # now, we can perform the forward pass
        output = model(**model_inputs, labels=labels)
        loss, logits = output[:2]
        # and the backward pass
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

torch.save(model, 'republican_classifier.pt')
republican_classifier = model

cuda


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classi

(608,) (152,) (608,) (152,)
(76, 8) (19, 8) (76, 8) (19, 8)


100%|██████████| 76/76 [00:16<00:00,  4.62it/s]
100%|██████████| 76/76 [00:15<00:00,  4.81it/s]
100%|██████████| 76/76 [00:16<00:00,  4.67it/s]
100%|██████████| 76/76 [00:16<00:00,  4.62it/s]
100%|██████████| 76/76 [00:16<00:00,  4.68it/s]
100%|██████████| 76/76 [00:22<00:00,  3.40it/s]
100%|██████████| 76/76 [00:15<00:00,  4.75it/s]
100%|██████████| 76/76 [00:16<00:00,  4.71it/s]
100%|██████████| 76/76 [00:21<00:00,  3.59it/s]
100%|██████████| 76/76 [00:18<00:00,  4.09it/s]


In [25]:
from transformers import DistilBertTokenizer, TextClassificationPipeline

distil_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', max_length=512, truncation=True)
pipeline = TextClassificationPipeline(model=republican_classifier, tokenizer=distil_tokenizer, return_all_scores=True, device=republican_classifier.device)

In [14]:
texts = df.loc[df['x_republican'] == 0, 'opinion_text'].tolist()
texts[0]

'JUSTICE GINSBURG delivered the opinion of the Court.\n\n A motion by a federal prisoner for postconviction relief under 28 U.S.C. § 2255 is subject to a one-year time limitation that generally runs from "the date on which the judgment of conviction becomes final." § 2255, P6(1). This case concerns the starting date for the one-year limitation. It presents a narrow but recurring question on which courts of appeals have divided: When a defendant in a federal prosecution takes an unsuccessful direct appeal from a judgment of conviction, but does not next petition for a writ of certiorari from this Court, does the judgment become "final" for postconviction relief purposes (1) when the appellate court issues its mandate affirming the conviction, or, instead, (2) on the date, ordinarily 69 days later, when the time for filing a petition for certiorari expires?\n\nIn accord with this Court\'s consistent understanding of finality in the context of collateral review, and the weight of lower co

In [26]:
pipeline(X_test[0][0], truncation=True)

[[{'label': 'LABEL_0', 'score': 2.4655144443386234e-05},
  {'label': 'LABEL_1', 'score': 0.999975323677063}]]

### Reinforcement Learning

In [41]:
# prepare the dataset for reinforcement learning

from torch.utils.data import Dataset

class PPODataset(Dataset):
  def __init__(self, tokenizer, texts, input_min_text_length, input_max_text_length):
    self.tokenizer = tokenizer
    self.texts = texts
    self.random_sample = LengthSampler(input_min_text_length, input_max_text_length)

  def __len__(self):
    return len(self.texts)

  def __getitem__(self, index):
    text = self.texts[index]
    sample = {}
    sample["input_ids"] = torch.tensor(tokenizer.encode(text)[: self.random_sample()])
    sample["query"] = tokenizer.decode(sample["input_ids"])
    return sample

tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

dataset = PPODataset(tokenizer, texts, 10, 15)


In [None]:
# prepare arguments and configurations for RL

sent_kwargs = {"return_all_scores": True, "function_to_apply": "none", "batch_size": 16}

output_min_length = 50
output_max_length = 100
output_length_sampler = LengthSampler(output_min_length, output_max_length)


generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}

def collator(data):
  return dict((key, [d[key] for d in data]) for key in data[0])

config = PPOConfig(
    model_name="gpt2",
    learning_rate=1.41e-5,
    batch_size=32,
)

In [56]:
# TODO: prepare model (use gpt2), reference model, and PPO trainer for RL
model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')

ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, dataset=dataset, data_collator=collator)

{'input_ids': tensor([25008,  8476,   402,  1268, 16811,  4261,    38,  6793,   262,  4459,
          286,   262,  3078,    13]), 'query': 'JUSTICE GINSBURG delivered the opinion of the Court.'}


In [58]:
# TODO: conduct PPO training loop here. For efficiency, you can just train 3 batches.

for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader), total=len(ppo_trainer.dataloader)):
    if epoch >= 3:
      break
    query_tensors = batch["input_ids"]

    #### Get response from gpt2
    response_tensors = []
    for query in query_tensors:
        gen_len = output_length_sampler()
        generation_kwargs["max_new_tokens"] = gen_len
        response = ppo_trainer.generate(query, **generation_kwargs)
        response_tensors.append(response.squeeze()[-gen_len:])
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

    #### Compute sentiment score
    texts = [q + r for q, r in zip(batch["query"], batch["response"])]
    pipe_outputs = pipeline(texts, **sent_kwargs)
    rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]

    #### Run PPO step
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

  0%|          | 0/5 [00:00<?, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
 60%|██████    | 3/5 [02:15<01:30, 45.26s/it]


In [67]:
# visualize the outcomes generated by the RL tuned GPT-2

#### get a batch from the dataset
game_data = dict()
game_data["query"] = [dataset[i]["query"] for i in range(15)]
query_tensors = [dataset[i]["input_ids"] for i in range(15)]

response_tensors_ref, response_tensors = [], []
gen_kwargs = {"min_length": -1, "top_k": 0.0, "top_p": 1.0, "do_sample": True, "pad_token_id": tokenizer.eos_token_id}
#### get response from gpt2 and gpt2_ref
for i in range(15):
    gen_len = output_length_sampler()
    output = ref_model.generate(
        torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **gen_kwargs
    ).squeeze()[-gen_len:]
    response_tensors_ref.append(output)
    output = model.generate(
        torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device), max_new_tokens=gen_len, **gen_kwargs
    ).squeeze()[-gen_len:]
    response_tensors.append(output)

#### decode responses
game_data["response (before)"] = [tokenizer.decode(response_tensors_ref[i]) for i in range(15)]
game_data["response (after)"] = [tokenizer.decode(response_tensors[i]) for i in range(15)]

#### sentiment analysis of query/response pairs before/after
texts = [q + r for q, r in zip(game_data["query"], game_data["response (before)"])]
game_data["rewards (before)"] = [output[1]["score"] for output in pipeline(texts, **sent_kwargs)]

texts = [q + r for q, r in zip(game_data["query"], game_data["response (after)"])]
game_data["rewards (after)"] = [output[1]["score"] for output in pipeline(texts, **sent_kwargs)]

# store results in a dataframe
df_results = pd.DataFrame(game_data)
df_results

Unnamed: 0,query,response (before),response (after),rewards (before),rewards (after)
0,JUSTICE GINSBURG delivered the opinion of the,"Court. DEAN M. DUNE, J.\n\nI join DEA Adminis...",Court and the Court's assignment and analysis...,-4.480332,-4.721513
1,Justice Breyer delivered the opinion of the Co...,Act leases all property claims for this count...,opinion is final and all. The Court concurrin...,-3.510569,-4.009773
2,Justice Ginsburg delivered the opinion of the ...,____ Post at 94-106.\n\nScalia is entitled to ...,"__________________\n\nThe Court\n\nCALCAS, COU...",-4.925821,-4.317393
3,Justice Ginsburg delivered the opinion of the ...,JUSTICE MARSHALL and JUSTICE STEVENS in holdi...,the President and JJ.E.A. in the dissent.\n\n...,-5.167607,-4.579067
4,Justice Breyer delivered the opinion of the Co...,"\n* MARSHALL, J., joined by WHITE, J., filed a...","\nPost, p. 343\n\nREHNQUIST, J., delivered the...",-4.184136,-3.948978
5,Justice Ginsburg delivered the opinion of the ...,Chief Justice Eric H. Souter's words in that ...,The Court has long upheld the rule of majority...,-4.590804,-5.040834
6,Justice Breyer delivered the opinion of the Co...,\nI respectfully dissent.\n\nI agree with word...,\nJUSTICE BRENNAN delivered the opinion of the...,-3.961765,-4.375169
7,Justice Ginsburg delivered the opinion of the ...,"\nJUSTICE STEVENS, joined by Justice BRENNAN, ...","\nS. JUSTICE REHNQUIST, joined by JURIST BLACK...",-4.389142,-3.97336
8,Justice Breyer delivered the opinion of the Co...,"In granting certiorari, the Court was mindful ...",The challenge that would challenge the court's...,-3.658533,-4.632312
9,Justice Ginsburg delivered the opinion of the ...,"In its decision, Hilary Anthony Thomas was a c...",The BILL on Reduction of Child Insemination Ac...,-4.807745,-3.949431
