# Capstone Part 2d: GPT-2 Model

Preface: Modelling was done on Colab, so there are a few lines of code that are useful only on Colab.

We now try to finetune a pretrained GPT-2 model on our dataset and see if it outperforms the rest.

Note: To use this notebook, run the import cells as well as the functions cells, then go straight to the 'Results' tab.

Abstract: GPT, or Generative Pre-trained Transformer, is an attention model which learns to focus attention on the previous words that are most relevant to the task at hand. This is done by assigning weights through the decoder to specific states in the past, thus creating a 'context'. Since this is a transfer model, the modelling process simply requires us to fine tune the downloaded model from OpenAI.

Reference: https://github.com/openai/gpt-2

A notice on decoding method used. There are several decoding methods in transformers (or any encoding decoding based models), and each gives quite different results.

1. Greedy Search: Uses the word with the highest probability given a context as the next word. This leads to a repetition of a phrase if the phrase ends with the first word of the phrase. (**I** love my dog but **I** love my dog but **I**...)


2. Beam Search: Depending on the number of beams x, it keeps track of x highest probability paths it has taken, and takes the highest probability path out of these x paths until the EOS token. Coupled with the n-gram penalty parameter, it ensures that a phrase of n length does not appear twice in the generated text.


3. Top-k Sampling: The top k words are filtered and the probability mass is redistributed among only those words, which is what GPT2 defaultly uses. This expands the vocabulary of the generator considerably, but also opens it up to less than par word sequences. To combat this, the top-k parameter and temperature parameter are used. The temperature parameter determines the limit of randomness the model is willing to accept, with 0 being it only accepts the highest probability word (similar to greedy search).


4. Nucleus Sampling: The smallest possible set of words whose cumulative probability exceeds the probability p is chosen. The probability mass is then redistributed similarly to top-k throughout this set of words. This means that the more similarly even the level of probability for the top few choices, the more choices the generator has to pick for the next word. This is the sampling method used below, and can be identified by the parameter 0 < top_p < 1.

Reference: https://huggingface.co/blog/how-to-generate

In [2]:
# basic imports
import numpy as np
import pandas as pd
import random
import torch
import fire
import logging
import os
import csv

# pytorch imports
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F

# tranformers imports
from transformers import (GPT2Tokenizer,
                          GPT2LMHeadModel,
                          AdamW,
                          get_linear_schedule_with_warmup)

# misc imports
from tqdm import tqdm, trange

In [3]:
# instantiate dataset class
class email(Dataset):
    
    def __init__(self, truncate=False, gpt2_type='gpt2', max_length=768):
        
        # instantiate pretrained tokenizer
        self.tokenizer = GPT2Tokenizer.from_pretrained(gpt2_type)
        self.emails = []

        with open('enron6_clean.csv', newline='') as csvfile:
            email_csv = csv.reader(csvfile)
            for row in email_csv:
                # encode text into tensors
                self.emails.append(torch.tensor(
                    self.tokenizer.encode(
                        # 768 characters is gpt2-small's limit
                        # endoftext is gpt2 specific delimiter
                        f'{row[0][:max_length]}<|endoftext|>'
                    )))
                
        if truncate:
            self.emails = self.emails[:20000]
            
        self.email_count = len(self.emails)
        
    def __len__(self):
        return self.email_count

    def __getitem__(self, item):
        return self.emails[item] # return a particular tensor

In [4]:
# ensure each input tensors have as much text as possible
def pack_tensor(new_tensor, packed_tensor, max_seq_len):
    if packed_tensor is None:
        return new_tensor, True, None
    if new_tensor.size()[1] + packed_tensor.size()[1] > max_seq_len:
        return packed_tensor, False, new_tensor
    else:
        packed_tensor = torch.cat([new_tensor, packed_tensor[:, 1:]], dim=1)
        return packed_tensor, True, None

In [5]:
def train(
    dataset,
    model,
    tokenizer,
    batch_size=16,
    epochs=20,
    lr=2e-5,
    max_seq_len=400,
    warmup_steps=5000,
    gpt2_type='gpt2',
    device='cuda',
    output_dir='',
    output_prefix='',
    test_mode=False,
    save_model_on_epoch=False
):

    acc_steps = 100

    model = model.to(device) # set to cuda (on colab)
    model.train() # switch into training mode
    optimizer = AdamW(model.parameters(), lr=lr) # assign optimizer
    
    # initialize optimizer schedule
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=warmup_steps, num_training_steps=-1
    ) 

    # initialize dataloader to iterate through dataset
    train_dataloader = DataLoader(dataset, batch_size=1, shuffle=True)

    accumulating_batch_count = 0
    input_tensor = None

    for epoch in range(epochs):

        print(f'Training epoch {epoch}')
        
        # loop through each batch in the dataloader object
        for idx, entry in tqdm(enumerate(train_dataloader)):
            
            # fill up tensor to 768 capacity
            (input_tensor, carry_on, remainder) = pack_tensor(entry, input_tensor, 768)

            if carry_on and idx != len(train_dataloader) - 1:
                continue

            input_tensor = input_tensor.to(device)
            outputs = model(input_tensor, labels=input_tensor)
            loss = outputs[0]
            loss.backward()

            # only run a step if the batch is fully packed
            if (accumulating_batch_count % batch_size) == 0:
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()
                model.zero_grad()

            accumulating_batch_count += 1
            input_tensor = None # refresh ram memory
            
        if save_model_on_epoch and epoch==10: # save model when epoch complete
            torch.save(
                model.state_dict(),
                os.path.join(output_dir, f'{output_prefix}-{epoch}.pt'),
            )
    return model

## Modelling

In [None]:
data = email(truncate=True, gpt2_type='gpt2')

In [None]:
model = train(
    data,
    GPT2LMHeadModel.from_pretrained('gpt2'),
    GPT2Tokenizer.from_pretrained('gpt2'),
    batch_size=16,
    epochs=11,
    lr=3e-5,
    max_seq_len=140,
    warmup_steps=5000,
    gpt2_type='gpt2',
    device='cuda',
    output_dir='trained_models',
    output_prefix='email',
    save_model_on_epoch=True
)

Training epoch 0


17402it [01:30, 192.02it/s]


Training epoch 1


17402it [01:30, 191.78it/s]


Training epoch 2


17402it [01:30, 191.37it/s]


Training epoch 3


17402it [01:30, 191.91it/s]


Training epoch 4


17402it [01:30, 191.75it/s]


Training epoch 5


17402it [01:30, 191.93it/s]


Training epoch 6


17402it [01:31, 190.73it/s]


Training epoch 7


17402it [01:30, 192.51it/s]


Training epoch 8


17402it [01:30, 191.74it/s]


Training epoch 9


17402it [01:30, 191.59it/s]


Training epoch 10


17402it [01:30, 191.82it/s]


## Results

In [6]:
# load model
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.load_state_dict(torch.load('models/gpt2_10epochs.pt', map_location='cpu'))
model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): Laye

In [7]:
# adapted from Huggingface's run_generation.py script
def generate(
    model,
    tokenizer,
    prompt,
    entry_count=1,
    entry_length=100,
    top_p=0.8,
    temperature=1.,
):

    model.eval()

    generated_num = 0

    filter_value = -float('Inf')

    with torch.no_grad():

        for entry_idx in trange(entry_count):

            entry_finished = False

            generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)

            for i in range(entry_length):
                outputs = model(generated, labels=generated)
                loss, logits = outputs[:2]
                logits = logits[:, -1, :] / (temperature if temperature > 0 else 1.0)

                sorted_logits, sorted_indices = torch.sort(logits, descending=True)
                cumulative_probs = torch.cumsum(
                    F.softmax(sorted_logits, dim=-1), dim=-1
                )

                sorted_indices_to_remove = cumulative_probs > top_p
                sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[
                    ..., :-1
                ].clone()
                sorted_indices_to_remove[..., 0] = 0

                indices_to_remove = sorted_indices[sorted_indices_to_remove]
                logits[:, indices_to_remove] = filter_value

                next_token = torch.multinomial(F.softmax(logits, dim=-1), num_samples=1)
                generated = torch.cat((generated, next_token), dim=1)

                if next_token in tokenizer.encode('<|endoftext|>'):
                    entry_finished = True

                if entry_finished:

                    generated_num = generated_num + 1

                    output_list = list(generated.squeeze().numpy())
                    output_text = tokenizer.decode(output_list)

                    break
            
            if not entry_finished:
                output_list = list(generated.squeeze().numpy())
                output_text = f'{tokenizer.decode(output_list)}<|endoftext|>' 
                
                
    return output_text

### Seeded Text Generation Test

In [None]:
generate(model.to('cpu'), GPT2Tokenizer.from_pretrained('gpt2'),'Please give me an update on the progress of',entry_count=1)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:32<00:00, 32.72s/it]


["Please give me an update on the progress of this game, we're working hard to get everything to you in as soon as possible!\n\nBefore getting into specifics, a good thing to know about PvP. If you don't like PvP but would like to try it out, you're welcome to get in on the game and play!\n\nWe've implemented a couple changes for PvP in Patch 6.0.\n\nNew Weapon Use Rate\n\nAdded the new UPROMIZE Damage class. It's a Weapon with a big<|endoftext|>"]

GPT-2 is very interesting in that it was trained in 8 million webpages, of which the subject ranges wildly. As we can see in the above generated text, as the seeded text was rather vague in terms of topic, it randomly pulled out words that made grammatical sense but did not necessarily have the correct topic. 

In [None]:
generate(model.to('cpu'), GPT2Tokenizer.from_pretrained('gpt2'),'I require the financial reports by today.',entry_count=1)

100%|██████████| 1/1 [00:32<00:00, 32.51s/it]


['I require the financial reports by today. If you do not have enough information, send it to me, and I will add it to the list of possible financial reports as quickly as possible. To the extent possible, please send me the name of the person to whom the financial reports are for payment. In addition, we ask that your name be used as the subject line for the documents.\n\nPlease note that I am not responsible for any costs incurred in providing your personal information.<|endoftext|>']

However, if the topic is stated explicitly ('financial reports'), it does give a paragraph of relevant text. Unfortunately, this paragraph of text sounds awfully like a scam email.

In [None]:
generate(model.to('cpu'), GPT2Tokenizer.from_pretrained('gpt2'),'Enron as a company',entry_count=1)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:30<00:00, 30.49s/it]


['Enron as a company with millions of customers around the world. The company has moved to become more efficient, more agile, more responsive, more responsive. And like many IT companies, we\'ve lost the energy and enthusiasm to run and navigate our business and which business and service can we utilize to provide an improved customer experience.\n\nThat\'s why we are introducing our "Integrated Customer Experience" to support customers and introduce your Customer Service team as a new firm as part of our new $19 Million Company of Advisors<|endoftext|>']

And now it has become an advertisement of sorts. Hilariously though, it doesn't really work as an advertisement that well either ('we've lost the energy and enthusiasm to run and navigate our business', what?).

In [None]:
generate(model.to('cpu'), GPT2Tokenizer.from_pretrained('gpt2'),'There will be an auditor at 3pm',entry_count=1)

100%|██████████| 1/1 [00:19<00:00, 19.34s/it]


['There will be an auditor at 3pm, and at 5pm all kinds of other stuff will be happening.\n\n"Our A.J. room will be open all day, it\'s just going to be a lot of activity."\n\nDo you have a location to call for updates on the venue? Email scott@phillynews.com<|endoftext|>']

This, by far, the most email like text of them all! It is relevant, it picked up the topic, and it sounds like an email that would be sent in an office setting.

### Topic-based Text Generation Test

We've seen that with seeded text, it can pick up the topic within. However, what about just the topic as a singular word?

In [8]:
generate(model.to('cpu'), GPT2Tokenizer.from_pretrained('gpt2'),'finance',entry_count=1)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:30<00:00, 30.49s/it]


'finance and make sure we have the tools we need to make sure the improvements in those wallets are taken into account," Sandoval told CoinDesk.\n\nBorow said that if the wallet will be successfully run on more exchanges, he expects more exchanges to join the market and exchange users to be able to upgrade their wallets, adding that exchanges will also have an option to update their own wallets for those wallets, along with reporting affected wallets.\n\nThe adoption of block size reduction\n\nThe announcement<|endoftext|>'

In [32]:
generate(model.to('cpu'), GPT2Tokenizer.from_pretrained('gpt2'),'schedule a meeting',entry_count=1)

100%|██████████| 1/1 [00:36<00:00, 36.51s/it]


'schedule a meeting with the government of British Columbia about its proposed "Enhanced Action Plan for Extraordinary Suspension of Regulation." These provisions are likely to be watered down, but the passage of those provisions is also likely to be seen as being an effort to help the Crown not directly benefit from a development that would have killed every SPC in Canada, and that Canada would not be able to buy.\n\nThe increase in the costs to the economy of the agreement means that there will be a considerable impact on private-<|endoftext|>'

It sort of works, but you have to be really specific as any amount of ambiguity can spiral off into something completely irrelevant. Also, it can't really decide on the tonality of the text based off a single word, so it's not the best at it.

### Conclusions

Advantages of GPT2:

1. The best at generating coherent and human-like writing.

2. Rather quick to finetune (3 minutes per epoch).

3. It is rather tolerant to noise.


Disadvantages of GPT2:

1. Tends to go off at tangents at the slightest bit of ambiguity.

2. It tends to write in an articular manner more than anything which might not be desirable behaviour.

3. It cannot autocomplete on a word level.