## GPT2 at glance

GPT2 for high-quality text augmentation. GTP2 is based on Transformer Architecture trained on 40GB of WebText. It’s a stack of multiple decoder units on top of each other enabled with some advanced learning concepts like Masked Self Attention (giving more importance to some input states in which it has more contextual relation). **The objective that GPT2 tries to optimize is essentially to predict the next word in the sequence having seen past words.**

GPT2 models have been open-sourced in mid-2019 and rescued us from training these models from scratch. This enables us to fine-tune such models on our tasks essentially utilizing the pre-trained knowledge/weights. 

## Implementation Details

The concept is mainly two steps: Train/fine-tune and generate.

I used PyTorch and HuggingFace Transformers library. Training GPT2 is straight forward as training any other model, but attention is put on using a learning rate scheduler and have checkpointing models at multiple local minima helping to get the loss even lower by picking the best one with this strategy.

Once the model is trained (and stored), the model is ready to be used to generate samples. It uses a Top-k, Top-p sampling strategy to sample word at each timestep (t) while decoding. This strategy helps to generate variety in the text under controlled circumstances. 


---




# First, install Huggingface Transformers library

In [None]:
pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/fd/1a/41c644c963249fd7f3836d926afa1e3f1cc234a1c40d80c5f03ad8f6f1b2/transformers-4.8.2-py3-none-any.whl (2.5MB)
[K     |████████████████████████████████| 2.5MB 31.7MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 36.2MB/s 
Collecting huggingface-hub==0.0.12
  Downloading https://files.pythonhosted.org/packages/2f/ee/97e253668fda9b17e968b3f97b2f8e53aa0127e8807d24a547687423fe0b/huggingface_hub-0.0.12-py3-none-any.whl
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |████

# Train and fine-tune GPT2 model

In [None]:
import csv
import os
import argparse
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import numpy as np
from torch.utils.data import Dataset, DataLoader
from transformers import AdamW, get_cosine_with_hard_restarts_schedule_with_warmup
import warnings
warnings.filterwarnings('ignore')

class MyDataset(Dataset):
	def __init__(self, data_file_name, data_dir='.data/'):
		super().__init__()

		data_path = os.path.join(data_file_name)

		self.data_list = []
		self.end_of_text_token = " <|endoftext|> "
		
		with open(data_path) as csv_file:
			csv_reader = csv.reader(csv_file, delimiter=',')
			
			for row in csv_reader:
				data_str = f"{row[0]}: {row[1]}{self.end_of_text_token}"
				self.data_list.append(data_str)
		
	def __len__(self):
		return len(self.data_list)

	def __getitem__(self, item):
		return self.data_list[item]

def get_data_loader(data_file_name):
  dataset = MyDataset(data_file_name)
  data_loader = DataLoader(dataset, batch_size=1, shuffle=True)
  return data_loader

def train(epochs, data_loader, batch_size, tokenizer, model, device, optimizer, scheduler):	
	batch_counter = 0
	sum_loss = 0.0

	for epoch in range(epochs):
		print (f'Running {epoch+1} epoch')

		for idx, txt in enumerate(data_loader):
			txt = torch.tensor(tokenizer.encode(txt[0]))
			txt = txt.unsqueeze(0).to(device)
			outputs = model(txt, labels=txt)
			loss, _ = outputs[:2]
			loss.backward() # to backpropogate the error 
			sum_loss += loss.data

			if idx%batch_size==0:
				batch_counter += 1
				optimizer.step() # updates the parameters
				scheduler.step() # updates the learning rate
				optimizer.zero_grad() # clears old gradients from the last step
				model.zero_grad() # clears old gradients from the last step

			if batch_counter == 10:
				print(f"Total Loss is {sum_loss}") #printed after every 10*batch_size
				batch_counter = 0
				sum_loss = 0.0

	return model

def save_model(model, name):
	print ("Saving model to Disk")
	torch.save(model.state_dict(), f"{name}.pt")
	return

def load_models():
	print ('Loading/Downloading GPT-2 Model')
	tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
	model = GPT2LMHeadModel.from_pretrained('gpt2-medium')
	return tokenizer, model

def train_gpt2():
    
  BATCH_SIZE = 32
  EPOCHS = 5
  LEARNING_RATE = 3e-5
  WARMUP_STEPS = 100
  MODEL_NAME = 'mymodel.pt'
  DATA_FILE = '/content/drive/MyDrive/data/data.csv'
  
  TOKENIZER, MODEL = load_models()
  LOADER = get_data_loader(DATA_FILE)
  DEVICE = 'cpu'  
  model = MODEL.to(DEVICE)
  model.train() # does not actually train it, but set it in training mode
  
  # The AdamW optimizer decouples the weight decay from the optimization step. 
  # This means that the weight decay and learning rate can be optimized separately, 
  # i.e. changing the learning rate does not change the optimal weight decay.
  optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)

  # Create a schedule with a learning rate that decreases following the values 
  # of the cosine function between the initial lr set in the optimizer to 0, 
  # with several hard restarts, after a warmup period during which it increases 
  # linearly between 0 and the initial lr set in the optimizer.
  scheduler = get_cosine_with_hard_restarts_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=-1)
  
  # train and save the model
  model = train(EPOCHS, LOADER, BATCH_SIZE, TOKENIZER, MODEL, DEVICE, optimizer, scheduler)
  save_model(model, MODEL_NAME)

In [None]:
train_gpt2()

Loading/Downloading GPT-2 Model


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=718.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1520013706.0, style=ProgressStyle(descr…


Running 1 epoch
Saving model to Disk


# Generate sentences

In [None]:
import os
import argparse
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import numpy as np
import warnings
warnings.filterwarnings('ignore')

def choose_from_top_k_top_n(probs, k=50, p=0.8):
  # In Top-K sampling, the K most likely next words are filtered and the 
  # probability mass is redistributed among only those K next words.
  #
  # Top-p sampling chooses from the smallest possible set of words whose 
  # cumulative probability exceeds the probability p. 
  # The probability mass is then redistributed among this set of words.
  ind = np.argpartition(probs, -k)[-k:]
  top_prob = probs[ind]
  top_prob = {i: top_prob[idx] for idx,i in enumerate(ind)}
  sorted_top_prob = {k: v for k, v in sorted(top_prob.items(), key=lambda item: item[1], reverse=True)}
  
  t=0
  f=[]
  pr = []
  for k,v in sorted_top_prob.items():
    t+=v
    f.append(k)
    pr.append(v)
    if t>=p:
      break
  top_prob = pr / np.sum(pr)
  token_id = np.random.choice(f, 1, p = top_prob)
  
  return int(token_id)

def generate(tokenizer, model, sentences, start, sentence_length):
  f_output = open('/content/drive/MyDrive/data/gpt2_output.txt', "a")
  with torch.no_grad():
    for idx in range(sentences):
      finished = False
      cur_ids = torch.tensor(tokenizer.encode(start)).unsqueeze(0).to('cpu')
      
      for i in range(sentence_length):
        outputs = model(cur_ids, labels=cur_ids)
        loss, logits = outputs[:2]
        
        softmax_logits = torch.softmax(logits[0,-1], dim=0)
        
        if i < 5:
          n = 10
        else:
          n = 5
          
        next_token_id = choose_from_top_k_top_n(softmax_logits.to('cpu').numpy()) #top-k-top-n sampling
        cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long().to('cpu') * next_token_id], dim = 1)
        
        if (next_token_id in tokenizer.encode('<|endoftext|>') or next_token_id == 526 or next_token_id == 13):
          finished = True
          break
      
      output_list = list(cur_ids.squeeze().to('cpu').numpy())
      output_text = tokenizer.decode(output_list)
      print (output_text)
      f_output.write(str(output_text)+'\n')

def load_models(model_name):
	print ('Loading Trained GPT-2 Model')
	tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
	model = GPT2LMHeadModel.from_pretrained('gpt2-medium')
	model_path = model_name
	model.load_state_dict(torch.load(model_path))
	return tokenizer, model

In [None]:
SENTENCES = 20
SENTENCE_LENGTH = 20
MODEL_NAME = 'mymodel.pt.pt'
START_SENTENCE = 'the customer data can'
  
TOKENIZER, MODEL = load_models(MODEL_NAME)
  
generate(TOKENIZER, MODEL, SENTENCES, START_SENTENCE, SENTENCE_LENGTH)

Loading Trained GPT-2 Model
the customer data can be compromised if somebody is sending the wrong email or someone else has access to the information and is sending
the customer data can be traced back to a single location and not the address or mobile number which is used to register an
the customer data can be shared with third parties, including our commercial and academic partners."
the customer data can be used in the case of fraud or abuse and also in any other cases for other reasons of course
the customer data can be read/viewed, used for business purpose and to communicate with its users," it said.
the customer data can also be used to search for advertising and other forms of service from which to monetize or connect with
the customer data can be obtained by using a custom script that will make calls to our database and query the data to determine
the customer data can then be obtained by any government agency" or "the data will be sold to a private company or
the customer da

In [None]:
SENTENCES = 20
SENTENCE_LENGTH = 20
MODEL_NAME = 'mymodel.pt.pt'
START_SENTENCE = 'the customer data cannot'

TOKENIZER, MODEL = load_models(MODEL_NAME)
  
generate(TOKENIZER, MODEL, SENTENCES, START_SENTENCE, SENTENCE_LENGTH)

Loading Trained GPT-2 Model
the customer data cannot be sold without prior consent from the customer or unless there is a compelling reason to suspect that the data
the customer data cannot be sold, altered, transferred or otherwise removed without the express written consent of the customer.
the customer data cannot be reused without prior authorization."
the customer data cannot be reused if the company can do so without incurring liability for the data, and the risk of
the customer data cannot be acquired with the expectation that any personal information is transmitted in the email," the memo says.
the customer data cannot be stored for as long as 24 hours, however they can be transferred once a day to another storage
the customer data cannot be read, analyzed or shared outside of the company's end user agreement," a representative for the company
the customer data cannot be used for fraud, illegal conduct, or harassment or for any other purpose that violates the law."
the customer

In [None]:
SENTENCES = 3
SENTENCE_LENGTH = 10
MODEL_NAME = 'mymodel.pt.pt'
START_SENTENCE = 'the customer data should'

TOKENIZER, MODEL = load_models(MODEL_NAME)
  
generate(TOKENIZER, MODEL, SENTENCES, START_SENTENCE, SENTENCE_LENGTH)

Loading Trained GPT-2 Model
the customer data should never be stored in a database, where there is
the customer data should be shared with a trusted third party, preferably on
the customer data should never have been put into an IoT device," said
