This script contains an implementation of GPT-2-small used for Question Reformulation. In this notebook the dataset QReCC is loaded and the GPT-2 model is trained on the dataset using a cross-entropy loss. Additional tokens are added to GPT-2's tokenizer and thereby need to be learnt in the word embedding matrix. The last hidden state of GPT-2 is then multipled by the word embedding matrix and a SoftMax is applied to select the next token. During training all tokens excepts those in the target re-written sentence of QReCC are masked.

In [None]:
%%shell
#wget  https://obj.umiacs.umd.edu/elgohary/CANARD_Release.zip
#unzip CANARD_Release.zip
wget https://github.com/apple/ml-qrecc/blob/main/dataset/qrecc_data.zip?raw=true
unzip qrecc_data.zip?raw=true
pip install transformers

--2020-12-09 04:09:56--  https://github.com/apple/ml-qrecc/blob/main/dataset/qrecc_data.zip?raw=true
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/apple/ml-qrecc/raw/main/dataset/qrecc_data.zip [following]
--2020-12-09 04:09:56--  https://github.com/apple/ml-qrecc/raw/main/dataset/qrecc_data.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/apple/ml-qrecc/main/dataset/qrecc_data.zip [following]
--2020-12-09 04:09:57--  https://raw.githubusercontent.com/apple/ml-qrecc/main/dataset/qrecc_data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting resp



In [None]:
import json
with open('qrecc_train.json') as f:
  train_file = json.load(f)
with open('qrecc_test.json') as f:
  val_file = json.load(f) #Conversation_no

In [None]:
def format_data(file, dialog_id):
  '''
  input file: the json data to format for GPT-2 to fine-tune on.
  input dialog_id: Identifier to each dialog sequence. This is to ensure proper formatting.   
  '''
  id = 'init'
  x_data = {}
  y_data = {}
  # idx will point a first question to begin
  for sample in file: 
    if id != sample[dialog_id]: #  initialize new conversation
      # TRY ADDING MORE CONVERSATIONAL IN THE STARTING QUESTION OTHERWISE IMPOSSIBLE AND THESE ARE NOT ANSWERABLE
      id = sample[dialog_id]
      x_data[id] = ['<|startoftext|>' + sample['Question'] + '<|go|>' + sample['Rewrite'] + '<|endoftext|>']
      y_data[id] = [sample['Rewrite']]
      idx = 1
    else:
      x_data[id] += ['<|startoftext|>' + '<|sep|>'.join(y_data[id][:idx] + [sample['Question']]) + '<|go|>' + sample['Rewrite'] + '<|endoftext|>'] # [previous_rewritten_Qs] + [new_Q]
      y_data[id] += [sample['Rewrite']]
      idx += 1

  # TRY SKIPPING SOME OF THE FIRST SAMPLES LATER TO SEE IF THEY ARE JUST INDUCING NOISE INTO THE PROBLEM 
  x_text = [t for text in x_data.values() for t in text]
  y_text = [t for text in y_data.values() for t in text]

  return x_text

In [None]:
dialog_id = 'Conversation_no'
train_text = format_data(train_file, dialog_id)
val_text = format_data(val_file, dialog_id)

In [None]:
count = 0
for samp in train_text:
  count+= len(samp.split())
print('The estimate of the number of tokens on average is {}'.format(count / len(train_text)))

The estimate of the number of tokens on average is 41.53488921434308


In [None]:
import time
import datetime
import math
import random
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers
from transformers import GPT2Tokenizer, GPT2Model, GPT2LMHeadModel,GPT2Config
from transformers import AdamW, get_linear_schedule_with_warmup
from torch.utils.data import Dataset, DataLoader, random_split, RandomSampler, SequentialSampler

#### Load tokenizer

In [None]:
PRE_TRAINED_MODEL = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', bos_token='<|startoftext|>', eos_token='<|endoftext|>',
                                  pad_token='<|pad|>', additional_special_tokens=['<|sep|>','<|go|>','<|CON|>','<|PRED|>'])
# gpt2 = GPT2Model.from_pretrained('gpt2') # must be small since ~500 MB

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…

Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.





In [None]:
print("The max model length is {} for this model, although the actual embedding size for GPT small is 768".format(tokenizer.model_max_length))
print("The beginning of sequence token {} token has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.bos_token_id), tokenizer.bos_token_id))
print("The end of sequence token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.eos_token_id), tokenizer.eos_token_id))
print("The padding token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.pad_token_id), tokenizer.pad_token_id))
print("The seperator token {} has the id {}".format('<|sep|>',tokenizer.convert_tokens_to_ids('<|sep|>')))
print("The go token {} has the id {}".format('<|go|>',tokenizer.convert_tokens_to_ids('<|go|>')))
print("The go token {} has the id {}".format('<|CON|>',tokenizer.convert_tokens_to_ids('<|CON|>')))
print("The go token {} has the id {}".format('<|PRED|>',tokenizer.convert_tokens_to_ids('<|PRED|>')))

The max model length is 1024 for this model, although the actual embedding size for GPT small is 768
The beginning of sequence token <|startoftext|> token has the id 50257
The end of sequence token <|endoftext|> has the id 50256
The padding token <|pad|> has the id 50258
The seperator token <|sep|> has the id 50259
The go token <|go|> has the id 50260
The go token <|CON|> has the id 50261
The go token <|PRED|> has the id 50262


#### Create Class to prepare the data

In [None]:
def create_label_mask(tensor,tokenizer):
  go_id = tokenizer.convert_tokens_to_ids('<|go|>')   # -100
  pad_id = tokenizer.convert_tokens_to_ids('<|pad|>') # -100
  arr = tensor.numpy().copy()
  flag = True
  for idx, ele in enumerate(arr):
    if ele == go_id or ele == pad_id:
      arr[idx] = -100
      flag = (flag != True)
    elif flag:
      arr[idx] = -100
    else:
      continue
  return torch.tensor(arr)

In [None]:
def create_token_type_ids(tensor,tokenizer):
  go_id = tokenizer.convert_tokens_to_ids('<|go|>')   # -100
  eos_id = tokenizer.convert_tokens_to_ids('<|endoftext|>') # -100
  arr = tensor.numpy().copy().tolist()
  type_id = tokenizer.convert_tokens_to_ids('<|CON|>')
  for idx, ele in enumerate(arr): # go in reverse
    if ele == go_id:
      arr[idx] = type_id
      type_id = tokenizer.convert_tokens_to_ids('<|PRED|>')
      continue
    elif ele == eos_id:
      arr[idx] = type_id
      type_id = tokenizer.convert_tokens_to_ids('<|CON|>')
      continue
    arr[idx] = type_id
  return torch.tensor(arr)

In [None]:
class GPT2Dataset(Dataset):
  def __init__(self, txt_list, tokenizer, gpt2_type="gpt2", max_length=768): # Max length can probably be even smaller.. pad either way..
    self.tokenizer = tokenizer
    self.input_ids = []
    self.attn_masks = []
    self.label_masks = []
    self.token_type_ids = []
    for txt in txt_list:
      # reformat txt: x_data, y_data -> x_data
      # ALREADY ADDED ALL SPECIAL CHARACTERS WHILE CREATING DATASET
      # return_token_type_ids=True -> might be a way to use this for label_masking
      encodings_dict = tokenizer(txt, truncation=True, max_length=max_length, padding="max_length")
      temp = torch.tensor(encodings_dict['input_ids'])
      self.input_ids.append(temp)
      self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))
      self.label_masks.append(create_label_mask(temp,tokenizer)) # to mask everything up to and including <|go|>
      #self.token_type_ids.append(create_label_mask(temp,tokenizer))
    
  def __len__(self):
    return len(self.input_ids)

  def __getitem__(self, idx):
    return self.input_ids[idx], self.attn_masks[idx],self.label_masks[idx]#,self.token_type_ids[idx]

#### Create train and val splits

In [None]:
train_dataset = GPT2Dataset(train_text, tokenizer, max_length=210)
val_dataset = GPT2Dataset(val_text, tokenizer, max_length=210)

In [None]:
count = 0
for tensor,_,_ in train_dataset: # tensor,_,_,_
  temp = tensor.numpy().tolist()
  for ele in temp:
    if ele != 50258:
      count += 1
    else:
      break # hit the pads
print('The estimate of the number of tokens on average is {}'.format(count/len(train_dataset) ))
print('The number of samples in training set is {}'.format(len(train_dataset)))
print('The number of samples in validation set is {}'.format(len(val_dataset)))
# 63.62543896946505 (max_len = 210. If not same number then max_length must of cut some out some)

The estimate of the number of tokens on average is 63.62543896946505
The number of samples in training set is 63501
The number of samples in validation set is 16451


#### Use DataLoader to prepare for pytorch

In [None]:
# Create the DataLoaders for our training and validation datasets.
# We'll take training samples in random order. 
batch_size = 2
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
            val_dataset, # The validation samples.
            sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )

Time to load model and prepare training settings

In [None]:
#configuration = GPT2Config.from_pretrained('gpt2', output_hidden_states=False)

# instantiate the model
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Needed since tokens were added to embedding matrix which is used at the input and output of the model.
model.resize_token_embeddings(len(tokenizer))

# Tell pytorch to run this model on the GPU.
device = torch.device("cuda")
model.cuda()

# Set the seed value all over the place to make this reproducible.
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…




In [None]:
# some parameters I cooked up that work reasonably well
epochs = 2
learning_rate = 5e-5
warmup_steps = 1e2
epsilon = 1e-8

In [None]:
# Note: AdamW is a class from the huggingface library (as opposed to pytorch) 
optimizer = AdamW(model.parameters(),
                  lr = learning_rate,
                  eps = epsilon
                )

In [None]:
# Total number of training steps is [number of batches] x [number of epochs]. 
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = warmup_steps, 
                                            num_training_steps = total_steps)

In [None]:
def quick_sample(data, idx):
  temp =  data[idx].split('<|go|>')[0] + '<|go|>'
  tensor = torch.tensor(tokenizer(temp)['input_ids'])
  print('input:',temp)
  print('answer:',data[idx].split('<|go|>')[-1])
  return tensor

In [None]:
def next_token(input, model, device):
  # To try only can select tokens from input
  model.eval()
  token = model.forward(input_ids=input.to(device))['logits'][-1]
  id = np.argsort(torch.nn.functional.softmax(token).cpu().detach().numpy())[-1]
  new_input = torch.tensor(input.cpu().detach().numpy().tolist() + [id])
  # actual token, id
  print('predict:',tokenizer.decode(torch.tensor(id)))
  return new_input, tokenizer.decode(torch.tensor(id))

In [None]:
def train_epoch(model, data_loader,optimizer, device, scheduler):
  
  model = model.train() # Set model to training mode
  losses = []   # keep log of loss
  total_train_loss = 0 # Track total loss for complete batch..

  for step,batch in enumerate(data_loader): # go over batches []
    # Will need to add len(title) and date here later
    input_ids = batch[0].to(device)
    attention_masks = batch[1].to(device)
    label_masks = batch[2].to(device)
    #input_type = batch[3].to(device)

    model.zero_grad() # zero gradient before prcocessing the batch

    outputs = model(
        input_ids = input_ids,
        labels=label_masks,
        attention_mask = attention_masks,
        token_type_ids = None
    )

    loss = outputs[0]  
    batch_loss = loss.item()
    total_train_loss += batch_loss

    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(),max_norm=1.0) # protection from exploding gradient
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad() # Zero gradients for next batch
    model.zero_grad() # zero gradient before prcocessing the batch

    if step%500 == 0:
      print('The current training batch loss is: {} this is step {}'.format(batch_loss, step))
      len_validate = len(val_text) - 1 
      tensor = quick_sample(val_text, random.randint(0, len_validate)) # randomly pickes
      i = 0
      while True:
        tensor,token = next_token(tensor, model, device)
        if token == '<|endoftext|>' or i >= 25:
          break
        i += 1
      model.train()

  # avg_batch_loss and total_batch_loss
  return total_train_loss/ len(data_loader), total_train_loss

In [None]:
def eval_model(model, data_loader, device):
  
  model = model.eval() # Set model to training mode

  losses = []   # keep log of loss
  total_train_loss = 0 # Track total loss for complete batch..

  for step,batch in enumerate(data_loader): # go over batches []
    # Will need to add len(title) and date here later
    input_ids = batch[0].to(device)
    attention_masks = batch[1].to(device)
    label_masks = batch[2].to(device)
    input_type = batch[3].to(device)
    model.zero_grad() # zero gradient before prcocessing the batch

    outputs = model(
        input_ids = input_ids,
        labels=label_masks,
        attention_mask = attention_masks
        )
    
    loss = outputs[0]  
    batch_loss = loss.item()
    total_train_loss += batch_loss

    if step%100 == 0:
      print('The current validation batch loss is: {} this is step {}'.format(batch_loss, step))

  # avg_batch_loss and total_batch_loss
  return total_train_loss/ len(data_loader), total_train_loss

#### Time to fine-tune the model

In [None]:
## Training loop
import timeit as tt

history = {} # for storing training history
best_accuracy = 0 # store when we got best accuracy

history['avg_train_loss'] = []
history['total_train_loss'] = []
history['avg_val_loss'] = []
history['total_val_loss'] = []

for epoch in range(epochs):

  print("Epoch {} of {}".format(1 + epoch,epochs))
  print('----------')
  start = tt.default_timer()
  avg_train_loss, total_train_loss = train_epoch(
      model,
      train_dataloader,
      optimizer,
      device,
      scheduler
  )
  # Train loss is averaged. train accuracy is weird since first batches always worse then last.
  print('The average train loss is {} and total train loss is {}'.format(avg_train_loss,total_train_loss))
  stop = tt.default_timer()
  print('It took {} seconds to train 1 epoch.'.format(stop - start))
  
  avg_val_loss, total_val_loss = eval_model(
      model,
      validation_dataloader,
      device,
  )

  print('The average validation loss is {} and total validation loss is {}'.format(avg_val_loss, total_val_loss))
  print('\n')

  history['avg_train_loss'].append(avg_train_loss)
  history['total_train_loss'].append(total_train_loss)
  history['avg_val_loss'].append(avg_val_loss)
  history['total_val_loss'].append(total_val_loss)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
torch.save(model.state_dict(),'GPT2_2_epoch_modelSEG_state.bin')
tokenizer.save_pretrained('GPT2_2_epoch_tokenizerSEG_state.bin')
!cp GPT2_3_epoch_model_state.bin /content/drive/MyDrive/STAT_946_Project
!cp -r GPT2_3_epoch_tokenizer_state.bin /content/drive/MyDrive/STAT_946_Project