Notebook adapted from:  
https://medium.com/askdata/train-t5-for-text-summarization-a1926f52d281  
https://colab.research.google.com/drive/14_A2kM8sOVpzwHn-0pMbfnD2htzI2Nte

# 0. Set up environment

In [1]:
import os

import datasets
import torch
import numpy as np
import pandas as pd
from sklearn import model_selection
from transformers import AutoTokenizer
from transformers import T5ForConditionalGeneration, Trainer, TrainingArguments
from tqdm.notebook import tqdm
from torch import nn

SEED = 2557
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2'

In [2]:
import transformers
transformers.__version__
torch.__version__

'1.8.1'

In [3]:
%%script false  --no-raise-error
!pip install transformers
!pip install datasets

Let's use Weights & Biases for tracking

In [4]:
import wandb
wandb.login()

%env WANDB_LOG_MODEL=true

[34m[1mwandb[0m: Currently logged in as: [33mbryanli[0m (use `wandb login --relogin` to force relogin)


env: WANDB_LOG_MODEL=true


In [5]:
DO_TRAIN = False
DO_EVAL = True

In [6]:
%cd ../glucose/

GLUCOSE_DIR = os.getcwd()
TRAIN_PATH = os.path.join(GLUCOSE_DIR, 't5_data/t5_training_data.tsv')
TEST_PATH = os.path.join(GLUCOSE_DIR, 't5_data/t5_test_data.txt')

from scripts import format_data

/mnt/nlpgridio3/data/bryanli/projects/stories/glucose


In [7]:
exp_num = '2a'
EXP_NAME = f'exp{exp_num}'
OUTPUT_DIR = f'{GLUCOSE_DIR}/outputs/{EXP_NAME}'
MODEL_DIR =  f'{OUTPUT_DIR}/model'

# Setting 1: Generation
Here, we frame the task as a generation problem.

Let's load the data and format it for our experiments. The options for `exp_num` are:  
'1' : input = dim + precontext, output = target sentence  
'2a': input = dim + precontext, output = original output (with generalized and contextualized)  
'2b': input = dim + precontext + <mask_sent> + postcontext, output = original output

In [8]:
df_train, df_val, ids_val = format_data.format_data(TRAIN_PATH, exp_num, split_val=True, seed=SEED)

with open(OUTPUT_DIR + '/ids_val.txt', 'w') as f:
    f.writelines([f'{idx}\n' for idx in ids_val])

In [20]:
ex = df_train.iloc[0]
print(f'input: {ex["input"]}')
print(f'output: {ex["output"]}')
df_train[['input', 'output']]

input: #7: The family hamster had escaped.
output: The family hamster has escaped >Causes> We feel(s) worried ** Something_A has escaped >Causes> Some People_A feel(s) worried


Unnamed: 0,input,output
0,#7: The family hamster had escaped.,The family hamster has escaped >Causes> We fee...
1,#1: Kate threw a frisbee to her dog. The dog c...,Kate's dog caught a frisbee >Causes/Enables> K...
2,#4: Kate threw a frisbee to her dog. The dog c...,Kate possess(es) a frisbee >Enables> Kate thre...
3,#6: Kate threw a frisbee to her dog. The dog c...,Kate threw the frisbee again >Causes/Enables> ...
4,#7: Kate threw a frisbee to her dog. The dog c...,Kate threw the frisbee again >Causes> Kate fee...
...,...,...
216779,#4: Simon invited Joey over for a playdate.,simon possess(es) pokemon cards >Enables> joey...
216780,#6: Simon invited Joey over for a playdate.,joey thought it would be funny to mess with si...
216781,#2: I was watching sports one day.,I want(s) pizza >Motivates> I pick up my phone...
216782,#4: I was watching sports one day.,He possess(es) a phone >Enables> He picks up h...


## Set up wand

In [13]:
if DO_TRAIN:
    WANDB_NAME = f'glucose_{EXP_NAME}'
    wandb.init(name=WANDB_NAME)

## Tokenization

In [10]:
tokenizer = AutoTokenizer.from_pretrained('t5-base')

if exp_num == '2b':
    special_tokens_dict = {'additional_special_tokens': ['<mask_sent>']}
    add_toks = tokenizer.add_special_tokens(special_tokens_dict)

In [11]:
ds_train = datasets.Dataset.from_pandas(df_train)
ds_val = datasets.Dataset.from_pandas(df_val)

In [12]:
def get_src_tgt_len(source_text, target_text):
    tokenized_source_text = tokenizer(list(source_text), truncation=False, padding=False)
    tokenized_target_text = tokenizer(list(target_text), truncation=False, padding=False)

    max_source = 0
    for item in tokenized_source_text['input_ids']:
        if len(item) > max_source:
            max_source = len(item)

    max_target = 0
    for item in tokenized_target_text['input_ids']:
        if len(item) > max_target:
            max_target = len(item)
    return max_source, max_target

max_source, max_target = get_src_tgt_len(df_train['input'], df_train['output'])
print(max_source, max_target)

97 161


In [13]:
def encode(batch):
    inp = tokenizer(batch['input'], padding='max_length', truncation=True, max_length=max_source)
    outp = tokenizer(batch['output'], padding='max_length', truncation=True, max_length=max_target)
    inp['labels'] = outp['input_ids']
    return inp

BATCH_SIZE_ENCODE = 512

ds_train = ds_train.map(encode, batched=True, batch_size=BATCH_SIZE_ENCODE)
ds_val = ds_val.map(encode, batched=True, batch_size=BATCH_SIZE_ENCODE)

ds_train.set_format('numpy', columns=['input_ids', 'attention_mask', 'labels'])
ds_val.set_format('numpy', columns=['input_ids', 'attention_mask', 'labels'])


  0%|          | 0/424 [00:00<?, ?ba/s]

  0%|          | 0/48 [00:00<?, ?ba/s]

In [14]:
COLS_TO_FORMAT = ['input_ids', 'labels', 'attention_mask']
ds_train.set_format(type='torch', columns=COLS_TO_FORMAT)
ds_val.set_format(type='torch', columns=COLS_TO_FORMAT)

In [15]:
# verify proper encoding
print(tokenizer.decode(ds_val[0]['input_ids']))
print()
print(tokenizer.decode(ds_val[0]['labels']))

#1: sally was walking to the park. She found a small kitten in the grass. SHe took the kitten to the park with her to play.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>

Sally finds a small kitten >Causes/Enables> Sally puts the kitten into her bookbag ** Someone_A finds Something_A >Causes/Enables> Someone_A puts Something_A into Something_B</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pa

## Load pretrained model

In [20]:
if DO_TRAIN:
    model = T5ForConditionalGeneration.from_pretrained('t5-base', cache_dir='/nlp/data/bryanli/.cache')
    if exp_num == '2b':
        model.resize_token_embeddings(len(tokenizer))

    ds_train_shuffled = ds_train.shuffle(seed=SEED)
    ds_val_shuffled = ds_val.shuffle(seed=SEED)

## Finetune

In [21]:
if DO_TRAIN:
    trainer = None
    training_args = TrainingArguments(
        output_dir=OUTPUT_DIR,
        num_train_epochs=4,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        eval_accumulation_steps=1, # Number of eval steps to keep in GPU (the higher, the mor vRAM used)
        # prediction_loss_only=True, # If I need co compute only loss and not other metrics, setting this to true will use less RAM
        learning_rate=0.0001,
        evaluation_strategy='steps', # Run evaluation every eval_steps
        save_steps=5000, # How often to save a checkpoint
        save_total_limit=4, # Number of maximum checkpoints to save
        remove_unused_columns=True, # Removes useless columns from the dataset
        run_name=EXP_NAME, # Wandb run name
        logging_steps=5000, # How often to log loss to wandb
        eval_steps=5000, # How often to run evaluation on the val_set
        logging_first_step=False, # Whether to log also the very first training step to wandb
        load_best_model_at_end=True, # Whether to load the best model found at each evaluation.
        metric_for_best_model="loss", # Use loss to evaluate best model.
        greater_is_better=False, # Best model is the one with the lowest loss, not highest.
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=ds_train_shuffled,
        eval_dataset=ds_val_shuffled,

    )
    trainer.args._n_gpu = 3
    trainer.train()
    trainer.save_model(MODEL_DIR)

# Evaluation

In [16]:
MODEL_DIR =  f'{OUTPUT_DIR}/model'
# MODEL_DIR = '/mnt/nlpgridio3/data/bryanli/projects/stories/glucose/outputs/exp2b/model'
model_ft = T5ForConditionalGeneration.from_pretrained(MODEL_DIR)
model_ft = model_ft.cuda()

In [17]:
ds_val.set_format(type='torch', columns=COLS_TO_FORMAT, device='cuda')

In [18]:
def generate_from_sentence(model, tokenizer, input):
  inputs = tokenizer.encode(input, return_tensors='pt')
  output_sequences = model.generate(
      inputs.to(model.device),
      pad_token_id=tokenizer.eos_token_id,
      max_length=256,
  )
  
  return [tokenizer.decode(x, skip_special_tokens=True) for x in output_sequences]

def generate_from_dataset(model, dataset, batch_size=128, skip=True):
  output_sequences_all = []
  for i in tqdm(range(0, len(dataset), batch_size)):
    batch = dataset[i:i+batch_size]
    output_sequences = model.generate(
        batch['input_ids'],
        attention_mask=batch['attention_mask'],
        pad_token_id=tokenizer.eos_token_id,
        max_length=256
    )
    output_sequences_all.extend(output_sequences)
  print(len(output_sequences_all))
  return output_sequences_all

def decode_seqs(seqs, skip):
  return [tokenizer.decode(x, skip_special_tokens=skip) for x in seqs]
  
  

In [21]:
print(generate_from_sentence(model_ft, tokenizer, "#1: I went for a steak dinner. I invited my roommate."))
print(generate_from_sentence(model_ft, tokenizer, "#6: I went for a steak dinner. I invited my roommate."))
preds = generate_from_dataset(model_ft, ds_val, batch_size=128)
preds_decoded = decode_seqs(preds, True)

['I invited my roommate to dinner >Causes/Enables> My roommate said it was the best steak ever ** Someone_A invited Someone_B to Something_A (that is a meal) >Causes/Enables> Someone_B said it was the best steak ever']
['My roommate said he would not go to the steak dinner >Causes/Enables> My roommate ate the steak ** Someone_A said they would not go to Somewhere_A (that is a restaurant) >Causes/Enables> Someone_A ate Something_A (that is a meal)']


  0%|          | 0/189 [00:00<?, ?it/s]

24154


In [22]:
if exp_num == '2b':
    sources_decoded = decode_seqs(ds_val['input_ids'], False)
    sources_decoded = [x.split('</s>', 1)[0] for x in sources_decoded]
else:
    sources_decoded = decode_seqs(ds_val['input_ids'], True)
labels_decoded = decode_seqs(ds_val['labels'], True)

output = pd.DataFrame({'input': sources_decoded, 'output_true': labels_decoded, 'output_pred': preds_decoded, 'target': df_val['target']})
output.to_csv(OUTPUT_DIR + "/predictions_val.csv")

In [23]:
NUM_TO_PRINT = 5
for i in range(NUM_TO_PRINT):
    ex = output.iloc[i*100]
    print(f'EX {i*100}')
    print('INPUT:  ', ex['input'], '\n')
    print('GOLD:   ', ex['output_true'], '\n')
    print('PRED:   ', ex['output_pred'], '\n')
    print('TARGET: ', ex['target'])
    print('-' * 50, '\n')

EX 0
INPUT:   #1: sally was walking to the park. She found a small kitten in the grass. SHe took the kitten to the park with her to play. 

GOLD:    Sally finds a small kitten >Causes/Enables> Sally puts the kitten into her bookbag ** Someone_A finds Something_A >Causes/Enables> Someone_A puts Something_A into Something_B 

PRED:    The kitten is at the park >Causes/Enables> The kitten is happy ** Something_A (that is an animal) is at Somewhere_A (that is a park) >Causes/Enables> Something_A is happy 

TARGET:  When is was time to go home,she put the kitten in her bookbag.
-------------------------------------------------- 

EX 100
INPUT:   #6: I went to a children's birthday party. First, I had some cake. Next, I played with some of the kids. 

GOLD:    I give the birthday boy his gift >Causes/Enables> I leave ** Someone_A gives Someone_B Something_A (that is a gift) >Causes/Enables> Someone_A leaves Somewhere_A (that is Someone_B's location) 

PRED:    I ate a lot of food >Causes/Ena