<a href="https://colab.research.google.com/github/jeanlucjackson/w266_final_project/blob/main/code/BB/bb_t5_simple_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Transformers Attempt

### Imports and installs

In [1]:
import numpy as np
import pandas as pd

import json

# Make longer output readable without scrolling
from pprint import pprint

# Stop warning messages from showing
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
!pip install -q datasets

[K     |████████████████████████████████| 441 kB 21.9 MB/s 
[K     |████████████████████████████████| 163 kB 18.5 MB/s 
[K     |████████████████████████████████| 212 kB 58.1 MB/s 
[K     |████████████████████████████████| 115 kB 27.2 MB/s 
[K     |████████████████████████████████| 127 kB 54.8 MB/s 
[K     |████████████████████████████████| 115 kB 21.5 MB/s 
[?25h

In [3]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


### Prep data

In [4]:
from datasets import list_datasets, load_dataset_builder, get_dataset_config_names, load_dataset, load_from_disk

In [5]:
def summarize_dataset (dataset, config=None):
    builder = load_dataset_builder(dataset, config)
    print(f"Description:\n {builder.info.description}")
    print(f"Features:")
    pprint(builder.info.features)
    return

In [6]:
data_squad = load_from_disk("/content/drive/MyDrive/w266 NLP Final Project/Data/squad.hf")

In [7]:
data_squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

Set up dataframes

In [9]:
# Create dataframe for train data 
# shuffle so random
count=10000
data_shuffle=data_squad['train'].shuffle(seed=1962).select(range(count))
train_df=pd.DataFrame()
# train_df['answer'] = [answer['text'][0] for answer in data_shuffle['answers']]
train_df['question'] = data_shuffle['question']
train_df['description'] = data_shuffle['context']
train_df.columns = ["target_text", "input_text"]
train_df['prefix'] = 'ask_question'


train_df.head()



Unnamed: 0,target_text,input_text,prefix
0,What type of businesses did Nickles want to at...,"Prior to moving its headquarters to Chicago, a...",ask_question
1,To whom did Chopin reveal in letters which par...,Four boarders at his parents' apartments becam...,ask_question
2,"If a species may be harmed, who holds final sa...",The question to be answered is whether a liste...,ask_question
3,What country has the dog as part of its 12 ani...,"In Asian countries such as China, Korea, and J...",ask_question
4,How long did his episcopate last?,Saint Athanasius of Alexandria (/ˌæθəˈneɪʃəs/;...,ask_question


In [10]:
# Create dataframe for validation data 
# shuffle so random
count=1000
data_shuffle=data_squad['validation'].shuffle(seed=1962).select(range(count))
eval_df=pd.DataFrame()
# eval_df['answer'] = [answer['text'][0] for answer in data_shuffle['answers']]
eval_df['question'] = data_shuffle['question']
eval_df['description'] = data_shuffle['context']
eval_df.columns = ["target_text", "input_text"]
eval_df['prefix'] = 'ask_question'


eval_df.head()



Unnamed: 0,target_text,input_text,prefix
0,How many levels of galleries do the façades su...,Prince Albert appears within the main arch abo...,ask_question
1,What are the secretions commonly called?,"When some species, including Bathyctena chuni,...",ask_question
2,When did Newcastle's first indoor market open?,The Grainger Market replaced an earlier market...,ask_question
3,What may be presented to Parliament in various...,Bills can be introduced to Parliament in a num...,ask_question
4,"Prior to the arrival of the French, the area n...",Jacksonville is in the First Coast region of n...,ask_question


## Simple Transformers

Sources: 

https://towardsdatascience.com/asking-the-right-questions-training-a-t5-transformer-model-on-a-new-task-691ebba2d72c

https://simpletransformers.ai/docs/t5-specifics/

https://github.com/ThilinaRajapakse/simpletransformers

### Install Requirements

In [11]:
!pip install -q simpletransformers

[K     |████████████████████████████████| 250 kB 23.6 MB/s 
[K     |████████████████████████████████| 1.9 MB 66.5 MB/s 
[K     |████████████████████████████████| 9.2 MB 71.4 MB/s 
[K     |████████████████████████████████| 5.3 MB 58.0 MB/s 
[K     |████████████████████████████████| 43 kB 2.3 MB/s 
[K     |████████████████████████████████| 1.3 MB 60.0 MB/s 
[K     |████████████████████████████████| 7.6 MB 66.1 MB/s 
[K     |████████████████████████████████| 182 kB 70.7 MB/s 
[K     |████████████████████████████████| 166 kB 73.0 MB/s 
[K     |████████████████████████████████| 63 kB 1.8 MB/s 
[K     |████████████████████████████████| 166 kB 53.4 MB/s 
[K     |████████████████████████████████| 162 kB 79.0 MB/s 
[K     |████████████████████████████████| 162 kB 64.2 MB/s 
[K     |████████████████████████████████| 158 kB 76.8 MB/s 
[K     |████████████████████████████████| 157 kB 76.5 MB/s 
[K     |████████████████████████████████| 157 kB 78.3 MB/s 
[K     |███████████████████

In [12]:
# !pip uninstall torch torchvision -y   # uncomment this line if having issues with cuda
!pip install torch torchvision

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# This isn't needed unless using fp16 training

# !git clone https://github.com/NVIDIA/apex
# %cd apex
# !pip install -v --no-cache-dir ./

### Training

In [13]:
from simpletransformers.t5 import T5Model

In [19]:
model_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "max_seq_length": 128,
    "train_batch_size": 32,
    "num_train_epochs": 2,
    "save_eval_checkpoints": True,
    "save_steps": -1,
    "use_multiprocessing": False,
    "evaluate_during_training": True,
    "evaluate_during_training_steps": 1000,
    "evaluate_during_training_verbose": True,
    "fp16": False,

    "wandb_project": "Question Generation with T5",
}

model = T5Model("t5", "t5-base", args=model_args)

model.train_model(train_df, eval_data=eval_df)

  0%|          | 0/10000 [00:00<?, ?it/s]

  "`as_target_tokenizer` is deprecated and will be removed in v5 of Transformers. You can tokenize your "


Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Training loss,▅▇█▁▅▇
eval_loss,▆▄█▁
global_step,▁▂▂▄▅▅▆███
lr,▁▁▁▁▁▁
train_loss,▆▁▆█

0,1
Training loss,2.2899
eval_loss,2.06984
global_step,313.0
lr,0.001
train_loss,2.41842


Running Epoch 0 of 2:   0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/1000 [00:00<?, ?it/s]

Running Epoch 1 of 2:   0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/1000 [00:00<?, ?it/s]

(626,
 {'global_step': [313, 626],
  'eval_loss': [2.0861186571121215, 2.1383585634231568],
  'train_loss': [1.9049079418182373, 1.504651427268982]})

Can see Weights and Biases at: https://wandb.ai/w266/Question%20Generation%20with%20T5?workspace=user-bronte-baer

*Note: entirely sure how best to use W&B yet...*

### Save and load trained model to drive

In [20]:
import torch


# path = "/content/drive/MyDrive/w266 NLP Final Project/Checkpoints/bb_t5_st")

# Specify a path
path = "bb_t5_st.pt"

# Save
torch.save(model, path)

# Load
model = torch.load(path)

### Evaluation

Choose some text to test/evaluate

In [17]:
test_txt1 = "In Asian countries such as China, Korea, and Japan, dogs are viewed as kind protectors. The role of the dog in Chinese mythology includes a position as one of the twelve animals which cyclically represent years (the zodiacal dog)."

test_txt2 = "After Hurricane Katrina in 2005, Beyoncé and Rowland founded the Survivor Foundation to provide transitional housing for victims in the Houston area, to which Beyoncé contributed an initial $250,000. The foundation has since expanded to work with other charities in the city, and also provided relief following Hurricane Ike three years later."

# test_txt3 = ""


In [22]:
model_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "max_seq_length": 128,
    "eval_batch_size": 128,
    "num_train_epochs": 1,
    "save_eval_checkpoints": False,
    "use_multiprocessing": False,
    "num_beams": None,
    "do_sample": True,
    "max_length": 50,
    "top_k": 50,
    "top_p": 0.95,
    "num_return_sequences": 3,
}


query = "ask_question: " + test_txt1

preds = model.predict([query])

print(preds)

Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/1 [00:00<?, ?it/s]

['What is the role of the dog in Chinese mythology?']


In [26]:
help(model)

Help on T5Model in module simpletransformers.t5.t5_model object:

class T5Model(builtins.object)
 |  T5Model(model_type, model_name, args=None, tokenizer=None, use_cuda=True, cuda_device=-1, **kwargs)
 |  
 |  Methods defined here:
 |  
 |  __init__(self, model_type, model_name, args=None, tokenizer=None, use_cuda=True, cuda_device=-1, **kwargs)
 |      Initializes a T5Model model.
 |      
 |      Args:
 |          model_type: The type of model (t5, mt5, byt5)
 |          model_name: The exact architecture and trained weights to use. This may be a Hugging Face Transformers compatible pre-trained model, a community model, or the path to a directory containing model files.
 |          args (optional): Default args will be used if this parameter is not provided. If provided, it should be a dict containing the args that should be changed in the default args.
 |          use_cuda (optional): Use GPU if available. Setting to False will force model to use CPU only.
 |          cuda_device (o