<a href="https://colab.research.google.com/github/jeanlucjackson/w266_final_project/blob/main/code/BB/bb_t5_simple_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Transformers Attempt

### Imports and installs

In [1]:
import numpy as np
import pandas as pd

import json

# Make longer output readable without scrolling
from pprint import pprint

# Stop warning messages from showing
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
!pip install -q datasets

[K     |████████████████████████████████| 441 kB 4.9 MB/s 
[K     |████████████████████████████████| 163 kB 53.7 MB/s 
[K     |████████████████████████████████| 115 kB 57.2 MB/s 
[K     |████████████████████████████████| 212 kB 16.5 MB/s 
[K     |████████████████████████████████| 127 kB 48.2 MB/s 
[K     |████████████████████████████████| 115 kB 56.5 MB/s 
[?25h

In [3]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


### Prep data

In [4]:
from datasets import list_datasets, load_dataset_builder, get_dataset_config_names, load_dataset, load_from_disk

In [5]:
def summarize_dataset (dataset, config=None):
    builder = load_dataset_builder(dataset, config)
    print(f"Description:\n {builder.info.description}")
    print(f"Features:")
    pprint(builder.info.features)
    return

In [5]:
data_squad = load_from_disk("/content/drive/MyDrive/w266 NLP Final Project/Data/squad.hf")

In [None]:
data_squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

Set up dataframes

In [25]:
# Create dataframe for train data 
# shuffle so random
#count=10000
data_shuffle=data_squad['train'].shuffle(seed=1962)
#.select(range(count))
train_df=pd.DataFrame()
train_df['answer'] = [answer['text'][0] for answer in data_shuffle['answers']]
train_df['question'] = data_shuffle['question']
train_df['description'] = data_shuffle['context']
train_df['txt'] = [f"answer: {answer} context: {context}" for answer, context in zip (train_df.answer, train_df.description)]
train_df['prefix'] = 'ask_question'
train_df = train_df[['question', 'txt', 'prefix']]
train_df.columns = ['target_text', 'input_text', 'prefix']


train_df.head()



Unnamed: 0,target_text,input_text,prefix
0,What type of businesses did Nickles want to at...,answer: biotech companies context: Prior to mo...,ask_question
1,To whom did Chopin reveal in letters which par...,answer: Tytus Woyciechowski context: Four boar...,ask_question
2,"If a species may be harmed, who holds final sa...",answer: the Endangered Species Committee conte...,ask_question
3,What country has the dog as part of its 12 ani...,answer: China context: In Asian countries such...,ask_question
4,How long did his episcopate last?,answer: 45 years context: Saint Athanasius of ...,ask_question


In [26]:
# Create dataframe for validation data 
# shuffle so random
#count=1000
data_shuffle=data_squad['validation'].shuffle(seed=1962)
#.select(range(count))
eval_df=pd.DataFrame()
eval_df['answer'] = [answer['text'][0] for answer in data_shuffle['answers']]
eval_df['question'] = data_shuffle['question']
eval_df['description'] = data_shuffle['context']
eval_df['txt'] = [f"answer: {answer} context: {context}" for answer, context in zip (eval_df.answer, eval_df.description)]
eval_df['prefix'] = 'ask_question'
eval_df = eval_df[['question', 'txt', 'prefix']]
eval_df.columns = ['target_text', 'input_text', 'prefix']


eval_df.head()



Unnamed: 0,target_text,input_text,prefix
0,How many levels of galleries do the façades su...,answer: four context: Prince Albert appears wi...,ask_question
1,What are the secretions commonly called?,"answer: ink context: When some species, includ...",ask_question
2,When did Newcastle's first indoor market open?,answer: 1835 context: The Grainger Market repl...,ask_question
3,What may be presented to Parliament in various...,answer: Bills context: Bills can be introduced...,ask_question
4,"Prior to the arrival of the French, the area n...",answer: the Timucua context: Jacksonville is i...,ask_question


## Simple Transformers

Sources: 

https://towardsdatascience.com/asking-the-right-questions-training-a-t5-transformer-model-on-a-new-task-691ebba2d72c

https://simpletransformers.ai/docs/t5-specifics/

https://github.com/ThilinaRajapakse/simpletransformers

### Install Requirements

In [8]:
!pip install -q simpletransformers

[K     |████████████████████████████████| 250 kB 6.2 MB/s 
[K     |████████████████████████████████| 7.6 MB 41.1 MB/s 
[K     |████████████████████████████████| 1.3 MB 48.5 MB/s 
[K     |████████████████████████████████| 5.3 MB 22.4 MB/s 
[K     |████████████████████████████████| 1.9 MB 46.4 MB/s 
[K     |████████████████████████████████| 43 kB 2.4 MB/s 
[K     |████████████████████████████████| 9.2 MB 46.9 MB/s 
[K     |████████████████████████████████| 182 kB 63.7 MB/s 
[K     |████████████████████████████████| 166 kB 66.2 MB/s 
[K     |████████████████████████████████| 63 kB 2.3 MB/s 
[K     |████████████████████████████████| 166 kB 60.2 MB/s 
[K     |████████████████████████████████| 162 kB 76.1 MB/s 
[K     |████████████████████████████████| 162 kB 67.3 MB/s 
[K     |████████████████████████████████| 158 kB 72.3 MB/s 
[K     |████████████████████████████████| 157 kB 65.9 MB/s 
[K     |████████████████████████████████| 157 kB 70.0 MB/s 
[K     |████████████████████

In [9]:
# !pip uninstall torch torchvision -y   # uncomment this line if having issues with cuda
!pip install torch torchvision

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# This isn't needed unless using fp16 training

# !git clone https://github.com/NVIDIA/apex
# %cd apex
# !pip install -v --no-cache-dir ./

### Training

In [10]:
from simpletransformers.t5 import T5Model

In [None]:
# Used to try to clear cuda memory to run more data through model

# import torch

# torch.cuda.empty_cache()

In [27]:
model_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "max_seq_length": 128,
    "train_batch_size": 8,
    "num_train_epochs": 1,
    "save_eval_checkpoints": True,
    "save_steps": -1,
    "use_multiprocessing": False,
    "evaluate_during_training": True,
    "evaluate_during_training_steps": 10000,
    "evaluate_during_training_verbose": True,
    "fp16": False,
    # settings to stop model when loss doesn't decrease
    # "use_early_stopping": True,
    # "early_stopping_delta": 0.01,
    # # "early_stopping_metric": "t5",
    # "early_stopping_metric_minimize": False,
    # "early_stopping_patience": 5,
    # "evaluate_during_training_steps" = 1000,

    "wandb_project": "Question Generation with T5",
}

model = T5Model("t5", "google/t5-v1_1-base", args=model_args)
#google/t5-v1_1-base

model.train_model(train_df, eval_data=eval_df)

  0%|          | 0/87599 [00:00<?, ?it/s]

  "`as_target_tokenizer` is deprecated and will be removed in v5 of Transformers. You can tokenize your "


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Training loss,█▄▄▄▄▂▄▄▁▂▂▅▁▂▅▃▃▁▂▂▁▃▃▂▂
eval_loss,█▂▁▁▆▄▃
global_step,▁▁▂▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▅▆▆▆▇▇▇▇▇████
lr,▁████████████████████████
train_loss,▆▆█▃▂▁▂

0,1
Training loss,2.7164
eval_loss,2.24621
global_step,1250.0
lr,0.001
train_loss,2.7164


Running Epoch 0 of 1:   0%|          | 0/10950 [00:00<?, ?it/s]

  0%|          | 0/10570 [00:00<?, ?it/s]

  0%|          | 0/10570 [00:00<?, ?it/s]

(10950,
 {'global_step': [10000, 10950],
  'eval_loss': [2.6553694263168253, 2.570266220374836],
  'train_loss': [3.209731340408325, 2.48545503616333]})

Can see Weights and Biases at: https://wandb.ai/w266/Question%20Generation%20with%20T5?workspace=user-bronte-baer

*Note: entirely sure how best to use W&B yet...*

### Save and load trained model to drive

In [None]:
# THIS IS NO LONGER NEEDED
# because I figured out how to access the outputs file for the model

# import torch


# # path = "/content/drive/MyDrive/w266 NLP Final Project/Checkpoints/bb_t5_st")

# # Specify a path
# path = "bb_t5_st.pt"

# # Save
# torch.save(model, path)

# # Load
# model = torch.load(path)

### Evaluation

Choose some text to test/evaluate

In [66]:
# Find a few random samples to predict from validation data
val_list = eval_df['input_text'][:10]

# Add prefix to items in list (per simple transformers instructions)
val_list = ['ask_question: ' + s for s in val_list]

val_list

['ask_question: answer: four context: Prince Albert appears within the main arch above the twin entrances, Queen Victoria above the frame around the arches and entrance, sculpted by Alfred Drury. These façades surround four levels of galleries. Other areas designed by Webb include the Entrance Hall and Rotunda, the East and West Halls, the areas occupied by the shop and Asian Galleries as well as the Costume Gallery. The interior makes much use of marble in the entrance hall and flanking staircases, although the galleries as originally designed were white with restrained classical detail and mouldings, very much in contrast to the elaborate decoration of the Victorian galleries, although much of this decoration was removed in the early 20th century.',
 "ask_question: answer: ink context: When some species, including Bathyctena chuni, Euplokamis stationis and Eurhamphaea vexilligera, are disturbed, they produce secretions (ink) that luminesce at much the same wavelengths as their bodies

In [67]:
model_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "max_seq_length": 128,
    "eval_batch_size": 128,
    "num_train_epochs": 1,
    "save_eval_checkpoints": False,
    "use_multiprocessing": False,
    "num_beams": None,
    "do_sample": True,
    "max_length": 30,
    "top_k": 50,
    "top_p": 0.95,
    "num_return_sequences": 1,
}

model = T5Model("t5", "outputs/best_model", args=model_args)

# Input list of contexts
query = val_list

preds = model.predict(query)

print(preds)

Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/10 [00:00<?, ?it/s]

['How many levels of galleries can one find in the entrances of galleries?', 'What does the name of a magase of?', 'When was The Grainger Market opened?', 'What must be introduced when an official party is in the parliamentary?', 'How was the settlement of Fort Caroline in 1564 originally inhabited?', 'In what year did the earliest problems for anticennostod resolving points?', 'What resulted in the decline in wages and wages of worker workers?', 'How many hydraulic parts of the steam engine does each year with a minimum amount of power?', 'In what year was the renamed ABC Motion Pictures released?', 'Which document set out the foundation for European Union law?']


In [68]:
# Create a dataframe with the context input and prediction outputs
preds_df=pd.DataFrame()
preds_df['input_text'] = val_list
preds_df['predictions'] = preds


preds_df

Unnamed: 0,input_text,predictions
0,ask_question: answer: four context: Prince Alb...,How many levels of galleries can one find in t...
1,ask_question: answer: ink context: When some s...,What does the name of a magase of?
2,ask_question: answer: 1835 context: The Graing...,When was The Grainger Market opened?
3,ask_question: answer: Bills context: Bills can...,What must be introduced when an official party...
4,ask_question: answer: the Timucua context: Jac...,How was the settlement of Fort Caroline in 156...
5,ask_question: answer: 1912 context: In additio...,In what year did the earliest problems for ant...
6,ask_question: answer: stagnant context: In Mar...,What resulted in the decline in wages and wage...
7,ask_question: answer: 90 context: The final ma...,How many hydraulic parts of the steam engine d...
8,"ask_question: answer: 1985 context: In 1968, A...",In what year was the renamed ABC Motion Pictur...
9,ask_question: answer: the Charter of Fundament...,Which document set out the foundation for Euro...


In [69]:
# Save predictions df as csv
filepath = "/content/drive/MyDrive/w266 NLP Final Project/Predictions CSVs/t5_simple_transformers_preds.csv"
preds_df.to_csv(filepath) 

Extras used for testing/eda

In [13]:
test_txt1 = "In Asian countries such as China, Korea, and Japan, dogs are viewed as kind protectors. The role of the dog in Chinese mythology includes a position as one of the twelve animals which cyclically represent years (the zodiacal dog)."

test_txt2 = "After Hurricane Katrina in 2005, Beyoncé and Rowland founded the Survivor Foundation to provide transitional housing for victims in the Houston area, to which Beyoncé contributed an initial $250,000. The foundation has since expanded to work with other charities in the city, and also provided relief following Hurricane Ike three years later."

# test_txt3 = ""


In [None]:
help(model)

Help on T5Model in module simpletransformers.t5.t5_model object:

class T5Model(builtins.object)
 |  T5Model(model_type, model_name, args=None, tokenizer=None, use_cuda=True, cuda_device=-1, **kwargs)
 |  
 |  Methods defined here:
 |  
 |  __init__(self, model_type, model_name, args=None, tokenizer=None, use_cuda=True, cuda_device=-1, **kwargs)
 |      Initializes a T5Model model.
 |      
 |      Args:
 |          model_type: The type of model (t5, mt5, byt5)
 |          model_name: The exact architecture and trained weights to use. This may be a Hugging Face Transformers compatible pre-trained model, a community model, or the path to a directory containing model files.
 |          args (optional): Default args will be used if this parameter is not provided. If provided, it should be a dict containing the args that should be changed in the default args.
 |          use_cuda (optional): Use GPU if available. Setting to False will force model to use CPU only.
 |          cuda_device (o