<a href="https://colab.research.google.com/github/jeanlucjackson/w266_final_project/blob/main/code/inference/awesome_T5_pt_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Generate Predictions From An Awesome Validation Dataset

This notebook assumes a T5 PyTorch model.

Setting the constants in the next call should be all that is necessary to run the validation set.

In [2]:
# Set these constants for each model and validation dataset combination

model_name = "T5_base_pt_long.quac"
validation_dataset_name = "triviaqa"

save_predictions = True
save_mode = 'w' # w for write, a for append

max_length = 1024 # 1024 for long model and 512 otherwise
batch_size = 50 # 150 is the norm, but dial back when needed

start_sample = 0
end_sample = 5000

### Generate Predictions

In [3]:
!pip install -q transformers

[K     |████████████████████████████████| 5.5 MB 4.9 MB/s 
[K     |████████████████████████████████| 7.6 MB 55.4 MB/s 
[K     |████████████████████████████████| 163 kB 83.2 MB/s 
[?25h

In [4]:
!pip install -q sentencepiece

[?25l[K     |▎                               | 10 kB 30.6 MB/s eta 0:00:01[K     |▌                               | 20 kB 6.0 MB/s eta 0:00:01[K     |▊                               | 30 kB 8.6 MB/s eta 0:00:01[K     |█                               | 40 kB 4.0 MB/s eta 0:00:01[K     |█▎                              | 51 kB 4.1 MB/s eta 0:00:01[K     |█▌                              | 61 kB 4.9 MB/s eta 0:00:01[K     |█▉                              | 71 kB 5.1 MB/s eta 0:00:01[K     |██                              | 81 kB 5.2 MB/s eta 0:00:01[K     |██▎                             | 92 kB 5.8 MB/s eta 0:00:01[K     |██▋                             | 102 kB 4.9 MB/s eta 0:00:01[K     |██▉                             | 112 kB 4.9 MB/s eta 0:00:01[K     |███                             | 122 kB 4.9 MB/s eta 0:00:01[K     |███▍                            | 133 kB 4.9 MB/s eta 0:00:01[K     |███▋                            | 143 kB 4.9 MB/s eta 0:00:01[K    

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
import os
import numpy as np
import pandas as pd

import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from google.colab import data_table
data_table.enable_dataframe_formatter()

In [7]:
# Some important file locations and constants

project_root = "/content/drive/MyDrive/w266 NLP Final Project/"
dataset_root = project_root + "Data/"
model_root = project_root + "Models/"
prediction_folder = project_root + "Predictions/"

tokenizer = "google/t5-v1_1-base"

model_folder = model_root + model_name

validation_data_file = f"{dataset_root}squad.hf/valid_pairs.csv"
if validation_dataset_name != "squad":
  validation_data_file = f"{dataset_root}{validation_dataset_name}/valid_pairs.csv"

prediction_file = f"{prediction_folder}predictions.{model_name}.{validation_dataset_name}.csv"

In [8]:
validation_df = pd.read_csv(validation_data_file)
validation_df[['orig', 'target']][:2]

Unnamed: 0,orig,target
0,generate question: answer: one context: Goliat...,"When David killed Goliath, how many of his fiv..."
1,generate question: answer: Apaches context: Ge...,Of which tribe of Red Indians was Geronimo a c...


In [9]:
validation_df.shape[0]

9835

In [10]:
# Download tokenizer and model, associate the model with the GPU

t5_tokenizer = T5Tokenizer.from_pretrained(tokenizer)
t5_model = T5ForConditionalGeneration.from_pretrained(model_folder)
t5_model.to(torch.device('cuda:0'))
pass

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.86k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/605 [00:00<?, ?B/s]

In [13]:
predictions = []

if end_sample is None:
  end_sample = validation_df.shape[0]

print(f"Generating predictions from {start_sample} to {end_sample}:")
for start in range (start_sample, end_sample, batch_size):
  to = min([end_sample, start + batch_size])
  inputs = t5_tokenizer(validation_df['orig'][start:to].to_list(), return_tensors='pt', max_length=max_length, truncation=True, padding=True)
  output_ids = t5_model.generate(inputs['input_ids'].cuda(), max_length=max_length)
  prediction_batch = t5_tokenizer.batch_decode(output_ids, skip_special_tokens=True)
  predictions.extend(prediction_batch)
  print (f"{to} ", end="")
  if to%1000 == 0: print()
print("Predictions generated.")

Generating predictions from 0 to 5000:
50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 
1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 
2050 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700 2750 2800 2850 2900 2950 3000 
3050 3100 3150 3200 3250 3300 3350 3400 3450 3500 3550 3600 3650 3700 3750 3800 3850 3900 3950 4000 
4050 4100 4150 4200 4250 4300 4350 4400 4450 4500 4550 4600 4650 4700 4750 4800 4850 4900 4950 5000 
Predictions generated.


In [15]:
df=pd.DataFrame()
df['context'] = [str.split('context: ')[1] for str in validation_df['orig'][start_sample:end_sample]]
df['answer'] =  [str.split('context: ')[0][26: ] for str in validation_df['orig'][start_sample:end_sample]]
df['target'] = validation_df['target']
df['prediction'] = predictions

In [16]:
df[:10]

Unnamed: 0,context,answer,target,prediction
0,"Goliath ( ; ; Arabic : جالوت , Ǧālūt ( Qur'ani...",one,"When David killed Goliath, how many of his fiv...",What was his first battle?
1,"Geronimo ( `` the one who yawns '' ; June 16 ,...",Apaches,Of which tribe of Red Indians was Geronimo a c...,What was his first battle?
2,"In Jewish eschatology the term mashiach , or `...",Elijah,"According to Jewish tradition, whose chair is ...",What was his first religion?
3,"Haiti ( ; ; ) , officially the Republic of Hai...",La Española,What island is shared by Haiti and the Dominic...,What was his first political he did?
4,Hyposmia is a reduced ability to smell and to ...,smell,"In humans, the medical condition Hyposmia affe...",What was his first job?
5,The UK Singles Chart is one of many music char...,Love Is All Around,Which hit for 'Wet Wet Wet' was the biggest-se...,What was the biggest hit of the album?
6,An assembly line is a manufacturing process ( ...,Henry Ford,What American industrialist is credited as the...,What was the most interesting aspects about th...
7,"Bockscar , sometimes called Bock 's Car , is t...",Nagasaki,What became the last city on earth to experien...,What was the military?
8,Beetles are a group of insects that form the o...,Coleopteran,What is the more common name for the order Col...,What was the name of the species?
9,"George Washington ( Contemporary records , whi...",Gregorian calendar,"When introduced into Britain in 1752, what cau...",What was his first political position?


In [17]:
if save_predictions:
  df.to_csv(prediction_file, mode=save_mode)