<a href="https://colab.research.google.com/github/jeanlucjackson/w266_final_project/blob/main/code/rr/sandbox.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sandbox -- First Attempts with T5

## Downloading Datasets
Download using HuggingFace's `datasets` library

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import json

from pprint import pprint

In [2]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!pip install -q sentencepiece

[?25l[K     |▎                               | 10 kB 21.4 MB/s eta 0:00:01[K     |▌                               | 20 kB 7.1 MB/s eta 0:00:01[K     |▊                               | 30 kB 10.0 MB/s eta 0:00:01[K     |█                               | 40 kB 4.5 MB/s eta 0:00:01[K     |█▎                              | 51 kB 4.6 MB/s eta 0:00:01[K     |█▌                              | 61 kB 5.4 MB/s eta 0:00:01[K     |█▉                              | 71 kB 6.0 MB/s eta 0:00:01[K     |██                              | 81 kB 5.5 MB/s eta 0:00:01[K     |██▎                             | 92 kB 6.2 MB/s eta 0:00:01[K     |██▋                             | 102 kB 5.4 MB/s eta 0:00:01[K     |██▉                             | 112 kB 5.4 MB/s eta 0:00:01[K     |███                             | 122 kB 5.4 MB/s eta 0:00:01[K     |███▍                            | 133 kB 5.4 MB/s eta 0:00:01[K     |███▋                            | 143 kB 5.4 MB/s eta 0:00:01[K   

In [7]:
!pip install -q transformers

[K     |████████████████████████████████| 5.3 MB 4.9 MB/s 
[K     |████████████████████████████████| 7.6 MB 40.1 MB/s 
[?25h

In [4]:
!pip install -q datasets

[K     |████████████████████████████████| 441 kB 5.5 MB/s 
[K     |████████████████████████████████| 163 kB 68.8 MB/s 
[K     |████████████████████████████████| 115 kB 55.7 MB/s 
[K     |████████████████████████████████| 212 kB 45.2 MB/s 
[K     |████████████████████████████████| 127 kB 40.6 MB/s 
[K     |████████████████████████████████| 115 kB 48.5 MB/s 
[?25h

In [5]:
from datasets import list_datasets, load_dataset_builder, get_dataset_config_names, load_dataset, load_from_disk

In [8]:
from transformers import T5Tokenizer, TFT5ForConditionalGeneration
model = TFT5ForConditionalGeneration.from_pretrained("google/t5-v1_1-base")
tokenizer = T5Tokenizer.from_pretrained("google/t5-v1_1-base")

Downloading:   0%|          | 0.00/605 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/991M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at google/t5-v1_1-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.86k [00:00<?, ?B/s]

In [9]:
def summarize_dataset (dataset, config=None):
    builder = load_dataset_builder(dataset, config)
    pprint(f"Description:\n {builder.info.description}")
    print(f"Features:")
    pprint(builder.info.features)
    return

In [10]:
def word_count(string):
  return(len(string.strip().split(" ")))

### SQuAD

In [11]:
summarize_dataset("squad")

Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.17k [00:00<?, ?B/s]

('Description:\n'
 ' Stanford Question Answering Dataset (SQuAD) is a reading comprehension '
 'dataset, consisting of questions posed by crowdworkers on a set of Wikipedia '
 'articles, where the answer to every question is a segment of text, or span, '
 'from the corresponding reading passage, or the question might be '
 'unanswerable.\n')
Features:
{'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
 'context': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None)}


In [12]:
# SQuAD is quick to download from Hugging Face
# Use the code below if you aren't accessing the data from the shared
# Google Drive folder.

# data_squad = load_dataset("squad")

# The followind code assumes you have added a link to the shared 
# w266 NLP Final Project folder in your Google Drive folder
# Loading data from there is faster.

data_squad = load_from_disk("/content/drive/MyDrive/w266 NLP Final Project/Data/squad.hf")

In [13]:
(type (data_squad))

datasets.dataset_dict.DatasetDict

In [14]:
# data_squad.save_to_disk("/content/drive/MyDrive/w266 NLP Final Project/Data/squad.hf")

## Getting Familiar

### SQuAD

In [15]:
data_squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [16]:
data_squad['train'].info.features

{'id': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None),
 'context': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None),
 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}

In [17]:
# Look at first example
pprint(data_squad['train'][0])

{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the '
            "Main Building's gold dome is a golden statue of the Virgin Mary. "
            'Immediately in front of the Main Building and facing it, is a '
            'copper statue of Christ with arms upraised with the legend '
            '"Venite Ad Me Omnes". Next to the Main Building is the Basilica '
            'of the Sacred Heart. Immediately behind the basilica is the '
            'Grotto, a Marian place of prayer and reflection. It is a replica '
            'of the grotto at Lourdes, France where the Virgin Mary reputedly '
            'appeared to Saint Bernadette Soubirous in 1858. At the end of the '
            'main drive (and in a direct line that connects through 3 statues '
            'and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did t

In [20]:
# Shuffle the dataset and take a handful of examples

count=2000
sample=data_squad['train'].shuffle(seed=1962).select(range(count))
#sample=data_squad['train']
df=pd.DataFrame()
df['answer'] = [answer['text'][0] for answer in sample['answers']]
df['context'] = sample['context']
df['question'] = sample['question']



In [21]:
df

Unnamed: 0,answer,context,question
0,biotech companies,"Prior to moving its headquarters to Chicago, a...",What type of businesses did Nickles want to at...
1,Tytus Woyciechowski,Four boarders at his parents' apartments becam...,To whom did Chopin reveal in letters which par...
2,the Endangered Species Committee,The question to be answered is whether a liste...,"If a species may be harmed, who holds final sa..."
3,China,"In Asian countries such as China, Korea, and J...",What country has the dog as part of its 12 ani...
4,45 years,Saint Athanasius of Alexandria (/ˌæθəˈneɪʃəs/;...,How long did his episcopate last?
...,...,...,...
1995,according to the type of aircraft they carry a...,"There is no single definition of an ""aircraft ...",How may aircraft carriers be classified?
1996,mistakes/defects,"In its metaphysics, Nyāya school is closer to ...",What does Nyaya say causes human suffering?
1997,Mexican,"President Franklin D. Roosevelt promoted a ""go...",People of what descent were classified as whit...
1998,Michael Dukakis,After receiving his J.D. from Boston College L...,Who was Kerry an Lt. Gov. for?


In [22]:
max([word_count(x) for x in df.context])

402

In [23]:
model.summary()

Model: "tft5_for_conditional_generation"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 shared (TFSharedEmbeddings)  multiple                 24674304  
                                                                 
 encoder (TFT5MainLayer)     multiple                  84954240  
                                                                 
 decoder (TFT5MainLayer)     multiple                  113275008 
                                                                 
 lm_head (Dense)             multiple                  24674304  
                                                                 
Total params: 247,577,856
Trainable params: 247,577,856
Non-trainable params: 0
_________________________________________________________________


### Create a list of input strings and a list of target strings


In [24]:
input_strings = [f"gq answer: {answer} context: {context}" for answer, context in zip (df.answer, df.context)]

target_strings = df.question.to_list()

In [25]:
input_strings[1]

'gq answer: Tytus Woyciechowski context: Four boarders at his parents\' apartments became Chopin\'s intimates: Tytus Woyciechowski, Jan Nepomucen Białobłocki, Jan Matuszyński and Julian Fontana; the latter two would become part of his Paris milieu. He was friendly with members of Warsaw\'s young artistic and intellectual world, including Fontana, Józef Bohdan Zaleski and Stefan Witwicki. He was also attracted to the singing student Konstancja Gładkowska. In letters to Woyciechowski, he indicated which of his works, and even which of their passages, were influenced by his fascination with her; his letter of 15 May 1830 revealed that the slow movement (Larghetto) of his Piano Concerto No. 1 (in E minor) was secretly dedicated to her – "It should be like dreaming in beautiful springtime – by moonlight." His final Conservatory report (July 1829) read: "Chopin F., third-year student, exceptional talent, musical genius."'

In [26]:
max([word_count(x) for x in input_strings])

408

In [27]:
target_strings[1]

'To whom did Chopin reveal in letters which parts of his work were about the singing student he was infatuated with?'

In [28]:
max_source_length = 1024
max_target_length = 64

In [29]:
input_encoding = tokenizer(input_strings, padding="longest", max_length=max_source_length, truncation=True, return_tensors="tf")
input_ids, attention_mask = input_encoding.input_ids, input_encoding.attention_mask

In [30]:
max (len(x) for x in input_ids)

647

In [31]:
target_encoding = tokenizer(target_strings, padding="longest", max_length=max_target_length, truncation=True, return_tensors="tf")

In [32]:
target_ids = target_encoding.input_ids

In [33]:
# We need to substitute -100 for the tokenizers pad token ID in the target labels
# And I can't figure out how to do that in TensorFlow
# So I will convert to a numpy array, make the substitution and then cast back 

target_ids = target_ids.numpy()
target_ids[target_ids == tokenizer.pad_token_id] = -100
target_ids = tf.convert_to_tensor(target_ids)

In [34]:
max (len (x) for x in target_ids)

37

In [35]:
target_ids[0]

<tf.Tensor: shape=(37,), dtype=int32, numpy=
array([ 363,  686,   13, 1623,  410, 7486,  965,  241,   12, 5521,   12,
       8854,   58,    1, -100, -100, -100, -100, -100, -100, -100, -100,
       -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
       -100, -100, -100, -100], dtype=int32)>

In [None]:
outputs = model(input_ids=input_ids[:32], attention_mask=attention_mask[:32], labels=target_ids[:32])

In [None]:
loss = outputs.loss
loss

<tf.Tensor: shape=(1,), dtype=float32, numpy=array([11.030337], dtype=float32)>

In [None]:
model.save_pretrained("test")

In [None]:
outputs = model.generate(input_ids[:10], max_length=max_target_length)

In [None]:
len(outputs)

10

In [None]:
[tokenizer.decode(x, skip_special_tokens=True) for x in outputs]

['.com..com..com..com..com..com..com..com\'s "Best Cities for Business and Careers." in 2006..com..com..com..com..',
 '. gq answer: gq answer: gq answer: gq answer: gq answer: gq answer: gq answer: gq gq answer: gq gq answer: gq answer: g',
 'gq gq answer: gq answer: gq answer: gq answer: gq answer: gq answer: gq answer: gq gq else. and',
 '. gq question: Dogs are protectors. gq answer: Dogs are protectors. gq answer: Dogs are protectors. gq answer: Dog. gq answer: Dog.. Dog',
 'a gq answer: 45 years gq answer: 45 years gq answer: 45 years gq answer: 45 years gq answer: 45 years gq gq gq',
 ". gq question: Cold War, First Gulf War, Kosovo War context: Canada's Cold War, First Gulf War, Kosovo War, and Kosovo War.",
 'gq answer: Buddha gq answer: Buddha gq answer: Buddha answer: Buddha answer: Buddha answer: Buddha answer: Buddha answer: Buddha context: Buddha  answer: Buddha: Buddha context: Buddha: Buddha: Answer',
 ', a Macintosh is the most popular personal computer in the world.. a

In [None]:
model = TFT5ForConditionalGeneration.from_pretrained("test")

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at test.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [38]:
model = TFT5ForConditionalGeneration.from_pretrained("google/t5-v1_1-base")
for index in range (32):
  start = index * 32
  end = start + 32
  outputs = model(input_ids=input_ids[start:end], 
                  attention_mask=attention_mask[start:end], 
                  labels=target_ids[start:end])
  model.save_pretrained("test")
  print (f"{start} to {end - 1} loss: {outputs.loss}")
  outputs = model.generate(input_ids[1024:1027], max_length=max_target_length)
  pprint([tokenizer.decode(x, skip_special_tokens=True) for x in outputs])


All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at google/t5-v1_1-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


0 to 31 loss: [11.030336]
['. gq answer: hunt in the dark. gq answer: hunt in the dark. gq answer: hunt '
 'in the dark. gq answer: hunt. gq. gq... manly-',
 ', a Vietnam War veteran, threw his decorations over the fence. gq answer:, a '
 'Vietnam War veteran, threw his decorations over the fence. gq answer: gq, a '
 'Vietnam War veteran, a Vietnam War veteran',
 '. gq answer: General Terrazas and General Terrazas.. Answer: General '
 'Terrazas and General Terrazas. Answer: General Terrazas and. Answer:. '
 'Answer:. Answer: Yes.']
32 to 63 loss: [9.447608]
['. gq answer: hunt in the dark. gq answer: hunt in the dark. gq answer: hunt '
 'in the dark. gq answer: hunt. gq. gq... manly-',
 ', a Vietnam War veteran, threw his decorations over the fence. gq answer:, a '
 'Vietnam War veteran, threw his decorations over the fence. gq answer: gq, a '
 'Vietnam War veteran, a Vietnam War veteran',
 '. gq answer: General Terrazas and General Terrazas.. Answer: General '
 'Terrazas and General T