[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb)

# Sandbox -- First Attempts with T5

## Downloading Datasets
Download using HuggingFace's `datasets` library

In [1]:
import numpy as np
import pandas as pd

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import json

from pprint import pprint

In [2]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!pip install -q sentencepiece

[?25l[K     |▎                               | 10 kB 15.9 MB/s eta 0:00:01[K     |▌                               | 20 kB 3.9 MB/s eta 0:00:01[K     |▊                               | 30 kB 5.4 MB/s eta 0:00:01[K     |█                               | 40 kB 6.9 MB/s eta 0:00:01[K     |█▎                              | 51 kB 6.3 MB/s eta 0:00:01[K     |█▌                              | 61 kB 7.4 MB/s eta 0:00:01[K     |█▉                              | 71 kB 7.9 MB/s eta 0:00:01[K     |██                              | 81 kB 6.9 MB/s eta 0:00:01[K     |██▎                             | 92 kB 7.6 MB/s eta 0:00:01[K     |██▋                             | 102 kB 8.3 MB/s eta 0:00:01[K     |██▉                             | 112 kB 8.3 MB/s eta 0:00:01[K     |███                             | 122 kB 8.3 MB/s eta 0:00:01[K     |███▍                            | 133 kB 8.3 MB/s eta 0:00:01[K     |███▋                            | 143 kB 8.3 MB/s eta 0:00:01[K    

In [4]:
!pip install -q transformers

[K     |████████████████████████████████| 5.3 MB 7.6 MB/s 
[K     |████████████████████████████████| 7.6 MB 33.0 MB/s 
[K     |████████████████████████████████| 163 kB 36.8 MB/s 
[?25h

In [5]:
!pip install -q datasets

[K     |████████████████████████████████| 441 kB 9.5 MB/s 
[K     |████████████████████████████████| 115 kB 46.1 MB/s 
[K     |████████████████████████████████| 212 kB 11.2 MB/s 
[K     |████████████████████████████████| 127 kB 44.5 MB/s 
[?25h

In [6]:
from datasets import list_datasets, load_dataset_builder, get_dataset_config_names, load_dataset, load_from_disk

In [90]:
from transformers import T5Tokenizer, TFT5ForConditionalGeneration
model = TFT5ForConditionalGeneration.from_pretrained("google/t5-v1_1-small")
tokenizer = T5Tokenizer.from_pretrained("google/t5-v1_1-small")

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at google/t5-v1_1-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [8]:
def summarize_dataset (dataset, config=None):
    builder = load_dataset_builder(dataset, config)
    pprint(f"Description:\n {builder.info.description}")
    print(f"Features:")
    pprint(builder.info.features)
    return

### SQuAD

In [9]:
print (get_dataset_config_names("squad"))

Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.17k [00:00<?, ?B/s]

['plain_text']


In [10]:
summarize_dataset("squad")

('Description:\n'
 ' Stanford Question Answering Dataset (SQuAD) is a reading comprehension '
 'dataset, consisting of questions posed by crowdworkers on a set of Wikipedia '
 'articles, where the answer to every question is a segment of text, or span, '
 'from the corresponding reading passage, or the question might be '
 'unanswerable.\n')
Features:
{'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
 'context': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None)}


In [11]:
# SQuAD is quick to download from Hugging Face
# Use the code below if you aren't accessing the data from the shared
# Google Drive folder.

# data_squad = load_dataset("squad")

# The followind code assumes you have added a link to the shared 
# w266 NLP Final Project folder in your Google Drive folder
# Loading data from there is faster.

data_squad = load_from_disk("/content/drive/MyDrive/w266 NLP Final Project/Data/squad.hf")

In [12]:
(type (data_squad))

datasets.dataset_dict.DatasetDict

In [13]:
# data_squad.save_to_disk("/content/drive/MyDrive/w266 NLP Final Project/Data/squad.hf")

## Getting Familiar

### SQuAD

In [14]:
data_squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [15]:
data_squad['train'].info.features

{'id': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None),
 'context': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None),
 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}

In [16]:
# Look at first example
pprint(data_squad['train'][0])

{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the '
            "Main Building's gold dome is a golden statue of the Virgin Mary. "
            'Immediately in front of the Main Building and facing it, is a '
            'copper statue of Christ with arms upraised with the legend '
            '"Venite Ad Me Omnes". Next to the Main Building is the Basilica '
            'of the Sacred Heart. Immediately behind the basilica is the '
            'Grotto, a Marian place of prayer and reflection. It is a replica '
            'of the grotto at Lourdes, France where the Virgin Mary reputedly '
            'appeared to Saint Bernadette Soubirous in 1858. At the end of the '
            'main drive (and in a direct line that connects through 3 statues '
            'and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did t

In [17]:
# Shuffle the dataset and take a handful of examples

count=25
sample=data_squad['train'].shuffle(seed=1962).select(range(count))
df=pd.DataFrame()
df['answer'] = [answer['text'][0] for answer in sample['answers']]
df['context'] = sample['context']
df['question'] = sample['question']



In [18]:
df

Unnamed: 0,answer,context,question
0,biotech companies,"Prior to moving its headquarters to Chicago, a...",What type of businesses did Nickles want to at...
1,Tytus Woyciechowski,Four boarders at his parents' apartments becam...,To whom did Chopin reveal in letters which par...
2,the Endangered Species Committee,The question to be answered is whether a liste...,"If a species may be harmed, who holds final sa..."
3,China,"In Asian countries such as China, Korea, and J...",What country has the dog as part of its 12 ani...
4,45 years,Saint Athanasius of Alexandria (/ˌæθəˈneɪʃəs/;...,How long did his episcopate last?
5,"Cold War, First Gulf War, Kosovo War","Since 1947, Canadian military units have parti...",What are some of the wars the Canadian Militar...
6,Buddha,Tibet has various festivals that are commonly ...,What is worshipped during Tibet's various fest...
7,9.3%,"From 2001 to 2008, Mac sales increased continu...",What was Apples market share of all computer s...
8,Improvisation,Improvisation stands at the centre of Chopin's...,What is central to Chopin's process?
9,1861,"Alfred North Whitehead was born in Ramsgate, K...",What year was Whitehead born?


In [19]:
model.summary()

Model: "tft5_for_conditional_generation"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 shared (TFSharedEmbeddings)  multiple                 16449536  
                                                                 
 encoder (TFT5MainLayer)     multiple                  18883264  
                                                                 
 decoder (TFT5MainLayer)     multiple                  25178816  
                                                                 
 lm_head (Dense)             multiple                  16449536  
                                                                 
Total params: 76,961,152
Trainable params: 76,961,152
Non-trainable params: 0
_________________________________________________________________


In [111]:
answer = df.answer[0]
context = df.context[0]
task = "qg"

input_string = f"{task} answer: {answer} </s> context: {context}"

target_string = df.question[0]

In [112]:
input_string

"qg answer: biotech companies </s> context: Prior to moving its headquarters to Chicago, aerospace manufacturer Boeing (#30) was the largest company based in Seattle. Its largest division is still headquartered in nearby Renton, and the company has large aircraft manufacturing plants in Everett and Renton, so it remains the largest private employer in the Seattle metropolitan area. Former Seattle Mayor Greg Nickels announced a desire to spark a new economic boom driven by the biotechnology industry in 2006. Major redevelopment of the South Lake Union neighborhood is underway, in an effort to attract new and established biotech companies to the city, joining biotech companies Corixa (acquired by GlaxoSmithKline), Immunex (now part of Amgen), Trubion, and ZymoGenetics. Vulcan Inc., the holding company of billionaire Paul Allen, is behind most of the development projects in the region. While some see the new development as an economic boon, others have criticized Nickels and the Seattle C

In [113]:
target_string

'What type of businesses did Nickles want to attract to Seattle?'

In [114]:
inputs = tokenizer(input_string, max_length=1024, truncation=True, return_tensors="tf")

In [115]:
inputs

{'input_ids': <tf.Tensor: shape=(1, 318), dtype=int32, numpy=
array([[    3,  1824,   122,  1525,    10,  2392,  3470,   688,     1,
         2625,    10,  6783,    12,  1735,   165, 13767,    12,  3715,
            6, 28674,  4818, 21430,    41,  4663,  1458,    61,    47,
            8,  2015,   349,     3,   390,    16,  8854,     5,    94,
            7,  2015,  4889,    19,   341,     3, 27630,    16,  4676,
         9405,   106,     6,    11,     8,   349,    65,   508,  6442,
         3732,  2677,    16,  6381,    15,    17,    17,    11,  9405,
          106,     6,    78,    34,  3048,     8,  2015,  1045,  6152,
           16,     8,  8854, 25233,   616,     5, 18263,  8854, 12394,
        11859, 29005,     7,  2162,     3,     9,  3667,    12, 13233,
            3,     9,   126,  1456, 13997,  6737,    57,     8,  2392,
        18485,   681,    16, 15066,  9236,     3,    60, 19677,    13,
            8,  1013,  2154,  3545,  5353,    19, 18953,     6,    16,
           46, 

In [116]:
labels = tokenizer(target_string, max_length=1024, truncation=True, return_tensors="tf")

In [117]:
labels

{'input_ids': <tf.Tensor: shape=(1, 14), dtype=int32, numpy=
array([[ 363,  686,   13, 1623,  410, 7486,  965,  241,   12, 5521,   12,
        8854,   58,    1]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 14), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}

In [118]:
pprint(tokenizer.batch_decode(inputs.input_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0], compact=True)

('qg answer: biotech companies context: Prior to moving its headquarters to '
 'Chicago, aerospace manufacturer Boeing (#30) was the largest company based '
 'in Seattle. Its largest division is still headquartered in nearby Renton, '
 'and the company has large aircraft manufacturing plants in Everett and '
 'Renton, so it remains the largest private employer in the Seattle '
 'metropolitan area. Former Seattle Mayor Greg Nickels announced a desire to '
 'spark a new economic boom driven by the biotechnology industry in 2006. '
 'Major redevelopment of the South Lake Union neighborhood is underway, in an '
 'effort to attract new and established biotech companies to the city, joining '
 'biotech companies Corixa (acquired by GlaxoSmithKline), Immunex (now part of '
 'Amgen), Trubion, and ZymoGenetics. Vulcan Inc., the holding company of '
 'billionaire Paul Allen, is behind most of the development projects in the '
 'region. While some see the new development as an economic boon, othe

In [119]:
pprint(tokenizer.batch_decode(labels.input_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0], compact=True)

'What type of businesses did Nickles want to attract to Seattle?'


In [120]:
outputs = model(inputs.input_ids, labels=labels.input_ids)

In [121]:
loss = outputs.loss

In [102]:
logits = outputs.logits

In [103]:
answer = df.answer[5]
context = df.context[5]
task = "qg"

input_string = f"{task} answer: {answer} </s> context: {context}"

target_string = df.question[5]

In [104]:
inputs = tokenizer(input_string, max_length=1024, truncation=True, return_tensors="tf").input_ids

In [105]:
input_string

"qg answer: Cold War, First Gulf War, Kosovo War </s> context: Since 1947, Canadian military units have participated in more than 200 operations worldwide, and completed 72 international operations. Canadian soldiers, sailors, and aviators came to be considered world-class professionals through conspicuous service during these conflicts and the country's integral participation in NATO during the Cold War, First Gulf War, Kosovo War, and in United Nations Peacekeeping operations, such as the Suez Crisis, Golan Heights, Cyprus, Croatia, Bosnia, Afghanistan, and Libya. Canada maintained an aircraft carrier from 1957 to 1970 during the Cold War, which never saw combat but participated in patrols during the Cuban Missile Crisis."

In [106]:
target_string

'What are some of the wars the Canadian Military was involved in?'

In [107]:
outputs = model.generate(inputs)

In [108]:
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

, qg answer: answer: Cold War, First Gulf War, answer
