<a href="https://colab.research.google.com/github/jeanlucjackson/w266_final_project/blob/main/code/JJ/JL_T5_huggingface_abstraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports & Setup

In [1]:
! pip install transformers datasets #huggingface_hub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.23.1-py3-none-any.whl (5.3 MB)
[K     |████████████████████████████████| 5.3 MB 4.3 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.6.1-py3-none-any.whl (441 kB)
[K     |████████████████████████████████| 441 kB 63.2 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 69.3 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 44.5 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.14-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 43.2 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x

In [2]:
!pip install -q sentencepiece

[?25l[K     |▎                               | 10 kB 22.7 MB/s eta 0:00:01[K     |▌                               | 20 kB 6.0 MB/s eta 0:00:01[K     |▊                               | 30 kB 8.5 MB/s eta 0:00:01[K     |█                               | 40 kB 3.6 MB/s eta 0:00:01[K     |█▎                              | 51 kB 3.8 MB/s eta 0:00:01[K     |█▌                              | 61 kB 4.5 MB/s eta 0:00:01[K     |█▉                              | 71 kB 4.7 MB/s eta 0:00:01[K     |██                              | 81 kB 4.9 MB/s eta 0:00:01[K     |██▎                             | 92 kB 5.5 MB/s eta 0:00:01[K     |██▋                             | 102 kB 4.4 MB/s eta 0:00:01[K     |██▉                             | 112 kB 4.4 MB/s eta 0:00:01[K     |███                             | 122 kB 4.4 MB/s eta 0:00:01[K     |███▍                            | 133 kB 4.4 MB/s eta 0:00:01[K     |███▋                            | 143 kB 4.4 MB/s eta 0:00:01[K    

In [3]:
import transformers

print(transformers.__version__)

# Verify! Version must be at least 4.16.0

4.23.1


In [4]:
from datasets import Dataset, load_dataset, load_metric, load_dataset, load_from_disk, load_dataset_builder

In [5]:
import numpy as np
import pandas as pd
import tensorflow as tf

from pprint import pprint

In [6]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


## Notebook-Level Settings

In [7]:
# This flag is the difference between SQUAD v1 or 2 (if you're using another dataset, it indicates if impossible
# answers are allowed or not).
squad_v2 = False
model_checkpoint = "google/t5-v1_1-base"
tokenizer_checkpoint = "t5-base" # because it threw an error with the model checkpoint, but I believe it's the same tokenizer

## Helper Functions

In [8]:
def summarize_dataset (dataset, config=None):
    builder = load_dataset_builder(dataset, config)
    pprint(f"Description:\n {builder.info.description}")
    print(f"Features:")
    pprint(builder.info.features)
    return

In [9]:
def word_count(string):
  return(len(string.strip().split(" ")))

# Loading SQuAD

## From `datasets` or Google Drive

In [10]:
# Uncomment accordingly
# datasets = load_dataset('squad_v2' if squad_v2 else 'squad')
datasets = load_from_disk("/content/drive/MyDrive/w266 NLP Final Project/Data/squad.hf")

In [11]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [12]:
# Look at first example
pprint(datasets['train'][0])

{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the '
            "Main Building's gold dome is a golden statue of the Virgin Mary. "
            'Immediately in front of the Main Building and facing it, is a '
            'copper statue of Christ with arms upraised with the legend '
            '"Venite Ad Me Omnes". Next to the Main Building is the Basilica '
            'of the Sacred Heart. Immediately behind the basilica is the '
            'Grotto, a Marian place of prayer and reflection. It is a replica '
            'of the grotto at Lourdes, France where the Virgin Mary reputedly '
            'appeared to Saint Bernadette Soubirous in 1858. At the end of the '
            'main drive (and in a direct line that connects through 3 statues '
            'and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did t

In [33]:
# Shuffle dataset and select specific number of examples for development

data_count = 1000
sample = datasets['train'].shuffle(seed=1962).select(range(data_count))
df = pd.DataFrame()
df['answer'] = [answer['text'][0] for answer in sample['answers']]
df['context'] = sample['context']
df['question'] = sample['question']



In [14]:
df

Unnamed: 0,answer,context,question
0,biotech companies,"Prior to moving its headquarters to Chicago, a...",What type of businesses did Nickles want to at...
1,Tytus Woyciechowski,Four boarders at his parents' apartments becam...,To whom did Chopin reveal in letters which par...
2,the Endangered Species Committee,The question to be answered is whether a liste...,"If a species may be harmed, who holds final sa..."
3,China,"In Asian countries such as China, Korea, and J...",What country has the dog as part of its 12 ani...
4,45 years,Saint Athanasius of Alexandria (/ˌæθəˈneɪʃəs/;...,How long did his episcopate last?
...,...,...,...
245,"December 8, 1991","On June 12, 1990, the Congress of People's Dep...",On what date were the Belavezha Accords signed?
246,the Boreal Kingdom,"Phytogeographically, Greece belongs to the Bor...",Greece's plant distribution belongs to what?
247,May to September,"Fog is fairly common, particularly in spring a...",What months do thunderstorms occur in Boston?
248,the corporation,The creation of a modern industrial economy to...,What became the dominant form of business orga...


In [15]:
max([word_count(x) for x in df.context])

297

# Preprocess training data
## Tokenization

In [16]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [17]:
# Check if it's a "fast" tokenizer

assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

In [18]:
# Test tokenizer
tokenizer('hello, world!')

{'input_ids': [21820, 6, 296, 55, 1], 'attention_mask': [1, 1, 1, 1, 1]}

## Create Input Strings and Target Strings

In [19]:
# Tokenization settings
max_source_length = 512
max_target_length = 64

In [20]:
# Prepare INPUTS by prepending "gq", providing an "answer:", and a "context:"
input_strings = [f"gq answer: {answer} context: {context}" for answer, context in zip (df.answer, df.context)]

target_strings = df.question.to_list()

In [21]:
input_strings[1]

'gq answer: Tytus Woyciechowski context: Four boarders at his parents\' apartments became Chopin\'s intimates: Tytus Woyciechowski, Jan Nepomucen Białobłocki, Jan Matuszyński and Julian Fontana; the latter two would become part of his Paris milieu. He was friendly with members of Warsaw\'s young artistic and intellectual world, including Fontana, Józef Bohdan Zaleski and Stefan Witwicki. He was also attracted to the singing student Konstancja Gładkowska. In letters to Woyciechowski, he indicated which of his works, and even which of their passages, were influenced by his fascination with her; his letter of 15 May 1830 revealed that the slow movement (Larghetto) of his Piano Concerto No. 1 (in E minor) was secretly dedicated to her – "It should be like dreaming in beautiful springtime – by moonlight." His final Conservatory report (July 1829) read: "Chopin F., third-year student, exceptional talent, musical genius."'

In [22]:
max([word_count(x) for x in input_strings])

302

In [23]:
target_strings[1]

'To whom did Chopin reveal in letters which parts of his work were about the singing student he was infatuated with?'

In [24]:
# Encode the INPUTS

input_encoding = tokenizer(
    input_strings,
    padding="longest",
    max_length=max_source_length,
    truncation=True,
    return_tensors="tf"
)
input_ids, attention_mask = input_encoding.input_ids, input_encoding.attention_mask

In [25]:
# Encode the TARGETS

target_encoding = tokenizer(
    target_strings,
    padding="longest",
    max_length=max_target_length,
    truncation=True,
    return_tensors="tf"
)
target_ids = target_encoding.input_ids

In [26]:
max(len(x) for x in target_ids)

35

In [27]:
target_ids[0]

<tf.Tensor: shape=(35,), dtype=int32, numpy=
array([ 363,  686,   13, 1623,  410, 7486,  965,  241,   12, 5521,   12,
       8854,   58,    1,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0], dtype=int32)>

In [28]:
# We need to substitute -100 for the tokenizers pad token ID in the target labels
# And I can't figure out how to do that in TensorFlow
# So I will convert to a numpy array, make the substitution and then cast back 

target_ids = target_ids.numpy()
target_ids[target_ids == tokenizer.pad_token_id] = -100
target_ids = tf.convert_to_tensor(target_ids)

In [29]:
max(len(x) for x in target_ids)

35

In [30]:
target_ids[0]

<tf.Tensor: shape=(35,), dtype=int32, numpy=
array([ 363,  686,   13, 1623,  410, 7486,  965,  241,   12, 5521,   12,
       8854,   58,    1, -100, -100, -100, -100, -100, -100, -100, -100,
       -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
       -100, -100], dtype=int32)>

In [52]:
# All together now in one prepping function

def prepare_data(examples):

  # Prepare INPUTS by prepending "gq", providing an "answer:", and a "context:"
  # input_strings = [f"gq answer: {answer} context: {context}" for answer, context in zip (df.answer, df.context)]
  # target_strings = df.question.to_list()

  # Prepare INPUTS
  input_strings = [f"gq answer: {answer} context: {context}" for answer, context in zip(examples['answers'], examples['context'])]

  tokenized_examples = tokenizer(
      input_strings,
      padding='longest',
      max_length=max_source_length,
      truncation=True,
      return_tensors='tf'
  )
  input_ids, attention_mask = input_encoding.input_ids, input_encoding.attention_mask

  # Create model_inputs to be passed to model
  model_inputs = {'input_ids': input_ids,
                  'attention_mask': attention_mask}

  # Prepare LABELS
  target_encoding = tokenizer(
      examples['question'],
      padding="longest",
      max_length=max_target_length,
      truncation=True,
      return_tensors="tf"
  )
  target_ids = target_encoding.input_ids

  # Replace 0 with -100
  target_ids = target_ids.numpy()
  target_ids[target_ids == tokenizer.pad_token_id] = -100
  target_ids = tf.convert_to_tensor(target_ids)

  # Add labels to model_inputs
  model_inputs['questions'] = target_ids

  return model_inputs

In [53]:
# tokenized_datasets = datasets.map(prepare_data, batched=True) # ran out of Disk space, skip this cell

In [54]:
print(f"Only using {data_count} examples from data for training.")

smaller_train_data = datasets['train'].shuffle(seed=1962).select(range(data_count))

print(len(smaller_train_data))
print(type(smaller_train_data))



Only using 1000 examples from data for training.
1000
<class 'datasets.arrow_dataset.Dataset'>


In [55]:
tokenized_train_data = smaller_train_data.map(prepare_data)

  0%|          | 0/1000 [00:00<?, ?ex/s]

In [56]:
val_count = int(data_count/3)
print(f"Only using {val_count} examples from data for validation.")

smaller_validation_data = datasets['validation'].shuffle(seed=1962).select(range(val_count))
tokenized_validation_data = smaller_validation_data.map(prepare_data)



Only using 333 examples from data for validation.


  0%|          | 0/333 [00:00<?, ?ex/s]

# Fine-tuning the Model

In [47]:
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/605 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/991M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at google/t5-v1_1-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [48]:
model_name = model_checkpoint.split("/")[-1]
print('Model:', model_name)

batch_size = 8
learning_rate = 2e-5
num_train_epochs = 2
weight_decay = 0.01

Model: t5-v1_1-base


## Convert Datasets to Keras-Friendly `tf.data.Dataset`

In [57]:
# training_ds = Dataset.from_dict(
#     {
#         'input_ids': input_ids,
#         'attention_mask': attention_mask,
#         'labels': target_ids
#     }
# )

# Train
# training_ds = tokenized_datasets['train']

train_set = model.prepare_tf_dataset(
    tokenized_train_data,
    shuffle=True,
    batch_size=batch_size
)

# train_set = tokenized_train_data.to_tf_dataset(
#     columns=['input_ids', '']
# )



In [58]:
# Validation
# validation_ds = tokenized_datasets['validation']

validation_set = model.prepare_tf_dataset(
    tokenized_validation_data,
    shuffle=True,
    batch_size=batch_size
)

In [62]:
model.summary()

Model: "tft5_for_conditional_generation"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 shared (TFSharedEmbeddings)  multiple                 24674304  
                                                                 
 encoder (TFT5MainLayer)     multiple                  84954240  
                                                                 
 decoder (TFT5MainLayer)     multiple                  113275008 
                                                                 
 lm_head (Dense)             multiple                  24674304  
                                                                 
Total params: 247,577,856
Trainable params: 247,577,856
Non-trainable params: 0
_________________________________________________________________


In [59]:
# Next, we can create an optimizer and specify a loss function.
# The create_optimizer function gives us a very solid AdamW optimizer
# with weight decay and a learning rate schedule, but it needs us
# to compute the number of training steps to build that schedule.

from transformers import create_optimizer

total_train_steps = len(train_set) * num_train_epochs

optimizer, schedule = create_optimizer(
    init_lr=learning_rate,
    num_warmup_steps=0,
    num_train_steps=total_train_steps
)

In [60]:
model.compile(
    optimizer=optimizer,
    metrics=['accuracy']
)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [66]:

# model.fit(
#     train_set,
#     validation_data=validation_set,
#     epochs=num_train_epochs
# )

model.fit(
    train_set,
    validation_data=validation_set,
    epochs=num_train_epochs
)


Epoch 1/2


ValueError: ignored