T5 Flan
----
This notebook follows this guide: https://huggingface.co/docs/transformers/tasks/summarization

# Setup

In [1]:
! pip install transformers datasets evaluate rouge_score

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting datasets
  Downloading datasets-2.14.1-py3-none-any.whl (492 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m492.4/492.4 kB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting filelock (from transformers)
  Downloading filelock-3.12.2-py3-none-any.whl (10 kB)
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━

In [2]:
import tensorflow as tf
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorForSeq2Seq, TFAutoModelForSeq2SeqLM, create_optimizer, AdamWeightDecay, pipeline
from transformers.keras_callbacks import KerasMetricCallback
import evaluate
import numpy as np

Defining the model to use.

In [3]:
checkpoint = 'google/flan-t5-small'

# Data

In [4]:
billsum = load_dataset('billsum', split='ca_test').train_test_split(test_size=0.2)

# HF data objects can be indexed EITHER by obs or key: the former returns a dict, the latter a list
print(billsum['train'][0])
print(billsum['train']['summary'][:5])

Downloading builder script: 100%|██████████| 3.66k/3.66k [00:00<00:00, 3.87MB/s]
Downloading metadata: 100%|██████████| 1.80k/1.80k [00:00<?, ?B/s]
Downloading readme: 100%|██████████| 6.70k/6.70k [00:00<?, ?B/s]
Downloading data: 100%|██████████| 67.3M/67.3M [00:02<00:00, 22.7MB/s]
Generating train split: 100%|██████████| 18949/18949 [00:01<00:00, 15658.03 examples/s]
Generating test split: 100%|██████████| 3269/3269 [00:00<00:00, 14886.16 examples/s]
Generating ca_test split: 100%|██████████| 1237/1237 [00:00<00:00, 10245.06 examples/s]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 82013 of the Government Code is amended to read:\n82013.\n“Committee” means any person or combination of persons who directly or indirectly does any of the following:\n(a) Receives contributions totaling two thousand dollars ($2,000) or more in a calendar year.\n(b) Makes independent expenditures totaling one thousand dollars ($1,000) or more in a calendar year; or\n(c) Makes contributions totaling ten thousand dollars ($10,000) or more in a calendar year to or at the behest of candidates or committees.\nA person or combination of persons that becomes a committee shall retain its status as a committee until such time as that status is terminated pursuant to Section 84214.\nSEC. 2.\nSection 82036 of the Government Code is amended to read:\n82036.\n“Late contribution” means any of the following:\n(a) A contribution, including a loan, that totals in the aggregate one thousand dollars ($1,000) or 




In [5]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading (…)okenizer_config.json: 100%|██████████| 2.54k/2.54k [00:00<00:00, 508kB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading spiece.model: 100%|██████████| 792k/792k [00:00<00:00, 815kB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 2.42M/2.42M [00:01<00:00, 2.03MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 2.20k/2.20k [00:00<00:00, 310kB/s]


In [6]:
def preprocess_function(data):
    prefix = "summarize: "
    inputs = [prefix + text for text in data['text']]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
    
    labels = tokenizer(text_target=data['summary'], max_length=128, truncation=True)
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

In [7]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=checkpoint,
    return_tensors='tf'
)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map: 100%|██████████| 989/989 [00:03<00:00, 303.63 examples/s]
Map: 100%|██████████| 248/248 [00:01<00:00, 222.75 examples/s]


# Defining Model and Metrics

In [8]:
rouge = evaluate.load('rouge')
def compute_metrics(eval_pred, evalutor=rouge, tokenizer=tokenizer):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id) # not sure what this is doing
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = evalutor.compute(
        predictions=decoded_preds,
        references=decoded_labels,
        use_stemmer=True
    )
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}


Downloading builder script: 100%|██████████| 6.27k/6.27k [00:00<?, ?B/s]


Note that HF models have built-in loss functions, so one does not need to be specified when compiling.

In [9]:
optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)
model.compile(optimizer=optimizer)

train_set = model.prepare_tf_dataset(
    tokenized_billsum['train'],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator
)
test_set = model.prepare_tf_dataset(
    tokenized_billsum['test'],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator
)

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=test_set)

model.fit(
    x=train_set,
    validation_data=test_set,
    epochs=3,
    callbacks=[metric_callback]
)

Downloading (…)lve/main/config.json: 100%|██████████| 1.40k/1.40k [00:00<00:00, 276kB/s]
Downloading tf_model.h5: 100%|██████████| 440M/440M [00:21<00:00, 20.0MB/s] 
All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at google/flan-t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
Downloading (…)neration_config.json: 100%|██████████| 147/147 [00:00<00:00, 24.2kB/s]
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch 1/3
 1/61 [..............................] - ETA: 2:14:28 - loss: 2.9154

: 

: 