In [1]:
!pip install transformers datasets evaluate rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25ldone
[?25h  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24955 sha256=262cb237923ad38a213fcc0db39b935476917eb3ea4b1d73695774a7f7992467
  Stored in directory: /home/pedro/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [2]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

Downloading readme:   0%|          | 0.00/7.27k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/91.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/15.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18949 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3269 [00:00<?, ? examples/s]

Generating ca_test split:   0%|          | 0/1237 [00:00<?, ? examples/s]

In [3]:
billsum = billsum.train_test_split(test_size=0.2)

In [4]:
billsum["train"][0]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nThe Legislature finds and declares all of the following:\n(a) The Isla Vista community encompasses a population of approximately 15,000 residents situated within approximately a half square mile of land in Santa Barbara County. It is adjacent to the University of California, Santa Barbara (UCSB) campus and its student population, of which approximately 8,000 students reside in university owned housing. Including university property, the area totals about 1,200 acres. Isla Vista represents one of the largest urban communities in California not governed as a city.\n(b) Isla Vista faces various challenges in local governance. As a university community, Isla Vista must accommodate the service needs associated with its transient student population and a predominantly renter-oriented community while balancing the needs of local homeowners and long-term residents. Isla Vista’s situation is complicated by its

There are two fields that you’ll want to use:

    text: the text of the bill which’ll be the input to the model.
    summary: a condensed version of text which’ll be the model target.

## Preprocess

The next step is to load a T5 tokenizer to process text and summary:

The preprocessing function you want to create needs to:

    Prefix the input with a prompt so T5 knows this is a summarization task. Some models capable of multiple NLP tasks require prompting for specific tasks.
    Use the keyword text_target argument when tokenizing labels.
    Truncate sequences to be no longer than the maximum length set by the max_length parameter.

In [5]:
from transformers import AutoTokenizer

checkpoint = "google-t5/t5-small"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [8]:
prefix = "summarize: "


def preprocess_function(examples):

    inputs = [prefix + doc for doc in examples["text"]]

    model_inputs = tokenizer(inputs, 
                             max_length=1024, 
                             truncation=True)

    labels = tokenizer(text_target=examples["summary"], 
                       max_length=128, 
                       truncation=True)

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets map method. You can speed up the map function by setting batched=True to process multiple elements of the dataset at once:

In [9]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

Now create a batch of examples using DataCollatorForSeq2Seq. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [10]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

2024-04-28 16:32:53.907763: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-28 16:32:54.109286: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-28 16:32:54.801908: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Evaluate

Including a metric during training is often helpful for evaluating your model’s performance. You can quickly load a evaluation method with the 🤗 Evaluate library. For this task, load the ROUGE metric (see the 🤗 Evaluate quick tour to learn more about how to load and compute a metric):

In [11]:
import evaluate

rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Then create a function that passes your predictions and labels to compute to calculate the ROUGE metric:

In [12]:
import numpy as np


def compute_metrics(eval_pred):

    predictions, labels = eval_pred

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]

    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

## Train

Pytorch

You’re ready to start training your model now! Load T5 with AutoModelForSeq2SeqLM:

In [13]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)



config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

At this point, only three steps remain:

1.     Define your training hyperparameters in Seq2SeqTrainingArguments. The only required parameter is output_dir which specifies where to save your model. You’ll push this model to the Hub by setting push_to_hub=True (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the Trainer will evaluate the ROUGE metric and save the training checkpoint.
2.     Pass the training arguments to Seq2SeqTrainer along with the model, dataset, tokenizer, data collator, and compute_metrics function.
3.     Call train() to finetune your model.

In [14]:
training_args = Seq2SeqTrainingArguments(

    output_dir="my_awesome_billsum_model",

    evaluation_strategy="epoch",

    learning_rate=2e-5,

    per_device_train_batch_size=16,

    per_device_eval_batch_size=16,

    weight_decay=0.01,

    save_total_limit=3,

    num_train_epochs=4,

    predict_with_generate=True,

    fp16=True,

    push_to_hub=True,

)

trainer = Seq2SeqTrainer(

    model=model,

    args=training_args,

    train_dataset=tokenized_billsum["train"],

    eval_dataset=tokenized_billsum["test"],

    tokenizer=tokenizer,

    data_collator=data_collator,

    compute_metrics=compute_metrics,

)

trainer.train()

ValueError: FP16 Mixed precision training with AMP or APEX (`--fp16`) and FP16 half precision evaluation (`--fp16_full_eval`) can only be used on CUDA or MLU devices or NPU devices or certain XPU devices (with IPEX).

## Train

Tensorflow

To finetune a model in TensorFlow, start by setting up an optimizer function, learning rate schedule, and some training hyperparameters: 

In [15]:
from transformers import create_optimizer, AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

In [16]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)

2024-04-28 16:38:08.610444: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 65798144 exceeds 10% of free system memory.
2024-04-28 16:38:08.797004: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 65798144 exceeds 10% of free system memory.
2024-04-28 16:38:08.820863: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 65798144 exceeds 10% of free system memory.
2024-04-28 16:38:09.918118: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 65798144 exceeds 10% of free system memory.
All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Convert your datasets to the tf.data.Dataset format with prepare_tf_dataset():

In [17]:
tf_train_set = model.prepare_tf_dataset(

    tokenized_billsum["train"],

    shuffle=True,

    batch_size=16,

    collate_fn=data_collator,

)

tf_test_set = model.prepare_tf_dataset(

    tokenized_billsum["test"],

    shuffle=False,

    batch_size=16,

    collate_fn=data_collator,

)

In [18]:
import tensorflow as tf

model.compile(optimizer=optimizer)  # No loss argument!

The last two things to setup before you start training is to compute the ROUGE score from the predictions, and provide a way to push your model to the Hub. Both are done by using Keras callbacks.

Pass your compute_metrics function to KerasMetricCallback:

In [20]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_test_set)

In [21]:
callbacks = [metric_callback]

In [None]:
model.fit(x=tf_train_set, 
          validation_data=tf_test_set, 
          epochs=3, 
          callbacks=callbacks)

Epoch 1/3
Cause: for/else statement not yet supported
Cause: for/else statement not yet supported


2024-04-28 16:41:24.078636: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 65798144 exceeds 10% of free system memory.
