# T5 Summarization Example

_Developed from Hugging Face example_

_https://huggingface.co/docs/transformers/tasks/summarization_

This notebook helps you fine-tune a T5 model to a public dataset that has lengthy descriptions **(input)** and summaries **(labels/target)** for training.

You can then run inference using the fine-tuned model

## Install libraries

In [None]:
pip install transformers datasets evaluate rouge_score

## Load and preprocess data

In [None]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

README.md: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


data/train-00000-of-00001.parquet:   0%|          | 0.00/91.8M [00:00<?, ?B/s]

In [None]:
# Create train/test split
billsum = billsum.train_test_split(test_size=0.2)

In [None]:
# View data
billsum["train"][0]

In [None]:
from transformers import AutoTokenizer

# Load tokenizer
checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:
# Create function to preprocess
prefix = "summarize: "

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
# Preprocess/tokenize
tokenized_billsum = billsum.map(preprocess_function, batched=True)

In [None]:
from transformers import DataCollatorForSeq2Seq

# Create a batch of samples
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

## Evaluation Function

In [None]:
import evaluate

rouge = evaluate.load("rouge")

In [None]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

## Train Model

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

# Load model
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [None]:
# Set up params
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=True, #change to bf16=True for XPU
    push_to_hub=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Train model
trainer.train()

## Use model for inference

In [None]:
# Create test sample
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

#### Pipeline Method

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model="./results")
summarizer(text)

#### Manual Method

In [None]:
from transformers import AutoTokenizer

# Tokenize and format
tokenizer = AutoTokenizer.from_pretrained("./results")
inputs = tokenizer(text, return_tensors="pt").input_ids

In [None]:
from transformers import AutoModelForSeq2SeqLM

# Generate tokens
model = AutoModelForSeq2SeqLM.from_pretrained("./results")
outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)

In [None]:
# Convert tokens to text
tokenizer.decode(outputs[0], skip_special_tokens=True)