```t5 (Text-To-Text Transfer Transformer)```


- T5 (Text-to-Text Transfer Transformer) is a transformer-based model introduced by Google Research in the paper:
 "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" (2020).

- It treats every NLP task (summarization, translation, classification, Q&A, etc.) as a text-to-text problem, meaning both input and output are strings.


### Can Handle Many NLP Tasks:

1. Summarization

2. Machine Translation

3. Text Classification

4. Question Answering

5. Sentiment Analysis

6. Named Entity Recognition (NER)

### How t5 Works?

Unlike BERT (which is a masked-language model), T5 is an encoder-decoder transformer:

- he encoder processes input text.
- The decoder generates output text.

### Using t5-small for text summarization:

```
from transformers import T5ForConditionalGeneration, T5Tokenizer

model = T5ForConditionalGeneration.from_pretrained("t5-small")
tokenizer = T5Tokenizer.from_pretrained("t5-small")

text = "summarize: The T5 model is a powerful NLP model by Google."
input_ids = tokenizer(text, return_tensors="pt").input_ids

summary_ids = model.generate(input_ids)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print(summary)

```


### Importing:

In [1]:
from datasets import load_dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration, Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


### loading dataset:

In [2]:
from datasets import load_dataset


dataset = load_dataset("csv", data_files={
    "train": r"U:\hugging_face\data\for_t5\train.csv", 
    "test": r"U:\hugging_face\data\for_t5\test.csv",
    "validation": r"U:\hugging_face\data\for_t5\validation.csv"})

# Print the dataset
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['Text', 'Summary'],
        num_rows: 557115
    })
    test: Dataset({
        features: ['Text', 'Summary'],
        num_rows: 5684
    })
    validation: Dataset({
        features: ['Text', 'Summary'],
        num_rows: 5628
    })
})


In [3]:
# Taking the subset of the dataset for the finetuning purpose
train_subset = dataset["train"].select(range(1000))
validation_subset = dataset["validation"].select(range(1000))
test_subset = dataset["test"].select(range(1000))

In [4]:
train_subset

Dataset({
    features: ['Text', 'Summary'],
    num_rows: 1000
})

### load tokenizer:

In [5]:
from transformers import T5Tokenizer

# Load SentencePiece tokenizer into Hugging Face format
tokenizer = T5Tokenizer(vocab_file=r"U:\hugging_face\t5_tokenizer.model")

# Save in Hugging Face format
tokenizer.save_pretrained(r"U:\hugging_face\t5_tokenizer_hf")

print("Tokenizer successfully saved in Hugging Face format!")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Tokenizer successfully saved in Hugging Face format!


In [6]:
from transformers import T5Tokenizer

# Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained(r"U:\hugging_face\t5_tokenizer_hf")

print("Tokenizer loaded successfully!")

Tokenizer loaded successfully!


### Tokenize datasets

In [7]:
import pandas as pd

train = pd.read_csv(r"data/for_t5/train.csv")
text = train["Text"]
summary = train["Summary"]

max_source = 0
for item in list(text):
    if len(item) > max_source:
        max_source = len(item)

max_target = 0
for item in list(summary):
    if len(item) > max_target:
        max_target = len(item)

In [8]:
max_source,max_target

(21409, 128)

In [9]:
def preprocess_function(examples):
    inputs = ["summarize: " + doc for doc in examples["Text"]]
    model_inputs = tokenizer(inputs, max_length=500, truncation=True, padding=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["Summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [10]:
tokenized_train = train_subset.map(preprocess_function, batched=True)
tokenized_validation = validation_subset.map(preprocess_function, batched=True)
tokenized_test = test_subset.map(preprocess_function, batched=True)

In [11]:
tokenized_train.save_to_disk(r'U:\hugging_face\data\for_t5\tokenized_data\tokenized_train')
tokenized_validation.save_to_disk(r'U:\hugging_face\data\for_t5\tokenized_data\tokenized_validation')
tokenized_test.save_to_disk(r'U:\hugging_face\data\for_t5\tokenized_data\tokenized_test')

Saving the dataset (1/1 shards): 100%|██████████| 1000/1000 [00:00<00:00, 135821.51 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 1000/1000 [00:00<00:00, 235291.37 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 1000/1000 [00:00<00:00, 156189.17 examples/s]


# Load Model

In [12]:
# checkpoint = 't5-small'

# model = T5ForConditionalGeneration.from_pretrained(checkpoint)

In [13]:
# Data collator
data_collator = DataCollatorForSeq2Seq(tokenizer)

In [14]:
import evaluate
rouge = evaluate.load("rouge")

Using the latest cached version of the module from C:\Users\Naruto\.cache\huggingface\modules\evaluate_modules\metrics\evaluate-metric--rouge\b01e0accf3bd6dd24839b769a5fda24e14995071570870922c71970b3a6ed886 (last modified on Wed Apr  2 19:26:01 2025) since it couldn't be found locally at evaluate-metric--rouge, or remotely on the Hugging Face Hub.


In [15]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    labels_ids = pred.label_ids

    # Ensure predictions are within tokenizer vocab size
    vocab_size = tokenizer.vocab_size
    pred_ids = [[token if token < vocab_size else tokenizer.unk_token_id for token in seq] for seq in pred_ids]

    # Decode the predictions and labels
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    # Compute metric (replace with actual metric like ROUGE)
    return rouge.compute(predictions=pred_str, references=label_str)

In [16]:
# Seq2Seq training arguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",             # Directory to save model checkpoints and logs
    evaluation_strategy="epoch",        # Evaluate the model at the end of each epoch
    eval_steps=100,
    logging_steps=100,
    logging_dir="./logs",               # Directory to save logs
    report_to="all",                    # Logs to console and file (use "tensorboard" if needed)
    save_strategy="epoch",
    learning_rate=1e-5,                 # Learning rate for the optimizer
    per_device_train_batch_size=16,     # Batch size for training
    per_device_eval_batch_size=16,      # Batch size for evaluation
    weight_decay=0.01,                  # Weight decay for regularization
    save_total_limit=3,                 # Limit the total number of checkpoints saved
    num_train_epochs=3,                 # Number of training epochs
    predict_with_generate=True,         # Use generation mode for prediction
    generation_max_length=150,          # Maximum length for generated sequences
    generation_num_beams=6,             # Number of beams for beam search during generation
    load_best_model_at_end=True,        # Whether to load the best model found at each evaluation.
    metric_for_best_model="loss",       # Use loss to evaluate best model.
    greater_is_better=False,            # Best model is the one with the lowest loss, not highest.
    logging_first_step=True,            # Log the first training step
    # label_smoothing_factor=0.1          # Helps prevent overconfidence
)



In [17]:
def model_init(checkpoint = 't5-small'):
    return T5ForConditionalGeneration.from_pretrained(checkpoint)

In [18]:
## Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model=T5ForConditionalGeneration.from_pretrained('t5-small'),                  # The model to be trained
    args=training_args,                # Training arguments defined with Seq2SeqTrainingArguments
    train_dataset=tokenized_train,     # The training dataset
    eval_dataset=tokenized_validation, # The evaluation dataset
    data_collator=data_collator,       # The data collator for processing data batches
    tokenizer=tokenizer,               # The tokenizer used for preprocessing
    compute_metrics=compute_metrics,   # The function to compute evaluation metrics
)

  trainer = Seq2SeqTrainer(


In [19]:
# # Start TensorBoard before training to monitor it in progress
# checkpoint = 't5-small'
# model_dir = f"U:\hugging_face\data\for_t5\{checkpoint}"

# %load_ext tensorboard
# %tensorboard --logdir '{model_dir}'/runs

In [None]:
# Train the model
trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss


In [None]:
# # Evaluate the model on validation set
# trainer.evaluate()

# # Evaluate the model on test set
# test_results = trainer.evaluate(eval_dataset=tokenized_test)

# print(test_results)

In [None]:
# test_subset[0]

In [None]:
import torch
# Select a specific data point from the test dataset
test_index = 0  # Change this index to the specific data point you want to summarize
# example_text = test_subset["Text"][test_index]["document"]

# Preprocess the input text
input_text = "summarize: " + test_subset[0]["Text"]
inputs = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
inputs = inputs.to(device)
# Generate the summary
summary_ids = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)

# Decode the generated summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("Original Text:\n", input_text)
print("\nGenerated Summary:\n", summary)

In [None]:
# def compute_metrics(pred):
#     pred_ids = pred.predictions
#     labels_ids = pred.label_ids

#     # Ensure predictions are within tokenizer vocab size
#     vocab_size = tokenizer.vocab_size
#     pred_ids = [[token if token < vocab_size else tokenizer.unk_token_id for token in seq] for seq in pred_ids]

#     # Decode the predictions and labels
#     pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
#     labels_ids[labels_ids == -100] = tokenizer.pad_token_id
#     label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

#     # Compute metric (replace with actual metric like ROUGE)
#     return rouge.compute(predictions=pred_str, references=label_str)


# # Define compute_metrics function

# ROUGE Metrics

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate the quality of summaries by comparing them to reference summaries (typically human-generated). ROUGE is particularly popular in the field of natural language processing for tasks such as summarization. The metrics focus on different aspects of the generated summary and provide insights into its quality. The main ROUGE metrics include:

## ROUGE-N
Measures the overlap of n-grams between the candidate summary and the reference summary. The most common versions are ROUGE-1 (unigrams) and ROUGE-2 (bigrams).

### ROUGE-1
Counts the overlap of single words.
- **ROUGE-1 Recall**:
  $$
  \text{ROUGE-1 Recall} = \frac{\text{Number of overlapping unigrams}}{\text{Total unigrams in reference summary}}
  $$
- **ROUGE-1 Precision**:
  $$
  \text{ROUGE-1 Precision} = \frac{\text{Number of overlapping unigrams}}{\text{Total unigrams in candidate summary}}
  $$
- **ROUGE-1 F1-Score**:
  $$
  \text{ROUGE-1 F1-Score} = 2 \times \frac{\text{ROUGE-1 Recall} \times \text{ROUGE-1 Precision}}{\text{ROUGE-1 Recall} + \text{ROUGE-1 Precision}}
  $$

**Example Calculation for ROUGE-1:**

Given a reference summary "The cat sat on the mat." and a candidate summary "The cat is on the mat.", calculate ROUGE-1:
- Unigrams in Reference: {The, cat, sat, on, the, mat}
- Unigrams in Candidate: {The, cat, is, on, the, mat}
- Overlap: {The, cat, on, the, mat}
- Recall: $ \frac{5}{6} $
- Precision: $ \frac{5}{6} $
- F1-Score: $ 2 \times \frac{5/6 \times 5/6}{5/6 + 5/6} = 0.833 $

### ROUGE-2
Counts the overlap of two-word sequences.
- **ROUGE-2 Recall**:
  $$
  \text{ROUGE-2 Recall} = \frac{\text{Number of overlapping bigrams}}{\text{Total bigrams in reference summary}}
  $$
- **ROUGE-2 Precision**:
  $$
  \text{ROUGE-2 Precision} = \frac{\text{Number of overlapping bigrams}}{\text{Total bigrams in candidate summary}}
  $$
- **ROUGE-2 F1-Score**:
  $$
  \text{ROUGE-2 F1-Score} = 2 \times \frac{\text{ROUGE-2 Recall} \times \text{ROUGE-2 Precision}}{\text{ROUGE-2 Recall} + \text{ROUGE-2 Precision}}
  $$

**Example Calculation for ROUGE-2:**

Using the same reference and candidate summaries:
- Bigrams in Reference: {The cat, cat sat, sat on, on the, the mat}
- Bigrams in Candidate: {The cat, cat is, is on, on the, the mat}
- Overlap: {The cat, on the, the mat}
- Recall: $ \frac{3}{5} = 0.600 $
- Precision: $ \frac{3}{5} = 0.600 $
- F1-Score: $ 2 \times \frac{0.6 \times 0.6}{0.6 + 0.6} = 0.600 $

## ROUGE-L
Measures the longest common subsequence (LCS) between the candidate and reference summaries. This captures the longest sequence of words that appear in both summaries in the same order, reflecting the importance of sentence-level structure.
- **ROUGE-L Recall**:
  $$
  \text{ROUGE-L Recall} = \frac{\text{LCS}}{\text{Total words in reference summary}}
  $$
- **ROUGE-L Precision**:
  $$
  \text{ROUGE-L Precision} = \frac{\text{LCS}}{\text{Total words in candidate summary}}
  $$
- **ROUGE-L F1-Score**:
  $$
  \text{ROUGE-L F1-Score} = 2 \times \frac{\text{ROUGE-L Recall} \times \text{ROUGE-L Precision}}{\text{ROUGE-L Recall} + \text{ROUGE-L Precision}}
  $$

**Example Calculation for ROUGE-L:**

Using the same reference and candidate summaries:
- LCS: "The cat on the mat"
- Recall: $ \frac{5}{6} \approx 0.833 $
- Precision: $ \frac{5}{6} \approx 0.833 $
- F1-Score: $ 2 \times \frac{0.833 \times 0.833}{0.833 + 0.833} = 0.833 $
