# T5 for Text summarization (Pytorch)

Text summarization is one of the most important NLP applications. This is a very difficult tasks that poses several challenges such as identifying the important content and generate a summary.

In this notebook, we will fine-tune the pre-trained T5 for the task of text summarization. T5 has a encoder-decoder architecture. We will use the XSum dataset from Hugging Face Datasets.

Unlike the previous notebook where we fine-tune a T5 model for this task on tensorflow, **the framework used will be Pytorch** in this notebook.

Source: https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb#scrollTo=imY1oC3SIrJf


In [1]:
!pip install transformers datasets rouge-score keras_nlp

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


To ignore warning, please run the following cell:

In [2]:
import transformers
print(transformers.__version__)

4.24.0


If you want that warnings are not printed, please run this cell:

In [3]:
import os
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'
# ignore warning about deprecation
o_deprecation_warning=True


## Data
we use the dataset xsum that consists of 226,711 news BBC articles accompanied with a one-sentence summary. The articles covers a wide variety of domains (e.g., News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts). 

The official random split contains 204,045 (90%), 11,332 (5%) and 11,334 (5) documents in training, validation and test sets, respectively.

As the dataset is very large, we will use a smaller sample to run this notebook during the class:

In [4]:
from datasets import load_dataset

REDUCE_DATA = True

if REDUCE_DATA:
    # we only load a smaller sample of the dataset for training a summarizer during this class 
    dataset = load_dataset("xsum", split='train[:1%]').shuffle(seed=42)
    # As we only got a smaller sample from the traing split, we need to create the splits
    dataset = dataset.train_test_split(test_size=0.2, shuffle=False)
    SIZE_TEST= 10
    dataset["validation"] = dataset["test"].select(range(SIZE_TEST,dataset["test"].num_rows))
    # we only get SIZE_TEST for test
    dataset["test"] = dataset["test"].select(range(SIZE_TEST))
else:
    # this loads the full dataset; in this case, we don't have to create the splits, because it already contains them. 
    dataset = load_dataset("xsum")

dataset



DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 1632
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 10
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 398
    })
})

We show some instances. We should always obtain the same ids if we set the seed to 42

In [5]:
print(dataset['train'][0]['id'])  #36884862     (if the dataset was reduced)
print(dataset['validation'][0]['id']) #27929646 (if the dataset was reduced)
print(dataset['test'][0]['id']) # 34493630 (if the dataset was reduced)


36884862
36219003
34493630


### Tokenization

In [6]:
PREFIX='summarize: '
MAX_INPUT_LENGTH = 1043  #  Maximum length of the input to the model. Use 1024 when Transformers v5.
MAX_TARGET_LENGTH = 128  # Maximum length of the output by the model

In [7]:
from transformers import AutoTokenizer

model_name = 't5-small'
# we must instanciate the tokenizer using model_max_length to increase the maximu length of the model from 512 to 
tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=MAX_INPUT_LENGTH)
# print(tokenizer.model_max_length)

def tokenize(examples):
    """For each example in the dataset examples, the function will tokenize the input document 
    but also the expected output, that is, its summary. This will be saved into a new field of the dataset with 
    the name 'labels'. We only need to save the input_ids of the summary."""
    inputs = [PREFIX + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, padding=True, truncation=True, return_tensors="pt").to('cuda')

    # Setup the tokenizer for targets
    # with tokenizer.as_target_tokenizer():
    labels = tokenizer(text_target=examples["summary"], max_length=MAX_TARGET_LENGTH,  padding=True, truncation=True, return_tensors="pt").to('cuda')

    # we add a new feature labels to contain the encoded output
    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

# we apply the function to the dataset for encoding it
encoded_datasets = dataset.map(tokenize, batched=True)
encoded_datasets

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1632
    })
    test: Dataset({
        features: ['document', 'summary', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 10
    })
    validation: Dataset({
        features: ['document', 'summary', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 398
    })
})

In [8]:
encoded_datasets=encoded_datasets.remove_columns(['document', 'summary', 'id'])
encoded_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1632
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 10
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 398
    })
})

## Model (Pytorch)

Here is when the code is different to the previous notebook where we fine-tune a T5 for text summarization on tensorflow. 
Now we have to use differente classes:


### Defining model, arguments and data collator

In [9]:
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to('cuda')


We will use a trainer class for Seq2Seq, so we need to set its arguments:

In [10]:
from transformers import Seq2SeqTrainingArguments

batch_size = 16
args = Seq2SeqTrainingArguments(
    output_dir='./outputs',
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
)

We also need to define a data collator, in particular, one for a Seq2Seq model:


In [11]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

### Metrics for the trainer
We also have to define the function that will be used by the trainer to measure the model on the validation dataset:

In [12]:
import keras_nlp
rouge_L = keras_nlp.metrics.RougeL()

def compute_metrics(eval_predictions):
    #the predictions and the corresponding reference labels
    predictions, labels = eval_predictions

    # we have to decode the predictions
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    
    # we also have to decode the reference labels
    # first, we replace those labels <0 with the token id for padding
    for label in labels:
        label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
    # we now decode the reference labels
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # we calculate rouge_L comparing the decoded labels and the decoded prediction
    result = rouge_L(decoded_labels, decoded_predictions)
    # We will print only the F1 score, you can use other aggregation metrics as well
    result = {"RougeL": result["f1_score"]}

    # return metric.compute(decoded_labels, decoded_predictions)
    return result

### Trainer 

Now, we can define the trainer object by using the *Seq2SeqTrainer* class:

In [13]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=encoded_datasets["train"],
    eval_dataset=encoded_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Using cuda_amp half precision backend


Finally, we train:

### Evaluation on the validation dataset
We evaluate eth

In [14]:
trainer.train()

***** Running training *****
  Num examples = 1632
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 102
  Number of trainable parameters = 60506624


Epoch,Training Loss,Validation Loss,Rougel
1,No log,3.62566,"tf.Tensor(0.107309885, shape=(), dtype=float32)"


***** Running Evaluation *****
  Num examples = 398
  Batch size = 16
Trainer is attempting to log a value of "0.10730988532304764" of type <class 'tensorflow.python.framework.ops.EagerTensor'> for key "eval/RougeL" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=102, training_loss=7.510858273973652, metrics={'train_runtime': 44.3155, 'train_samples_per_second': 36.827, 'train_steps_per_second': 2.302, 'total_flos': 449952277856256.0, 'train_loss': 7.510858273973652, 'epoch': 1.0})

In [15]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 398
  Batch size = 16


Trainer is attempting to log a value of "0.1073099821805954" of type <class 'tensorflow.python.framework.ops.EagerTensor'> for key "eval/RougeL" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


{'eval_loss': 3.6256604194641113,
 'eval_RougeL': <tf.Tensor: shape=(), dtype=float32, numpy=0.10730998>,
 'eval_runtime': 11.5393,
 'eval_samples_per_second': 34.491,
 'eval_steps_per_second': 2.167,
 'epoch': 1.0}

## Evaluation


### Inference
You can directly use the model to generate the summary for some text from the test dataset (or any another text). To do this, we create a pipeline object containing the model and the tokenizer.

In [16]:
from transformers import pipeline
MIN_TARGET_LENGTH = 5
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, framework="pt", device=0)

summarizer(
    dataset["test"][0]["document"],
    min_length=5,
    max_length=120,
    # max_new_tokens=MAX_TARGET_LENGTH,
)

[{'summary_text': "Virgil van Dijk's first goal for 18 months gave the hosts the lead . he doubled the lead with a header from Dusan Tadic's corner . the 28-year-old is the fourth englishman to score in six consecutive matches this season ."}]

In [17]:
dataset["test"][0]["summary"]

'Premier League top scorer Jamie Vardy scored twice as Leicester came from 2-0 down to draw at Southampton.'

In [18]:
dataset["test"][0]["document"]

'Jose Fonte\'s first goal for 18 months gave the hosts the lead, glancing in a header from Dusan Tadic\'s corner.\nVirgil van Dijk earlier saw a header cleared off the line, but he doubled the lead with a close-range prod.\nVardy headed the Foxes back into the match, before blasting home his ninth of the season in injury time to keep the Foxes in fifth.\nRelive the match action here\nAll the Premier League action and reaction\nNot judging by their second-half display.\nThe Foxes have scored in every Premier League match this season and, sparked into life by the half-time introduction of forwards Riyad Mahrez and Nathan Dyer, they earned an unlikely point with a stunning final 45 minutes.\nSouthampton were in complete control at half-time but, helped by the trickery of Mahrez and the clinical finishing of Vardy, the Foxes again showed they should never be ruled out.\nThe draw is the seventh point Leicester have earned from a losing position this season.\nIt would be very hard to leave t

### Results on the test dataset
We also want to provide some final scores about our model on the test dataset

In [19]:
generated_summaries =summarizer(dataset["test"]["document"], truncation=True, min_length=MIN_TARGET_LENGTH, max_length=MAX_TARGET_LENGTH)
generated_summaries=[example['summary_text'] for example in generated_summaries]

result = rouge_L(dataset["test"]["summary"], generated_summaries)

Disabling tokenizer parallelism, we're using DataLoader multithreading already
Your max_length is set to 128, but you input_length is only 91. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=45)


In [20]:
import tensorflow as tf
#print("rouge-L:", result['precision'], result['recall'], result['f1_score'])
print("rouge-L -  Precision:", tf.get_static_value(result['precision']), ", Recal: ", tf.get_static_value(result['recall']), ", f1-score:", tf.get_static_value(result['f1_score']))

rouge-L -  Precision: 0.14080042 , Recal:  0.090049334 , f1-score: 0.10729898
