# T5 for Text summarization (tensorflow)

Text summarization is one of the most important NLP applications. This is a very difficult tasks that poses several challenges such as identifying the important content and generate a summary.

In this notebook, we will fine-tune the pre-trained T5 for the task of text summarization. T5 has a encoder-decoder architecture. We will use the XSum dataset from Hugging Face Datasets.

**The model will be fine-tuned tensorflow framework**.

In [19]:
!pip install transformers datasets rouge-score keras_nlp

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


To ignore warning, please run the following cell:

In [20]:
import transformers
print(transformers.__version__)

4.24.0


If you want that warnings are not printed, please run this cell:

In [21]:
import os
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'

## Data
we use the dataset xsum that consists of 226,711 news BBC articles accompanied with a one-sentence summary. The articles covers a wide variety of domains (e.g., News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts). 

The official random split contains 204,045 (90%), 11,332 (5%) and 11,334 (5) documents in training, validation and test sets, respectively.

As the dataset is very large, we will use a smaller sample to run this notebook during the class:

In [22]:
from datasets import load_dataset

REDUCE_DATA = True

if REDUCE_DATA:
    # we only load a smaller sample of the dataset for training a summarizer during this class 
    dataset = load_dataset("xsum", split='train[:1%]').shuffle(seed=42)

    # As we only got a smaller sample from the traing split, we need to create the splits
    dataset = dataset.train_test_split(test_size=0.2, shuffle=False)

    SIZE_TEST= 10   #number of examples for test
    dataset["validation"] = dataset["test"].select(range(SIZE_TEST,dataset["test"].num_rows))
    # we only get SIZE_TEST for test
    dataset["test"] = dataset["test"].select(range(SIZE_TEST))
else:
    # this loads the full dataset; in this case, we don't have to create the splits, because it already contains them. 
    dataset = load_dataset("xsum")

dataset



DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 1632
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 10
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 398
    })
})

We show some instances. We should always obtain the same ids if we set the seed to 42

In [23]:
print(dataset['train'][0]['id'])  #36884862     (if the dataset was reduced)
print(dataset['validation'][0]['id']) #27929646 (if the dataset was reduced)
print(dataset['test'][0]['id']) # 34493630 (if the dataset was reduced)


36884862
36219003
34493630


### Tokenization

In [24]:
PREFIX='summarize: '
MAX_INPUT_LENGTH = 1043  #  Maximum length of the input to the model. Use 1024 when Transformers v5.
MAX_TARGET_LENGTH = 128  # Maximum length of the output by the model

In [25]:
from transformers import AutoTokenizer

model_name = 't5-small'
# we must instanciate the tokenizer using model_max_length to increase the maximu length of the model from 512 to 
tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=MAX_INPUT_LENGTH)
# print(tokenizer.model_max_length)

def tokenize(examples):
    """For each example in the dataset examples, the function will tokenize the input document 
    but also the expected output, that is, its summary. This will be saved into a new field of the dataset with 
    the name 'labels'. We only need to save the input_ids of the summary."""
    inputs = [PREFIX + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)

    # Setup the tokenizer for targets
    # with tokenizer.as_target_tokenizer():
    labels = tokenizer(text_target=examples["summary"], max_length=MAX_TARGET_LENGTH, truncation=True)

    # we add a new feature labels to contain the encoded output
    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

# we apply the function to the dataset for encoding it
encoded_datasets = dataset.map(tokenize, batched=True)
encoded_datasets



  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1632
    })
    test: Dataset({
        features: ['document', 'summary', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 10
    })
    validation: Dataset({
        features: ['document', 'summary', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 398
    })
})

## Model

We load the model. 
We also have to define a data collator to pass the input data to the model. By the default the data collator y datacollatorpadding, which is used for text classification. This datacollator is not useful for seq2seq.

In [26]:
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_name)
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


We prepare the dataset to be passed to the model:

In [27]:
BATCH_SIZE= 16

train_dataset = encoded_datasets["train"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=True,
    collate_fn=data_collator,
)

validation_dataset = encoded_datasets["validation"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    collate_fn=data_collator,
)

test_dataset = encoded_datasets["test"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    collate_fn=data_collator,
)

print('datasets are ready!!!')

datasets are ready!!!


### Training the model

In [28]:
import keras

from keras import optimizers
LEARNING_RATE = 2e-5  # Learning-rate for training our model

optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(optimizer=optimizer)


No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [29]:
import keras_nlp
rouge_L = keras_nlp.metrics.RougeL()


In [30]:
def compute_metric(eval_predictions):
    #the predictions and the corresponding reference labels
    predictions, labels = eval_predictions

    # we have to decode the predictions
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    
    # we also have to decode the reference labels
    # first, we replace those labels <0 with the token id for padding
    for label in labels:
        label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
    # we now decode the reference labels
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # we calculate rouge_L comparing the decoded labels and the decoded prediction
    result = rouge_L(decoded_labels, decoded_predictions)
    # We will print only the F1 score, you can use other aggregation metrics as well
    result = {"RougeL": result["f1_score"]}

    # return metric.compute(decoded_labels, decoded_predictions)
    return result
    

Finally, we can train the model. We define a callback that will compute the metric rouge-L after each epoch. The results will be calculated on the validation_dataset dataset.

For our running this notebook during our session class, we will only use three epochs. However, we recommend training the model with all the dataset and with at least 5 epochs (tought you may need to run it using Google Colab Pro!!!). 

In [31]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(
    compute_metric, eval_dataset=validation_dataset, 
    predict_with_generate=True, label_cols=['labels'])

MAX_EPOCHS = 3 # we recommend at least 5 epochs

model.fit(train_dataset, validation_data=validation_dataset, epochs=MAX_EPOCHS, callbacks=[metric_callback])

Epoch 1/3



Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fe4c3a2d9d0>

## Inference
You can directly use the model to generate the summary for some text from the test dataset (or any another text). To do this, we create a pipeline object containing the model and the tokenizer.

In [32]:
from transformers import pipeline
MIN_TARGET_LENGTH = 5
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, framework="tf")

summarizer(
    dataset["test"][0]["document"],
    min_length=MIN_TARGET_LENGTH,
    max_length=MAX_TARGET_LENGTH,
    # max_new_tokens=MAX_TARGET_LENGTH,
)

[{'summary_text': 'Leicester drew 2-2 with Leicester in the Premier League on saturday in the second half of the season.'}]

In [33]:
dataset["test"][0]["summary"]

'Premier League top scorer Jamie Vardy scored twice as Leicester came from 2-0 down to draw at Southampton.'

## Evaluation
We also want to provide some final scores about our model on the test dataset. First, we use the pipeline to generate a summary for each text in the test dataset. 

In [34]:
generated_summaries =summarizer(dataset["test"]["document"], truncation=True, min_length=MIN_TARGET_LENGTH, max_length=MAX_TARGET_LENGTH)
generated_summaries

Your max_length is set to 128, but you input_length is only 91. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=45)


In [35]:
# we save into a list
generated_summaries=[example['summary_text'] for example in generated_summaries]
# we calculate the rouge L metrics:
result = rouge_L(dataset["test"]["summary"], generated_summaries)

We print the final results:

In [36]:
import tensorflow as tf
#print("rouge-L:", result['precision'], result['recall'], result['f1_score'])
print("rouge-L -  Precision:", tf.get_static_value(result['precision']), ", Recal: ", tf.get_static_value(result['recall']), ", f1-score:", tf.get_static_value(result['f1_score']))

rouge-L -  Precision: 0.19914204 , Recal:  0.1299356 , f1-score: 0.15438351
