## Summarization:
We use a Transformer model create summaries (TL;DR) of the 13F reports, a task known as text summarization. This is aimed at allowing domain experts relieve the burden of domain experts having to read the whole document in detail.

In [1]:
%pip install datasets evaluate transformers[sentencepiece]
%pip install accelerate
# To run the training on TPU, you will need to uncomment the following line:
# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
%pip install git-lfs

Collecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/519.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/519.6 kB[0m [31m3.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━[0m [32m409.6/519.6 kB[0m [31m5.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m29.3 MB/s[0m eta [3

In [2]:
!git config --global user.email "annajvk@gmail.com"
!git config --global user.name "jvk36"

In [3]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

PREPARING THE CORPUS:

We’ll use the 13F Reports Dataset in the hub that has labels for the purpose - jkv53/13F_Reports_with_labels:

In [4]:
from datasets import load_dataset

raw_reports_dataset = load_dataset("jkv53/13F_Reports_with_labels", split="train")
raw_reports_dataset

Downloading readme:   0%|          | 0.00/521 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/3.33M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/1113 [00:00<?, ? examples/s]

Dataset({
    features: ['title', 'body', 'label'],
    num_rows: 1113
})

In [5]:
# We have 1113 pairs of sentences, but in one single split, so we will need to create
# our own train, test, and validation.

split_datasets = raw_reports_dataset.train_test_split(train_size=0.9, seed=20)
split_datasets

DatasetDict({
    train: Dataset({
        features: ['title', 'body', 'label'],
        num_rows: 1001
    })
    test: Dataset({
        features: ['title', 'body', 'label'],
        num_rows: 112
    })
})

There are ~1,000 reports for the train split, and ~100 reports for the test split. The report information we are interested in is contained in the body and label columns. We look at a few examples by creating a simple function that takes a random sample from the training set:

In [6]:
def show_samples(dataset, num_samples=3, seed=42):
    sample = dataset["train"].shuffle(seed=seed).select(range(num_samples))
    for example in sample:
        print(f"\n'>> Label: {example['label']}'")
        print(f"'>> Review: {example['body']}'")


show_samples(split_datasets)


'>> Label: "Philippe Laffont’s 13F portfolio value increased from $10.38B to $11.09B this quarter.Coatue Management added Shopify & increased Alibaba while reducing Bank of America & dropping JPMorgan Chase.The largest three positions are Facebook, Liberty Broadband, and Alibaba Group Holdings and they add up to ~22% of the portfolio."'
'>> Review: "Philippe Laffont’s 13F portfolio value increased ~7% from $10.38B to $11.09B. Recent 13F reports have shown a total of around 50 individual stock positions in the portfolio. The largest five stakes are Facebook Inc. (FB), Liberty Broadband (LBRDK), Alibaba Group Holdings (BABA), Activision Blizzard (ATVI), and Broadcom Ltd. (AVGO) and they add up to over one-third of the entire portfolio.Philippe Laffont is one of the most successful among the "tiger cubs". To know more about Julian Robertson and his legendary Tiger Management, check out .Below is a summary:Apple Inc. (AAPL): AAPL is a 3.76% of the portfolio stake. It was established in Q3

## MODELS FOR TEXT SUMMARIZATION:

Text summarization is a similar sort of task to machine translation: we have a body of text like a review that we’d like to “translate” into a shorter version that captures the salient features of the input. Accordingly, most Transformer models for summarization adopt the encoder-decoder architecture, although there are some exceptions like the GPT family of models which can also be used for summarization in few-shot settings. The following table lists some popular pretrained models that can be fine-tuned for summarization.

1. GPT-2: Although trained as an auto-regressive language model, you can make GPT-2 generate summaries by appending “TL;DR” at the end of the input text.	

2. PEGASUS: Uses a pretraining objective to predict masked sentences in multi-sentence texts. This pretraining objective is closer to summarization than vanilla language modeling and scores highly on popular benchmarks.	

3. T5: A universal Transformer architecture that formulates all tasks in a text-to-text framework; e.g., the input format for the model to summarize a document is summarize: ARTICLE.

4. mT5: A multilingual version of T5, pretrained on the multilingual Common Crawl corpus (mC4), covering 101 languages.	

5. BART:	A novel Transformer architecture with both an encoder and a decoder stack trained to reconstruct corrupted input that combines the pretraining schemes of BERT and GPT-2.

6. mBART-50: A multilingual version of BART, pretrained on 50 languages.	


We’ll focus on mT5, an interesting architecture based on T5 that was pretrained in a text-to-text framework. In T5, every NLP task is formulated in terms of a prompt prefix like summarize: which conditions the model to adapt the generated text to the prompt. As shown in the figure below, this makes T5 extremely versatile, as you can solve many tasks with a single model!


In [7]:
# Load the tokenizer associated with the pretrained model checkpoint.
# We’ll use mt5-small as our checkpoint so we can fine-tune the model
# in a reasonable amount of time:
from transformers import AutoTokenizer

model_checkpoint = "google/mt5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [8]:
# Test on a small example
inputs = tokenizer("I loved reading the Hunger Games!")
inputs

{'input_ids': [336, 259, 28387, 11807, 287, 62893, 295, 12507, 309, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [9]:
# decode these input IDs with the tokenizer’s convert_ids_to_tokens() function to see what
# kind of tokenizer we’re dealing with:
tokenizer.convert_ids_to_tokens(inputs.input_ids)

['▁I', '▁', 'loved', '▁reading', '▁the', '▁Hung', 'er', '▁Games', '!', '</s>']

The special Unicode character ▁ and end-of-sequence token </s> indicate that we’re dealing with the SentencePiece tokenizer, which is based on the Unigram segmentation algorithm. Unigram is especially useful for multilingual corpora since it allows SentencePiece to be agnostic about accents, punctuation, and the fact that many languages, like Japanese, do not have whitespace characters.

To tokenize our corpus, we have to deal with a subtlety associated with summarization: because our labels are also text, it is possible that they exceed the model’s maximum context size. This means we need to apply truncation to both the body and their titles to ensure we don’t pass excessively long inputs to our model. The tokenizers in 🤗 Transformers provide a nifty text_target argument that allows you to tokenize the labels in parallel to the inputs.

In [56]:
max_input_length = 512
max_target_length = 120


def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["body"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        examples["label"], max_length=max_target_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# NOTE: We defined values for max_input_length and max_target_length, which set the upper limits for
# how long our body and label (summary text) can be. Since the  body is typically much larger than
# the label, we’ve scaled these values accordingly.

In [57]:
# With preprocess_function(), it is then a simple matter to tokenize the whole corpus
# using the handy Dataset.map() function:

tokenized_datasets = split_datasets.map(preprocess_function, batched=True)

# NOTE: Now that the corpus has been preprocessed, let’s take a look at some metrics that
# are commonly used for summarization. As we’ll see, there is no silver bullet when it
# comes to measuring the quality of machine-generated text.

Map:   0%|          | 0/1001 [00:00<?, ? examples/s]

Map:   0%|          | 0/112 [00:00<?, ? examples/s]

We used batched=True in our Dataset.map() function above. This encodes the examples in batches of 1,000 (the default) and allows you to make use of the multithreading capabilities of the fast tokenizers in 🤗 Transformers. Where possible, try using batched=True to get the most out of your preprocessing!

METRICS FOR TEXT SUMMARIZATION:

In comparison to most of the other tasks we’ve covered in this course, measuring the performance of text generation tasks like summarization or translation is not as straightforward. For example, given a body like “I loved reading the Hunger Games”, there are multiple valid summaries, like “I loved the Hunger Games” or “Hunger Games is a great read”. Clearly, applying some sort of exact match between the generated summary and the label is not a good solution — even humans would fare poorly under such a metric, because we all have our own writing style.

For summarization, one of the most commonly used metrics is the ROUGE score (short for Recall-Oriented Understudy for Gisting Evaluation). The basic idea behind this metric is to compare a generated summary against a set of reference summaries that are typically created by humans. To make this more precise, suppose we want to compare the following two summaries:

In [1]:
generated_summary = "I absolutely loved reading the Hunger Games"
reference_summary = "I loved reading the Hunger Games"

One way to compare them could be to count the number of overlapping words, which in this case would be 6. However, this is a bit crude, so instead ROUGE is based on computing the precision and recall scores for the overlap.

For ROUGE, recall measures how much of the reference summary is captured by the generated one. If we are just comparing words, recall can be calculated according to a formula.


For our simple example above, this formula gives a perfect recall of 6/6 = 1; i.e., all the words in the reference summary have been produced by the model. This may sound great, but imagine if our generated summary had been “I really really loved reading the Hunger Games all night”. This would also have perfect recall, but is arguably a worse summary since it is verbose. To deal with these scenarios we also compute the precision, which in the ROUGE context measures how much of the generated summary was relevant.


Applying this to our verbose summary gives a precision of 6/10 = 0.6, which is considerably worse than the precision of 6/7 = 0.86 obtained by our shorter one. In practice, both precision and recall are usually computed, and then the F1-score (the harmonic mean of precision and recall) is reported. We can do this easily in 🤗 Datasets by first installing the rouge_score package:

In [59]:
%pip install rouge_score

NotImplementedError: ignored

In [60]:
import evaluate

rouge_score = evaluate.load("rouge")

Datasets actually computes confidence intervals for precision, recall, and F1-score; these are the low, mid, and high attributes you can see here. Moreover, 🤗 Datasets computes a variety of ROUGE scores which are based on different types of text granularity when comparing the generated and reference summaries. The rouge1 variant is the overlap of unigrams — this is just a fancy way of saying the overlap of words. 

In [61]:
scores = rouge_score.compute(
    predictions=[generated_summary], references=[reference_summary]
)
scores

{'rouge1': 0.923076923076923,
 'rouge2': 0.7272727272727272,
 'rougeL': 0.923076923076923,
 'rougeLsum': 0.923076923076923}

In [62]:
scores["rouge1"]

0.923076923076923

The precision and recall numbers match up! 

rouge2 measures the overlap between bigrams (think the overlap of pairs of words), while rougeL and rougeLsum measure the longest matching sequences of words by looking for the longest common substrings in the generated and reference summaries. The “sum” in rougeLsum refers to the fact that this metric is computed over a whole summary, while rougeL is computed as the average over individual sentences.

Creating a strong baseline

A common baseline for text summarization is to simply take the first three sentences of an article, often called the lead-3 baseline. We could use full stops to track the sentence boundaries, but this will fail on acronyms like “U.S.” or “U.N.” — so instead we’ll use the nltk library, which includes a better algorithm to handle these cases. You can install the package using pip as follows:

In [63]:
%pip install nltk

NotImplementedError: ignored

In [64]:
import nltk

nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Next, we import the sentence tokenizer from nltk and create a simple function to extract the first three sentences in a body. 

In [65]:
from nltk.tokenize import sent_tokenize


def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])


print(three_sentence_summary(split_datasets["train"][1]["body"]))

"This article is part of a series that provides an ongoing analysis of the changes made to William Von Mueffling’s US stock portfolio on a quarterly basis.
It is based on Mueffling’s regulatory  filed on 02/03/2016.
Please visit our  to get an idea of his investment philosophy and our  highlighting the fund’s moves during Q3 2015.This quarter, Mueffling’s US long portfolio increased 13.31% from $4.69B to $5.32B.


In [66]:
# This seems to work, so let’s now implement a function that extracts these “summaries” from a dataset and computes the ROUGE scores for the baseline:

def evaluate_baseline(dataset, metric):
    summaries = [three_sentence_summary(text) for text in dataset["body"]]
    return metric.compute(predictions=summaries, references=dataset["label"])

In [67]:
# We can then use this function to compute the ROUGE scores over the test set and prettify them a bit using Pandas:

import pandas as pd

score = evaluate_baseline(split_datasets["test"], rouge_score)
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
rouge_dict = dict((rn, score[rn]) for rn in rouge_names)
rouge_dict

{'rouge1': 0.33151596305561215,
 'rouge2': 0.17823005037013614,
 'rougeL': 0.20082080103528593,
 'rougeLsum': 0.22638778734560433}

We can see that the rouge2 score is significantly lower than the rest; this likely reflects the fact that review titles are typically concise and so the lead-3 baseline is too verbose. Now that we have a good baseline to work from, let’s turn our attention toward fine-tuning mT5!

Fine-tuning mT5 with the Trainer API
The first thing we need to do is load the pretrained model from the mt5-small checkpoint. Since summarization is a sequence-to-sequence task, we can load the model with the AutoModelForSeq2SeqLM class, which will automatically download and cache the weights:

In [68]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [69]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

We’ll need to generate summaries in order to compute ROUGE scores during training. Fortunately, 🤗 Transformers provides dedicated Seq2SeqTrainingArguments and Seq2SeqTrainer classes that can do this for us automatically! To see how this works, let’s first define the hyperparameters and other arguments for our experiments:

In [70]:
from transformers import Seq2SeqTrainingArguments

batch_size = 8
num_train_epochs = 8
# Show the training loss with every epoch
logging_steps = len(tokenized_datasets["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

args = Seq2SeqTrainingArguments(
    output_dir=f"{model_name}-finetuned-13f-reports",
    evaluation_strategy="epoch",
    learning_rate=5.6e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    logging_steps=logging_steps,
    push_to_hub=True,
)

Here, the predict_with_generate argument has been set to indicate that we should generate summaries during evaluation so that we can compute ROUGE scores for each epoch. The decoder performs inference by predicting tokens one by one, and this is implemented by the model’s generate() method. Setting predict_with_generate=True tells the Seq2SeqTrainer to use that method for evaluation. We’ve also adjusted some of the default hyperparameters, like the learning rate, number of epochs, and weight decay, and we’ve set the save_total_limit option to only save up to 3 checkpoints during training — this is because even the “small” version of mT5 uses around a GB of hard drive space, and we can save a bit of room by limiting the number of copies we save.

The push_to_hub=True argument will allow us to push the model to the Hub after training. Note that you can specify the name of the repository you want to push to with the hub_model_id argument (in particular, you will have to use this argument to push to an organization). 

The next thing we need to do is provide the trainer with a compute_metrics() function so that we can evaluate our model during training. For summarization this is a bit more involved than simply calling rouge_score.compute() on the model’s predictions, since we need to decode the outputs and labels into text before we can compute the ROUGE scores. The following function does exactly that, and also makes use of the sent_tokenize() function from nltk to separate the summary sentences with newlines:

In [71]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    # Decode generated summaries into text
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    # Decode reference summaries into text
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # ROUGE expects a newline after each sentence
    decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(sent_tokenize(label.strip())) for label in decoded_labels]
    # Compute ROUGE scores
    result = rouge_score.compute(
        predictions=decoded_preds, references=decoded_labels, use_stemmer=True
    )
    # Extract the median scores
    result = {key: value for key, value in result.items()}
    return {k: round(v, 4) for k, v in result.items()}

Next, we need to define a data collator for our sequence-to-sequence task. Since mT5 is an encoder-decoder Transformer model, one subtlety with preparing our batches is that during decoding we need to shift the labels to the right by one. This is required to ensure that the decoder only sees the previous ground truth labels and not the current or future ones, which would be easy for the model to memorize. This is similar to how masked self-attention is applied to the inputs in a task like causal language modeling.

Transformers provides a DataCollatorForSeq2Seq collator that will dynamically pad the inputs and the labels for us. To instantiate this collator, we simply need to provide the tokenizer and model:

In [72]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [73]:
# Let’s see what this collator produces when fed a small batch of examples. First, we need to remove the columns with strings because the collator won’t know how to pad these elements:

tokenized_datasets = tokenized_datasets.remove_columns(
    split_datasets["train"].column_names
)

In [74]:
# Since the collator expects a list of dicts, where each dict represents a single example in the dataset, we also need to wrangle the data into the expected format before passing it to the data collator:

features = [tokenized_datasets["train"][i] for i in range(2)]
data_collator(features)

{'input_ids': tensor([[  313, 13673,  3737,  ...,  2032,   486,     1],
        [  313, 13673,  3737,  ...,   344, 36805,     1]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]]), 'labels': tensor([[   313,  58637,   1599,  35522,    293,    263,    849,    545,  75190,
           9387,    281,  41980,    702,  39208,    260,   3950,    452,    288,
          60654,  44610,    452,    714,  43095,    260, 170876,    321,    259,
          65556,    339,    287,    259,  42983,   6920,  10032,  13446,    344,
           3267,  53369,    304,    287,  75190,    260,   2009,   5700,  13446,
            281,    320,    546,    559,   2023,    394,    737,    259, 141769,
           8064,    263,    639,    331, 102270,    259,   4944,  86710,    265,
          33682,  15497,    259,    262,   2952,  13446,    394,    737,    259,
         141769,  10633,    263,   2454,      1,   -100,   -100,   -100,   -100,
           -100,   -100],
        [   313

Notice the padding with a [PAD] token (whose ID is 0). The labels have been padded with -100s, to make sure the padding tokens are ignored by the loss function. And finally, we can see a new decoder_input_ids which has shifted the labels to the right by inserting a [PAD] token in the first entry.

We finally have all the ingredients we need to train with! We now simply need to instantiate the trainer with the standard arguments:

In [75]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [76]:
# FROM THIS POINT ON, MAKE SURE YOU ARE RUNNING ON A MACHINE WITH A GPU
trainer.train()

Epoch,Training Loss,Validation Loss


OutOfMemoryError: ignored

In [37]:
# During training, you should see the training loss decrease and the ROUGE scores increase
# with each epoch. Once the training is complete, you can see the final ROUGE scores by
# running Trainer.evaluate():

trainer.evaluate()

{'eval_loss': 0.599791407585144,
 'eval_rouge1': 0.6865,
 'eval_rouge2': 0.6132,
 'eval_rougeL': 0.6746,
 'eval_rougeLsum': 0.675,
 'eval_runtime': 7.5771,
 'eval_samples_per_second': 14.781,
 'eval_steps_per_second': 1.848,
 'epoch': 8.0}

From the scores we can see that our model has handily outperformed our lead-3 baseline! The final thing to do is push the model weights to the Hub, as follows:

In [38]:
trainer.push_to_hub(commit_message="Training complete", tags="summarization")

pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

'https://huggingface.co/jkv53/mt5-small-finetuned-13f-reports/tree/main/'

This will save the checkpoint and configuration files to output_dir, before uploading all the files to the Hub. By specifying the tags argument, we also ensure that the widget on the Hub will be one for a summarization pipeline instead of the default text generation one associated with the mT5 architecture. The output from trainer.push_to_hub() is a URL to the Git commit hash, so you can easily see the changes that were made to the model repository!

We can also fine-tune mT5 using the low-level features provided by 🤗 Accelerate.

## NOTE: The code below duplicates the above using low-level features. So, when training, skip to the last two cells - print_summary.

Fine-tuning mT5 with 🤗 Accelerate:

In [42]:
# The first thing we need to do is create a DataLoader for each of our splits. Since the PyTorch dataloaders expect batches of tensors, we need to set the format to "torch" in our datasets:

tokenized_datasets.set_format("torch")

Now that we’ve got datasets consisting of just tensors, the next thing to do is instantiate the DataCollatorForSeq2Seq again. For this we need to provide a fresh version of the model, so let’s load it again from our cache:

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [44]:
# We can then instantiate the data collator and use this to define our dataloaders:

from torch.utils.data import DataLoader

batch_size = 8
train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=batch_size,
)
eval_dataloader = DataLoader(
    tokenized_datasets["test"], collate_fn=data_collator, batch_size=batch_size
)

In [45]:
# The next thing to do is define the optimizer we want to use. As in our other examples,
# we’ll use AdamW, which works well for most problems:

from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

In [46]:
# Finally, we feed our model, optimizer, and dataloaders to the accelerator.prepare() method:

from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

Now that we’ve prepared our objects, there are three remaining things to do:

Define the learning rate schedule.
Implement a function to post-process the summaries for evaluation.
Create a repository on the Hub that we can push our model to.
For the learning rate schedule, we’ll use the standard linear one from previous sections:

In [47]:
from transformers import get_scheduler

num_train_epochs = 10
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

For post-processing, we need a function that splits the generated summaries into sentences that are separated by newlines. This is the format the ROUGE metric expects, and we can achieve this with the following snippet of code:

In [48]:
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # ROUGE expects a newline after each sentence
    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

    return preds, labels

This should look familiar to you if you recall how we defined the compute_metrics() function of the Seq2SeqTrainer.

Finally, we need to create a model repository on the Hugging Face Hub. For this, we can use the appropriately titled 🤗 Hub library. We just need to define a name for our repository, and the library has a utility function to combine the repository ID with the user profile:

In [49]:
from huggingface_hub import get_full_repo_name

model_name = "mt5-small-finetuned-13f-reports"
repo_name = get_full_repo_name(model_name)
repo_name

'jkv53/mt5-small-finetuned-13f-reports'

In [50]:
# Now we can use this repository name to clone a local version to our results directory that will store the training artifacts:

from huggingface_hub import Repository

output_dir = "results-mt5-small-finetuned-13f-reports-accelerate"
repo = Repository(output_dir, clone_from=repo_name)

# NOTE: This will allow us to push the artifacts back to the Hub by calling the repo.push_to_hub() method
# during training! Let’s now wrap up our analysis by writing out the training loop.

Cloning https://huggingface.co/jkv53/mt5-small-finetuned-13f-reports into local empty directory.


Download file pytorch_model.bin:   0%|          | 15.6k/1.12G [00:00<?, ?B/s]

Download file training_args.bin: 100%|##########| 4.12k/4.12k [00:00<?, ?B/s]

Clean file training_args.bin:  24%|##4       | 1.00k/4.12k [00:00<?, ?B/s]

Download file spiece.model:   0%|          | 16.5k/4.11M [00:00<?, ?B/s]

Download file tokenizer.json:   0%|          | 1.40k/15.6M [00:00<?, ?B/s]

Clean file spiece.model:   0%|          | 1.00k/4.11M [00:00<?, ?B/s]

Clean file tokenizer.json:   0%|          | 1.00k/15.6M [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/1.12G [00:00<?, ?B/s]

Training loop


Training loop is roughly split into four main steps:

1. Train the model by iterating over all the examples in train_dataloader for each epoch.

2. Generate model summaries at the end of each epoch, by first generating the tokens and then decoding them (and the reference summaries) into text.

3. Compute the ROUGE scores using the same techniques we saw earlier.

4. Save the checkpoints and push everything to the Hub. Here we rely on the nifty blocking=False argument of the Repository object so that we can push the checkpoints per epoch asynchronously. This allows us to continue training without having to wait for the somewhat slow upload associated with a GB-sized model!


These steps can be seen in the following block of code:

In [52]:
from tqdm.auto import tqdm
import torch
import numpy as np

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for step, batch in enumerate(train_dataloader):
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            generated_tokens = accelerator.unwrap_model(model).generate(
                batch["input_ids"],
                attention_mask=batch["attention_mask"],
            )

            generated_tokens = accelerator.pad_across_processes(
                generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
            )
            labels = batch["labels"]

            # If we did not pad to max length, we need to pad the labels too
            labels = accelerator.pad_across_processes(
                batch["labels"], dim=1, pad_index=tokenizer.pad_token_id
            )

            generated_tokens = accelerator.gather(generated_tokens).cpu().numpy()
            labels = accelerator.gather(labels).cpu().numpy()

            # Replace -100 in the labels as we can't decode them
            labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
            if isinstance(generated_tokens, tuple):
                generated_tokens = generated_tokens[0]
            decoded_preds = tokenizer.batch_decode(
                generated_tokens, skip_special_tokens=True
            )
            decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

            decoded_preds, decoded_labels = postprocess_text(
                decoded_preds, decoded_labels
            )

            rouge_score.add_batch(predictions=decoded_preds, references=decoded_labels)

    # Compute metrics
    result = rouge_score.compute()
    # Extract the median ROUGE scores
    result = {key: value for key, value in result.items()}
    result = {k: round(v, 4) for k, v in result.items()}
    print(f"Epoch {epoch}:", result)

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

# NOTE: And that’s it! Once you run this, you’ll have a model and results that are pretty
# similar to the ones we obtained with the Trainer.

  0%|          | 0/1260 [00:00<?, ?it/s]

Epoch 0: {'rouge1': 0.6979, 'rouge2': 0.6252, 'rougeL': 0.6873, 'rougeLsum': 0.6869}




Epoch 1: {'rouge1': 0.7022, 'rouge2': 0.6309, 'rougeL': 0.6929, 'rougeLsum': 0.6919}




Epoch 2: {'rouge1': 0.7035, 'rouge2': 0.6332, 'rougeL': 0.6938, 'rougeLsum': 0.6936}




Epoch 3: {'rouge1': 0.7084, 'rouge2': 0.6376, 'rougeL': 0.699, 'rougeLsum': 0.6989}




Epoch 4: {'rouge1': 0.7071, 'rouge2': 0.6362, 'rougeL': 0.698, 'rougeLsum': 0.6978}




Epoch 5: {'rouge1': 0.7119, 'rouge2': 0.6421, 'rougeL': 0.7045, 'rougeLsum': 0.7042}




Epoch 6: {'rouge1': 0.7131, 'rouge2': 0.6439, 'rougeL': 0.7057, 'rougeLsum': 0.7052}




Epoch 7: {'rouge1': 0.7148, 'rouge2': 0.6465, 'rougeL': 0.7075, 'rougeLsum': 0.7071}




Epoch 8: {'rouge1': 0.7148, 'rouge2': 0.6465, 'rougeL': 0.7075, 'rougeLsum': 0.7071}




Epoch 9: {'rouge1': 0.7148, 'rouge2': 0.6465, 'rougeL': 0.7075, 'rougeLsum': 0.7071}


In [53]:
# Using your fine-tuned model:
# Once you’ve pushed the model to the Hub, you can play with it either via the inference
# widget or with a pipeline object, as follows:

from transformers import pipeline

hub_model_id = "jkv53/mt5-small-finetuned-13f-reports"
summarizer = pipeline("summarization", model=hub_model_id)

Downloading pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Feed some examples from the test set (which the model has not seen) to our pipeline to get a feel for the quality of the summaries. First let’s implement a simple function to show the review, title, and generated summary together:

In [54]:
def print_summary(idx):
    review = split_datasets["test"][idx]["body"]
    title = split_datasets["test"][idx]["label"]
    summary = summarizer(split_datasets["test"][idx]["body"])[0]["summary_text"]
    print(f"'>>> Body: {review}'")
    print(f"\n'>>> Label: {title}'")
    print(f"\n'>>> Summary: {summary}'")

In [55]:
# Let’s take a look at one of the English examples we get:

print_summary(100)



'>>> Body: "Chevron Corp (CVX): CVS, a ~1.1% stake, was initiated this quarter in the $102 to $111 price-range. CVX currently trades at ~$98. The significant new stake signals a bullish bias.Macys Inc (M): M, a ~0.9% stake, was initiated this quarter in the $32 to $41 price range. M currently trades at around $35. The significant new stake indicates a bullish bias.SPDR Consumer Staples (XLP): XLP, a 0.5% position, was established this quarter when the price-per-share varied between $32 and $34. XLP currently trades at $33.50. The stake is too small to show any clear bias.Suntrust Banks (STI): STI, a 1.1% position, was introduced this quarter when the price-per-share varied between $17.70 and $25. STI currently trades at ~ $21.75. The noteworthy new stake indicates a bullish bias.Tesoro Corp (TSO): TSO, a 0.5% position, was picked up this quarter when the price-per-share varied between $22 and $30. TSO currently trades at the low end of that range. The stake is too slight to show any cl