In [1]:
magma_dir = '/home/marco/epfl/magma/'

## 🤗 Finetune **Longformer Encoder-Decoder (LED)** on Karger Books 🤗

The *Longformer Encoder-Decoder (LED)* was recently added as an extension to [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.

We will leverage 🤗`Seq2SeqTrainer`, gradient checkpointing and as usual 🤗`datasets`.

First, let's try to get a GPU with at least 15GB RAM.

In [2]:
# crash colab to get more RAM
# !kill -9 -1

To check that we are having enough RAM we can run the following command.
If the randomely allocated GPU is too small, the above cells can be run 
to crash the notebook hoping to get a better GPU.

In [3]:
!nvidia-smi

/usr/bin/sh: 1: nvidia-smi: not found


Next, we install 🤗Transformers, 🤗Datasets, and `rouge_score`.



In [4]:
'''
%%capture
!pip install datasets==1.2.1
!pip install transformers==4.2.0
!pip install rouge_score
'''

'\n%%capture\n!pip install datasets==1.2.1\n!pip install transformers==4.2.0\n!pip install rouge_score\n'

Let's start by loading and preprocessing the dataset.



In [21]:
from datasets import load_dataset, load_metric
train_dataset = load_dataset('csv', data_files=magma_dir+'datasets/karger_books_base/train.csv', split='train')
val_dataset = load_dataset('csv', data_files=magma_dir+'datasets/karger_books_base/val.csv', split='train')
test_dataset = load_dataset('csv', data_files=magma_dir+'datasets/karger_books_base/test.csv', split='train')

Using custom data configuration default
Reusing dataset csv (/home/marco/.cache/huggingface/datasets/csv/default-f940c07eb5482384/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2)
Using custom data configuration default
Reusing dataset csv (/home/marco/.cache/huggingface/datasets/csv/default-9dbc55a10a5f37c8/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2)
Using custom data configuration default
Reusing dataset csv (/home/marco/.cache/huggingface/datasets/csv/default-0c70df74eafd3191/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2)


It's always a good idea to take a look at some data samples. Let's do that here.

In [22]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=4):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

We can see that the input data is the `text` - a scientific chapter and the target data is the `bullets` - a concise summary.

Cool! Having downloaded the dataset, let's tokenize it.
We'll import the convenient `AutoTokenizer` class.

In [7]:
from transformers import AutoTokenizer, AddedToken

 and load the tokenizer

In [20]:
tokenizer = AutoTokenizer.from_pretrained("allenai/led-base-16384")

In [21]:
print(len(tokenizer))
tokenizer.additional_special_tokens

50265


[]

In [22]:
special_tokens_dict = {'additional_special_tokens': ['<BULL>']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
print('We have added', num_added_toks, 'tokens')
print(len(tokenizer))
model.resize_token_embeddings(len(tokenizer))

We have added 1 tokens
50266


Note that for the sake of this notebook, we finetune the "smaller" LED checkpoint ["allenai/led-base-16384"](https://huggingface.co/allenai/led-base-16384). Better performance can however be attained by finetuning ["allenai/led-large-16384"](https://huggingface.co/allenai/led-large-16384) at the cost of a higher required GPU RAM.

In [42]:
import pandas as pd
df_train = pd.read_csv(magma_dir+'datasets/karger_books_base/train.csv').set_index(['book', 'chapter'])
df_val = pd.read_csv(magma_dir+'datasets/karger_books_base/val.csv').set_index(['book', 'chapter'])
df_test = pd.read_csv(magma_dir+'datasets/karger_books_base/test.csv').set_index(['book', 'chapter'])

df = pd.concat([df_train, df_val, df_test])

In [43]:
df['bullets_enc'] = df.bullets.map(tokenizer.encode)
df['bullets_num_tok'] = df.bullets_enc.map(len)
df['text_enc'] = df.text.map(tokenizer.encode)
df['text_num_tok'] = df.text_enc.map(len)

In [45]:
df.text_num_tok.describe()

count      453.000000
mean      2957.896247
std       1896.892605
min        640.000000
25%       1680.000000
50%       2488.000000
75%       3616.000000
max      13452.000000
Name: text_num_tok, dtype: float64

In [58]:
len(df[df.text_num_tok > 8500])

10

In [48]:
df.bullets_num_tok.describe()

count    453.000000
mean     184.512141
std       90.921720
min       47.000000
25%      114.000000
50%      169.000000
75%      234.000000
max      679.000000
Name: bullets_num_tok, dtype: float64

In [50]:
len(df[df.bullets_num_tok > 500])

2

In [59]:
max_input_length = 8192
max_output_length = 512
batch_size = 2

Now, let's write down the input data processing function that will be used to map each data sample to the correct model format.
As explained earlier `text` represents here our input data and `bullets` is the target data. The datasamples are thus tokenized up to the respective maximum lengths of 8192 and 512.

In addition to the usual `attention_mask`, LED can make use of an additional `global_attention_mask` defining which input tokens are attended globally and which are attended only locally, just as it's the case of [Longformer](https://huggingface.co/transformers/model_doc/longformer.html). For more information on Longformer's self-attention, please take a look at the corresponding [docs](https://huggingface.co/transformers/model_doc/longformer.html#longformer-self-attention). For summarization, we follow recommendations of the [paper](https://arxiv.org/abs/2004.05150) and use global attention only for the very first token. Finally, we make sure that no loss is computed on padded tokens by setting their index to `-100`.

In [60]:
def process_data_to_model_inputs(batch):
    # tokenize the inputs and labels
    inputs = tokenizer(
        batch["text"],
        padding="max_length",
        truncation=True,
        max_length=max_input_length,
    )
    outputs = tokenizer(
        batch["bullets"],
        padding="max_length",
        truncation=True,
        max_length=max_output_length,
    )

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask

    # create 0 global_attention_mask lists
    batch["global_attention_mask"] = len(batch["input_ids"]) * [
        [0 for _ in range(len(batch["input_ids"][0]))]
    ]

    # since above lists are references, the following line changes the 0 index for all samples
    batch["global_attention_mask"][0][0] = 1
    batch["labels"] = outputs.input_ids

    # We have to make sure that the PAD token is ignored
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels]
        for labels in batch["labels"]
    ]

    return batch

For the sake of this notebook, we will reduce the training and validation data 
to a dummy dataset of sizes 250 and 25 respectively. For a full training run, those lines should be commented out.

In [61]:
train_dataset = train_dataset.select(range(250))
val_dataset = val_dataset.select(range(25))

In [62]:
train_dataset

Dataset({
    features: ['book', 'chapter', 'text', 'bullets'],
    num_rows: 250
})

Great, having defined the mapping function, let's preprocess the training data

In [73]:
train_dataset = train_dataset.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=['book', 'chapter', 'text', 'bullets'],
)

Loading cached processed dataset at /home/marco/.cache/huggingface/datasets/csv/default-f940c07eb5482384/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2/cache-7b912e9241083a5a.arrow


In [74]:
train_dataset

Dataset({
    features: ['attention_mask', 'global_attention_mask', 'input_ids', 'labels'],
    num_rows: 250
})

and validation data

In [75]:
val_dataset = val_dataset.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=['book', 'chapter', 'text', 'bullets'],
)
test_dataset = test_dataset.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=['book', 'chapter', 'text', 'bullets'],
)

HBox(children=(FloatProgress(value=0.0, max=13.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=23.0), HTML(value='')))




Finally, the datasets should be converted into the PyTorch format as follows.

In [76]:
train_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)
val_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)

Alright, we're almost ready to start training. Let's load the model via the `AutoModelForSeq2SeqLM` class.

In [77]:
from transformers import AutoModelForSeq2SeqLM

We've decided to stick to the smaller model `"allenai/led-base-16384"` for the sake of this notebook. In addition, we directly enable gradient checkpointing and disable the caching mechanism to save memory.

In [78]:
led = AutoModelForSeq2SeqLM.from_pretrained("allenai/led-base-16384", gradient_checkpointing=True, use_cache=False)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=647693783.0, style=ProgressStyle(descri…




During training, we want to evaluate the model on Rouge, the most common metric used in summarization, to make sure the model is indeed improving during training. For this, we set fitting generation parameters. We'll use beam search with a small beam of just 2 to save memory. Also, we force the model to generate at least 100 tokens, but no more than 512. In addition, some other generation parameters are set that have been found helpful for generation. For more information on those parameters, please take a look at the [docs](https://huggingface.co/transformers/main_classes/model.html?highlight=generate#transformers.generation_utils.GenerationMixin.generate).

In [None]:
# set generate hyperparameters
led.config.num_beams = 2
led.config.max_length = 512
led.config.min_length = 90
led.config.length_penalty = 2.0
led.config.early_stopping = True
led.config.no_repeat_ngram_size = 3

Next, we also have to define the function that will compute the `"rouge"` score during evalution.

Let's load the `"rouge"` metric from 🤗datasets and define the `compute_metrics(...)` function.

In [None]:
rouge = load_metric("rouge")

The compute metrics function expects the generation output, called `pred.predictions` as well as the gold label, called `pred.label_ids`.

Those tokens are decoded and consequently, the rouge score can be computed.

In [None]:
def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(
        predictions=pred_str, references=label_str, rouge_types=["rouge2"]
    )["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

Now, we're ready to start training. Let's import the `Seq2SeqTrainer` and `Seq2SeqTrainingArguments`.

In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

In contrast to the usual `Trainer`, the `Seq2SeqTrainer` makes it possible to use the `generate()` function during evaluation. This should be enabled with `predict_with_generate=True`. Because our GPU RAM is limited, we make use of gradient accumulation by setting `gradient_accumulation_steps=4` to have an effective `batch_size` of 2 * 4 = 8.

Other training arguments can be read upon in the [docs](https://huggingface.co/transformers/main_classes/trainer.html?highlight=trainingarguments#transformers.TrainingArguments).

In [None]:
# enable fp16 apex training
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    fp16=True,
    output_dir="./",
    logging_steps=5,
    eval_steps=10,
    save_steps=10,
    save_total_limit=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
)

The training arguments, along with the model, tokenizer, datasets and the `compute_metrics` function can then be passed to the `Seq2SeqTrainer`

In [None]:
trainer = Seq2SeqTrainer(
    model=led,
    tokenizer=tokenizer,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

and we can start training. This will take about ~35min.

In [None]:
trainer.train()

  return torch.tensor(x, **format_kwargs)


Step,Training Loss,Validation Loss




This completes the fine-tuning tutorial for LED. This training script with some small changes was used to train [this](https://huggingface.co/patrickvonplaten/led-large-16384-pubmed) checkpoint, called `" patrickvonplaten/led-large-16384-pubmed"` on a single GPU for ca. 3 days. Evaluating `" patrickvonplaten/led-large-16384-pubmed"` on Pubmed's test data gives a Rouge-2 score of **19.33** which is around 1 Rouge-2 point below SOTA performance on Pubmed.

In the Appendix below, the condensed training and evaluation scripts that were used locally to finetune `" patrickvonplaten/led-large-16384-pubmed"` are attached.

# **Appendix**

## Training

In [None]:
#!/usr/bin/env python3
from datasets import load_dataset, load_metric
from transformers import (
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
)

# load rouge
rouge = load_metric("rouge")

# load pubmed
pubmed_train = load_dataset("scientific_papers", "pubmed", ignore_verifications=True, split="train")
pubmed_val = load_dataset("scientific_papers", "pubmed", ignore_verifications=True, split="validation[:10%]")

# comment out following lines for a test run
# pubmed_train = pubmed_train.select(range(32))
# pubmed_val = pubmed_val.select(range(32))

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/led-large-16384")


# max encoder length is 8192 for PubMed
encoder_max_length = 8192
decoder_max_length = 512
batch_size = 2


def process_data_to_model_inputs(batch):
    # tokenize the inputs and labels
    inputs = tokenizer(
        batch["article"],
        padding="max_length",
        truncation=True,
        max_length=encoder_max_length,
    )
    outputs = tokenizer(
        batch["abstract"],
        padding="max_length",
        truncation=True,
        max_length=decoder_max_length,
    )

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask

    # create 0 global_attention_mask lists
    batch["global_attention_mask"] = len(batch["input_ids"]) * [
        [0 for _ in range(len(batch["input_ids"][0]))]
    ]

    # since above lists are references, the following line changes the 0 index for all samples
    batch["global_attention_mask"][0][0] = 1
    batch["labels"] = outputs.input_ids

    # We have to make sure that the PAD token is ignored
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels]
        for labels in batch["labels"]
    ]

    return batch


# map train data
pubmed_train = pubmed_train.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=["article", "abstract", "section_names"],
)

# map val data
pubmed_val = pubmed_val.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=["article", "abstract", "section_names"],
)

# set Python list to PyTorch tensor
pubmed_train.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)

# set Python list to PyTorch tensor
pubmed_val.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)

# enable fp16 apex training
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    fp16=True,
    fp16_backend="apex",
    output_dir="./",
    logging_steps=250,
    eval_steps=5000,
    save_steps=500,
    warmup_steps=1500,
    save_total_limit=2,
    gradient_accumulation_steps=4,
)


# compute Rouge score during validation
def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(
        predictions=pred_str, references=label_str, rouge_types=["rouge2"]
    )["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }


# load model + enable gradient checkpointing & disable cache for checkpointing
led = AutoModelForSeq2SeqLM.from_pretrained("allenai/led-large-16384", gradient_checkpointing=True, use_cache=False)

# set generate hyperparameters
led.config.num_beams = 4
led.config.max_length = 512
led.config.min_length = 100
led.config.length_penalty = 2.0
led.config.early_stopping = True
led.config.no_repeat_ngram_size = 3


# instantiate trainer
trainer = Seq2SeqTrainer(
    model=led,
    tokenizer=tokenizer,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=pubmed_train,
    eval_dataset=pubmed_val,
)

# start training
trainer.train()

## Evaluation

In [None]:
import torch

from datasets import load_dataset, load_metric
from transformers import LEDTokenizer, LEDForConditionalGeneration

# load pubmed
pubmed_test = load_dataset("scientific_papers", "pubmed", ignore_verifications=True, split="test")

# load tokenizer
tokenizer = LEDTokenizer.from_pretrained("patrickvonplaten/led-large-16384-pubmed")
model = LEDForConditionalGeneration.from_pretrained("patrickvonplaten/led-large-16384-pubmed").to("cuda").half()


def generate_answer(batch):
  inputs_dict = tokenizer(batch["article"], padding="max_length", max_length=8192, return_tensors="pt", truncation=True)
  input_ids = inputs_dict.input_ids.to("cuda")
  attention_mask = inputs_dict.attention_mask.to("cuda")
  global_attention_mask = torch.zeros_like(attention_mask)
  # put global attention on <s> token
  global_attention_mask[:, 0] = 1

  predicted_abstract_ids = model.generate(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask)
  batch["predicted_abstract"] = tokenizer.batch_decode(predicted_abstract_ids, skip_special_tokens=True)
  return batch


result = pubmed_test.map(generate_answer, batched=True, batch_size=4)

# load rouge
rouge = load_metric("rouge")

print("Result:", rouge.compute(predictions=result["predicted_abstract"], references=result["abstract"], rouge_types=["rouge2"])["rouge2"].mid)
