#### For Colab

In [None]:
"""
function ClickConnect(){
    console.log("Working");
    document.querySelector("colab-toolbar-button").click() 
}
var i = setInterval(ClickConnect, 900000)
clearInterval(i)
"""

'\nfunction ClickConnect(){\n    console.log("Working");\n    document.querySelector("colab-toolbar-button").click() \n}\nvar i = setInterval(ClickConnect, 900000)\nclearInterval(i)\n'

First, let's try to get a GPU with at least 15GB RAM.

In [None]:
# crash colab to get more RAM
!kill -9 -1

In [None]:
!nvidia-smi

Thu Jan 28 14:36:48 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P0    32W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
drive_dir = '/content/drive/My Drive/MAGMA: Summarization/'

#### Install Libraries

In [None]:
!pip install datasets==1.2.1
!pip install transformers==4.2.0
!pip install rouge_score
!pip install -U wandb
!pip install -U sentence-transformers

Requirement already up-to-date: wandb in /usr/local/lib/python3.6/dist-packages (0.10.15)
Requirement already up-to-date: sentence-transformers in /usr/local/lib/python3.6/dist-packages (0.4.1.2)


### **Config**

In [None]:
import sys
sys.path.insert(0, drive_dir)
import config

import wandb
wandb.login()

project_name = 'ft_led_karger_books'
%env WANDB_PROJECT=$project_name

[34m[1mwandb[0m: Currently logged in as: [33mmarcoabrate[0m (use `wandb login --relogin` to force relogin)


env: WANDB_PROJECT=finetune_led


## 🤗 Finetune **Longformer Encoder-Decoder (LED)** on Karger Books 🤗

The *Longformer Encoder-Decoder (LED)* was recently added as an extension to [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.

We will leverage 🤗`Seq2SeqTrainer`, gradient checkpointing and as usual 🤗`datasets`.

Let's start by loading and preprocessing the dataset.



In [None]:
from datasets import load_dataset, load_metric
train_dataset = load_dataset('csv', data_files=drive_dir+'datasets/karger_books_base/train.csv', split='train')
val_dataset = load_dataset('csv', data_files=drive_dir+'datasets/karger_books_base/val.csv', split='train')
test_dataset = load_dataset('csv', data_files=drive_dir+'datasets/karger_books_base/test.csv', split='train')

Using custom data configuration default
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-0498b287fc32a431/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2)
Using custom data configuration default
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-82fd9fd954b5a267/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2)
Using custom data configuration default
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-acb527b34156b505/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2)


We can see that the input data is the `text` - a scientific chapter and the target data is the `bullets` - a concise summary.

In [None]:
from transformers import AutoTokenizer, AddedToken

tokenizer = AutoTokenizer.from_pretrained("allenai/led-base-16384")

print(len(tokenizer))
print(tokenizer.additional_special_tokens)

In [None]:
num_added_toks = tokenizer.add_tokens(AddedToken('<BULL>', single_word=False, lstrip=True, rstrip=True, normalized=False))
print('We have added', num_added_toks, 'tokens')
print(len(tokenizer))

We have added 1 tokens
50266


Note that for the sake of this notebook, we finetune the "smaller" LED checkpoint ["allenai/led-base-16384"](https://huggingface.co/allenai/led-base-16384). Better performance can however be attained by finetuning ["allenai/led-large-16384"](https://huggingface.co/allenai/led-large-16384) at the cost of a higher required GPU RAM.

In [None]:
import pandas as pd
df_train = pd.read_csv(drive_dir+'datasets/karger_books_base/train.csv').set_index(['book', 'chapter'])
df_val = pd.read_csv(drive_dir+'datasets/karger_books_base/val.csv').set_index(['book', 'chapter'])
df_test = pd.read_csv(drive_dir+'datasets/karger_books_base/test.csv').set_index(['book', 'chapter'])

df = pd.concat([df_train, df_val, df_test])

In [None]:
df['bullets_tok'] = df.bullets.map(tokenizer.tokenize)
df['bullets_enc'] = df.bullets.map(tokenizer.encode)
df['bullets_num_tok'] = df.bullets_enc.map(len)
df['text_enc'] = df.text.map(tokenizer.encode)
df['text_num_tok'] = df.text_enc.map(len)

In [None]:
print(len(df.iloc[0].bullets_tok), len(df.iloc[0].bullets_enc))

168 170


In [None]:
print(tokenizer.bos_token_id, tokenizer.eos_token_id)
print(df.iloc[0].bullets_enc[0], df.iloc[0].bullets_enc[-1])

0 2
0 2


In [None]:
list(zip(df.iloc[0].bullets_tok, df.iloc[0].bullets_enc[1:-1]))

[('<BULL> ', 50265),
 ('The', 133),
 ('Ġfour', 237),
 ('Ġmain', 1049),
 ('Ġtypes', 3505),
 ('Ġof', 9),
 ('Ġleukemia', 28837),
 ('Ġare', 32),
 ('Ġacute', 13827),
 ('Ġmy', 127),
 ('el', 523),
 ('oid', 12572),
 ('Ġleukemia', 28837),
 (',', 6),
 ('Ġacute', 13827),
 ('Ġlymph', 23496),
 ('obl', 33449),
 ('astic', 11599),
 ('Ġleukemia', 28837),
 (',', 6),
 ('Ġchronic', 7642),
 ('Ġmy', 127),
 ('el', 523),
 ('oid', 12572),
 ('Ġleukemia', 28837),
 ('Ġand', 8),
 ('Ġchronic', 7642),
 ('Ġlymph', 23496),
 ('ocy', 30321),
 ('tic', 13240),
 ('Ġleukemia', 28837),
 ('.', 4),
 (' <BULL> ', 50265),
 ('The', 133),
 ('Ġacute', 13827),
 ('Ġle', 2084),
 ('uke', 7480),
 ('m', 119),
 ('ias', 5003),
 ('Ġare', 32),
 ('Ġpredominantly', 15351),
 ('Ġcharacterized', 17407),
 ('Ġby', 30),
 ('Ġthe', 5),
 ('Ġuncontrolled', 38411),
 ('Ġgrowth', 434),
 ('Ġof', 9),
 ('Ġimmature', 39001),
 ('Ġpoorly', 12101),
 ('Ġdifferentiated', 32691),
 ('Ġcells', 4590),
 ('Ġthat', 14),
 ('Ġare', 32),
 ('Ġblocked', 4953),
 ('Ġfrom', 31),


In [None]:
df.text_num_tok.describe()

count      453.000000
mean      2957.896247
std       1896.892605
min        640.000000
25%       1680.000000
50%       2488.000000
75%       3616.000000
max      13452.000000
Name: text_num_tok, dtype: float64

In [None]:
len(df[df.text_num_tok > 8192])

13

In [None]:
df.bullets_num_tok.describe()

count    453.000000
mean     185.512141
std       90.921720
min       48.000000
25%      115.000000
50%      170.000000
75%      235.000000
max      680.000000
Name: bullets_num_tok, dtype: float64

In [None]:
len(df[df.bullets_num_tok > 512])

1

In [None]:
max_input_length = 8192
max_output_length = 512

Now, let's write down the input data processing function that will be used to map each data sample to the correct model format.
As explained earlier `text` represents here our input data and `bullets` is the target data. The datasamples are thus tokenized up to the respective maximum lengths of 8192 and 512.

In addition to the usual `attention_mask`, LED can make use of an additional `global_attention_mask` defining which input tokens are attended globally and which are attended only locally, just as it's the case of [Longformer](https://huggingface.co/transformers/model_doc/longformer.html). For more information on Longformer's self-attention, please take a look at the corresponding [docs](https://huggingface.co/transformers/model_doc/longformer.html#longformer-self-attention). For summarization, we follow recommendations of the [paper](https://arxiv.org/abs/2004.05150) and use global attention only for the very first token. Finally, we make sure that no loss is computed on padded tokens by setting their index to `-100`.

In [None]:
def process_data_to_model_inputs(batch):
    # tokenize the inputs and labels
    inputs = tokenizer(
        batch["text"],
        padding="max_length",
        truncation=True,
        max_length=max_input_length,
    )
    outputs = tokenizer(
        batch["bullets"],
        padding="max_length",
        truncation=True,
        max_length=max_output_length,
    )

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask

    # create 0 global_attention_mask lists
    batch["global_attention_mask"] = len(batch["input_ids"]) * [
        [0 for _ in range(len(batch["input_ids"][0]))]
    ]

    # since above lists are references, the following line changes the 0 index for all samples
    batch["global_attention_mask"][0][0] = 1
    batch["labels"] = outputs.input_ids

    # We have to make sure that the PAD token is ignored
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels]
        for labels in batch["labels"]
    ]

    return batch

For the sake of this notebook, we will reduce the training and validation data 
to a dummy dataset of sizes 250 and 25 respectively. For a full training run, those lines should be commented out.

In [None]:
train_dataset

Dataset({
    features: ['book', 'chapter', 'text', 'bullets'],
    num_rows: 362
})

Great, having defined the mapping function, let's preprocess the training data

In [None]:
train_dataset = train_dataset.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=['book', 'chapter', 'text', 'bullets'],
)

Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-0498b287fc32a431/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2/cache-d04e131ec9c71489.arrow


In [None]:
train_dataset

Dataset({
    features: ['attention_mask', 'global_attention_mask', 'input_ids', 'labels'],
    num_rows: 362
})

and validation data

In [None]:
val_dataset = val_dataset.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=['book', 'chapter', 'text', 'bullets'],
)
test_dataset = test_dataset.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=['book', 'chapter', 'text', 'bullets'],
)

Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-82fd9fd954b5a267/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2/cache-5ee4aebeea5df705.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-acb527b34156b505/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2/cache-758d0938dbb23390.arrow


Finally, the datasets should be converted into the PyTorch format as follows.

In [None]:
train_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)
val_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)

Alright, we're almost ready to start training. Let's load the model via the `AutoModelForSeq2SeqLM` class.

In [None]:
from transformers import AutoModelForSeq2SeqLM

We've decided to stick to the smaller model `"allenai/led-base-16384"` for the sake of this notebook. In addition, we directly enable gradient checkpointing and disable the caching mechanism to save memory.

In [None]:
model_name_or_path = 'allenai/led-base-16384'
led = AutoModelForSeq2SeqLM.from_pretrained("allenai/led-base-16384", gradient_checkpointing=True, use_cache=False)
led.resize_token_embeddings(len(tokenizer))

Embedding(50266, 768)

During training, we want to evaluate the model on Rouge, the most common metric used in summarization, to make sure the model is indeed improving during training. For this, we set fitting generation parameters. We'll use beam search with a small beam of just 2 to save memory. Also, we force the model to generate at least 100 tokens, but no more than 512. In addition, some other generation parameters are set that have been found helpful for generation. For more information on those parameters, please take a look at the [docs](https://huggingface.co/transformers/main_classes/model.html?highlight=generate#transformers.generation_utils.GenerationMixin.generate).

In [None]:
# set generate hyperparameters
led.config.num_beams = 2
led.config.max_length = 512
led.config.min_length = 70
led.config.length_penalty = 1.0
led.config.early_stopping = True
led.config.no_repeat_ngram_size = 3

The compute metrics function expects the generation output, called `pred.predictions` as well as the gold label, called `pred.label_ids`.

Those tokens are decoded and consequently, the rouge score can be computed.

In [None]:
from rouge_score import rouge_scorer, scoring
import numpy as np
import re
import nltk
nltk.download('punkt')
from sentence_transformers import SentenceTransformer
sentence_distilroberta = SentenceTransformer('paraphrase-distilroberta-base-v1')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
ROUGE_KEYS = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

def extract_rouge_mid_statistics(dct):
    new_dict = {}
    for k1, v1 in dct.items():
        mid = v1.mid
        for stat in ["precision", "recall", "fmeasure"]:
            new_dict[k1+'_'+stat] = round(getattr(mid, stat), 4)*100
    return new_dict

def add_newline_to_end_of_each_sentence(x: str) -> str:
    """This was added to get rougeLsum scores matching published rougeL scores for BART and PEGASUS."""
    re.sub("<n>", "", x)  # remove pegasus newline char
    return "\n".join(nltk.sent_tokenize(x))

def calculate_rouge(
    pred_lns,
    tgt_lns,
    use_stemmer=True,
    rouge_keys=ROUGE_KEYS,
    return_precision_and_recall=True,
    bootstrap_aggregation=True,
    newline_sep=True):

    scorer = rouge_scorer.RougeScorer(rouge_keys, use_stemmer=use_stemmer)
    aggregator = scoring.BootstrapAggregator()
    for tgt, pred in zip(tgt_lns, pred_lns):
        # rougeLsum expects "\n" separated sentences within a summary
        if newline_sep:
            pred = add_newline_to_end_of_each_sentence(pred)
            tgt = add_newline_to_end_of_each_sentence(tgt)
        scores = scorer.score(tgt, pred)
        aggregator.add_scores(scores)

    if bootstrap_aggregation:
        result = aggregator.aggregate()
        if return_precision_and_recall:
            return extract_rouge_mid_statistics(result)  # here we return dict
        else:
            return {k: round(v.mid.fmeasure * 100, 4) for k, v in result.items()}

    else:
        return aggregator._scores  # here we return defaultdict(list)

def calculate_sentence_trans_cosine(pred_lns, tgt_lns):

    cosine_sim = lambda a, b: (np.dot(a, b) / (np.linalg.norm(a)*np.linalg.norm(b)))

    return np.mean([\
        cosine_sim(sentence_distilroberta.encode(pred),
                   sentence_distilroberta.encode(tgt))\
        for tgt, pred in zip(tgt_lns, pred_lns)])*100

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    metrics = calculate_rouge(pred_str, label_str)

    cosine_sim = calculate_sentence_trans_cosine(pred_str, label_str)
    metrics.update({"sentence_distilroberta_cosine": cosine_sim})

    return metrics

Now, we're ready to start training. Let's import the `Seq2SeqTrainer` and `Seq2SeqTrainingArguments`.

In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

In contrast to the usual `Trainer`, the `Seq2SeqTrainer` makes it possible to use the `generate()` function during evaluation. This should be enabled with `predict_with_generate=True`. Because our GPU RAM is limited, we make use of gradient accumulation by setting `gradient_accumulation_steps=4` to have an effective `batch_size` of 2 * 4 = 8.

Other training arguments can be read upon in the [docs](https://huggingface.co/transformers/main_classes/trainer.html?highlight=trainingarguments#transformers.TrainingArguments).

In [None]:
output_dir = drive_dir+'fine-tuning/'+project_name

log_dir = output_dir + '/logs'
print(log_dir)

/content/drive/My Drive/MAGMA: Summarization/fine-tuning/allenai?led-base-16384_karger_books_base/logs


In [None]:
# enable fp16 apex training
training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    do_train=True,
    num_train_epochs=5,
    do_eval=True,
    evaluation_strategy='steps',
    eval_steps=10,
    predict_with_generate=True,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=8,
    fp16=True,
    logging_steps=5,
    save_steps=20,
    save_total_limit=10,
    logging_dir=log_dir
)

The training arguments, along with the model, tokenizer, datasets and the `compute_metrics` function can then be passed to the `Seq2SeqTrainer`

In [None]:
trainer = Seq2SeqTrainer(
    model=led,
    tokenizer=tokenizer,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

and we can start training. This will take about ~35min.

In [None]:
trainer.train()

  return torch.tensor(x, **format_kwargs)


Step,Training Loss,Validation Loss,Rouge1 Precision,Rouge1 Recall,Rouge1 Fmeasure,Rouge2 Precision,Rouge2 Recall,Rouge2 Fmeasure,Rougel Precision,Rougel Recall,Rougel Fmeasure,Rougelsum Precision,Rougelsum Recall,Rougelsum Fmeasure,Sentence Distilroberta Cosine,Runtime,Samples Per Second
10,2.7824,2.544499,52.35,30.7,36.34,16.22,9.63,11.44,29.67,17.57,20.67,34.87,20.3,24.05,70.823926,701.8652,0.064
20,2.5032,2.438534,51.05,34.06,38.3,16.78,11.04,12.49,27.93,18.76,20.95,33.79,21.85,24.83,71.69956,825.6693,0.055
30,2.2622,2.381989,48.05,36.87,38.14,14.96,11.51,11.88,26.36,20.53,20.97,31.64,24.32,25.1,73.104674,1049.6039,0.043
40,2.2491,2.354266,51.34,32.78,37.67,16.59,10.76,12.28,29.0,19.08,21.6,34.49,22.02,25.23,73.239094,799.8891,0.056
50,2.0828,2.383488,50.49,34.57,38.23,16.15,11.12,12.27,29.24,20.01,22.01,33.82,22.75,25.26,73.963046,900.8144,0.05
60,2.0406,2.365585,50.29,38.04,40.92,16.2,12.51,13.28,27.51,20.96,22.46,33.06,24.59,26.63,74.976486,1039.729,0.043




Step,Training Loss,Validation Loss,Rouge1 Precision,Rouge1 Recall,Rouge1 Fmeasure,Rouge2 Precision,Rouge2 Recall,Rouge2 Fmeasure,Rougel Precision,Rougel Recall,Rougel Fmeasure,Rougelsum Precision,Rougelsum Recall,Rougelsum Fmeasure,Sentence Distilroberta Cosine,Runtime,Samples Per Second
10,2.7824,2.544499,52.35,30.7,36.34,16.22,9.63,11.44,29.67,17.57,20.67,34.87,20.3,24.05,70.823926,701.8652,0.064
20,2.5032,2.438534,51.05,34.06,38.3,16.78,11.04,12.49,27.93,18.76,20.95,33.79,21.85,24.83,71.69956,825.6693,0.055
30,2.2622,2.381989,48.05,36.87,38.14,14.96,11.51,11.88,26.36,20.53,20.97,31.64,24.32,25.1,73.104674,1049.6039,0.043
40,2.2491,2.354266,51.34,32.78,37.67,16.59,10.76,12.28,29.0,19.08,21.6,34.49,22.02,25.23,73.239094,799.8891,0.056
50,2.0828,2.383488,50.49,34.57,38.23,16.15,11.12,12.27,29.24,20.01,22.01,33.82,22.75,25.26,73.963046,900.8144,0.05
60,2.0406,2.365585,50.29,38.04,40.92,16.2,12.51,13.28,27.51,20.96,22.46,33.06,24.59,26.63,74.976486,1039.729,0.043
70,2.1248,2.347729,51.36,33.72,38.05,16.47,10.54,11.98,29.66,19.76,22.04,34.3,22.46,25.2,73.818028,888.5797,0.051
80,1.9763,2.360626,48.91,37.22,39.32,15.52,11.74,12.47,27.6,21.42,22.3,32.27,24.31,25.74,74.243903,1329.3649,0.034
90,2.0123,2.348964,51.85,32.74,37.12,17.61,11.55,12.84,30.68,19.55,22.02,35.46,22.16,25.12,73.655987,846.9593,0.053


