# 🏉 Go long! Long range transformers

In recent research there has been an extensive study for improving the calculation of attention in transformer architectures. Mostly for improving their capacity to handle longer token sequences. 👊

The attention calculation is known to be quadratic in computation time with respect to the sequence length 👎. These recent advances, however, are able to perform attention calculation in near-linear time with respect to the sequence length. This allows us to scale the transformer architecture such that it can handle input sequences beyond the usual token length of 512 more efficiently. 

In this notebook, we compare traditional transformers with novel efficient transformers. We'll use roBERTa as a baseline to compare against LongFormer and BigBird.  

Let's put these architectures to the test and see which one comes out on top 🏆!  


## 🛠️ Getting started: Install packages & download models

The below cells will setup everything that is required to get started with model training:

* Install python specific packages
* Import required packages

In [None]:
!pip install -q sklearn transformers datasets torch plotly sentencepiece tqdm

In [None]:
import time 
import sys 
import json
import shutil
import pandas as pd
from enum import Enum
import math
import torch

import plotly.express as px
import plotly.graph_objects as go


from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import BigBirdTokenizerFast, BigBirdForSequenceClassification, RobertaTokenizer, RobertaForSequenceClassification, LongformerForSequenceClassification, TrainingArguments, Trainer, LongformerTokenizerFast
from datasets import load_dataset

## 💾 Dataset & downstream task

We will use the [Hyperpartisan news dataset](https://huggingface.co/datasets/hyperpartisan_news_detection) for binary sentiment classification. In the paper publication of LongFormer and BigBird, both architectures were compared against RoBERTa with this exact dataset.

This dataset contains on average wordpieces, which is ideal to make our point 💪.

We aim to gain more insight in when to use which architecture, therefore we will go *one step beyond 🔥*, and evaluate the architectures on distinct subsets of the data, each time introducing sentences with more tokens!

In [None]:
# Load the tokenizer
tokenizer =  RobertaTokenizer.from_pretrained('roberta-base')

In [None]:
# Load the hyperpartisan dataset
ds = load_dataset('hyperpartisan_news_detection', 'byarticle')['train']

# Rename the label column for uniformity
ds = ds.rename_column("hyperpartisan", "label")

# Remove unused columns
ds = ds.remove_columns(['title', 'url', 'published_at'])

# Add token length column to filter on later
ds=ds.add_column(name = 'token_length', column=[len(tokenizer.batch_encode_plus([x['text']]).input_ids[0]) for x in ds])

Split into train and test set

In [None]:
split_ds = ds.train_test_split(test_size=0.20)

Make various train partitions

In [None]:
train_ds = split_ds['train']
test_ds = split_ds['test']

train_ds_dict = {}

for min_tok, max_tok in [(0,256), (0, 512), (0, 1024), (0, 2048), (0, 4096)]:
    # Filter on the lengths
    train_ds_dict[str(max_tok)] = train_ds.filter(lambda x : x['token_length'] <= max_tok).filter(lambda x : x['token_length'] > min_tok)

    # Select the closest even number
    # the longformer optimizer cannot handle a single remaining datapoint at the end of the epoch
    train_ds_dict[str(max_tok)] = train_ds_dict[str(max_tok)].select(range(
        math.floor(
            len(train_ds_dict[str(max_tok)]) / 2.
        ) * 2
    ))

# 💥 Models

Run the cells below to redo the training of each model

🛎️ Disclaimer: this will take some time... So if you're a busy bee or a hurrying hippo, you can skip this section and load in the results from one of our runs smoothly and swiftly 😊!

## 💪 Training

In [None]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

def train_and_evaluate(timing_run, model, identifier, tokenizer, train_ds, test_ds, max_length=1024):
    def tokenization(batched_text):
        return tokenizer(batched_text['text'], padding='max_length', truncation=True, max_length=min(max_length, tokenizer.model_max_length))

    # tokenizing both the training and test dataset 
    train_ds = train_ds.map(tokenization,
                            batched=True,
                            batch_size=len(train_ds),
                            remove_columns=['text'])

    test_ds = test_ds.map(tokenization,
                          batched=True,
                          batch_size=len(test_ds),
                          remove_columns=['text'])

    train_ds.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
    test_ds.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

    parameters = model.num_parameters()

    # Set different number of steps for accuracy and timing runs
    logging_steps = 1000 if timing_run else 10
    eval_steps = 1000 if timing_run else 10
    max_steps = 100 if timing_run else 250

    training_args = TrainingArguments(
        # Set the batch sizes
        per_device_train_batch_size=2,
        per_device_eval_batch_size=4,

        # Apply efficiency tricks
        gradient_accumulation_steps=8,
        fp16=True,
        
        # steps paramters
        evaluation_strategy="steps",
        warmup_steps=0,
        eval_steps=eval_steps, 
        max_steps=max_steps,
        logging_steps=logging_steps,

        # Optimizer parameters
        learning_rate=2e-5,

        # Finalization
        load_best_model_at_end=False if timing_run else True,
        metric_for_best_model='accuracy', # default value is validation loss, we want the model with the highest accuracy 
        
        # Output locations
        output_dir='./{}'.format(identifier),
        run_name='{}'.format(identifier),
        logging_dir='./{}-logging'.format(identifier),
        log_level='warning',
        save_strategy="steps"
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=train_ds,
        eval_dataset=test_ds,
    )
    start_time = time.time()
    trainer.train()
    duration = time.time() - start_time

    shutil.rmtree('./{}'.format(identifier))

    metrics = None
    if not timing_run:
        metrics = trainer.evaluate()

    return trainer, parameters, metrics, duration

In [None]:
def run_training(timing_run, model, tokenizer, model_name, train_ds, test_ds, max_length=1024):
    torch.cuda.empty_cache()

    try:
        identifier = '{}-{}-{}'.format(model_name, max_length, 'timing' if timing_run else 'accuracy').lower()
        trainer, parameters, metrics, duration  = train_and_evaluate(timing_run, model, identifier, tokenizer, train_ds, test_ds, max_length)

        results_and_metrics = {
            "model_name": model_name,
            "parameters": parameters,
            "duration": duration,
            "metrics": metrics,
            'tokencount': str(max_length),
            'timing_run': timing_run
        }
        with open(f'{identifier}-results.json', "w") as fd:
            json.dump(results_and_metrics, fd)

    except RuntimeError:
        del trainer 

    del model 


### 🎯 Accuracy runs

In [None]:
timing = False
gradient_checkpointing = True

for size in [256, 512, 1024, 2048, 4096]:

    for model_name, tokenizer, model in [
        (
            'roberta',
            RobertaTokenizer.from_pretrained('roberta-base'),
            RobertaForSequenceClassification.from_pretrained('roberta-base',
                                                             gradient_checkpointing=gradient_checkpointing,
                                                             num_labels=2)
        ),
        (
            'longformer',
            LongformerTokenizerFast.from_pretrained('allenai/longformer-base-4096'),
            LongformerForSequenceClassification.from_pretrained('allenai/longformer-base-4096',
                                                                gradient_checkpointing=gradient_checkpointing,
                                                                attention_window=128,
                                                                num_labels=2)
        ),
        (
            'bigbird',
            BigBirdTokenizerFast.from_pretrained('google/bigbird-roberta-base'),
            BigBirdForSequenceClassification.from_pretrained('google/bigbird-roberta-base',
                                                             gradient_checkpointing=gradient_checkpointing,
                                                             num_labels=2)
        )
        ]:

        print(f"Training {model_name} on size {size}")

        run_training(timing, model, tokenizer, model_name, train_ds_dict[str(size)], test_ds, max_length=size)

Downloading:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifi

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/694 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/597M [00:00<?, ?B/s]

Some weights of the model checkpoint at allenai/longformer-base-4096 were not used when initializing LongformerForSequenceClassification: ['lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.bias']
- This IS expected if you are initializing LongformerForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LongformerForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of LongformerForSequenceClassification were not initialized from the model checkpoint at allenai/longformer-base-4096 and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.weight'

Downloading:   0%|          | 0.00/846k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/775 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/760 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/513M [00:00<?, ?B/s]

Some weights of the model checkpoint at google/bigbird-roberta-base were not used when initializing BigBirdForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BigBirdForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BigBirdForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BigBirdForSequenceClassifica

Training roberta on size 256


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
10,0.6642,0.761917,0.612403,0.0,0.0,0.0
20,0.5068,0.961746,0.612403,0.0,0.0,0.0
30,0.4043,0.941379,0.612403,0.0,0.0,0.0
40,0.2604,1.289519,0.620155,0.039216,1.0,0.02
50,0.129,1.549111,0.658915,0.214286,1.0,0.12
60,0.0096,1.494855,0.736434,0.514286,0.9,0.36
70,0.0023,2.464012,0.635659,0.113208,1.0,0.06
80,0.0013,2.576693,0.635659,0.113208,1.0,0.06
90,0.0008,2.435969,0.682171,0.305085,1.0,0.18
100,0.0007,2.424143,0.689922,0.333333,1.0,0.2



Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



Training longformer on size 256


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
10,0.65,0.835249,0.612403,0.0,0.0,0.0
20,0.4681,1.113063,0.612403,0.0,0.0,0.0
30,0.2259,1.561631,0.620155,0.039216,1.0,0.02
40,0.0742,1.501076,0.689922,0.393939,0.8125,0.26
50,0.0078,2.429008,0.627907,0.076923,1.0,0.04
60,0.0021,2.596727,0.643411,0.148148,1.0,0.08
70,0.0012,2.076451,0.713178,0.506667,0.76,0.38
80,0.0012,2.161357,0.697674,0.493506,0.703704,0.38
90,0.0007,2.765382,0.658915,0.214286,1.0,0.12
100,0.0528,2.547976,0.674419,0.275862,1.0,0.16



Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



Training bigbird on size 256


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Attention type 'block_sparse' is not possible if sequence_length: 256 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3.Changing attention type to 'original_full'...


Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
10,0.6298,0.749291,0.612403,0.0,0.0,0.0
20,0.5173,0.852247,0.612403,0.0,0.0,0.0
30,0.3887,0.870401,0.612403,0.0,0.0,0.0
40,0.2688,1.011961,0.612403,0.0,0.0,0.0
50,0.2162,1.014458,0.674419,0.3,0.9,0.18
60,0.0949,0.889575,0.713178,0.602151,0.651163,0.56
70,0.0331,1.121046,0.736434,0.585366,0.75,0.48
80,0.0149,1.37336,0.697674,0.434783,0.789474,0.3
90,0.0086,1.457502,0.72093,0.5,0.818182,0.36
100,0.0066,1.533653,0.713178,0.493151,0.782609,0.36



Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifi

Training roberta on size 512


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
10,0.4999,0.983921,0.612403,0.0,0.0,0.0
20,0.5274,0.868535,0.612403,0.0,0.0,0.0
30,0.475,0.787669,0.612403,0.0,0.0,0.0
40,0.4758,0.787085,0.612403,0.0,0.0,0.0
50,0.3321,0.771663,0.612403,0.0,0.0,0.0
60,0.313,1.167688,0.612403,0.0,0.0,0.0
70,0.2499,0.691968,0.813953,0.72093,0.861111,0.62
80,0.1487,0.749225,0.806202,0.725275,0.804878,0.66
90,0.1215,1.453197,0.713178,0.412698,1.0,0.26
100,0.0878,0.818421,0.829457,0.75,0.868421,0.66



Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will chang

Training longformer on size 512


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
10,0.5655,1.044518,0.612403,0.0,0.0,0.0
20,0.5335,0.780434,0.612403,0.0,0.0,0.0
30,0.4415,0.951957,0.612403,0.0,0.0,0.0
40,0.4233,0.689905,0.612403,0.0,0.0,0.0
50,0.3069,0.879322,0.604651,0.0,0.0,0.0
60,0.213,1.281937,0.620155,0.039216,1.0,0.02
70,0.1118,0.96249,0.72093,0.55,0.733333,0.44
80,0.0451,1.798414,0.635659,0.175439,0.714286,0.1
90,0.0066,1.567969,0.72093,0.55,0.733333,0.44
100,0.0032,2.055627,0.689922,0.393939,0.8125,0.26



Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` paramet

Training bigbird on size 512


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Attention type 'block_sparse' is not possible if sequence_length: 512 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3.Changing attention type to 'original_full'...


Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
10,0.5555,0.847979,0.612403,0.0,0.0,0.0
20,0.4858,0.772949,0.612403,0.0,0.0,0.0
30,0.5159,0.850025,0.612403,0.0,0.0,0.0
40,0.4959,0.802247,0.612403,0.0,0.0,0.0
50,0.3861,0.898066,0.612403,0.0,0.0,0.0
60,0.4263,0.720003,0.612403,0.0,0.0,0.0
70,0.4082,0.76411,0.612403,0.0,0.0,0.0
80,0.3018,0.750921,0.612403,0.0,0.0,0.0
90,0.2381,0.680616,0.627907,0.076923,1.0,0.04
100,0.2152,0.827965,0.651163,0.181818,1.0,0.1



Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifi

Training roberta on size 1024


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
10,0.5862,0.83149,0.612403,0.0,0.0,0.0
20,0.5571,0.671752,0.612403,0.0,0.0,0.0
30,0.5205,0.813238,0.612403,0.0,0.0,0.0
40,0.5676,0.658352,0.612403,0.0,0.0,0.0
50,0.438,0.797536,0.612403,0.0,0.0,0.0
60,0.5666,0.618376,0.612403,0.0,0.0,0.0
70,0.4614,0.594566,0.612403,0.0,0.0,0.0
80,0.4601,0.599614,0.612403,0.0,0.0,0.0
90,0.4148,0.75956,0.705426,0.387097,1.0,0.24
100,0.3872,0.474885,0.806202,0.698795,0.878788,0.58



Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to co

Training longformer on size 1024


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
10,0.5801,0.855798,0.612403,0.0,0.0,0.0
20,0.5486,0.73448,0.612403,0.0,0.0,0.0
30,0.4956,0.88622,0.612403,0.0,0.0,0.0
40,0.5132,0.667367,0.612403,0.0,0.0,0.0
50,0.3483,0.808092,0.612403,0.0,0.0,0.0
60,0.4464,0.670474,0.627907,0.076923,1.0,0.04
70,0.4044,0.690457,0.72093,0.4375,1.0,0.28
80,0.3574,0.812586,0.751938,0.529412,1.0,0.36
90,0.2684,0.570404,0.767442,0.673913,0.738095,0.62
100,0.2268,0.802243,0.744186,0.571429,0.814815,0.44



Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter t

Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
10,0.5801,0.855798,0.612403,0.0,0.0,0.0
20,0.5486,0.73448,0.612403,0.0,0.0,0.0
30,0.4956,0.88622,0.612403,0.0,0.0,0.0
40,0.5132,0.667367,0.612403,0.0,0.0,0.0
50,0.3483,0.808092,0.612403,0.0,0.0,0.0
60,0.4464,0.670474,0.627907,0.076923,1.0,0.04
70,0.4044,0.690457,0.72093,0.4375,1.0,0.28
80,0.3574,0.812586,0.751938,0.529412,1.0,0.36
90,0.2684,0.570404,0.767442,0.673913,0.738095,0.62
100,0.2268,0.802243,0.744186,0.571429,0.814815,0.44



Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.



Training bigbird on size 1024


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]


floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)



Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
10,0.5777,0.759369,0.612403,0.0,0.0,0.0
20,0.5995,0.796641,0.612403,0.0,0.0,0.0
30,0.5416,0.799303,0.612403,0.0,0.0,0.0
40,0.5718,0.68229,0.612403,0.0,0.0,0.0
50,0.4031,0.706493,0.612403,0.0,0.0,0.0
60,0.4894,0.694978,0.612403,0.0,0.0,0.0
70,0.5066,0.620515,0.612403,0.0,0.0,0.0
80,0.4676,0.515016,0.806202,0.712644,0.837838,0.62
90,0.3906,0.60083,0.775194,0.591549,1.0,0.42
100,0.3525,0.485372,0.806202,0.712644,0.837838,0.62



Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter t

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifi

Training roberta on size 2048


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
10,0.6745,0.657507,0.612403,0.0,0.0,0.0
20,0.6305,0.707385,0.612403,0.0,0.0,0.0
30,0.6737,0.643604,0.612403,0.0,0.0,0.0
40,0.6088,0.624107,0.612403,0.0,0.0,0.0
50,0.5849,0.584006,0.612403,0.0,0.0,0.0
60,0.4944,0.702376,0.666667,0.245614,1.0,0.14
70,0.5031,0.539003,0.782946,0.688889,0.775,0.62
80,0.4674,0.502834,0.782946,0.688889,0.775,0.62
90,0.4304,0.551373,0.821705,0.735632,0.864865,0.64
100,0.3888,0.484905,0.782946,0.702128,0.75,0.66



Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter t

Training longformer on size 2048


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
10,0.6877,0.661641,0.612403,0.0,0.0,0.0
20,0.6163,0.671856,0.612403,0.0,0.0,0.0
30,0.669,0.63352,0.612403,0.0,0.0,0.0
40,0.5919,0.60162,0.612403,0.0,0.0,0.0
50,0.5173,0.493386,0.775194,0.666667,0.783784,0.58
60,0.463,0.732943,0.728682,0.461538,1.0,0.3
70,0.3875,0.438826,0.821705,0.757895,0.8,0.72
80,0.3833,0.453931,0.775194,0.72381,0.690909,0.76
90,0.3374,0.414173,0.868217,0.804598,0.945946,0.7
100,0.2628,0.406023,0.868217,0.804598,0.945946,0.7



Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.


Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will c

Training bigbird on size 2048


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
10,0.6773,0.654226,0.612403,0.0,0.0,0.0
20,0.5884,0.610554,0.612403,0.0,0.0,0.0
30,0.5951,0.520661,0.79845,0.682927,0.875,0.56
40,0.436,0.440862,0.829457,0.792453,0.75,0.84
50,0.4101,0.373192,0.860465,0.8125,0.847826,0.78
60,0.3651,0.421595,0.837209,0.746988,0.939394,0.62
70,0.2908,0.362163,0.868217,0.828283,0.836735,0.82
80,0.2907,0.359947,0.868217,0.824742,0.851064,0.8
90,0.196,0.357636,0.891473,0.851064,0.909091,0.8
100,0.1337,0.395859,0.875969,0.818182,0.947368,0.72



Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.


Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.


Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior wi

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifi

Training roberta on size 4096


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
10,0.6635,0.658828,0.612403,0.0,0.0,0.0
20,0.7123,0.652497,0.612403,0.0,0.0,0.0
30,0.6483,0.638417,0.612403,0.0,0.0,0.0
40,0.6462,0.603288,0.612403,0.0,0.0,0.0
50,0.6309,0.555263,0.744186,0.592593,0.774194,0.48
60,0.5424,0.581419,0.751938,0.542857,0.95,0.38
70,0.5093,0.50274,0.790698,0.703297,0.780488,0.64
80,0.4019,0.555112,0.806202,0.705882,0.857143,0.6
90,0.3354,0.472136,0.790698,0.709677,0.767442,0.66
100,0.3271,0.458029,0.813953,0.727273,0.842105,0.64



Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.


Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will c

Training longformer on size 4096


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
10,0.6627,0.662502,0.612403,0.0,0.0,0.0
20,0.7011,0.63494,0.612403,0.0,0.0,0.0
30,0.6244,0.692817,0.612403,0.0,0.0,0.0
40,0.6195,0.571276,0.736434,0.679245,0.642857,0.72
50,0.5235,0.532642,0.767442,0.634146,0.8125,0.52
60,0.4123,0.53754,0.782946,0.681818,0.789474,0.6
70,0.4279,0.485278,0.775194,0.72381,0.690909,0.76
80,0.3466,0.521663,0.790698,0.696629,0.794872,0.62
90,0.2342,0.458516,0.79845,0.745098,0.730769,0.76
100,0.2691,0.478383,0.829457,0.76087,0.833333,0.7



Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required

Training bigbird on size 4096


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
10,0.6333,0.663903,0.612403,0.0,0.0,0.0
20,0.6412,0.606036,0.612403,0.0,0.0,0.0
30,0.5808,0.628415,0.612403,0.0,0.0,0.0
40,0.5534,0.494754,0.844961,0.772727,0.894737,0.68
50,0.4913,0.455993,0.829457,0.78,0.78,0.78
60,0.3898,0.453985,0.837209,0.740741,0.967742,0.6
70,0.3491,0.388494,0.821705,0.747253,0.829268,0.68
80,0.3235,0.404326,0.821705,0.788991,0.728814,0.86
90,0.2506,0.378507,0.837209,0.778947,0.822222,0.74
100,0.2102,0.392178,0.860465,0.808511,0.863636,0.76



Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.


Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.


Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required

### ⏱️ Timing runs

In [None]:
timing = True
gradient_checkpointing = True

for size in [256, 512, 1024, 2048, 4096]:

    for model_name, tokenizer, model in [
        (
            'roberta',
            RobertaTokenizer.from_pretrained('roberta-base'),
            RobertaForSequenceClassification.from_pretrained('roberta-base',
                                                             gradient_checkpointing=gradient_checkpointing,
                                                             num_labels=2)
        ),
        (
            'longformer',
            LongformerTokenizerFast.from_pretrained('allenai/longformer-base-4096'),
            LongformerForSequenceClassification.from_pretrained('allenai/longformer-base-4096',
                                                                gradient_checkpointing=gradient_checkpointing,
                                                                attention_window=256,
                                                                num_labels=2)
        ),
        (
            'bigbird',
            BigBirdTokenizerFast.from_pretrained('google/bigbird-roberta-base'),
            BigBirdForSequenceClassification.from_pretrained('google/bigbird-roberta-base',
                                                             gradient_checkpointing=gradient_checkpointing,
                                                             num_labels=2)
        )
        ]:

        print(f"Training {model_name} on size {size}")

        run_training(timing, model, tokenizer, model_name, train_ds_dict[str(size)], test_ds, max_length=size)

# 📈 Visualize

In [None]:
import os 
run_dict = []

# loop to collect the results from each model
for root, dirs, files in os.walk('./'):
    for name in files:
        if name.endswith(("results.json")):
            full_path = os.path.join(root, name)
            with open(full_path) as f:
                data = json.load(f)
                run_dict.append(data)

In [None]:
df = pd.DataFrame(run_dict)
df2 = df.metrics.apply(pd.Series)
result = pd.concat([df.drop('metrics', axis=1), df2], axis=1)
result["iden"] = result["model_name"] + '-' + result["tokencount"].astype(str)
result['parameters'] /= 1e6
result.sort_values(by=['model_name', 'tokencount'], )

Unnamed: 0,model_name,parameters,duration,tokencount,timing_run,eval_loss,eval_accuracy,eval_f1,eval_precision,eval_recall,eval_runtime,eval_samples_per_second,eval_steps_per_second,epoch,iden
21,bigbird,128.06093,488.144646,1024,True,,,,,,,,,,bigbird-1024
22,bigbird,128.06093,1547.258527,1024,False,0.506037,0.852713,0.776471,0.942857,0.66,8.0108,16.103,4.119,13.15,bigbird-1024
9,bigbird,128.06093,925.536229,2048,True,,,,,,,,,,bigbird-2048
24,bigbird,128.06093,2830.359838,2048,False,0.465422,0.906977,0.866667,0.975,0.78,15.7516,8.19,2.095,8.92,bigbird-2048
13,bigbird,128.06093,93.173873,256,True,,,,,,,,,,bigbird-256
23,bigbird,128.06093,449.907733,256,False,1.121046,0.736434,0.585366,0.75,0.48,1.4328,90.036,23.032,83.3,bigbird-256
16,bigbird,128.06093,1819.089413,4096,True,,,,,,,,,,bigbird-4096
29,bigbird,128.06093,5416.983666,4096,False,0.34784,0.891473,0.851064,0.909091,0.8,29.4785,4.376,1.119,8.06,bigbird-4096
6,bigbird,128.06093,173.928012,512,True,,,,,,,,,,bigbird-512
7,bigbird,128.06093,633.653134,512,False,0.683401,0.837209,0.746988,0.939394,0.62,2.8548,45.187,11.56,27.71,bigbird-512


## Importing results (optional)

In [None]:
!git clone https://github.com/ml6team/quick-tips.git
!cd quick-tips && git checkout feature/long_seq_transformers
!mv quick-tips/nlp/long_range_transformers/results ./results
!rm -rf quick-tips

In [None]:
result = pd.read_csv("results/experiment_results.csv")

## Visualization

In [None]:
import plotly.graph_objects as go
sizes=['256', '512', '1024', '2048', '4096']

fig = go.Figure(data=[
    go.Bar(
        name=model,
        x=sizes,
        y=[result[
            (result['timing_run']==False) \
            & (result['model_name']==model)\
            & (result['tokencount']==size)]['eval_accuracy'].iloc[0]\
        for size in sizes]
    )
    for model in ['roberta', 'bigbird', 'longformer']
])

fig.update_layout(
    barmode='group',
    title="Accuracy of the various models and data token sizes",
    xaxis_title="Dataset token size",
    yaxis_title="Accuracy"
)

fig.show()

In [None]:
import plotly.graph_objects as go
sizes=['256', '512', '1024', '2048', '4096']

fig = go.Figure(data=[
    go.Bar(
        name=model,
        x=sizes,
        y=[result[
            (result['timing_run']==True) \
            & (result['model_name']==model)\
            & (result['tokencount']==size)]['duration'].iloc[0]\
        for size in sizes]
    )
    for model in ['roberta', 'bigbird', 'longformer']
])

fig.update_layout(
    barmode='group',
    title="Training time for 100 steps for the various models and dataset token sizes",
    xaxis_title="Dataset token size",
    yaxis_title="Training time (seconds)"
)

fig.show()

# 🎁 Wrapping up 

Nicely done! Let's wrap up some of our findings 🧪.

* Both LongFormer and BigBird outperform roBERTa for longer sequences by a small margin 💥 **Sparse attention for the win** 👑
* BigBird and LongFormer also show excellent performance on shorter sequences, in line with roBERTa's performance 💪, with LongFormer lagging a bit behind BigBird
* LongFormer and BigBird have **higher training times**. However, the relation seems more linear than quadratic in nature 🙋‍

