# Text classification with the *Longformer*

In a previous [post](https://jesusleal.io/2020/10/20/RoBERTA-Text-Classification/) I explored how to use Hugging Face Transformers *Trainer* class to easily create a text classification pipeline. The code was pretty straighforward to implement and I was able to obtain results that put the basic model at a very competitive level with a few lines of code. In that post I also briefly discussed one of the main drawbacks of the first generation of Transformers and BERT based architectures; the sequence lenght is limited to a maximum of 512 characters. The reason behind that limitation is the fact that self-attention mechanism scale quadratically with the input sequence length *O(n^2)*. Given the need to process longer sequences of text a a second generation of attention based models have been proposed. The idea behind this models is to reduce the memory footprint of the attention mechanisms in order to process longer sequences of text; see this really useful analysis of transformer models that try to overcome this limitation [by researchers from Google](https://arxiv.org/pdf/2009.06732.pdf). New models such as the [***Reformer***](https://arxiv.org/pdf/2001.04451.pdf) by Google proposes a series of innovations to the traditional Transformer architecture such as Local Self Attention, Locality sensitive hashing (LSH) Self-Attention, Chunked Feed Forward Layers, etc. This [post](https://huggingface.co/blog/reformer) from Hugging Face for a detailed discussion). This model can process sequences of half a million tokens with as little as 8GB of RAM. However one big drawback of the model for downstream applications is the fact that the authors have not released pre trained weights of their model and at the time of publication of this post there is no freely available model pretrained on a large corpus. 

Another very promising model, and the subject of this post, is the [***Longformer***](https://arxiv.org/pdf/2004.05150.pdf) by researchers from Allen AI Institure. The Longformer allows the processing sequences of thousand characters without facing the memory bottleneck of BERT like architectures and achieved SOTA at the time of publication in several benchmarks. The authors use a new variation of attention, called local attention where every token only attends to tokens in the vicinity defined by a window *w* where each token attends to $\frac{1}{2}\ w$  tokens to the left and to the right. To increase the receptive field the authors also applied dilation to the local window so they can increase the size of w without incurring in memory costs. A dilation is simply a "hole", meaning the token simply skips that token thus allowing tokens to reach farther tokens. The performance is not hurt since the transformer architecture has multiple attention heads across multiple layers and the different layers and head learn and attend different properties of texts and tokens. In addition to the local attention the authors also included a token that is attended globally so it can be used in downstream taks, just like thee *CLS* token of BERT. One of the interesting aspects of this model is the fact that the authors created their own CUDA kernel to calculate the attention scores of the sliding window attention. This type of attention is more efficient since there are many zeros in the matrix, this operation is called  matrix banded multiplication, but is not implemented in Pytorch/Tensorflow. Thanks to our friends from Hugging Face an implementation with standard CUDA kernels is available altough it does not have all the capabilities the authors of the Longformer model describe in their paper it is suitable for finetuning [downtream tasks](https://github.com/allenai/longformer). 


The authors tested the model with an autoregressive model to process sequences of thousands of tokens, achieving state of the art on *text8* and *enwik8*. They also tested the model on downstream tasks by finetuning the model with the weights of RoBERTA to conduct masked token prediction (MLM) of one third of the real news dataset, and a third of the stories dataset.  The authors pretrained two variations of the model a base model (with 12 layers) and a large model (30 layers). Both models were trained for 65K gradient updates with sequences of length 4,096 and batch size 64. Once the pretraining was completed they tested the models on downstream tasks such as question answering, coreference resolution and document classification. The model achieved SOTA results on the WikiHop TriviaQA datasets and in the hyper partisan data. For the IMDB dataset the authors achieved 95.7 percent accuracy, a small increase from the 95.3 percent accuracy reported by RoBERTa. 

Given all this nice features I decided to try the model and see how it compares to RoBERTA on the IMDB the iris dataset of text classification. For this script I used the trainer class from Hugging Face and the pretrained model offered by Allen AI available in the model hub of Hugging Face.


In [1]:
import pandas as pd
import datasets
from transformers import LongformerTokenizerFast, LongformerForSequenceClassification, Trainer, TrainingArguments, LongformerConfig
import torch.nn as nn
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from tqdm import tqdm
import wandb
import os

One of the cool features about this model is that you can specify the attention sliding window across different levels; the authors exploited this design for the autoregressive language model using different sliding windows for different layers. If this parameter is not changed it will assume a default of 512 across all the different layers.

In [2]:
config = LongformerConfig()

config

LongformerConfig {
  "attention_probs_dropout_prob": 0.1,
  "attention_window": 512,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "longformer",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "sep_token_id": 2,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

wandb.login()

In [8]:
train_data, test_data = datasets.load_dataset('imdb', split =['train', 'test'], 
                                             cache_dir='/media/data_files/github/website_tutorials/data')

Reusing dataset imdb (/media/data_files/github/website_tutorials/data/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3)


In [4]:
train_data
#dir(train_data)

Dataset(features: {'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None)}, num_rows: 25000)

For my implementation of the model, and to save speed in the pretraining I chose the maximun length of 1024 characters which covers close to 98 percent of all the documents in the dataset. Before using my brand new and still pretty much impossible to find RTX3090, I set the gradient checkpointing parameter to true. This saves a huge amount of memory and allows models such as the longformer to train on more modest GPU's such as my old EVGA GeForce GTX 1080. Gradient checkpointing is a really nice way to re use weights in the neural network and allows massive models to run on more modest settings with a 30 percent increase in training time. The original paper discussing gradient checkpointing can be found [here](https://arxiv.org/pdf/1604.06174.pdf) and a nice [discussion of gradient checkpointing can be hound here](https://qywu.github.io/2019/05/22/explore-gradient-checkpointing.html). 

Additionally to save memory and increase training time I also used mixed precision training to speed up the computation time of the training process. If you want to learn more about mixed precision I recommend this [blogpost](https://jonathan-hui.medium.com/mixed-precision-in-deep-learning-67f6dce3e0f3). With the combination of mixed precision, gradient accumulation and gradient checkpoint you can set the length to 4096. 

In [3]:
# load model and tokenizer and define length of the text sequence
model = LongformerForSequenceClassification.from_pretrained('allenai/longformer-base-4096',
                                                           gradient_checkpointing=False,
                                                           attention_window = 512,
                                                           cache_dir='/media/data_files/github/website_tutorials/data')
tokenizer = LongformerTokenizerFast.from_pretrained('allenai/longformer-base-4096', max_length = 1024)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=694.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=597257159.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at allenai/longformer-base-4096 were not used when initializing LongformerForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing LongformerForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LongformerForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of LongformerForSequenceClassification were not initialized from the model checkpoint at allenai/longformer-base-4096 and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', '

In [4]:
model.config

LongformerConfig {
  "_name_or_path": "allenai/longformer-base-4096",
  "attention_mode": "longformer",
  "attention_probs_dropout_prob": 0.1,
  "attention_window": [
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512
  ],
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "ignore_attention_mask": false,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 4098,
  "model_type": "longformer",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "sep_token_id": 2,
  "type_vocab_size": 1,
  "vocab_size": 50265
}

In [9]:
# define a function that will tokenize the model, and will return the relevant inputs for the model
def tokenization(batched_text):
    return tokenizer(batched_text['text'], padding = 'max_length', truncation=True, max_length = 1024)

train_data = train_data.map(tokenization, batched = True, batch_size = len(train_data))
test_data = test_data.map(tokenization, batched = True, batch_size = len(test_data))

Loading cached processed dataset at /media/data_files/github/website_tutorials/data/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3/cache-be72d1cc7c4b2de6.arrow
Loading cached processed dataset at /media/data_files/github/website_tutorials/data/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3/cache-287bc9e4ecd8dcc7.arrow


In [8]:
# we make sure our truncation strateging and the padding are set to the maximung length
len(train_data['input_ids'][0])

1024

Once the tokenization process is finished we can use the set the column names and types. One thing that is important to note is that the `LongformerForSequenceClassification` implementation by default sets the global attention to the `CLS`[token](https://huggingface.co/transformers/_modules/transformers/modeling_longformer.html#LongformerForSequenceClassification), so there is no need to further modify the inputs.

In [10]:
train_data.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
test_data.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

In [11]:
# define accuracy metrics
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    # argmax(pred.predictions, axis=1)
    #pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In the paper the authors trained for 15 epochs, with batch size of 32, learning rate of 3e-5 and linear warmup steps equal to 0.1 of the total training steps. For this quick tutorial I went for the default learning rate of the trainer class which is 5e-5, 5 epochs for training, batch size of 8 with gradient accumulation of 8 steps for an effective batch size of 64 and 200 warm up steps (roughly 10 percent of total training steps). The overall training time for this implementation was 2 hours and 54 minutes.

In [11]:
# define the training arguments
training_args = TrainingArguments(
    output_dir = '/media/data_files/github/website_tutorials/results',
    num_train_epochs = 5,
    per_device_train_batch_size = 8,
    gradient_accumulation_steps = 8,    
    per_device_eval_batch_size= 16,
    evaluation_strategy = "epoch",
    disable_tqdm = False, 
    load_best_model_at_end=True,
    warmup_steps=200,
    weight_decay=0.01,
    logging_steps = 4,
    fp16 = True,
    logging_dir='/media/data_files/github/website_tutorials/logs',
    dataloader_num_workers = 0,
    run_name = 'longformer-classification-updated-rtx3090_paper_replication_2_warm'
)

In [12]:
# instantiate the trainer class and check for available devices
trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_data,
    eval_dataset=test_data
)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [None]:
# train the model
trainer.train()

longformer-classification-updated-rtx3090_paper_replication_2_warm


[34m[1mwandb[0m: Currently logged in as: [33mjlealtru[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.10.11 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.10.1
[34m[1mwandb[0m: Run data is saved locally in wandb/run-20201124_160529-2taocndf
[34m[1mwandb[0m: Syncing run [33mlongformer-classification-updated-rtx3090_paper_replication_2_warm[0m


  return torch.tensor(x, **format_kwargs)





Epoch,Training Loss,Validation Loss


After the training has been completed we can evaluate the performance of the model and make sure we are loading the right model.

In [None]:
# save the best model
trainer.save_model('/media/data_files/github/website_tutorials/results/paper_replication_lr_warmup200')

In [None]:
trainer.evaluate()


The best iteration of our model achieved an accuracy 0.9534, below what the authors report (0.957). This results are probably explained by the fact that we have used several tricks to increase training speed, the use of half-precision floating-point (*fp16*) and the fact that we are not using their special CUDA kernel. Additionally 
As the authors recognize in the paper this corpus collection is composed mostly of shorter documents thus the model does not fully utilizes its capabilities to learn long sequences. Recent evaluations of the new generation of the model indicate that while longformer does not achieve the best results in any category it performs competitively across all the different tasks explored in the model evaluation, ranking second overall.

![Results](images/longformer_eval_accuracy_imdb.svg)

Thats it for this tutorial, hopefully you will find this helpful.