<a href="https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Reformer_2_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **The Reformer - Pushing the limits of language modeling**

***How the Reformer uses less than 8GB of RAM to train on sequences of half a million tokens***

The Reformer model as introduced by [Kitaev, Kaiser et al. (2020)](https://arxiv.org/pdf/2001.04451.pdf) is one of the most memory-efficient transformer models for long sequence modeling as of today.

Recently, long sequence modeling has experienced a surge of interest as can be seen by the many submissions from this year alone - [Beltagy et al. (2020)](https://arxiv.org/abs/2004.05150), [Roy et al. (2020)](https://arxiv.org/abs/2003.05997), [Tay et al.](https://arxiv.org/abs/2002.11296), [Wang et al.](https://arxiv.org/abs/2006.04768) to name  a few. 
The motivation behind long sequence modeling is that many tasks in NLP, *e.g.* summarization, question answering, require the model to process longer input sequences than models, such as BERT, are able to handle. In tasks that require the model to process a large input sequence, long sequence models do not have to cut the input sequence to avoid memory overflow and thus have been shown to outperform standard "BERT"-like models *cf.* [Beltagy et al. (2020)](https://arxiv.org/abs/2004.05150). 

The Reformer pushes the limit of longe sequence modeling by its ability to process up to half a million tokens at once as shown in this [demo](https://github.com/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb). As a comparison, a conventional `bert-base-uncased` model limits the input length to only 512 tokens. In Reformer, each part of the standard transformer architecture is re-engineered to optimize for minimal memory requirement without a significant drop in performance.

The memory improvements can be attributed to **4** features which the Reformer authors introduced to the transformer world:

1.   **Reformer Self-Attention Layer** - *How to efficiently implement self-attention without being restricted to a local context?*
=> see [this colab](https://colab.research.google.com/drive/15oP52_7W5dRcAnbgX3tYADsu4R3cjMIf?usp=sharing).
2.  **Chunked Feed Forward Layers** - *How to get a better time-memory trade-off for large feed forward layers?*
3.   **Reversible Residual Layers**  - *How to drastically reduce memory consumption in training by a smart residual architecture?*
4.   **Axial Positional Encodings** - *How to make positional encodings usable for extremely large input sequences?*

The goal of this blog post is to give the reader an **in-depth** understanding of each of the four Reformer features mentioned above. While the explanations are focussed on the Reformer, the reader should get a better intuition under which circumstances each of the four features can be effective for other transformer models as well. 
The four sections are only loosely connected, so they can very well be read individually.

Reformer is part of the 🤗Transformers library. For all users of the Reformer, it is advised to go through this very detailed blog post to better understand how the model works and how to correctly set its configuration. All equations are accompanied by their equivalent name for the Reformer config, *e.g.* `config.<param_name>`, so that the reader can quickly relate to the official docs and configuration file.

**Note**: *Axial Positional Encodings* are not explained in the official Reformer paper, but are extensively used in the official codebase. This blog post gives the first in-depth explanation of Axial Positional Encodings.

## **2. Chunked Feed Forward Layers**

Transformer-based models often employ very large feed forward layers after the self-attention layer in parallel. Thereby, this layer can take up a significant amount of the overall memory and sometimes even represent the memory bottleneck of a model.
First introduced in the Reformer paper, feed forward chunking is a technique that allows to effectively trade better memory consumption for increased time consumption.


### **Chunked Feed Forward Layer in Reformer**

In Reformer, the *LSH*- or *local* self-attention layer (review part 1 [here](https://colab.research.google.com/drive/15oP52_7W5dRcAnbgX3tYADsu4R3cjMIf?usp=sharing)) is usually followed by a residual connection, which then defines the first part in a *transformer block*. For more detail on this please refer to this [blog](http://jalammar.github.io/illustrated-transformer/). 

The output of the first part of the *transformer block*, called *normed self-attention* output can be written as $\mathbf{\overline{Z}} = \mathbf{Z} + \mathbf{X}$, with $\mathbf{Z}$ being either $\mathbf{Z}^{\text{LSH}}$ or $\mathbf{Z}^\text{loc}$ in Reformer.

For our example input $\mathbf{x}_1, \ldots, \mathbf{x}_{16}$, we illustrate the normed self-attention output as follows.

![alt text](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/reformer_benchmark/layer_normed_output.png)

Now, the second part of a *transformer block* usually consists of two feed forward layers$^{1}$, defined as $\text{Linear}_{\text{int}}(\ldots)$ that processes $\mathbf{\overline{Z}}$, to an intermediate output $\mathbf{Y}_{\text{int}}$ and $\text{Linear}_{\text{out}}(\ldots)$ that processes the intermediate output to the output $\mathbf{Y}_{\text{out}}$. The two feed forward layers can be defined by $\mathbf{Y}_{\text{out}} = \text{Linear}_{\text{out}}(\mathbf{Y}_\text{int}) = 
\text{Linear}_{\text{out}}(\text{Linear}_{\text{int}}(\mathbf{\overline{Z}}))$.

It is important to remember at this point that mathematically the output of a feed forward layer at position $\mathbf{y}_{\text{out}, i}$ only depends on the input at this position $\mathbf{\overline{y}}_i$. In contrast to the self-attention layer, every output $\mathbf{y}_{\text{out}, i}$ is therefore completely independent of all inputs $\mathbf{\overline{y}}_{j \ne i}$ of different positions. 

Let's illustrate the feed forward layers for $\mathbf{\overline{z}}_1, \ldots, \mathbf{\overline{z}}_{16}$.

![alt text](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/reformer_benchmark/feed_forward.png)

As can be depicted from the illustration, all input vectors $\mathbf{\overline{z}}_i$ are processed by the same feed forward layer in parallel.

It becomes interesting when one takes a look at the output dimensions of the feed forward layers. In Reformer, the output dimension of $\text{Linear}_{\text{int}}$ is defined as `config.feed_forward_size`, *e.g.* $d_f$, and the output dimension of $\text{Linear}_{\text{int}}$ is defined as `config.hidden_size`, *i.e.* $d_h$. 

The Reformer authors observed that in a transformer model the intermediate dimension $d_f$ usually tends to be much larger than the output dimension$^{2}$ $d_h$. This means that the tensor $\mathbf{\mathbf{Y}}_\text{int}$ of dimension $d_f \times n$ allocates a significant amount of the total memory and can even become the memory bottleneck.

To get a better feeling for the differences in dimensions let's picture the matrices $\mathbf{Y}_\text{int}$ and $\mathbf{Y}_\text{out}$ for our example.

![alt text](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/reformer_benchmark/feed_forward_matrix.png)

It is becoming quite obvious that the tensor $\mathbf{Y}_\text{int}$ holds much more memory ($\frac{d_f}{d_h} \times n$ as much to be exact) than $\mathbf{Y}_{\text{out}}$. But, is it even necessary to compute the full intermediate matrix $\mathbf{Y}_\text{int}$ ? Not really, because relevant is only the output matrix $\mathbf{Y}_\text{out}$. 
To trade memory for speed, one can thus chunk the linear layers computation to only process one chunk at the time. Defining `config.chunk_size_feed_forward` as $c_f$, chunked linear layers are defined as $\mathbf{Y}_{\text{out}} = \left[\mathbf{Y}_{\text{out}, 1: c_f}, \ldots, \mathbf{Y}_{\text{out}, (n - c_f): n}\right]$ with $\mathbf{Y}_{\text{out}, (c_f * i): (i * c_f + i)} = \text{Linear}_{\text{out}}(\text{Linear}_{\text{int}}(\mathbf{\overline{Z}}_{(c_f * i): (i * c_f + i)}))$. 
In practice, it just means that the output is incrementally computed and concatenated to avoid having to store the whole intermediate tensor $\mathbf{Y}_{\text{int}}$ in memory.

Assuming $c_f=1$ for our example we can illustrate the incremental computation of the output for position $i=9$ as follows. 

![alt text](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/reformer_benchmark/chunked_feed_forward.png)

By processing the inputs in chunks of size 1, the only tensors that have to be stored in memory at the same time are $\mathbf{Y}_\text{out}$ of a maximum size of $16 \times d_h$, $\mathbf{y}_{\text{int}, i}$ of size $d_f$ and the input $\mathbf{\overline{Z}}$ of size $16 \times d_h$, with $d_h$ being `config.hidden_size`$^{3}$.

Finally, it is important to remember that *chunked linear layers* yield a mathematically equivalent output to conventional linear layers and can therefore be applied to all transformer linear layers. Making use of `config.chunk_size_feed_forward` therefore allows a better trade-off between memory and speed in certain use cases.

---
${}^1$ For a simpler explanation, the layer norm layer which is normally applied to $\mathbf{\overline{Z}}$ before being processed by the feed forward layers is omitted for now.

${}^2$ In `bert-base-uncased`, *e.g.* the intermediate dimension $d_f$ is with 3072 four times larger than the output dimension $d_h$.

${}^3$ As a reminder, the output `config.num_attention_heads` is assumed to be 1 for the sake of clarity and illustration in this notebook, so that the output of the self-attention layers can be assumed to be of size `config.hidden_size`.

More information on chunked linear / feed forward layers can also be found [here](https://huggingface.co/transformers/glossary.html#feed-forward-chunking) on the 🤗Transformers docs.


### **Benchmark**

Let's test how much memory can be saved by using chunked feed forward layers. Check out this [notebook](https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb) for more detail on benchmarking in Transformers.

In [1]:
#@title Installs and Imports
# pip installs
!pip -qq install git+https://github.com/huggingface/transformers.git
!pip install -qq py3nvml

from transformers import ReformerConfig, PyTorchBenchmark, PyTorchBenchmarkArguments

[K     |████████████████████████████████| 3.0MB 6.4MB/s 
[K     |████████████████████████████████| 1.1MB 40.7MB/s 
[K     |████████████████████████████████| 890kB 39.3MB/s 
[?25h  Building wheel for transformers (setup.py) ... [?25l[?25hdone
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 61kB 3.3MB/s 
[?25h

First, let's compare the default `google/reformer-enwik8` model without chunked feed forward layers to the one with chunked feed forward layers.

In [2]:
config_no_chunk = ReformerConfig.from_pretrained("google/reformer-enwik8")  # no chunk
config_chunk = ReformerConfig.from_pretrained("google/reformer-enwik8", chunk_size_feed_forward=1)  # feed forward chunk
benchmark_args = PyTorchBenchmarkArguments(sequence_lengths=[1024, 2048, 4096], batch_sizes=[8], models=["Reformer-No-Chunk", "Reformer-Chunk"], no_speed=True, no_env_print=True)
benchmark = PyTorchBenchmark(configs=[config_no_chunk, config_chunk], args=benchmark_args)
result = benchmark.run()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1279.0, style=ProgressStyle(description…


1 / 2
Doesn't fit on GPU. CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 11.17 GiB total capacity; 7.85 GiB already allocated; 1.74 GiB free; 9.06 GiB reserved in total by PyTorch)
2 / 2
Doesn't fit on GPU. CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 11.17 GiB total capacity; 7.85 GiB already allocated; 1.24 GiB free; 9.56 GiB reserved in total by PyTorch)

--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
      Reformer-No-Chunk              8              1024            4281     
      Reformer-No-Chunk              8              2048            7607     
      Reformer-No-Chunk              8              4096            N/A      
        Reformer-Chunk               8              1024            4309     
        Reformer-Chunk               8              2048            76

Interesting, chunked feed forward layers do not seem to help here at all! The reason is that `config.feed_forward_size` is not sufficiently large to become the memory bottleneck. 

Let's see what happens to the memory peak usage if we increase the size of the feed forward layer by a factor of 4 and reduce the number of attention heads also by a factor of 4 so that the feed forward layer becomes the memory bottleneck.

In [3]:
config_no_chunk = ReformerConfig.from_pretrained("google/reformer-enwik8", chunk_size_feed_forward=0, num_attention_heads=2, feed_forward_size=16384)  # no chuck
config_chunk = ReformerConfig.from_pretrained("google/reformer-enwik8", chunk_size_feed_forward=1, num_attention_heads=2, feed_forward_size=16384)  # feed forward chunk
benchmark_args = PyTorchBenchmarkArguments(sequence_lengths=[1024, 2048, 4096], batch_sizes=[8], models=["Reformer-No-Chunk", "Reformer-Chunk"], no_speed=True, no_env_print=True)
benchmark = PyTorchBenchmark(configs=[config_no_chunk, config_chunk], args=benchmark_args)
result = benchmark.run()

1 / 2
2 / 2

--------------------------------------------------------------------------------
          Model Name             Batch Size     Seq Length    Memory in MB 
--------------------------------------------------------------------------------
      Reformer-No-Chunk              8              1024            3743     
      Reformer-No-Chunk              8              2048            5539     
      Reformer-No-Chunk              8              4096            9087     
        Reformer-Chunk               8              1024            2973     
        Reformer-Chunk               8              2048            3999     
        Reformer-Chunk               8              4096            6011     
--------------------------------------------------------------------------------


Now a clear decrease in peak memory usage can be seen for longer input sequences. 
As a conclusion, it should be noted that chunked feed forward layers only make sense for models having few attention heads and large feed forward layers.