# **Please set using GPU Tesla T4 16GB in Runtime/change runtype time at the beginning.

As we leverage on Hugging face's PyTorchBenchmarkArguments for comparison, we will be running the code on Google Colab.

We will be testing the benchmark using the Google Colab provided Tesla T4 GPU with 16GB capacity.

Reference: https://huggingface.co/blog/reformer (accessed on 10/7/2023)

In [1]:
# pip installs
!pip -qq install git+https://github.com/huggingface/transformers.git
!pip install -qq py3nvml

from transformers import ReformerConfig, BertConfig, PyTorchBenchmark, PyTorchBenchmarkArguments

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


### **Section 1**

First, we compare global self attention (that is the original transformer model setting) with reformer's memory usage.

By setting lsh_attn_chunk_length = local_attn_chunk_length = 16384 so that for all input sequences smaller or equal to 16384, the model automatically switches to global self-attention, which will be the same as the transformer model's original setting.

In [2]:
config_global = ReformerConfig.from_pretrained("google/reformer-enwik8", lsh_attn_chunk_length=16384, local_attn_chunk_length=16384, lsh_num_chunks_before=0, local_num_chunks_before=0)
config_LSH = ReformerConfig.from_pretrained("google/reformer-enwik8")
benchmark_args = PyTorchBenchmarkArguments(sequence_lengths=[512, 2048, 4096, 8192, 16384, 32768], batch_sizes=[1], models=["Transformer", "Reformer_with_LSH"])
benchmark = PyTorchBenchmark(configs=[config_global, config_LSH], args=benchmark_args)
result = benchmark.run()

1 / 2




Doesn't fit on GPU. CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 14.75 GiB total capacity; 10.88 GiB already allocated; 3.04 GiB free; 10.94 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Doesn't fit on GPU. CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 14.75 GiB total capacity; 10.88 GiB already allocated; 3.04 GiB free; 10.94 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Doesn't fit on GPU. CUDA out of memory. Tried to allocate 16.00 GiB (GPU 0; 14.75 GiB total capacity; 1.69 GiB already allocated; 12.16 GiB free; 1.81 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation 

It is found that without adding reformer features, the memory requirement exceed for the 16K sequence length exceed the capacity of Tesla T4 16GB GPU on Google Colab, while adding the LSH can lower the memory requirement to around 8.3GB only. This leads to significant memory efficiency. Under the Reformer model, the 16GB GPU can handle the single batch with 32K input length now.

### **Section 2**

Next, we test whether adding chunking further improve memory saving efficiency in reformer model. Given the computational time involved, we will focus on testing 1 batch up to 8192 sequence only, and compare the memory result with the original transformer.

Remark: The running time for below part can be long (around 50 minutes), as chunking will increase computation time.

In [None]:
config_LSH_chunk = ReformerConfig.from_pretrained("google/reformer-enwik8", chunk_size_feed_forward=1, num_attention_heads=2, feed_forward_size=16384)  # feed forward chunk
benchmark_args = PyTorchBenchmarkArguments(sequence_lengths=[512, 2048, 4096, 8192], batch_sizes=[1], models=["Reformer_with_LSH&Chunk"])
benchmark = PyTorchBenchmark(configs=[config_LSH_chunk], args=benchmark_args)
result = benchmark.run()

1 / 1


Adding the chunking feature can further lower the memory requirement for 8K token inputs from 8GB memory requirement in the original global self-attention setting to around 3.9GB now only, although it comes at the expense of lengthing the computational time.
