# CH08b Working_with_Efficient_Self-attention

In [None]:
!pip install transformers
!pip install py3nvml

We see the quadratic relationship $\mathcal{O}(n^2)$ between input sequence and peak memory usage, as the sequence length gets long. 
Let us check the memory usage and make sure no running processes

In [None]:
!nvidia-smi

# Longformer

In [None]:
from transformers import LongformerTokenizer, LongformerModel
import torch

tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-base-4096")
model = LongformerModel.from_pretrained("allenai/longformer-base-4096")
sequence = "hello " * 4093
inputs = tokenizer(sequence, return_tensors="pt")
print("input shape: ", inputs.input_ids.shape)
outputs = model(**inputs)

As you pass a sequence whose length is more than 4096 you will get "IndexError: index out of range in self" 

In [None]:
# default attention window size is 512
# Window size refers to the size of an attention window around each token.
from transformers import LongformerConfig, PyTorchBenchmark, PyTorchBenchmarkArguments

config_longformer = LongformerConfig.from_pretrained("allenai/longformer-base-4096")
config_longformer_window4 = LongformerConfig.from_pretrained(
    "allenai/longformer-base-4096", attention_window=4
)

In [None]:
sequence_lengths = [128, 256, 512, 1024, 2048, 4096]
models = ["config_longformer", "config_longformer_window4"]
configs = [eval(m) for m in models]

In [None]:
benchmark_args = PyTorchBenchmarkArguments(
    sequence_lengths=sequence_lengths, batch_sizes=[1], models=models
)
benchmark = PyTorchBenchmark(configs=configs, args=benchmark_args)
results = benchmark.run()

In [None]:
import matplotlib.pyplot as plt


def plotMe(results, title="Time"):
    plt.figure(figsize=(8, 8))
    fmts = ["rs--", "go--", "b+-", "c-o"]
    q = results.memory_inference_result
    if title == "Time":
        q = results.time_inference_result
    models = list(q.keys())
    seq = list(q[models[0]]["result"][1].keys())
    models_perf = [list(q[m]["result"][1].values()) for m in models]
    plt.xlabel("Sequence Length")
    plt.ylabel(title)
    plt.title("Inference Result")
    for perf, fmt in zip(models_perf, fmts):
        plt.plot(seq, perf, fmt)
    plt.legend(models)
    plt.show()

Speed Test

In [None]:
plotMe(results)

Memory Test

In [None]:
plotMe(results, "Memory")

## BigBird

In [None]:
# pip installs
!pip install transformers
!pip install py3nvml

In [None]:
from transformers import BigBirdConfig

# Default Bird  with num_random_blocks=3, block_size=64
sparseBird = BigBirdConfig.from_pretrained("google/bigbird-roberta-base")
# Fuyll attention Bird:
fullBird = BigBirdConfig.from_pretrained(
    "google/bigbird-roberta-base", attention_type="original_full"
)

In [None]:
from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments

In [None]:
sequence_lengths = [256, 512, 1024, 2048, 3072, 4096]
models = ["sparseBird", "fullBird"]
configs = [eval(m) for m in models]

For smaller sequence lengths, The BigBird Model works with full-attention model due to block-size and seq-length inconsistency     

In [None]:
benchmark_args = PyTorchBenchmarkArguments(
    sequence_lengths=sequence_lengths, batch_sizes=[1], models=models
)
benchmark = PyTorchBenchmark(configs=configs, args=benchmark_args)
results = benchmark.run()

In [None]:
plotMe(results)

In [None]:
plotMe(results, "Memory")

# Reformer

In [None]:
# pip installs
!pip install transformers
!pip install py3nvml

In [None]:
from transformers import ReformerConfig, PyTorchBenchmark, PyTorchBenchmarkArguments

We will tweak some settings for the *Reformer* model to work in full-attention mode. When we set **lsh_attn_chunk_length**  and **local_attn_chunk_length** to 16384 which is maximum length that Reformer can process, in this case, the Reformer model will have no chance for local optimization and will automatically work like the vanilla transformers.

In [None]:
fullReformer = ReformerConfig.from_pretrained(
    "google/reformer-enwik8", lsh_attn_chunk_length=16384, local_attn_chunk_length=16384
)
sparseReformer = ReformerConfig.from_pretrained("google/reformer-enwik8")

In [None]:
!nvidia-smi

In [None]:
sequence_lengths = [256, 512, 1024, 2048, 4096, 8192, 12000]
models = ["fullReformer", "sparseReformer"]
configs = [eval(e) for e in models]

Indeed, Reformer can process the sequences up to length of 16384. Due to the accelerator capacity of our environment, the attention matrix does not fit on GPU, and we get CUDA out of memory warning.  

In [None]:
benchmark_args = PyTorchBenchmarkArguments(
    sequence_lengths=sequence_lengths, batch_sizes=[1], models=models
)
benchmark = PyTorchBenchmark(configs=configs, args=benchmark_args)
results = benchmark.run()

In [None]:
plotMe(results)

In [None]:
plotMe(results, "Memory Footprint")