# Mirco-Benchmarking

This notebook benchmarks the most time consuming components in BERT to help you understand its performance. Let's first check our libraries and hardware. If your GPUs are recent models, please make sure you are using a recent CUDA version, which may greatly affect the performance.

In [1]:
import torch

print('Pytorch version\t:', torch.__version__)
print('CUDA version\t:', torch.version.cuda)
print('GPU\t\t:',torch.cuda.get_device_name())

Pytorch version	: 1.13.0a0+08820cb
CUDA version	: 11.7
GPU		: Tesla V100-SXM2-16GB


  from .autonotebook import tqdm as notebook_tqdm


## Matrix Multiplication

Matrix multiplication is the most used operator in Transformers. So its performance is crucial. 

Let's first define a `walltime` method to benchmark Pytorch statement by at least 3 seconds. 

In [2]:
import inspect
from torch.utils import benchmark 

def var_dict(*args):
    callers_local_vars = inspect.currentframe().f_back.f_locals.items()
    return dict([(name, val) for name, val in callers_local_vars if val is arg][0] 
                for arg in args)

def walltime(stmt, arg_dict, duration=3):
    return benchmark.Timer(stmt=stmt, globals=arg_dict).blocked_autorange(
        min_run_time=duration).median

Now test the [TFLOPS](https://en.wikipedia.org/wiki/FLOPS) we can achieve on square matrices. 

In [3]:
import pandas as pd
import numpy as np
from collections import defaultdict

pd.options.display.precision = 3

matmul_tflops = defaultdict(lambda: {})
for n in 2**np.arange(7, 14, 2):
    for dtype in (torch.float32, torch.float16):
        a = torch.randn(n, n, dtype=dtype).cuda()
        b = torch.randn(n, n, dtype=dtype).cuda()   
        t = walltime('a @ b', var_dict(a, b))
        matmul_tflops[f'n={n}'][dtype] = 2*n**3 / t / 1e12
        del a, b
        
pd.DataFrame(matmul_tflops)

Unnamed: 0,n=128,n=512,n=2048,n=8192
torch.float32,0.1,9.501,14.382,14.426
torch.float16,0.199,9.184,77.61,94.904


You can see that the performance increases with the matrix size. If your GPU has [Tensor Cores](https://www.nvidia.com/en-us/data-center/tensor-cores/), you will see a big performance jump when switching from 32-bit floating points to 16-bit floating points.

Next you can find the theory TFLOPS of your GPU from Wikipedia, for example, [Nvidia Tesla](https://en.wikipedia.org/wiki/Ampere_(microarchitecture)), [Nvidia Quadro](https://en.wikipedia.org/wiki/Quadro), [RTX 30xx](https://en.wikipedia.org/wiki/GeForce_30_series), and [RTX 20xx](https://en.wikipedia.org/wiki/GeForce_20_series). Here we list several cards, with their memory information.

| Model       | Memory (GB) | Memory Bandwidth (GB/sec) | FP32 TFLOPS | FP16 TFLOPS |
| ----------- | ----------- | ------------------------- | ----------- | ----------- |
| A100        | 80          | 2039                      | 19.5        | 312         |
| V100        | 16          | 900                       | 15.7        | 125         |
| A6000       | 48          | 768                       | 38          | 150         |
| RTX 3090 TI | 24          | 1008                      | 33.5        | 160         |

If the best TFLOPS number you got is still far away from the theory TFLOPS of your GPU, the performance is likely bottlenecked by the memory bandwidth. To illustrate it, let's benchmark a simple elemental-wise multiplication to show both its TFLOPS with memory bandwidth. 

In [4]:
vector = defaultdict(lambda: {})
for n in 2**np.arange(14, 24, 2):
    a = torch.randn(n).cuda()
    t = walltime('a * 1.2', var_dict(a))
    vector[n]['TFLOPS'] = n / t / 1e12
    vector[n]['GB/s'] = 8 * n / t / 1e9
    
pd.DataFrame(vector)

Unnamed: 0,16384,65536,262144,1048576,4194304
TFLOPS,0.001,0.006,0.024,0.093,0.098
GB/s,11.87,47.387,190.19,746.122,783.343


You can see that even for large vectors, the TFLOPS is far far way from GPU peak performance, while the bandwidth may be quite close to its theoretical number.

The matrix multiplication performance is a main topic in HPC. There are a large number of research papers. Unfortunately the backend library, cuBLAS, is not open sourced. You may check [cutlass](https://github.com/NVIDIA/cutlass), which claimed similar performance as cuBLAS, for some implementation details.


## BERT Layer

The main body of a Transformer model is a stacking of Transformer blocks. Let's benchmark the performance of a single block. In BERT, it is often called a BERT layer. Let's construct one such layer from the [BERT large model](https://huggingface.co/bert-large-uncased). We use 16-bit floating points for better performance. 

In [5]:
from transformers import AutoConfig, BertLayer

config = AutoConfig.from_pretrained("bert-large-uncased")
layer = BertLayer(config).half().cuda()

In BERT pre-training, we often train with a sequence of 128 (stage 1) or 512 (stage 2). Let's test both forward and backward TFLOPS under different batch size. 

In [6]:
h = config.hidden_size
bert_layer = defaultdict(lambda: {})
for s in [128, 512]:  # sequence length
    for b in 2**np.arange(2, 8): # batch size
        X = torch.randn(b, s, h).half().cuda()
        tflops = (24*b*s*h*h + 4*b*h*s**2) / 1e12
        bert_layer[f'batch={b}'][f'fwd seq_len={s}'] = tflops / walltime(
            'layer(X)', var_dict(layer, X))
        bert_layer[f'batch={b}'][f'fwd+bwd seq_len={s}'] = 3 * tflops / walltime(
            'layer(X)[0].sum().backward()', var_dict(layer, X))
        del X
pd.DataFrame(bert_layer)

Unnamed: 0,batch=4,batch=8,batch=16,batch=32,batch=64,batch=128
fwd seq_len=128,10.448,22.529,45.268,52.066,55.535,56.306
fwd+bwd seq_len=128,13.627,27.021,53.081,59.443,64.373,67.325
fwd seq_len=512,40.317,41.949,44.402,45.023,45.844,43.599
fwd+bwd seq_len=512,44.322,48.512,51.737,53.667,54.738,53.326


No surprise that a large batch size helps. But the best number is below the matrix multiplication TFLOPS. Let's find why.

We first benchmark the first dense layer in the Feed-Forward Network (FFN) in the layer. 

In [7]:
b, s = 64, 128
X = torch.randn(b, s, h).half().cuda()

'Dense layer TFLOPS: %.3f' % (8*b*s*h*h / 1e12 / walltime(    
    'layer.intermediate.dense(X)', var_dict(layer, X)))

'Dense layer TFLOPS: 78.256'

The number is pretty good. Then run this dense layer with the GeLU activation.

In [8]:
'Dense+Activation TFLOPS: %.3f' % (8*b*s*h*h / 1e12 / walltime(
    'layer.intermediate(X)', var_dict(layer, X)))

'Dense+Activation TFLOPS: 66.739'

Even the activation function has a ignorable complexity, it brings down the TFLOPS. We pointed out the reason before, the elemental-wise operation of the activation function is bounded by the memory bandwidth.

Now test the whole FFN.

In [9]:
ffn = 16*b*s*h*h / 1e12
'FFN TFLOPS: %.3f'%(ffn / walltime(
    'layer.output(layer.intermediate(X),X)', var_dict(layer, X)))

'FFN TFLOPS: 68.241'

The other part in the BERT layer is the multi-head self-attention.

In [10]:
att = (4*b*h*s*s + 8*b*s*h*h) / 1e12
'Attention TFLOPS: %.3f'%(
    att / walltime('layer.attention(X)', var_dict(layer, X)))

'Attention TFLOPS: 40.418'

Even though the main computation part of the attention block is still matrix multiplication, it has more memory bounded operators compared to FFN. So you see a lower TFLOPS.

In [11]:
att / ffn

0.53125

The ratio of complexity between attention and FFN depends on the BERT configuration. The overall performance is a weighted sum between the FLOPS of these two components.

To conclude, to achieve the best performance for a BERT layer, you need to use a fast data type and a large batch size. For further improvement, we need to rewrite the code. For example, [fusing](https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#fuse-pointwise-operations) multiple kernels into a single one. 