# Transformers Benchmarks

Evaluate Bert and GPT training performance on single/multi GPUs. 

List all available GPUs.

In [1]:
import torch

print('Pytorch version\t:', torch.__version__)
print('CUDA version\t:', torch.version.cuda)

for i in range(torch.cuda.device_count()):
    print(f'GPU{i}\t\t:',torch.cuda.get_device_name(i))

  from .autonotebook import tqdm as notebook_tqdm


Pytorch version	: 1.13.0a0+08820cb
CUDA version	: 11.7
GPU0		: Tesla V100-SXM2-16GB


## Installation and Utilities

Install huggingface and deepspeed. Note that `transformers` is installed from source to run its examples.

In [1]:
from IPython.display import clear_output

!git clone https://github.com/huggingface/transformers
!cd transformers; pip install .
!pip install datasets evaluate accelerate deepspeed psutil

clear_output()

Get the model specification given its name in [Huggingface Hub](https://huggingface.co/models)

In [25]:
import os
from dataclasses import dataclass, asdict
from transformers import AutoConfig


In [18]:
@dataclass
class Config:
    model: str         # huggingface model name
    seq_len: int       # input sequence length
    batch_size: int    # batch size per GPU
        
    ## Improve speed / reduce memory  
    bf16: bool = False  # Faster, less memory. Recommend if GPU supports
    fp16: bool = False  # Faster, less memory, but need to scale loos. 
                        # Recommend if BF16 is not available.
    optim: str = 'adamw_hf'  # Optimization method
    grad_ckpt: bool = False  # save memory with an extra forward
    grad_accum: int = 1      # accumulate gradients for better performance
    steps: int = 50          # number of batches to benchmark
        
    ## Multi-GPUs
    gpus: str = '0'          # GPUs to use. "0,1" means use GPU 0 and 1
    ddp: bool = False        # if or not use pytorch's DistributedDataParallel
    deepspeed: bool = False  # if or not use deepspeed
    ds_config: str = ''      # deepspeed config 
    
    def TFLOPs(self):
        """Tera floating points operators to train one example"""
        spec = AutoConfig.from_pretrained(self.model)
        get = lambda *keys: max(
            [getattr(spec, k) if hasattr(spec, k) else 0 for k in keys])
        n = get('num_hidden_layers', 'n_layer')
        h = get('hidden_size', 'n_embd', 'd_model')
        s = self.seq_len
        v = get('vocab_size')
        att, ffn, embed = 4*h*s**2 + 8*s*h**2, 16*s*h**2, 2*s*h*v
        forward = n*(att+ffn) + embed
        return (4 * forward if self.grad_ckpt else 3 * forward) / 1e12

Parse Huggingface to get GPU memory consumption and speed.

In [20]:
def log_summary(config, log_filename):
    with open(log_filename) as f:
        lines = f.readlines()
    for l in lines:
        if 'CUDA out of memory' in l:
            print('Out of GPU memory, try a smaller batch size')
            return
        if '{\'train_runtime' in l:
            metrics = json.loads(l.replace('\'', '\"'))
            gpu_mem = metrics['init_mem_cpu_peaked_delta'] + \
                    metrics['train_mem_gpu_alloc_delta'] + metrics['train_mem_gpu_peaked_delta']
            r = metrics['train_samples_per_second']
            num_gpus = len(config.gpus.split(','))
            print('Total samples / second\t: %.1f' % r)
            print('Per GPU memory (GB)\t: %.1f'% (gpu_mem/1e9))
            print('Per GPU TFLOPs\t\t: %.1f' % (r * config.tflops() / num_gpus))
            return
    print(f'Failed. Check "{log_filename}" to find error')

Get the launcher based on config.

In [21]:
def launcher(config):
    if config.ddp:
        num_gpus = len(config.gpus.split(','))
        return f'python -m torch.distributed.launch --nproc_per_node {num_gpus}'
    return 'deepspeed' if config.deepspeed else 'python'

## Bert on a Single GPU

We use the [masked langunage modeling](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling) task as the benchmark workload. It's a good proximation of BERT pre-training, but no needs to prepare the dataset.

The following function maps our configure into a command for `run_mlm.py`. The log is saved into `log.txt`, and we print a training summary.

In [22]:
!python transformers/examples/pytorch/language-modeling/run_mlm.py --help >help.txt

In [26]:
def run_bert(config):
    cmd = f'''rm -rf /tmp/bert; \
export CUDA_VISIBLE_DEVICES={config.gpus}; \
{launcher(config)} transformers/examples/pytorch/language-modeling/run_mlm.py \
  --config_name {config.model} \
  --tokenizer_name {config.model} \
  --dataset_name wikitext \
  --dataset_config_name wikitext-2-raw-v1 \
  --do_train \
  --max_seq_length {config.seq_len} \
  --per_device_train_batch_size {config.batch_size} \
  --fp16 {config.fp16} \
  --bf16 {config.bf16} \
  --optim {config.optim} \
  --gradient_accumulation_steps {config.grad_accum} \
  --gradient_checkpointing {config.grad_ckpt} \
  --max_steps {config.steps} \
  --output_dir /tmp/bert/ \
  --skip_memory_metrics False'''
    if config.deepspeed:
        cmd += f' --deepspeed {config.ds_config}'
    cmd += ' > log.txt 2>&1'
    os.system(cmd)
    log_summary(config, 'log.txt')

We use a 128 sequence length, which is used in 90\% steps in the original BERT paper. Then choose a large batch size that will not cause out of memory for a good performance.

In [27]:
bert_1 = Config('bert-large-uncased', 128, 56)
run_bert(bert_1)

Out of GPU memory, try a smaller batch size


Now switch to `bf16`. Use `fp16=True` if your GPU architecture is before Ampere. But we recommend you to use `bf16` if available. It doesn't require to tune the [loss scaling](https://moocaholic.medium.com/fp64-fp32-fp16-bfloat16-tf32-and-other-members-of-the-zoo-a1ca7897d407), due to more exponent bits compared to `fp16`.

In [8]:
bert_2 = Config('bert-large-uncased', 128, 56, bf16=True)
run_bert(bert_2)

Total samples / second	: 135.3
Per GPU memory (GB)	: 18.2
Per GPU TFLOPs		: 35.3


You can see both an improved performance and a reduction of memory usage. The speed improvement is due to using Tensor Cores.

The memory usage is mainly due to three parts: model parameters, layer outputs in the forward path (activations) and workspace memory used by backend libraries.  It may surprise you that neither `fp16` or `bf16` save memory related to model parameters. The reason is because model updating is running with 32-bit. For one model parameter: 

- with normal `fp32`, we use 4 bytes for the 32-bit weight, 4 bytes for the 32-bit gradient, 8 bytes for the two momentums in Adam, a total of 16 bytes
- with `fp16` or `bf16`, we use 2 bytes for the 16-bit weight, 2 bytes for the 16-bit gradient (some implementation uses 32-bit gradient), 4 bytes for the master 32-bit weight, and 8 bytes for the two momentums in adam, with a total of 16 bytes 

The memory saving contributes to all activations are stored in 16-bit now. The activation size is linear to the batch size, sequence length, number of layers and hidden size. If you spend a lot of memory for activation, then both `bf16` or `fp16` will help a lot. 

Now we can use a larger batch size, which further improves performance.

In [9]:
bert_3 = Config('bert-large-uncased', 128, 80, bf16=True)
run_bert(bert_3)

Total samples / second	: 144.9
Per GPU memory (GB)	: 23.4
Per GPU TFLOPs		: 37.8


The model updating involves multiple vector operators. It causes unignorable overheads. Replacing it with a better implementation helps.

In [10]:
bert_4 = Config('bert-large-uncased', 128, 80, bf16=True, optim='adamw_apex_fused')
run_bert(bert_4)

Total samples / second	: 152.5
Per GPU memory (GB)	: 23.4
Per GPU TFLOPs		: 39.8


To further reduce the optimization overhead, we can accumulate the gradients multiple times before updating weight. If we accumulate 4 times, then it leads to an 4x larger effective batch size. It may be too big for the fine tuning task, but not a problem for pre-training. Also note that this option needs extra buff for gradients, so it may require you to use a smaller batch size.

In [11]:
bert_5 = Config('bert-large-uncased', 128, 76, bf16=True, optim='adamw_apex_fused', 
                grad_accum=4, steps=10)
run_bert(bert_5)

Total samples / second	: 157.2
Per GPU memory (GB)	: 22.6
Per GPU TFLOPs		: 41.0


To further improve batch size, we can throw away activations, and re-compute them when needed. Now we can use a near 4x larger batch size. 

In [12]:
bert_6 = Config('bert-large-uncased', 128, 260, bf16=True, optim='adamw_apex_fused', 
                grad_accum=4, grad_ckpt=True, steps=5)
run_bert(bert_6)

Total samples / second	: 138.9
Per GPU memory (GB)	: 19.4
Per GPU TFLOPs		: 48.3


Though it furthers improve TFLOPS, but decreases the number of samples per second because of the extra forward. So use it only when the model is too big you cannot use an effective batch size. 

So the best option is `bert_5`. You can save it as plain text for sharing. 

In [13]:
asdict(bert_5)

{'model': 'bert-large-uncased',
 'seq_len': 128,
 'batch_size': 76,
 'bf16': True,
 'fp16': False,
 'optim': 'adamw_apex_fused',
 'grad_ckpt': False,
 'grad_accum': 4,
 'steps': 10,
 'gpus': '0',
 'ddp': False,
 'deepspeed': False,
 'ds_config': ''}

## GPT-2 on a Single GPU

Next we train language model with GPT-2.

In [14]:
def run_gpt(config):
    cmd = f'''rm -rf /tmp/gpt; \
export CUDA_VISIBLE_DEVICES={config.gpus}; \
{launcher(config)} transformers/examples/pytorch/language-modeling/run_clm.py \
  --model_name_or_path {config.model} \
  --dataset_name wikitext \
  --dataset_config_name wikitext-2-raw-v1 \
  --do_train \
  --per_device_train_batch_size {config.batch_size} \
  --block_size {config.seq_len} \
  --learning_rate 2e-5 \
  --max_steps {config.steps} \
  --fp16 {config.fp16} \
  --bf16 {config.bf16} \
  --optim {config.optim} \
  --gradient_accumulation_steps {config.grad_accum} \
  --gradient_checkpointing {config.grad_ckpt} \
  --output_dir /tmp/gpt/ \
  --skip_memory_metrics False'''
    if config.deepspeed:
        cmd += f' --deepspeed {config.ds_config}'
    cmd += ' > log.txt 2>&1'
    os.system(cmd)
    log_summary(config, 'log.txt')

We use `gpt2-medium` whose architecture is similar to `bert-large`. GPT-2 models uses a larger sequence length, here we pick 512, which leads to a much smaller batch size.

In [15]:
gpt_1 = Config("gpt2-medium", 512, 6)
run_gpt(gpt_1)

Total samples / second	: 13.6
Per GPU memory (GB)	: 22.4
Per GPU TFLOPs		: 15.8


Use a configure similar to `bert_5`. But note that the batch size increase is smaller than BERT when using `bf16`. Also we observed a smaller TFLOPS (24.6 vs 41). The reason is unknown to us. 

In [16]:
gpt_2 = Config("gpt2-medium", 512, 7, bf16=True, optim='adamw_apex_fused', 
                grad_accum=4)
run_gpt(gpt_2)

Total samples / second	: 21.1
Per GPU memory (GB)	: 21.2
Per GPU TFLOPs		: 24.6


## Multiple GPUs

Huggingface uses multiple GPUs with data parallelism when multi-GPUs are available. Here we train on two GPUs based on configure `bert_5`. 

In [17]:
mbert_1 = Config('bert-large-uncased', 128, 76, bf16=True, optim='adamw_apex_fused', 
                grad_accum=4, steps=10, gpus='0,1')
run_bert(mbert_1)

Total samples / second	: 255.4
Per GPU memory (GB)	: 23.1
Per GPU TFLOPs		: 33.3


You can see the per GPU TFLOPs is reduced due to communication overhead. Using Pytorch DistributedDataParallel helps. But note that it uses extra memory for communication, so we need to reduce the batch size.

In [18]:
mbert_2 = mbert_1
mbert_2.ddp = True
mbert_2.batch_size = 70
run_bert(mbert_2)

Total samples / second	: 316.2
Per GPU memory (GB)	: 22.7
Per GPU TFLOPs		: 41.2


GPUs are connected by NVLinks. Let's test the speed without using NVLinks.

In [25]:
!nvidia-smi topo -m

	[4mGPU0	GPU1	CPU Affinity	NUMA Affinity[0m
GPU0	 X 	NV4	0-23		N/A
GPU1	NV4	 X 	0-23		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks


In [19]:
os.environ["NCCL_P2P_DISABLE"] = "1"
run_bert(mbert_2)
os.environ["NCCL_P2P_DISABLE"] = "0"

Total samples / second	: 305.9
Per GPU memory (GB)	: 22.7
Per GPU TFLOPs		: 39.9


The performance is only slightly decreased. The reason is we are using a relative batch size, and accumulate gradients by 4 times, so the weight updating cost is small compared to others.

Next let's use DeepSpeed with Zero 2, which has a worse performance compared to DDP, but allow to use a larger batch size as model and optimizer status are partitioned.

In [20]:
mbert_3 = mbert_1
mbert_3.deepspeed = True
mbert_3.ds_config = 'transformers/tests/deepspeed/ds_config_zero2.json'
mbert_3.batch_size = 128
run_bert(mbert_3)

Total samples / second	: 297.6
Per GPU memory (GB)	: 22.4
Per GPU TFLOPs		: 38.8


Lastly let's test GPT-2, using its default 1024 sequence length. 

In [22]:
mgpt_1 = Config("gpt2-large", 1024, 2, bf16=True, optim='adamw_apex_fused', 
                grad_accum=16, gpus='0,1', steps=5, deepspeed=True, 
                ds_config='transformers/tests/deepspeed/ds_config_zero2.json')
run_gpt(mgpt_1)

Total samples / second	: 7.0
Per GPU memory (GB)	: 16.2
Per GPU TFLOPs		: 18.8


The performance is degraded more when NVLinks are not available. 

In [23]:
os.environ["NCCL_P2P_DISABLE"] = "1"
run_gpt(mgpt_1)
os.environ["NCCL_P2P_DISABLE"] = "0"

Total samples / second	: 5.6
Per GPU memory (GB)	: 16.2
Per GPU TFLOPs		: 15.0


Here is the screen shot of `nvtop` when executing the above cell. GPUs are idle when communication. 

![](imgs/nvtop.png)

Finally, let's train the 1.3B GPT-2 model.

In [24]:
mgpt_2 = Config("gpt2-xl", 1024, 1, bf16=True, optim='adamw_apex_fused', 
                grad_accum=16, gpus='0,1', deepspeed=True, steps=5,
                ds_config='transformers/tests/deepspeed/ds_config_zero2.json')
run_gpt(mgpt_2)

Total samples / second	: 2.2
Per GPU memory (GB)	: 15.3
Per GPU TFLOPs		: 11.8


Duo the GPU memory size, we can only compute one example per time, even when using Zero-2. It leads to unsatisfied performance.

## Discussions

We explore several options to tune Huggingface's example code and understand their performance. But note HF have other flags that may further improve performance. Also there are other libraries reported higher performance. We will discuss more options in other notebooks.