# Transformers Benchmarks

Evaluate Bert GPT training performance on single/multi GPUs. 

## Installation and Utilities

Install huggingface and deepspeed. Note that `transformers` is installed from source, as we will run its examples.

In [1]:
from IPython.display import clear_output

!git clone https://github.com/huggingface/transformers
!cd transformers; pip install .
!pip install datasets evaluate accelerate deepspeed psutil

clear_output()

Get the model specification given its name in [Huggingface Hub](https://huggingface.co/models)

In [2]:
from dataclasses import dataclass, asdict
import requests
import json
import os

@dataclass
class ModelSpec:
    num_layers: int  # number of Transformer blocks
    hidden_size: int # attention/ffn dimensions
    vocab_size: int  # vocabulary size
        
    @staticmethod
    def from_config(model_name: str):
        page = requests.get(f'https://huggingface.co/{model_name}/raw/main/config.json')
        assert page.ok, f'failed to get the config of {model_name}'
        spec = page.json()
        get = lambda *keys: max(int(spec.get(k, 0)) for k in keys)
        num_layers = get('num_hidden_layers', 'n_layer')
        hidden_size = get('hidden_size', 'n_embd')
        vocab_size = get('vocab_size')
        return ModelSpec(num_layers, hidden_size, vocab_size)
    
ModelSpec.from_config('bert-large-uncased')

ModelSpec(num_layers=24, hidden_size=1024, vocab_size=30522)

Experiment configurations. 

In [20]:
@dataclass
class Config:
    model: str         # huggingface model name
    seq_len: int       # input sequence length
    batch_size: int    # batch size per GPU
        
    ## Improve speed / reduce memory  
    bf16: bool = False  # Faster, less memory. Recommend if GPU supports
    fp16: bool = False  # Faster, less memory, but need to scale loos. 
                        # Recommend if BF16 is not available.
    optim: str = 'adamw_hf'  # Optimization method
    grad_ckpt: bool = False  # save memory with an extra forward
    grad_accum: int = 1      # accumulate gradients for better performance
    steps: int = 50          # number of batches to benchmark
        
    ## Multi-GPUs
    gpus: str = '0'          # GPUs to use. "0,1" means use GPU 0 and 1
    ddp: bool = False        # if or not use pytorch's DistributedDataParallel
    deepspeed: bool = False  # if or not use deepspeed
    ds_config: str = ''      # deepspeed config 
    
    def tflops(self):
        """TeraFLOPS for training one example"""
        # Ignored all vector operators for simplicity. 
        spec = ModelSpec.from_config(self.model) 
        attention = 4 * spec.hidden_size * self.seq_len**2 + \
                    8 * self.seq_len * spec.hidden_size**2 
        ffn = 16 * self.seq_len * spec.hidden_size**2
        embedding = 2 * self.seq_len * spec.hidden_size * spec.vocab_size
        forward = spec.num_layers * (attention + ffn) + embedding
        return (4 * forward if self.grad_ckpt else 3 * forward) / 1e12

Parse 

In [36]:
def log_summary(config, log_filename):
    with open(log_filename) as f:
        lines = f.readlines()
    for l in lines:
        if 'CUDA out of memory' in l:
            print('Out of GPU memory, try a smaller batch size')
            return
        if '{\'train_runtime' in l:
            metrics = json.loads(l.replace('\'', '\"'))
            gpu_mem = metrics['init_mem_cpu_peaked_delta'] + \
                    metrics['train_mem_gpu_alloc_delta'] + metrics['train_mem_gpu_peaked_delta']
            print('Total used GPU memory:\t%.1f GB'% (gpu_mem/1e9))
            r = metrics['train_samples_per_second']
            print('# samples per second:\t%.1f' % r)
            num_gpus = len(config.gpus.split(','))
            print('Measured per GPU TFLOPs:\t%.1f' % (r * config.tflops() / num_gpus))
            return
    print(f'Failed. Check "{log_filename}" to find error')

In [5]:
def launcher(config):
    if config.ddp:
        num_gpus = len(config.gpus.split(','))
        return f'python -m torch.distributed.launch --nproc_per_node {num_gpus}'
    if config.deepspeed:
        return 'deepspeed'
    return 'python'

## Bert on a Single GPU

Even though we are interested in pre-training, we 
use fine-tuning BERT for [text-classifcation](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification) as a proximation, which only needs to download a small dataset.

The follwing function maps our configure into a command for `run_glue.py`. The log is saved into `log.txt`, then we print a summary of the GPU memory usage and measured TFLOPS.

In [21]:
def run_bert(config):
    cmd = f'''rm -rf /tmp/bert; \
export CUDA_VISIBLE_DEVICES={config.gpus}; \
{launcher(config)} transformers/examples/pytorch/language-modeling/run_mlm.py \
  --model_name_or_path {config.model} \
  --dataset_name wikitext \
  --dataset_config_name wikitext-2-raw-v1 \
  --do_train \
  --max_seq_length {config.seq_len} \
  --per_device_train_batch_size {config.batch_size} \
  --fp16 {config.fp16} \
  --bf16 {config.bf16} \
  --optim {config.optim} \
  --gradient_accumulation_steps {config.grad_accum} \
  --gradient_checkpointing {config.grad_ckpt} \
  --max_steps {config.steps} \
  --output_dir /tmp/bert/ \
  --skip_memory_metrics False'''
    if config.deepspeed:
        cmd += f' --deepspeed {config.ds_config}'
    cmd += ' > log.txt 2>&1'
    os.system(cmd)
    log_summary(config, 'log.txt')

Use a large batch size that will not cause out of memory for a good performance.

In [12]:
bert_1 = Config('bert-large-uncased', 128, 56)
run_bert(bert_1)

Total used GPU memory:	23.0 GB
# samples per second:	88.9
Measured TFLOPs:	23.2


Now switch to `bf16`. You can see both an improved performance and a reduction of memory usage. The former is due to using Tensor Cores in new Nvidia GPUs. If you see an error or no improvement, please try `fp16=True`. We recommend you to use `bf16` as it doesn't require to tune the [loss scaling](https://moocaholic.medium.com/fp64-fp32-fp16-bfloat16-tf32-and-other-members-of-the-zoo-a1ca7897d407), due to more exponent bits compared to `fp16`. 

The memory usage is mainly due to three parts: model parameters, layout outputs in the forward path (activations) and workspace memory used by backend libraries.  It may surprise you that neither `fp16` or `bf16` save space related to model parameters. For one model parameter: 

- Use normal `fp32`, we use 4 bytes for the 32-bit weight, 4 bytes for the 32-bit gradient, 8 bytes for the two momentums in Adam, with a total of 16 bytes
- Use `fp16` or `bf16`, we use 2 bytes for the 16-bit weight, 2 bytes for the 16-bit gradient (some implementation uses 32-bit gradient), 4 bytes for the master 32-bit weight, and 8 bytes for the two momentums in adam, with a total of 16 bytes 

The memory saving to mainly due to all activations are stored in 16-bit instead of 32-bit. The activation size is linear to the batch size and sequence length, so if your GPU memory is large, or model is small (or use zero to shard the model), then you save more.

In [13]:
bert_2 = Config('bert-large-uncased', 128, 56, bf16=True)
run_bert(bert_2)

Total used GPU memory:	18.2 GB
# samples per second:	133.5
Measured TFLOPs:	34.8


Now we can use a larger batch size, which further improves performance.

In [14]:
bert_3 = Config('bert-large-uncased', 128, 80, bf16=True)
run_bert(bert_3)

Total used GPU memory:	23.4 GB
# samples per second:	145.0
Measured TFLOPs:	37.8


The model updating involes multiple vector operators. It causes unignorable overheads. Replacing it with a better implementation helps.

In [15]:
bert_4 = Config('bert-large-uncased', 128, 80, bf16=True, optim='adamw_apex_fused')
run_bert(bert_4)

Total used GPU memory:	23.4 GB
# samples per second:	154.1
Measured TFLOPs:	40.2


To further reduce the optmization overhead, we can accumulate the gradients multiple times before updating weight. If we accumulate 4 times, then it leads to an effective 4\*96 batch size. It may be too big for the fine tuning task, but not a problem for pre-training. 

In [19]:
bert_5 = Config('bert-large-uncased', 128, 76, bf16=True, optim='adamw_apex_fused', 
                grad_accum=4, steps=10)
run_bert(bert_5)

Total used GPU memory:	22.6 GB
# samples per second:	159.7
Measured TFLOPs:	41.6


To further improve batch size, we can throw away activations, and re-compute them when needed. Now we can use a 9x larger batch size. 

In [29]:
bert_6 = Config('bert-large-uncased', 128, 260, bf16=True, optim='adamw_apex_fused', 
                grad_accum=4, grad_ckpt=True, steps=5)
run_bert(bert_6)

Total used GPU memory:	19.2 GB
# samples per second:	139.2
Measured TFLOPs:	48.4


Though it furthers improve TFLOPS, but decreases the number of samples per second because of the extra forward. So use it only when the model is too big you cannot use an effective batch size. 

In [39]:
asdict(bert_5)

{'model': 'bert-large-uncased',
 'batch_size': 96,
 'seq_len': 128,
 'bf16': True,
 'fp16': False,
 'optim': 'adamw_apex_fused',
 'grad_ckpt': False,
 'grad_accum': 4,
 'gpus': '0',
 'ddp': False,
 'deepspeed': False,
 'ds_config': ''}

## GPT-2 on a Single GPU

In [83]:
def run_gpt(config):
    cmd = f'''rm -rf /tmp/gpt; \
export CUDA_VISIBLE_DEVICES={config.gpus}; \
{launcher(config)} transformers/examples/pytorch/language-modeling/run_clm.py \
  --model_name_or_path {config.model} \
  --dataset_name wikitext \
  --dataset_config_name wikitext-2-raw-v1 \
  --do_train \
  --per_device_train_batch_size {config.batch_size} \
  --block_size {config.seq_len} \
  --learning_rate 2e-5 \
  --max_steps {config.steps} \
  --fp16 {config.fp16} \
  --bf16 {config.bf16} \
  --optim {config.optim} \
  --gradient_accumulation_steps {config.grad_accum} \
  --gradient_checkpointing {config.grad_ckpt} \
  --output_dir /tmp/gpt/ \
  --skip_memory_metrics False'''
    if config.deepspeed:
        cmd += f' --deepspeed {config.ds_config}'
    cmd += ' > log.txt 2>&1'
    os.system(cmd)
    log_summary(config, 'log.txt')

In [44]:
gpt_1 = Config("gpt2-medium", 512, 6)
run_gpt(gpt_1)

Total used GPU memory:	22.4 GB
# samples per second:	13.5
Measured TFLOPs:	15.7


In [49]:
gpt_2 = Config("gpt2-medium", 512, 7, bf16=True, optim='adamw_apex_fused', 
                grad_accum=4)
run_gpt(gpt_2)

Total used GPU memory:	21.2 GB
# samples per second:	21.1
Measured TFLOPs:	24.5


## Multiple GPUs

In [75]:
mbert_1 = Config('bert-large-uncased', 128, 76, bf16=True, optim='adamw_apex_fused', 
                grad_accum=4, steps=10, gpus='0,1')
run_bert(mbert_1)

Total used GPU memory:	23.1 GB
# samples per second:	252.2
Measured per GPU TFLOPs:	32.9


In [76]:
mbert_2 = mbert_1
mbert_2.ddp = True
mbert_2.batch_size = 70
run_bert(mbert_2)

Total used GPU memory:	22.7 GB
# samples per second:	319.9
Measured per GPU TFLOPs:	41.7


In [77]:
os.environ["NCCL_P2P_DISABLE"] = "1"
run_bert(mbert_2)
os.environ["NCCL_P2P_DISABLE"] = "0"

Total used GPU memory:	22.7 GB
# samples per second:	310.0
Measured per GPU TFLOPs:	40.4


In [78]:
mbert_3 = mbert_1
mbert_3.deepspeed = True
mbert_3.ds_config = 'transformers/tests/deepspeed/ds_config_zero2.json'
mbert_3.batch_size = 128
run_bert(mbert_3)

Total used GPU memory:	22.4 GB
# samples per second:	297.4
Measured per GPU TFLOPs:	38.8


In [79]:
os.environ["NCCL_P2P_DISABLE"] = "1"
run_bert(mbert_3)
os.environ["NCCL_P2P_DISABLE"] = "0"

Total used GPU memory:	22.4 GB
# samples per second:	278.7
Measured per GPU TFLOPs:	36.3


In [85]:
mgpt_1 = Config("gpt2-large", 1024, 2, bf16=True, optim='adamw_apex_fused', 
                grad_accum=16, gpus='0,1', steps=5, deepspeed=True, 
                ds_config='transformers/tests/deepspeed/ds_config_zero2.json')
run_gpt(mgpt_1)

Total used GPU memory:	16.2 GB
# samples per second:	7.0
Measured per GPU TFLOPs:	18.8


In [84]:
os.environ["NCCL_P2P_DISABLE"] = "1"
run_gpt(mgpt_1)
os.environ["NCCL_P2P_DISABLE"] = "0"

Total used GPU memory:	16.2 GB
# samples per second:	5.7
Measured per GPU TFLOPs:	15.0


In [86]:
mgpt_2 = Config("gpt2-xl", 1024, 1, bf16=True, optim='adamw_apex_fused', 
                grad_accum=16, gpus='0,1', deepspeed=True, steps=5,
                ds_config='transformers/tests/deepspeed/ds_config_zero2.json')
run_gpt(mgpt_2)

Total used GPU memory:	15.3 GB
# samples per second:	2.2
Measured per GPU TFLOPs:	11.8


This is what