# Transformers Benchmarks

Evaluate Bert/GPT on single/multi GPUs. 

Install libraries for our benchmark:

In [1]:
!git clone https://github.com/huggingface/transformers
!cd transformers; pip install .
!pip install datasets evaluate deepspeed psutil

fatal: destination path 'transformers' already exists and is not an empty directory.
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Processing /workspace/transformers
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 8.4 MB/s eta 0:00:01
Collecting huggingface-hub<1.0,>

Collecting frozenlist>=1.1.1
  Downloading frozenlist-1.3.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (161 kB)
[K     |████████████████████████████████| 161 kB 61.4 MB/s eta 0:00:01
[?25hCollecting multidict<7.0,>=4.5
  Downloading multidict-6.0.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (121 kB)
[K     |████████████████████████████████| 121 kB 83.5 MB/s eta 0:00:01
[?25hCollecting async-timeout<5.0,>=4.0.0a3
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Building wheels for collected packages: deepspeed, py-cpuinfo
  Building wheel for deepspeed (setup.py) ... [?25ldone
[?25h  Created wheel for deepspeed: filename=deepspeed-0.7.0-py3-none-any.whl size=644048 sha256=71cc0e86bde6d2eebe579b0ba362d15ac0ea05d8ade2ed99c6397788c2b58dd4
  Stored in directory: /tmp/pip-ephem-wheel-cache-vzrgtz95/wheels/89/b1/1f/36abd13839a2c71b019c76b220576c62c2b16fe0558dbd326c
  Building wheel for py-cpuinfo (setup.py) ..

A few utility functions.

The TFLOPS of a BERT-like or GPT-like model to train one example. We ignored vector operations such as LayerNorm and weight updates for simplicity.

In [60]:
def model_tflops(num_layers, hidden_size, vocab_size, seq_len):
    attention = 4 * hidden_size * seq_len**2 + 8 * seq_len * hidden_size**2 
    ffn = 16 * seq_len * hidden_size**2
    embedding = 2 * seq_len * hidden_size * vocab_size
    forward = num_layers * (attention + ffn) + embedding
    return 3 * forward / 1e12

Find the number of examples per second from Huggingface's training log.

In [61]:
import json

def throughput(output):
    for l in output:
        if 'CUDA out of memory' in l:
            print('Out of GPU memory, try a smaller batch size')
            return 0
        if '{\'train_runtime' in l:
            metrics = json.loads(l.replace('\'', '\"'))
            gpu_mem = metrics['init_mem_cpu_peaked_delta'] + \
                metrics['train_mem_gpu_alloc_delta'] + metrics['train_mem_gpu_peaked_delta']
            print('Total used GPU memory:\t%.1f GB'% (gpu_mem/1e9))
            r = metrics['train_samples_per_second']
            print('# samples per second:\t%.1f' %r)
            return r
    print('Unknown error, print output to check')
    return 0

## Bert on a Single GPU

Add your model here if not exists.

In [54]:
model_spec = {
    # https://huggingface.co/bert-large-uncased/blob/main/config.json
    'bert-large-uncased' : {
        'num_layers' : 24, 'vocab_size' : 30522, 'hidden_size' : 1024},
    # https://huggingface.co/bert-base-cased/blob/main/config.json
    'bert-base-cased' : {
        'num_layers' : 12, 'vocab_size' : 28996, 'hidden_size' : 768}
}

Use fine-tuning BERT for [text-classifcation](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification) as our workloads.

In [66]:
task = "mrpc"
model = "bert-large-uncased"
batch_size = 48
seq_len = 128
fp16 = True # default: False
optim = "adamw_apex_fused"  # default: adamw_hf
gradient_checkpointing = False # default: False
gradient_accumulation_steps = 4 # default: 1

cmd = f'''rm -rf /tmp/{task}; \
cd transformers/examples/pytorch/text-classification; \
python run_glue.py \
  --model_name_or_path {model} \
  --task_name {task} \
  --do_train \
  --max_seq_length {seq_len} \
  --per_device_train_batch_size {batch_size} \
  --learning_rate 2e-5 \
  --num_train_epochs 1 \
  --fp16 {fp16} \
  --optim {optim} \
  --gradient_accumulation_steps {gradient_accumulation_steps} \
  --gradient_checkpointing {gradient_checkpointing} \
  --output_dir /tmp/{task}/ \
  --skip_memory_metrics False \
'''

output = !$cmd

Get performance metrics.

In [68]:
tflops = model_tflops(seq_len=seq_len, **model_spec[model]) * throughput(output)
print('Measured TFLOPs:\t%.1f' % tflops)

Total used GPU memory:	14.6 GB
# samples per second:	158.0
Measured TFLOPs:	41.2


## GPT on a Single GPU

In [82]:
model_spec.update({
    'gpt2': { # https://huggingface.co/gpt2/blob/main/config.json
        'num_layers' : 12, 'vocab_size' : 50257, 'hidden_size' : 768},
    'gpt2-medium': { # https://huggingface.co/gpt2-medium/blob/main/config.json 
        'num_layers' : 24, 'vocab_size' : 50257, 'hidden_size' : 1024}},
)

In [79]:
model = "gpt2-medium"
batch_size = 4
seq_len = 512
fp16 = True # default: False
optim = "adamw_apex_fused"  # default: adamw_hf
gradient_checkpointing = False # default: False
gradient_accumulation_steps = 4 # default: 1


cmd = f'''rm -rf /tmp/clm; \
cd transformers/examples/pytorch/language-modeling; \
python run_clm.py \
    --model_name_or_path {model} \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --per_device_train_batch_size {batch_size} \
    --do_train \
    --block_size {seq_len} \
    --fp16 {fp16} \
    --optim {optim} \
    --gradient_accumulation_steps {gradient_accumulation_steps} \
    --gradient_checkpointing {gradient_checkpointing} \
    --skip_memory_metrics False \
    --max_steps 25 \
    --output_dir /tmp/clm
'''

output = ! $cmd

In [84]:
tflops = model_tflops(seq_len=seq_len, **model_spec[model]) * throughput(output)
print('Measured TFLOPs:\t%.1f' % tflops)

Total used GPU memory:	14.8 GB
# samples per second:	15.5
Measured TFLOPs:	18.0
