Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Benchmark] HF Trainer on A100 #15026

Open
stas00 opened this issue Jan 4, 2022 · 17 comments
Open

[Benchmark] HF Trainer on A100 #15026

stas00 opened this issue Jan 4, 2022 · 17 comments
Assignees
Labels
Benchmarks Issues related to Memory regressions in tests and scripts WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress

Comments

@stas00
Copy link
Contributor

stas00 commented Jan 4, 2022

🖥 Benchmarking transformers w/ HF Trainer on a single A100 40GB

We are going to use a special benchmarking tool that will do all the work for us. #14934

This is the index post and specific benchmarks are in their own posts below:

  1. fp16 vs bf16 vs tf32 vs fp32
  2. gradient accumulation steps
  3. batch size
  4. gradient checkpointing
  5. optimizers
  6. combining winning strategies ~3x speed improvement!
  7. RTX-3090 vs A100

Note that each benchmark was run only once, so multiple runs and averaging is probably going to give slightly different results. The purpose here though is to see relative differences roughly and not try to give an exact number.

See also the same benchmarks for RTX-3090

@stas00
Copy link
Contributor Author

stas00 commented Jan 4, 2022

precision: fp16 vs bf16 vs tf32 vs fp32

Main interest: benchmarking the new --bf16 and --tf32 on Ampere/RTX-3090, comparatively to fp16 and fp32 modes.

  • bf16 is autocast(dtype=torch.bfloat16)
  • tf32 is torch.backends.cuda.matmul.allow_tf32 = True

Benchmark

The benchmark uses 3 different t5 models, and at the end of the section also gpt2. For t5 the main script is:

CUDA_VISIBLE_DEVICES=0 python \
 examples/pytorch/translation/run_translation.py --model_name_or_path t5-small \
--output_dir output_dir --do_train --label_smoothing 0.1 --logging_strategy no \
--save_strategy no --per_device_train_batch_size 64 --max_source_length 512 \
--max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: " --warmup_steps 50 \
--max_train_samples 40000 --dataloader_num_workers 2 

and now adding one of:

--tf32 0 # fp32
--tf32 0 --fp16
--tf32 0 --bf16
--tf32 1
--tf32 1 --fp16
--tf32 1 --bf16

But we are going to use a special benchmarking tool that will do all the work for us. #14934

Important notes:

  1. --tf32 0 --fp16 0 combo is just fp32 (which is the default mode - we don't have this option per se)
  2. I changed --per_device_train_batch_size in the base command from 32 (t5-small) to 16 (t5-base) to 8 (t5-large) to be able to fit into the GPU memory while keeping it as occupied as possible.
  3. I changed --max_train_samples in the base command from 20k (t5-small) to 10k (t5-base) to 5k (t5-large) to give each run about 1-3min of run time so that the benchmark doesn't take too too long, but is long enough to put strain on the card.

*** Setup:

Datetime    : 2022-01-03 22:43:38

Software:
transformers: 4.16.0.dev0
torch       : 1.10.1
cuda        : 11.3
python      : 3.8.12

Hardware:
1 GPUs      : NVIDIA A100-SXM4-40GB, 39.59GB

Benchmark 1: t5-small

Variation Train
samples
per
second
Diff
%
Train
loss
--tf32 0 272.59 0 2.49
--tf32 1 581.61 113 2.49
--fp16 --tf32 0 643.07 136 2.49
--fp16 --tf32 1 635.24 133 2.49
--bf16 --tf32 0 616.23 126 2.50
--bf16 --tf32 1 612.59 125 2.50

Conclusions:

  • fp16 is 136% faster than fp32
  • bf16 is ~4% slower than fp16
  • tf32 is 113% faster than fp32, and only ~10% slower than fp16
  • tf32 makes ~1% impact on bf16 and fp16 modes
CUDA_VISIBLE_DEVICES=3 python ./scripts/benchmark/trainer-benchmark.py \
--base-cmd \
' examples/pytorch/translation/run_translation.py --model_name_or_path t5-small \
--output_dir output_dir --do_train --label_smoothing 0.1 --logging_strategy no \
--save_strategy no --per_device_train_batch_size 64 --max_source_length 512 \
--max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: " --warmup_steps 50 \
--max_train_samples 40000 --dataloader_num_workers 2 ' \
--target-metric-key train_samples_per_second --repeat-times 1 --variations \
'|--fp16|--bf16' '--tf32 0|--tf32 1' --report-metric-keys train_loss \
--repeat-times 1 --base-variation '--tf32 0'

Benchmark 2: t5-base

Variation Train
samples
per
second
Diff
%
Train
loss
--tf32 0 80.10 0 2.21
--tf32 1 214.10 167 2.21
--fp16 --tf32 0 219.20 174 2.21
--fp16 --tf32 1 218.46 173 2.21
--bf16 --tf32 0 214.17 167 2.22
--bf16 --tf32 1 225.44 181 2.22

Conclusions:

  • fp16 is 173% faster than fp32
  • bf16 is about the same as fp16
  • tf32 is 167% faster than fp32, and is about the same as fp16
CUDA_VISIBLE_DEVICES=0 python ./scripts/benchmark/trainer-benchmark.py \
--base-cmd \
' examples/pytorch/translation/run_translation.py --model_name_or_path t5-base \
--output_dir output_dir --do_train --label_smoothing 0.1 --logging_strategy no \
--save_strategy no --per_device_train_batch_size 32 --max_source_length 512 \
--max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: " --warmup_steps 50 \
--max_train_samples 20000 --dataloader_num_workers 2 ' \
--target-metric-key train_samples_per_second --repeat-times 1 --variations \
'|--fp16|--bf16' '--tf32 0|--tf32 1' --report-metric-keys train_loss \
--repeat-times 1 --base-variation '--tf32 0'

Benchmark 3: t5-large

Variation Train
samples
per
second
Diff
%
Train
loss
--tf32 0 31.59 0 2.03
--tf32 1 36.13 14 2.03
--fp16 --tf32 0 34.86 10 0.00
--fp16 --tf32 1 36.77 16 0.00
--bf16 --tf32 0 31.35 -1 2.04
--bf16 --tf32 1 31.30 -1 2.04

Conclusions:

  • fp16 overflows here (loss=0). (this is a very well known issue with many bf16-pretrained models that are being attempted to be finetuned in fp16).
  • tf32 is only 14% faster than fp32
  • bf16 for some reason performs really badly - same as fp32 (this same benchmark on RTX-3090 doesn't have this problem)
CUDA_VISIBLE_DEVICES=3 python ./scripts/benchmark/trainer-benchmark.py \
--base-cmd \
' examples/pytorch/translation/run_translation.py --model_name_or_path t5-large \
--output_dir output_dir --do_train --label_smoothing 0.1 --logging_strategy no \
--save_strategy no --per_device_train_batch_size 8 --max_source_length 512 \
--max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: " --warmup_steps 50 \
--max_train_samples 5000 --dataloader_num_workers 2 ' \
--target-metric-key train_samples_per_second --repeat-times 1 --variations \
'|--fp16|--bf16' '--tf32 0|--tf32 1' --report-metric-keys train_loss \
--repeat-times 1 --base-variation '--tf32 0'

If I use a higher bs=16 instead of 8, bf16 does deliver a better than fp32 performance, but still not on par with fp16:

Variation Train
samples
per
second
Diff
%
Train
loss
--tf32 0 39.59 0 2.10
--tf32 1 67.24 70 2.10
--fp16 --tf32 0 70.88 79 0.00
--fp16 --tf32 1 70.38 78 0.00
--bf16 --tf32 0 61.37 55 2.12
--bf16 --tf32 1 59.95 51 2.12

It'd be great to know why CUDA doesn't activate some optimization since not everybody is going to run benchmarks, but if you do run benchmarks and find yourself in this situation Eddie Yan proposed that adding --gradient_accumulation_steps to create a much larger batch for the scheduler to step with which should help a lot.

So let's try --per_device_train_batch_size 16 --gradient_accumulation_steps 4 for a total of effective bs=64:

Variation Train
samples
per
second
Diff
%
Train
loss
--tf32 0 45.60 0 2.35
--tf32 1 86.78 90 2.36
--fp16 --tf32 0 77.47 70 0.00
--fp16 --tf32 1 79.63 75 0.00
--bf16 --tf32 0 75.85 66 2.41
--bf16 --tf32 1 73.19 61 2.41

Both bf16 and tf32 show a much better performance here.

Benchmark 4: gpt2

Variation Train
samples
per
second
Diff
%
Train
loss
--tf32 0 28.77 0 3.34
--tf32 1 63.51 121 3.34
--fp16 --tf32 0 69.60 142 3.34
--fp16 --tf32 1 69.98 143 3.34
--bf16 --tf32 0 70.37 145 3.34
--bf16 --tf32 1 69.88 143 3.34

Conclusions:

  • fp16 is ~140% faster than fp32
  • bf16 is on par with fp16
  • tf32 is 121% faster than fp32, and only ~8% slower than fp16
  • tf32 makes almost no impact on bf16 and fp16 modes
*** The benchmark command line was:

CUDA_VISIBLE_DEVICES=0 python ./scripts/benchmark/trainer-benchmark.py \
--base-cmd \
' examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--logging_strategy no --save_strategy no --do_train --max_train_samples 5000 \
--per_device_train_batch_size 16 --num_train_epochs 1 --warmup_steps 8 \
--block_size 512 --report_to none ' \
--target-metric-key train_samples_per_second --repeat-times 1 --variations \
'|--fp16|--bf16' '--tf32 0|--tf32 1' --report-metric-keys train_loss \
--repeat-times 1 --base-variation '--tf32 0'

Benchmark 5: gpt2-medium

Variation Train
samples
per
second
Diff
%
Train
loss
--tf32 0 10.60 0 2.98
--tf32 1 24.81 134 2.98
--fp16 --tf32 0 27.67 161 2.99
--fp16 --tf32 1 27.62 160 2.99
--bf16 --tf32 0 27.57 160 2.99
--bf16 --tf32 1 27.55 160 2.99

Conclusions:

  • fp16 is ~160% faster than fp32
  • bf16 is on par with fp16
  • tf32 is 134% faster than fp32, and only ~10% slower than fp16
  • tf32 makes no impact on bf16 and fp16 modes

CUDA_VISIBLE_DEVICES=3 python ./scripts/benchmark/trainer-benchmark.py \
--base-cmd \
' examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2-medium \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--logging_strategy no --save_strategy no --do_train --max_train_samples 2500 \
--per_device_train_batch_size 8 --num_train_epochs 1 --warmup_steps 8 \
--block_size 512 --report_to none ' \
--target-metric-key train_samples_per_second --repeat-times 1 --variations \
'|--fp16|--bf16' '--tf32 0|--tf32 1' --report-metric-keys train_loss \
--repeat-times 1 --base-variation '--tf32 0'

@stas00
Copy link
Contributor Author

stas00 commented Jan 4, 2022

gradient accumulation steps

Let's choose t5-base model to test with as it's pretty large yet doesn't overflow like t5-large.

Let's measure --gradient_accumulation_steps 1,2,4,8,16,32 with different precision configurations.

Variation Train
samples
per
second
Diff
%
Train
loss
--gradient_accumulation_steps 1 --tf32 0 93.68 0 2.21
--gradient_accumulation_steps 1 --tf32 1 210.53 125 2.21
--gradient_accumulation_steps 1 --tf32 0 --fp16 217.75 132 2.21
--gradient_accumulation_steps 1 --tf32 0 --bf16 224.09 139 2.22
--gradient_accumulation_steps 2 --tf32 0 97.48 4 2.28
--gradient_accumulation_steps 2 --tf32 1 236.39 152 2.28
--gradient_accumulation_steps 2 --tf32 0 --fp16 244.81 161 2.28
--gradient_accumulation_steps 2 --tf32 0 --bf16 246.08 163 2.29
--gradient_accumulation_steps 4 --tf32 0 99.68 6 2.39
--gradient_accumulation_steps 4 --tf32 1 248.24 165 2.40
--gradient_accumulation_steps 4 --tf32 0 --fp16 259.20 177 2.41
--gradient_accumulation_steps 4 --tf32 0 --bf16 263.39 181 2.42
--gradient_accumulation_steps 8 --tf32 0 100.67 7 2.58
--gradient_accumulation_steps 8 --tf32 1 252.45 169 2.58
--gradient_accumulation_steps 8 --tf32 0 --fp16 261.59 179 2.58
--gradient_accumulation_steps 8 --tf32 0 --bf16 267.37 185 2.62
--gradient_accumulation_steps 16 --tf32 0 100.97 8 2.83
--gradient_accumulation_steps 16 --tf32 1 253.68 171 2.84
--gradient_accumulation_steps 16 --tf32 0 --fp16 256.13 173 2.84
--gradient_accumulation_steps 16 --tf32 0 --bf16 274.14 193 2.89

Let's filter out just one subset so that it's easier to compare the gradient accumulation differences alone, so re-running with just bf16 enabled (--tf32 0 --bf16):

Variation Train
samples
per
second
Diff
%
Train
loss
--gradient_accumulation_steps 1 228.41 0 2.22
--gradient_accumulation_steps 2 248.77 9 2.29
--gradient_accumulation_steps 4 263.54 15 2.42
--gradient_accumulation_steps 8 270.12 18 2.62
--gradient_accumulation_steps 16 271.99 19 2.89

Conclusions:

  • that's a significant speed up for even 4 steps
  • at 16 the impact is almost negligible over 8
  • the loss gets much bigger with the higher accumulation steps - my benchmark is very short and with less steps to take when the batches are larger, the model simply doesn't have a chance to step down far enough. The same can be observed with just normal batch size changes.
    non-zero lr warm up too plays a role here since it's a very short run.

1.

CUDA_VISIBLE_DEVICES=3 python ./scripts/benchmark/trainer-benchmark.py \
--base-cmd \
' examples/pytorch/translation/run_translation.py --model_name_or_path t5-base \
--output_dir output_dir --do_train --label_smoothing 0.1 --logging_strategy no \
--save_strategy no --per_device_train_batch_size 32 --max_source_length 512 \
--max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: " --warmup_steps 50 \
--max_train_samples 20000 --dataloader_num_workers 2 ' \
--target-metric-key train_samples_per_second --repeat-times 1 --variations \
'--gradient_accumulation_steps 1|--gradient_accumulation_steps 2|--gradient_accumulation_steps 4|--gradient_accumulation_steps 8|--gradient_accumulation_steps 16' \
'--tf32 0|--tf32 1|--tf32 0 --fp16|--tf32 0 --bf16' --report-metric-keys \
train_loss --repeat-times 1 

2.

CUDA_VISIBLE_DEVICES=3 python ./scripts/benchmark/trainer-benchmark.py \
--base-cmd \
' examples/pytorch/translation/run_translation.py --model_name_or_path t5-base \
--output_dir output_dir --do_train --label_smoothing 0.1 --logging_strategy no \
--save_strategy no --per_device_train_batch_size 32 --max_source_length 512 \
--max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: " --warmup_steps 50 \
--max_train_samples 20000 --dataloader_num_workers 2 --tf32 0 --bf16 ' \
--target-metric-key train_samples_per_second --repeat-times 1 --variations \
'--gradient_accumulation_steps 1|--gradient_accumulation_steps 2|--gradient_accumulation_steps 4|--gradient_accumulation_steps 8|--gradient_accumulation_steps 16' \
--report-metric-keys train_loss --repeat-times 1

@stas00
Copy link
Contributor Author

stas00 commented Jan 4, 2022

batch size

Variation Train
samples
per
second
Diff
%
Train
loss
--per_device_train_batch_size 1 7.77 0 1.90
--per_device_train_batch_size 2 15.51 100 2.01
--per_device_train_batch_size 4 29.66 282 2.09
--per_device_train_batch_size 8 61.16 687 2.16
--per_device_train_batch_size 16 115.84 1392 2.25
--per_device_train_batch_size 32 224.96 2797 2.38

Conclusions:

  • No surprise here, the speed here is directly proportional to the gpu capacity utilization. In this particular configuration BS=16 is the highest BS we can fit. So when we use BS=1 we greatly underutilize the GPU. The speed up is linear and almost directly proportional to the batch-size.
  • as with gradient accumulation steps lm loss gets worse with the increase in the batch size because my benchmark is very short and with less steps to take when the batches are larger, the model simply doesn't have a chance to step down far enough.

CUDA_VISIBLE_DEVICES=3 python ./scripts/benchmark/trainer-benchmark.py \
--base-cmd \
' examples/pytorch/translation/run_translation.py --model_name_or_path t5-base \
--output_dir output_dir --do_train --label_smoothing 0.1 --logging_strategy no \
--save_strategy no --max_source_length 512 \
--max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: " --warmup_steps 50 \
--max_train_samples 5000 --dataloader_num_workers 2 --bf16' \
--target-metric-key train_samples_per_second --repeat-times 1 --variations \
'--per_device_train_batch_size 1|--per_device_train_batch_size 2|--per_device_train_batch_size 4|--per_device_train_batch_size 8|--per_device_train_batch_size 16|--per_device_train_batch_size 32' \
--report-metric-keys train_loss --repeat-times 1 

@stas00
Copy link
Contributor Author

stas00 commented Jan 4, 2022

gradient checkpointing

Variation Train
samples
per
second
Diff
%
Train
loss
--gradient_checkpointing 0 225.67 24 2.30
--gradient_checkpointing 1 182.68 0 2.30

Conclusions:

  • as expected since gradient checkpointing recalculates forward activations it should be slower - we get a 24% slowdown here.

Let's look at memory:

Variation Train
samples
per
second
Diff
%
Train
loss
Train
mem
gpu
alloc
delta
Train
mem
gpu
peaked
delta
--gradient_checkpointing 0 122.81 35 2.38 2739MB 1155MB
--gradient_checkpointing 1 90.92 0 2.38 2697MB 3229MB

We can clearly see that peak GPU memory is ~2/3 less.

note: I had to half BS in the 2nd benchmark as I was getting OOM.

CUDA_VISIBLE_DEVICES=3 python ./scripts/benchmark/trainer-benchmark.py \
--base-cmd \
' examples/pytorch/translation/run_translation.py --model_name_or_path t5-base \
--output_dir output_dir --do_train --label_smoothing 0.1 --logging_strategy no \
--save_strategy no --per_device_train_batch_size 32 --max_source_length 512 \
--max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: " --warmup_steps 50 \
--max_train_samples 10000 --dataloader_num_workers 2 --bf16' \
--target-metric-key train_samples_per_second --repeat-times 1 --variations \
'--gradient_checkpointing 0|--gradient_checkpointing 1' --report-metric-keys \
train_loss --repeat-times 1 

CUDA_VISIBLE_DEVICES=3 python ./scripts/benchmark/trainer-benchmark.py \
--base-cmd \
' examples/pytorch/translation/run_translation.py --model_name_or_path t5-base \
--output_dir output_dir --do_train --label_smoothing 0.1 --logging_strategy no \
--save_strategy no --per_device_train_batch_size 32 --max_source_length 512 \
--max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: " --warmup_steps 50 \
--max_train_samples 5000 --dataloader_num_workers 2 --bf16 --skip_memory_metrics 0' \
--target-metric-key train_samples_per_second --repeat-times 1 --variations \
'--gradient_checkpointing 0|--gradient_checkpointing 1' --report-metric-keys \
'train_loss train_mem_gpu_alloc_delta train_mem_gpu_peaked_delta' \
--repeat-times 1

@stas00 stas00 changed the title [WIP] [Benchmark] HF Trainer on A100 [Benchmark] HF Trainer on A100 Jan 4, 2022
@stas00
Copy link
Contributor Author

stas00 commented Jan 4, 2022

optimizers

Let's do fp32 first:

Variation Train
samples
per
second
Diff
%
Train
loss
--optim adamw_hf 214.55 2 2.21
--optim adamw_torch 209.72 0 2.21
--optim adafactor 158.56 -24 2.21
--optim adamw_apex_fused 227.96 9 2.21

Observations:

  • apex's FusedAdam is the fastest.

fp16:

Variation Train
samples
per
second
Diff
%
Train
loss
--optim adamw_hf 221.08 5 2.21
--optim adamw_torch 209.85 0 2.21
--optim adafactor 160.69 -23 2.21
--optim adamw_apex_fused 231.71 10 2.21

bf16:

Variation Train
samples
per
second
Diff
%
Train
loss
--optim adamw_hf 221.28 4 2.22
--optim adamw_torch 212.83 0 2.22
--optim adafactor 164.21 -23 2.22
--optim adamw_apex_fused 237.31 12 2.22

Observations:

  • The relative speed up is similar

# fp32
CUDA_VISIBLE_DEVICES=3 python scripts/benchmark/trainer-benchmark.py --base-cmd \
' \
examples/pytorch/translation/run_translation.py --model_name_or_path t5-base --output_dir output_dir \
--do_train --label_smoothing 0.1 --logging_strategy no --save_strategy no --per_device_train_batch_size 32 \
--max_source_length 512 --max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: "  --warmup_steps 50 \
--max_train_samples 20000 --dataloader_num_workers 2 \
' \
--target-metric-key train_samples_per_second --repeat-times 1 --variations \
'--optim adamw_hf|--optim adamw_torch|--optim adafactor|--optim adamw_apex_fused' \
--report-metric-keys train_loss --base-variation '--optim adamw_torch'

# fp16 - just add --fp16 to base-cmd

# bf16 - just add --bf16 to base-cmd

@stas00
Copy link
Contributor Author

stas00 commented Jan 4, 2022

combining winning strategies

Now let's combine the winning strategies from each individual benchmark above and compare with the baseline:

Variation Train
samples
per
second
Diff
%
Train
loss
--optim adamw_torch --gradient_accumulation_steps 1 --tf32 0 92.15 0 2.21
--optim adamw_apex_fused --gradient_accumulation_steps 8 --tf32 --bf16 267.21 190 2.62

Getting an almost 3x improvement in speed!

CUDA_VISIBLE_DEVICES=3 python \
../transformers-stas/scripts/benchmark/trainer-benchmark.py --base-cmd \
' \
examples/pytorch/translation/run_translation.py --model_name_or_path t5-base --output_dir output_dir \
--do_train --label_smoothing 0.1 --logging_strategy no --save_strategy no --per_device_train_batch_size 32 \
--max_source_length 512 --max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: "  --warmup_steps 50 \
--max_train_samples 20000 --dataloader_num_workers 2 \
' \
--target-metric-key train_samples_per_second --repeat-times 1 --variations \
'--optim adamw_torch --gradient_accumulation_steps 1 --tf32 0|--optim adamw_apex_fused --gradient_accumulation_steps 8 --tf32 --bf16' \
--report-metric-keys train_loss --base-variation '--optim adamw_torch'

@stas00
Copy link
Contributor Author

stas00 commented Jan 4, 2022

RTX-3090 vs A100

In all the benchmarks above I was making the batch size bigger and run more samples comparative to the same RTX-3090 benchmarks as A100 40GB card can handle more than RTX-3090 24GB, but let's compare now the 2 using the same config. So we will have RTX-3090 fully loaded, but A100 will be only partially loaded.

Also each card is running on a different machines so there is a bit of hardware difference as well.

A100

Variation Train
samples
per
second
Diff
%
Train
loss
--optim adamw_torch --gradient_accumulation_steps 1 --tf32 0 85.99 0 2.16
--optim adamw_apex_fused --gradient_accumulation_steps 8 --tf32 --bf16 153.72 79 2.42

RTX-3090

Variation Train
samples
per
second
Diff
%
Train
loss
--optim adamw_torch --gradient_accumulation_steps 1 --tf32 0 88.94 0 2.16
--optim adamw_apex_fused --gradient_accumulation_steps 8 --tf32 --bf16 173.15 95 2.42

Observations:

  • This is very unexpected, RTX-3090 appears to be faster when we use only 1/2 of A100's capacity to match the 2.
  • as the machines aren't the same it'd be good to find a machine that has both cards and test them on equal hardware.

Same software was used for both setups:

transformers: 4.16.0.dev0
torch       : 1.10.1
cuda        : 11.3
python      : 3.8.12
CUDA_VISIBLE_DEVICES=0 python \
/hf/transformers-trainer-benchmark/scripts/benchmark/trainer-benchmark.py \
--base-cmd \
' \
examples/pytorch/translation/run_translation.py --model_name_or_path t5-base --output_dir output_dir \
--do_train --label_smoothing 0.1 --logging_strategy no --save_strategy no --per_device_train_batch_size 16 \
--max_source_length 512 --max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: "  --warmup_steps 50 \
--max_train_samples 20000 --dataloader_num_workers 2 \
' \
--target-metric-key train_samples_per_second --repeat-times 1 --variations \
'--optim adamw_torch --gradient_accumulation_steps 1 --tf32 0|--optim adamw_apex_fused --gradient_accumulation_steps 8 --tf32 --bf16' \
--report-metric-keys train_loss

I thought that perhaps this had to do with bf16, so I re-did the same with --fp16 instead of --bf16, but the outcome is similar that RTX-3090 appears to be faster on the same benchmark:

A100

Variation Train
samples
per
second
Diff
%
Train
loss
--optim adamw_torch --gradient_accumulation_steps 1 --tf32 0 85.89 0 2.16
--optim adamw_apex_fused --gradient_accumulation_steps 8 --tf32 --fp16 144.20 68 2.39

RTX-3090

Variation Train
samples
per
second
Diff
%
Train
loss
--optim adamw_torch --gradient_accumulation_steps 1 --tf32 0 88.95 0 2.16
--optim adamw_apex_fused --gradient_accumulation_steps 8 --tf32 --fp16 168.28 89 2.39

Still not good for A100. Let's try w/o tf32:

A100

Variation Train
samples
per
second
Diff
%
Train
loss
--optim adamw_torch --gradient_accumulation_steps 1 --tf32 0 86.35 0 2.16
--optim adamw_apex_fused --gradient_accumulation_steps 8 --tf32 0 --fp16 156.16 81 2.39

RTX-3090

Variation Train
samples
per
second
Diff
%
Train
loss
--optim adamw_torch --gradient_accumulation_steps 1 --tf32 0 88.87 0 2.16
--optim adamw_apex_fused --gradient_accumulation_steps 8 --tf32 0 --fp16 167.36 88 2.39

This is better for A100. So tf32 was making things worse here for some reason.

Eddie Yan explained the reason for RTX-3090 being faster:

When there is underutilization on GPUs of similar architectures, it may come down to clock rates, and 3090 does have faster peak clock rates than A100.

@stas00 stas00 added the Benchmarks Issues related to Memory regressions in tests and scripts label Jan 5, 2022
@stas00 stas00 self-assigned this Jan 5, 2022
@stas00 stas00 added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Jan 5, 2022
@eqy
Copy link

eqy commented Jan 14, 2022

@stas00 Another interesting issue that I found regarding batch size is that it is an important parameter when the model is mostly in fp32 and relies on autocast to dispatch to fp16 or bf16. I believe this is because of the overhead of casting back-and-forth can dominate the total runtime compared to the actual kernel/operator.

Consider the following microbenchmark:

import time
import torch

from torch.cuda.amp import autocast

def bench(dtype, dim, shape, auto=True):
    linear = torch.nn.Linear(shape[-1], dim, device='cuda')
    inp = torch.randn(shape, device='cuda')
    ctx_manager = None
    if not auto:
        inp = inp.to(dtype)
        linear = linear.to(dtype)
    else:
        ctx_manager = autocast(dtype=dtype)

    def run(inp, layer, ctx_manager):
        if ctx_manager is not None:
            with ctx_manager:
                layer(inp)
        else:
            layer(inp)

    run(inp, linear, ctx_manager)
    torch.cuda.synchronize()
    t1 = time.time()
    for i in range(1000):
        run(inp, linear, ctx_manager)
    torch.cuda.synchronize()
    t2 = time.time()
    return t2 - t1

if __name__ == '__main__':
    hidden_dim = 1024
    for auto in (True, False):
        print(f"autocast: {auto}")
        for batch_size in (8, 16, 32, 64):
            shape = [batch_size, 56, hidden_dim]
            times = list()
            for dtype in (torch.float32, torch.float16, torch.bfloat16):
                times.append(bench(dtype, hidden_dim, shape, auto))
            print(f"batch_size: {batch_size} fp32 {times[0]:3f} fp16 {times[1]:3f} bf16 {times[2]:3f}")
            print(f"speedup: fp32 {(times[0]/times[0]):3f} fp16 {(times[0]/times[1]):3f} bf16 {(times[0]/times[2]):3f}")

I get the following times on A6000 (similar architecture to 3090):

autocast: True
batch_size: 8 fp32 0.040668 fp16 0.061515 bf16 0.060857
speedup: fp32 1.000000 fp16 0.661102 bf16 0.668253
batch_size: 16 fp32 0.050241 fp16 0.061965 bf16 0.061339
speedup: fp32 1.000000 fp16 0.810793 bf16 0.819065
batch_size: 32 fp32 0.109936 fp16 0.089657 bf16 0.091546
speedup: fp32 1.000000 fp16 1.226184 bf16 1.200876
batch_size: 64 fp32 0.189083 fp16 0.150391 bf16 0.150227
speedup: fp32 1.000000 fp16 1.257275 bf16 1.258648
autocast: False
batch_size: 8 fp32 0.038590 fp16 0.031647 bf16 0.030893
speedup: fp32 1.000000 fp16 1.219398 bf16 1.249145
batch_size: 16 fp32 0.049446 fp16 0.032320 bf16 0.031509
speedup: fp32 1.000000 fp16 1.529887 bf16 1.569281
batch_size: 32 fp32 0.111689 fp16 0.056600 bf16 0.060192
speedup: fp32 1.000000 fp16 1.973323 bf16 1.855555
batch_size: 64 fp32 0.190082 fp16 0.103095 bf16 0.104311
speedup: fp32 1.000000 fp16 1.843766 bf16 1.822261

@stas00
Copy link
Contributor Author

stas00 commented Jan 14, 2022

Thank you, @eqy.

update: edited out the original note on casting back, since the explicit casting is not being measured

Added a nicely formatted table output so it's much easier to analyze. Updated script attached: bench.txt

On RTX-3090 I get:

Autocast: True
Results:

bs torch.float32 torch.float16 torch.bfloat16
8 0.057 0.070 0.070
16 0.082 0.082 0.070
32 0.169 0.103 0.119
64 0.267 0.190 0.191

Speedup:

bs torch.float32 torch.float16 torch.bfloat16
8 1.000 0.810 0.818
16 1.000 0.997 1.179
32 1.000 1.639 1.421
64 1.000 1.411 1.398

Autocast: False
Results:

bs torch.float32 torch.float16 torch.bfloat16
8 0.052 0.040 0.040
16 0.082 0.045 0.045
32 0.170 0.073 0.090
64 0.268 0.143 0.148

Speedup:

bs torch.float32 torch.float16 torch.bfloat16
8 1.000 1.320 1.306
16 1.000 1.849 1.820
32 1.000 2.338 1.895
64 1.000 1.866 1.814

@eqy
Copy link

eqy commented Jan 14, 2022

I believe there are "speed-of-light" cases where the cast-back wouldn't be necessary, though this may not be possible for the architectures we're interested in. Here, I think the big picture is that once the batch-size falls below a certain amount, the "building-block" operations like GEMMs will be slower in reduced precision vs. fp32 when casts are needed.

@stas00
Copy link
Contributor Author

stas00 commented Jan 14, 2022

why do you think bs=32 is an oddball relative to other bs for speedup? in both cases w/ and w/o amp its relatively faster for bf16 and fp16 then bs=64, and much more significantly for fp16. One would expect 8 < 16 < 32 < 64, but here it is 8 < 16 < 64< 32.

so actual results are proportionally in line, but the speed ups aren't.

@eqy
Copy link

eqy commented Jan 14, 2022

That's interesting, I didn't see quite so dramatic results on an A100 (80GB), 2 runs:

Autocast: True

Results:

bs torch.float32 torch.float16 torch.bfloat16
8 0.062 0.086 0.077
16 0.043 0.089 0.085
32 0.073 0.084 0.084
64 0.119 0.101 0.112

Speedup:

bs torch.float32 torch.float16 torch.bfloat16
8 1.000 0.714 0.805
16 1.000 0.486 0.508
32 1.000 0.865 0.871
64 1.000 1.181 1.058

Autocast: False

Results:

bs torch.float32 torch.float16 torch.bfloat16
8 0.045 0.049 0.040
32 0.073 0.046 0.044
64 0.120 0.063 0.076

Speedup:

bs torch.float32 torch.float16 torch.bfloat16
8 1.000 0.908 1.129
16 1.000 0.855 0.873
32 1.000 1.570 1.638
64 1.000 1.913 1.580

Autocast: True

Results:

bs torch.float32 torch.float16 torch.bfloat16
8 0.062 0.086 0.077
16 0.059 0.089 0.085
32 0.073 0.084 0.084
64 0.119 0.101 0.114

Speedup:

bs torch.float32 torch.float16 torch.bfloat16
8 1.000 0.720 0.802
16 1.000 0.660 0.691
32 1.000 0.871 0.866
64 1.000 1.182 1.051

Autocast: False

Results:

bs torch.float32 torch.float16 torch.bfloat16
8 0.044 0.048 0.041
16 0.041 0.047 0.048
32 0.073 0.045 0.047
64 0.120 0.063 0.077

Speedup:

bs torch.float32 torch.float16 torch.bfloat16
8 1.000 0.929 1.094
16 1.000 0.883 0.861
32 1.000 1.615 1.541
64 1.000 1.907 1.562

@pratikchhapolika
Copy link

🖥 Benchmarking transformers w/ HF Trainer on A100 40GB

We are going to use a special benchmarking tool that will do all the work for us. #14934

This is the index post and specific benchmarks are in their own posts below:

This is the index post and specific benchmarks are in their own posts below:

  1. fp16 vs bf16 vs tf32 vs fp32
  2. gradient accumulation steps
  3. batch size
  4. gradient checkpointing
  5. optimizers
  6. combining winning strategies ~3x speed improvement!
  7. RTX-3090 vs A100

Note that each benchmark was run only once, so multiple runs and averaging is probably going to give slightly different results. The purpose here though is to see relative differences roughly and not try to give an exact number.

See also the same benchmarks for RTX-3090

Is all benchmarking done on A100 ("NVIDIA_TESLA_A100") single GPU? Can you also include CUDA memory required Vs Data points for training Vs No. of GPU's.

@stas00
Copy link
Contributor Author

stas00 commented Mar 23, 2022

Is all benchmarking done on A100 ("NVIDIA_TESLA_A100") single GPU?

Yes.

Can you also include CUDA memory required Vs Data points for training Vs No. of GPU's.

I don't understand your question.

@pratikchhapolika
Copy link

Is all benchmarking done on A100 ("NVIDIA_TESLA_A100") single GPU?

Yes.

Can you also include CUDA memory required Vs Data points for training Vs No. of GPU's.

I don't understand your question.

On how many data points and epochs is it benchmarked on with Single GPU?

@pratikchhapolika
Copy link

Is all benchmarking done on A100 ("NVIDIA_TESLA_A100") single GPU?

Yes.

Can you also include CUDA memory required Vs Data points for training Vs No. of GPU's.

I don't understand your question.

On how many data points and epochs is it benchmarked on with Single GPU?

I get error with 4 GPU's, 20 epochs on A100 with 700000 data points

python -m torch.distributed.launch --nproc_per_node 4 train.py --gradient_accumulation_steps 8 --per_device_train_batch_size 8 --optim adamw_hf --tf32 --bf16"])

'Traceback (most recent call last):\n', ' File "train.py", line 160, in <module>\n trainer.train()\n', ' File "/opt/conda/lib/python3.7/site-packages/transformers/trainer.py", line 1398, in train\n tr_loss_step = self.training_step(model, inputs)\n', ' File "/opt/conda/lib/python3.7/site-packages/transformers/trainer.py", line 1994, in training_step\n self.scaler.scale(loss).backward()\n', ' File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 363, in backward\n torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)\n', ' File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward\n allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass\n', 'RuntimeError: CUDA out of memory. Tried to allocate 3.82 GiB (GPU 0; 39.59 GiB total capacity; 17.07 GiB already allocated; 2.75 GiB free; 21.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF\n'

@stas00
Copy link
Contributor Author

stas00 commented Mar 23, 2022

On how many data points and epochs is it benchmarked on with Single GPU?

It's defined by --max_train_samples

To your last OOM comment - please let's not derail this Benchmark Issue. If you want to discuss an unrelated question please open a new issue. Best to delete it from here and post in another Issue. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmarks Issues related to Memory regressions in tests and scripts WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Projects
None yet
Development

No branches or pull requests

3 participants