Skip to content

Performance Comparison

hoshi-hiyouga edited this page Apr 15, 2024 · 5 revisions

Short-sequence training

NVIDIA A100 * 1

Method Bits TGS VRAM Speed
HF 16 2,392 18GB 100%
HF+FA2 16 2,954 17GB 123%
Unsloth+FA2 16 4,007 16GB 168%
HF 4 2,415 9GB 101%
Unsloth+FA2 4 3,726 7GB 160%

NVIDIA A100 * 2

Method Bits TGS VRAM Speed
HF 16 2,155 29GB 100%
HF+FA2 16 2,556 28GB 119%
Unsloth+FA2 16 3,400 27GB 158%
  • TGS: tokens per GPU per second
  • Model: LLaMA2-7B
  • Batch size: 4
  • Gradient accumulation: 2
  • LoRA rank: 8
  • LoRA modules: all
  • Max length: 1024

Long-sequence training

VRAM 1,024 2,048 4,096 8,192 16,384 32,768 65,536 100,000
FlashAttention2 6GB 7GB 9GB 12GB 19GB 32GB OOM OOM
Unsloth+FA2 5GB 6GB 7GB 8GB 10GB 16GB 25GB 37GB
TGS 1,024 2,048 4,096 8,192 16,384 32,768 65,536 100,000
FlashAttention2 2,295 2,741 2,926 3,128 3,542 2,216 OOM OOM
Unsloth+FA2 2,556 3,178 3,413 3,632 4,050 2,456 1,820 1,202
Improvement 111% 116% 117% 116% 114% 111%
  • TGS: tokens per GPU per second
  • GPU: NVIDIA A100 40GB * 1
  • Model: LLaMA2-7B
  • Batch size: 1
  • Gradient accumulation: 4
  • LoRA rank: 8
  • LoRA modules: all
  • Quantization bit: 4