bf16 is more unstable than fp16, when looking at the difference of generation logprobs and forward logprobs #31267

vwxyzjn · 2024-06-05T19:47:21Z

System Info

transformers version: 4.40.1
Platform: Linux-6.5.0-28-generic-x86_64-with-glibc2.35
Python version: 3.12.3
Huggingface_hub version: 0.23.0
Safetensors version: 0.4.3
Accelerate version: 0.30.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Download https://gist.github.com/vwxyzjn/a9c952e0d2baf603b394f20145532d5e
run python bf16_logprobs.py --fp32
run python bf16_logprobs.py --fp16
run python bf16_logprobs.py --bf16

(miniforge3-22.11.1-4/envs/trl) ➜  trl git:(dataset-processor) ✗ python bf16_stuff.py --fp32
Namespace(bf16=False, fp16=False, fp32=True, seed=1)
generation_logprob[:,:5]=tensor([[-0.1527, -0.2258, -3.5535, -3.4805, -0.0519],
        [-0.1527, -0.2258, -5.2855, -0.1686, -8.4760]], device='cuda:0')
forward_logprob[:,:5]=tensor([[-0.1527, -0.2258, -3.5535, -3.4805, -0.0519],
        [-0.1527, -0.2258, -5.2855, -0.1686, -8.4760]], device='cuda:0',
       grad_fn=<SliceBackward0>)
ratio=tensor([[1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0001, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0001, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000]], device='cuda:0',
       grad_fn=<ExpBackward0>)
ratio.mean()=1.0000026226043701
ratio.std()=1.7616826880839653e-05
ratio.max()=1.0000758171081543
ratio.min()=0.999953031539917
(miniforge3-22.11.1-4/envs/trl) ➜  trl git:(dataset-processor) ✗ python bf16_stuff.py --fp16
Namespace(bf16=False, fp16=True, fp32=False, seed=1)
generation_logprob[:,:5]=tensor([[-0.1486, -0.2212, -3.5586, -3.4688, -0.0520],
        [-0.1486, -0.2212, -5.2852, -0.1681, -8.4844]], device='cuda:0',
       dtype=torch.float16)
forward_logprob[:,:5]=tensor([[-0.1523, -0.2211, -3.5547, -3.4688, -0.0519],
        [-0.1523, -0.2211, -5.2812, -0.1667, -8.4766]], device='cuda:0',
       dtype=torch.float16, grad_fn=<SliceBackward0>)
ratio=tensor([[1.0039, 1.0000, 0.9961, 1.0000, 1.0000, 1.0059, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         0.9980, 1.0000, 1.0000, 0.9961, 1.0049, 1.0000, 0.9961, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 0.9995, 0.9990, 0.9990, 1.0000, 0.9883,
         0.9922, 0.9922, 1.0000, 0.9883, 0.9941],
        [1.0039, 1.0000, 0.9961, 0.9985, 0.9922, 1.0117, 1.0088, 1.0000, 1.0000,
         1.0059, 1.0000, 1.0000, 1.0039, 1.0000, 0.9980, 1.0000, 0.9980, 1.0000,
         1.0088, 1.0020, 0.9922, 1.0000, 0.9980, 1.0000, 1.0000, 0.9961, 1.0000,
         0.9976, 1.0078, 1.0000, 0.9990, 1.0000, 1.0000, 1.0010, 1.0098, 1.0000,
         0.9976, 0.9941, 0.9893, 1.0000, 0.9990, 1.0137, 1.0000, 0.9990, 0.9985,
         1.0000, 1.0000, 1.0127, 0.9995, 0.9980]], device='cuda:0',
       dtype=torch.float16, grad_fn=<ExpBackward0>)
ratio.mean()=1.0
ratio.std()=0.00421142578125
ratio.max()=1.013671875
ratio.min()=0.98828125
(miniforge3-22.11.1-4/envs/trl) ➜  trl git:(dataset-processor) ✗ python bf16_stuff.py --bf16
Namespace(bf16=True, fp16=False, fp32=False, seed=1)
generation_logprob[:,:5]=tensor([[-0.1426, -0.1904, -3.5938, -3.4688, -0.0618],
        [-0.1426, -0.1904, -5.2812, -0.1641, -8.5625]], device='cuda:0',
       dtype=torch.bfloat16)
forward_logprob[:,:5]=tensor([[-0.1270, -0.3457, -3.5938, -3.4688, -0.0547],
        [-0.1270, -0.3457, -5.2812, -0.1553, -8.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<SliceBackward0>)
ratio=tensor([[0.9844, 1.1719, 1.0000, 1.0000, 0.9922, 1.0000, 0.9531, 1.0938, 1.0391,
         1.0000, 1.0078, 1.1328, 1.0859, 1.0000, 0.7109, 1.0000, 1.0000, 1.0312,
         1.2422, 0.9922, 1.0312, 0.9180, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 0.9961, 1.1328, 1.0156, 1.1172, 1.0625, 1.0000, 1.0000, 1.0000,
         1.0312, 1.0000, 0.9961, 1.0000, 1.0781],
        [0.9844, 1.1719, 1.0000, 0.9922, 1.0000, 1.0312, 0.9883, 0.9414, 1.0859,
         0.9844, 0.9922, 1.0000, 0.9102, 1.0078, 1.0000, 0.9844, 1.1016, 1.0000,
         1.0000, 1.0000, 1.0000, 0.9688, 0.9531, 1.0078, 1.0000, 1.0000, 0.9844,
         1.0156, 0.9766, 0.9844, 0.9961, 1.0000, 0.9961, 1.0000, 1.0000, 1.0000,
         1.0078, 1.0000, 1.0000, 0.9531, 1.0000, 1.0000, 1.0078, 1.0000, 1.0000,
         0.9688, 1.0078, 1.0000, 1.0000, 1.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<ExpBackward0>)
ratio.mean()=1.0078125
ratio.std()=0.05712890625
ratio.max()=1.2421875
ratio.min()=0.7109375

Expected behavior

Basically, the ratio should be 1s

The text was updated successfully, but these errors were encountered:

vasqu · 2024-06-05T21:36:33Z

I also encounter more instability with bf16 in rope related models like pythia (i.e. the gptneox architecture). This occurred to me when working on SDPA support for the architecture at #31031.

My guess it is due to rope which forces upcasting to fp32. Further down the line it is usually downcasted somewhere along there which is more severe for bf16 (as it has less precision) which results into this phenomenon. For example, this upcasting is also circumvented in the flash attention 2 variant by downcasting:

transformers/src/transformers/models/gpt_neox/modeling_gpt_neox.py

Lines 364 to 369 in 940fde8

    
           # GPT-neo-X casts query and key in fp32 to apply rotary embedding in full precision 
        
           target_dtype = value.dtype 
        
           if query.dtype != target_dtype: 
        
               query = query.to(target_dtype) 
        
           if key.dtype != target_dtype: 
        
               key = key.to(target_dtype)

Edit: tl;dr: more precision is needed which bf16 can't offer

ArthurZucker · 2024-06-06T07:26:45Z

Yep, I think you should check if #29285 applied to the model you are using does not solved the issue? 🤗

vwxyzjn · 2024-06-07T16:11:10Z

I tried applying the same fix in #29285 naively but it doesn't seem to make a difference

Did some further digging, the logprobs of the forward pass feel length related. For example, if I run the model through the first 20 generated tokens, I get

(.venv) ➜  transformers git:(main) ✗ python -i bf16_stuff.py --bf16
Namespace(bf16=True, fp16=False, fp32=False, seed=1)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
generation_logprob[:,:5]=tensor([[-0.1426, -0.1904, -3.5938, -3.4688, -0.0618],
        [-0.1426, -0.1904, -5.2812, -0.1641, -8.5625]], device='cuda:0',
       dtype=torch.bfloat16)
forward_logprob[:,:5]=tensor([[-0.1426, -0.1904, -3.5781, -3.4062, -0.0618],
        [-0.1426, -0.1904, -5.2812, -0.1670, -8.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<SliceBackward0>)
ratio=tensor([[1.0000, 1.0000, 0.9844, 0.9375, 1.0000, 0.8945, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0859, 1.0000, 1.0000, 1.0000, 1.0000, 1.0312,
         1.2266],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 0.9961, 1.0859,
         1.0625, 0.9922, 1.0000, 0.9688, 1.0000, 1.0000, 1.0000, 0.9844, 1.0000,
         1.0000]], device='cuda:0', dtype=torch.bfloat16,
       grad_fn=<ExpBackward0>)
ratio.mean()=1.0078125
ratio.std()=0.048095703125
ratio.max()=1.2265625
ratio.min()=0.89453125

If I do the forward pass on the first 40 generated tokens, I get

generation_logprob[:,:5]=tensor([[-0.1426, -0.1904, -3.5938, -3.4688, -0.0618],
        [-0.1426, -0.1904, -5.2812, -0.1641, -8.5625]], device='cuda:0',
       dtype=torch.bfloat16)
forward_logprob[:,:5]=tensor([[-0.1270, -0.3164, -3.5938, -3.3906, -0.0547],
        [-0.1270, -0.3164, -5.1562, -0.1387, -8.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<SliceBackward0>)
ratio=tensor([[0.9844, 1.1328, 1.0000, 0.9258, 0.9922, 0.9688, 0.9023, 1.0000, 1.0000,
         1.0000, 0.9336, 1.0000, 0.8477, 1.0000, 0.8047, 1.0000, 1.0000, 0.9961,
         1.2266, 1.0078, 0.9688, 0.9258, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0234, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 0.9961,
         1.0000, 1.0000, 1.0000],
        [0.9844, 1.1328, 0.8828, 0.9766, 1.0000, 1.0000, 0.9727, 0.9375, 0.9922,
         1.0000, 1.0000, 1.0000, 0.8828, 1.0078, 1.0000, 0.8672, 1.0156, 1.0000,
         1.0000, 1.0000, 1.0156, 1.0000, 0.8164, 1.0156, 1.0078, 1.0000, 0.9844,
         1.0234, 1.0000, 0.9688, 0.9961, 1.0000, 0.9961, 1.0234, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000]], device='cuda:0', dtype=torch.bfloat16,
       grad_fn=<ExpBackward0>)
ratio.mean()=0.98828125
ratio.std()=0.057373046875
ratio.max()=1.2265625
ratio.min()=0.8046875

Notice how the logprob changed from -0.1426 to -0.1270.

natolambert mentioned this issue Jun 5, 2024

rewardbench.py results are different for different batch size for beaver-7b allenai/reward-bench#137

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bf16 is more unstable than fp16, when looking at the difference of generation logprobs and forward logprobs #31267

bf16 is more unstable than fp16, when looking at the difference of generation logprobs and forward logprobs #31267

vwxyzjn commented Jun 5, 2024

vasqu commented Jun 5, 2024 •

edited

Loading

ArthurZucker commented Jun 6, 2024

vwxyzjn commented Jun 7, 2024

bf16 is more unstable than fp16, when looking at the difference of generation logprobs and forward logprobs #31267

bf16 is more unstable than fp16, when looking at the difference of generation logprobs and forward logprobs #31267

Comments

vwxyzjn commented Jun 5, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

vasqu commented Jun 5, 2024 • edited Loading

ArthurZucker commented Jun 6, 2024

vwxyzjn commented Jun 7, 2024

vasqu commented Jun 5, 2024 •

edited

Loading