Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bf16 is more unstable than fp16, when looking at the difference of generation logprobs and forward logprobs #31267

Open
2 of 4 tasks
vwxyzjn opened this issue Jun 5, 2024 · 3 comments

Comments

@vwxyzjn
Copy link
Contributor

vwxyzjn commented Jun 5, 2024

System Info

  • transformers version: 4.40.1
  • Platform: Linux-6.5.0-28-generic-x86_64-with-glibc2.35
  • Python version: 3.12.3
  • Huggingface_hub version: 0.23.0
  • Safetensors version: 0.4.3
  • Accelerate version: 0.30.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Download https://gist.github.com/vwxyzjn/a9c952e0d2baf603b394f20145532d5e
  2. run python bf16_logprobs.py --fp32
  3. run python bf16_logprobs.py --fp16
  4. run python bf16_logprobs.py --bf16
(miniforge3-22.11.1-4/envs/trl) ➜  trl git:(dataset-processor) ✗ python bf16_stuff.py --fp32
Namespace(bf16=False, fp16=False, fp32=True, seed=1)
generation_logprob[:,:5]=tensor([[-0.1527, -0.2258, -3.5535, -3.4805, -0.0519],
        [-0.1527, -0.2258, -5.2855, -0.1686, -8.4760]], device='cuda:0')
forward_logprob[:,:5]=tensor([[-0.1527, -0.2258, -3.5535, -3.4805, -0.0519],
        [-0.1527, -0.2258, -5.2855, -0.1686, -8.4760]], device='cuda:0',
       grad_fn=<SliceBackward0>)
ratio=tensor([[1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0001, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0001, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000]], device='cuda:0',
       grad_fn=<ExpBackward0>)
ratio.mean()=1.0000026226043701
ratio.std()=1.7616826880839653e-05
ratio.max()=1.0000758171081543
ratio.min()=0.999953031539917
(miniforge3-22.11.1-4/envs/trl) ➜  trl git:(dataset-processor) ✗ python bf16_stuff.py --fp16
Namespace(bf16=False, fp16=True, fp32=False, seed=1)
generation_logprob[:,:5]=tensor([[-0.1486, -0.2212, -3.5586, -3.4688, -0.0520],
        [-0.1486, -0.2212, -5.2852, -0.1681, -8.4844]], device='cuda:0',
       dtype=torch.float16)
forward_logprob[:,:5]=tensor([[-0.1523, -0.2211, -3.5547, -3.4688, -0.0519],
        [-0.1523, -0.2211, -5.2812, -0.1667, -8.4766]], device='cuda:0',
       dtype=torch.float16, grad_fn=<SliceBackward0>)
ratio=tensor([[1.0039, 1.0000, 0.9961, 1.0000, 1.0000, 1.0059, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         0.9980, 1.0000, 1.0000, 0.9961, 1.0049, 1.0000, 0.9961, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 0.9995, 0.9990, 0.9990, 1.0000, 0.9883,
         0.9922, 0.9922, 1.0000, 0.9883, 0.9941],
        [1.0039, 1.0000, 0.9961, 0.9985, 0.9922, 1.0117, 1.0088, 1.0000, 1.0000,
         1.0059, 1.0000, 1.0000, 1.0039, 1.0000, 0.9980, 1.0000, 0.9980, 1.0000,
         1.0088, 1.0020, 0.9922, 1.0000, 0.9980, 1.0000, 1.0000, 0.9961, 1.0000,
         0.9976, 1.0078, 1.0000, 0.9990, 1.0000, 1.0000, 1.0010, 1.0098, 1.0000,
         0.9976, 0.9941, 0.9893, 1.0000, 0.9990, 1.0137, 1.0000, 0.9990, 0.9985,
         1.0000, 1.0000, 1.0127, 0.9995, 0.9980]], device='cuda:0',
       dtype=torch.float16, grad_fn=<ExpBackward0>)
ratio.mean()=1.0
ratio.std()=0.00421142578125
ratio.max()=1.013671875
ratio.min()=0.98828125
(miniforge3-22.11.1-4/envs/trl) ➜  trl git:(dataset-processor) ✗ python bf16_stuff.py --bf16
Namespace(bf16=True, fp16=False, fp32=False, seed=1)
generation_logprob[:,:5]=tensor([[-0.1426, -0.1904, -3.5938, -3.4688, -0.0618],
        [-0.1426, -0.1904, -5.2812, -0.1641, -8.5625]], device='cuda:0',
       dtype=torch.bfloat16)
forward_logprob[:,:5]=tensor([[-0.1270, -0.3457, -3.5938, -3.4688, -0.0547],
        [-0.1270, -0.3457, -5.2812, -0.1553, -8.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<SliceBackward0>)
ratio=tensor([[0.9844, 1.1719, 1.0000, 1.0000, 0.9922, 1.0000, 0.9531, 1.0938, 1.0391,
         1.0000, 1.0078, 1.1328, 1.0859, 1.0000, 0.7109, 1.0000, 1.0000, 1.0312,
         1.2422, 0.9922, 1.0312, 0.9180, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 0.9961, 1.1328, 1.0156, 1.1172, 1.0625, 1.0000, 1.0000, 1.0000,
         1.0312, 1.0000, 0.9961, 1.0000, 1.0781],
        [0.9844, 1.1719, 1.0000, 0.9922, 1.0000, 1.0312, 0.9883, 0.9414, 1.0859,
         0.9844, 0.9922, 1.0000, 0.9102, 1.0078, 1.0000, 0.9844, 1.1016, 1.0000,
         1.0000, 1.0000, 1.0000, 0.9688, 0.9531, 1.0078, 1.0000, 1.0000, 0.9844,
         1.0156, 0.9766, 0.9844, 0.9961, 1.0000, 0.9961, 1.0000, 1.0000, 1.0000,
         1.0078, 1.0000, 1.0000, 0.9531, 1.0000, 1.0000, 1.0078, 1.0000, 1.0000,
         0.9688, 1.0078, 1.0000, 1.0000, 1.0000]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<ExpBackward0>)
ratio.mean()=1.0078125
ratio.std()=0.05712890625
ratio.max()=1.2421875
ratio.min()=0.7109375

Expected behavior

Basically, the ratio should be 1s

@vasqu
Copy link
Contributor

vasqu commented Jun 5, 2024

I also encounter more instability with bf16 in rope related models like pythia (i.e. the gptneox architecture). This occurred to me when working on SDPA support for the architecture at #31031.

My guess it is due to rope which forces upcasting to fp32. Further down the line it is usually downcasted somewhere along there which is more severe for bf16 (as it has less precision) which results into this phenomenon. For example, this upcasting is also circumvented in the flash attention 2 variant by downcasting:

# GPT-neo-X casts query and key in fp32 to apply rotary embedding in full precision
target_dtype = value.dtype
if query.dtype != target_dtype:
query = query.to(target_dtype)
if key.dtype != target_dtype:
key = key.to(target_dtype)

Edit: tl;dr: more precision is needed which bf16 can't offer

@ArthurZucker
Copy link
Collaborator

Yep, I think you should check if #29285 applied to the model you are using does not solved the issue? 🤗

@vwxyzjn
Copy link
Contributor Author

vwxyzjn commented Jun 7, 2024

I tried applying the same fix in #29285 naively but it doesn't seem to make a difference

image

Did some further digging, the logprobs of the forward pass feel length related. For example, if I run the model through the first 20 generated tokens, I get

(.venv) ➜  transformers git:(main) ✗ python -i bf16_stuff.py --bf16
Namespace(bf16=True, fp16=False, fp32=False, seed=1)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
generation_logprob[:,:5]=tensor([[-0.1426, -0.1904, -3.5938, -3.4688, -0.0618],
        [-0.1426, -0.1904, -5.2812, -0.1641, -8.5625]], device='cuda:0',
       dtype=torch.bfloat16)
forward_logprob[:,:5]=tensor([[-0.1426, -0.1904, -3.5781, -3.4062, -0.0618],
        [-0.1426, -0.1904, -5.2812, -0.1670, -8.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<SliceBackward0>)
ratio=tensor([[1.0000, 1.0000, 0.9844, 0.9375, 1.0000, 0.8945, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0859, 1.0000, 1.0000, 1.0000, 1.0000, 1.0312,
         1.2266],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 0.9961, 1.0859,
         1.0625, 0.9922, 1.0000, 0.9688, 1.0000, 1.0000, 1.0000, 0.9844, 1.0000,
         1.0000]], device='cuda:0', dtype=torch.bfloat16,
       grad_fn=<ExpBackward0>)
ratio.mean()=1.0078125
ratio.std()=0.048095703125
ratio.max()=1.2265625
ratio.min()=0.89453125

If I do the forward pass on the first 40 generated tokens, I get

generation_logprob[:,:5]=tensor([[-0.1426, -0.1904, -3.5938, -3.4688, -0.0618],
        [-0.1426, -0.1904, -5.2812, -0.1641, -8.5625]], device='cuda:0',
       dtype=torch.bfloat16)
forward_logprob[:,:5]=tensor([[-0.1270, -0.3164, -3.5938, -3.3906, -0.0547],
        [-0.1270, -0.3164, -5.1562, -0.1387, -8.5625]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<SliceBackward0>)
ratio=tensor([[0.9844, 1.1328, 1.0000, 0.9258, 0.9922, 0.9688, 0.9023, 1.0000, 1.0000,
         1.0000, 0.9336, 1.0000, 0.8477, 1.0000, 0.8047, 1.0000, 1.0000, 0.9961,
         1.2266, 1.0078, 0.9688, 0.9258, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0234, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 0.9961,
         1.0000, 1.0000, 1.0000],
        [0.9844, 1.1328, 0.8828, 0.9766, 1.0000, 1.0000, 0.9727, 0.9375, 0.9922,
         1.0000, 1.0000, 1.0000, 0.8828, 1.0078, 1.0000, 0.8672, 1.0156, 1.0000,
         1.0000, 1.0000, 1.0156, 1.0000, 0.8164, 1.0156, 1.0078, 1.0000, 0.9844,
         1.0234, 1.0000, 0.9688, 0.9961, 1.0000, 0.9961, 1.0234, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000]], device='cuda:0', dtype=torch.bfloat16,
       grad_fn=<ExpBackward0>)
ratio.mean()=0.98828125
ratio.std()=0.057373046875
ratio.max()=1.2265625
ratio.min()=0.8046875

Notice how the logprob changed from -0.1426 to -0.1270.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants