Clip floating point constants to bf16 range to avoid inf conversion #20605

sangeethabal · 2022-12-06T00:10:29Z

When running HuggingFace BERT (any size) fine-tuning tutorial with transformers version >= 4.21.0 and using XLA_USE_BF16=1 or XLA_DOWNCAST_BF16=1, I see NaNs in the loss after the first step.

What does this PR do?

This PR addresses the issue where the model code passes a value that is out of range for XLA_USE_BF16=1 or XLA_DOWNCAST_BF16=1, so the conversion would cast it to -inf.

The NaNs likely come from the transformers library change: #17306 . This PR replaced many lines which used to be -float(inf) (or other small constants) with torch.finfo().min. For torch.float32 the min value is -3.4028234663852886e+38 which is smaller than the bfloat16 minimum of -3.3895313892515355e+38. So the problem is that torch.finfo(torch.float32).min = -3.4028234663852886e+38 gets converted to -inf. When the original encoder_extended_attention_mask is 1, then encoder_extended_attention_mask becomes (1.0 - 1.0 ) * -inf which becomes NaN (via IEEE rule Inf * 0.0 = NaN).

This PR ensures torch.finfo(torch.bfloat16).min = -3.3895313892515355e+38 and not -inf. Then the results would not have Nans.

The following lines checks for XLA_USE_BF16 or XLA_DOWNCAST_BF16 environment variable and sets the dtype accordingly:

if is_torch_tpu_available():
    if os.environ.get("XLA_USE_BF16") == 1:
         return torch.bfloat16
    if os.environ.get("XLA_DOWNCAST_BF16") == 1:
         if t.dtype == torch.float:
             return torch.bfloat16
         if t.dtype == torch.double:
             return torch.float32

Referencing related issues: aws-neuron/aws-neuron-sdk#593 and pytorch/xla#4152

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sgugger

HuggingFaceDocBuilderDev · 2022-12-06T00:25:12Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Thanks for opening a clean PR. I still have the same comment :-)
Also make sure you run make style on your branch to pass the quality tests on your PR.

sgugger · 2022-12-06T12:35:11Z

src/transformers/modeling_utils.py

+                if os.environ.get("XLA_USE_BF16") == '1':
+                    return torch.bfloat16
+                if os.environ.get("XLA_DOWNCAST_BF16") == '1':


As said before, we have a constant ENV_VARS_TRUE_VALUES in utils you should reuse here, to catch any declination of the user setting this environment variable. You should then test

os.environ.get(xxx, "0").upper() in ENV_VARS_TRUE_VALUES

sgugger

Thanks for iterating!

…uggingface#20605) Co-authored-by: EC2 Default User <ec2-user@ip-172-31-40-169.us-west-2.compute.internal>

sgugger reviewed Dec 6, 2022

View reviewed changes

Clip floating point constants to bf16 range to avoid inf conversion

85aaf16

sgugger approved these changes Dec 6, 2022

View reviewed changes

sgugger merged commit c95f847 into huggingface:main Dec 6, 2022

amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Dec 7, 2022

Clip floating point constants to bf16 range to avoid inf conversion (h…

9af9df0

…uggingface#20605) Co-authored-by: EC2 Default User <ec2-user@ip-172-31-40-169.us-west-2.compute.internal>

mpierrau pushed a commit to mpierrau/transformers that referenced this pull request Dec 15, 2022

Clip floating point constants to bf16 range to avoid inf conversion (h…

9b472d3

…uggingface#20605) Co-authored-by: EC2 Default User <ec2-user@ip-172-31-40-169.us-west-2.compute.internal>

geniki mentioned this pull request Feb 6, 2023

Longformer FP16 training broken since transformers 4.21 #21449

Closed

4 tasks

miyu386 pushed a commit to miyu386/transformers that referenced this pull request Feb 9, 2023

Clip floating point constants to bf16 range to avoid inf conversion (h…

2b23509

…uggingface#20605) Co-authored-by: EC2 Default User <ec2-user@ip-172-31-40-169.us-west-2.compute.internal>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clip floating point constants to bf16 range to avoid inf conversion #20605

Clip floating point constants to bf16 range to avoid inf conversion #20605

sangeethabal commented Dec 6, 2022

HuggingFaceDocBuilderDev commented Dec 6, 2022 •

edited

Loading

sgugger left a comment

sgugger Dec 6, 2022

sgugger left a comment

Clip floating point constants to bf16 range to avoid inf conversion #20605

Clip floating point constants to bf16 range to avoid inf conversion #20605

Conversation

sangeethabal commented Dec 6, 2022

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Dec 6, 2022 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

sgugger Dec 6, 2022

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Dec 6, 2022 •

edited

Loading