Fix bf16 Support #4193

manbearian · 2022-05-10T16:20:20Z

Fix bf16 Support
Fix bfloat16 support in helper and numpy_helper.py.

Motivation and Context
Without native bfloat16 support in numpy implementing bfloat16 support for ONNX is a bit complicated. Recently support was added by encoding bfloat16 as float16. this is not correct as this is a lossy encoding; it's also a bit confusing as a design and indeed broke some bespoke tools on my end that didn't expect the bfloat16 to be encoded as float16 in the ONNX file.

Open Questions

I've implemented some changes here that give rudimentary support (not completely broken) but i do have a two questions that i believe need to be answered before committing these changes.

what is the proper binary encoding of bfloat16 in ONNX protobuf format (is this documented? should it be?)
it appears that "raw" encoding and normal encoding for 16-bit values are of different sizes, is this intended?

Not implemented Changes

support for big-endian encodings is not implemented in my code
support for "correct" truncation of NaN payloads is not implemented when converting from f32->bf16 in my code

Addresses #4189

onnx/helper.py

manbearian · 2022-05-17T18:02:44Z

How can we move forward on this? I have some time this week to continue to hack on this if there's more folks would like to see.

gramalingam · 2022-05-17T21:31:54Z

It would be great if we could make this work for both big-endian and small-endian. Is that hard? Is there some reference for the transformation you are using? I am trying to understand why trunction of 32-bit to 16-bit is incorrect.

onnx/helper.py

manbearian · 2022-05-20T18:59:21Z

It would be great if we could make this work for both big-endian and small-endian. Is that hard? Is there some reference for the transformation you are using? I am trying to understand why trunction of 32-bit to 16-bit is incorrect.

Endianness makes by brain hurt, but i'll see if i can work through the pain and implement it. :) (i was kind of hoping someone would say it isn't needed ;)

See my other reply as to why i chose rounding over truncation for converting from fp32 to bf16 and a pointer to the pytorch bfloat16 code which is a correct implementation that i drew inspiration from.

manbearian · 2022-05-20T21:22:18Z

NaN is now implemented correctly (my previous implementation could potentially convert NaN to infinity)
Big-Endian is now 'supported'... at least i believe i have the right implementation, but i have no way to test.

manbearian · 2022-05-23T21:06:48Z

@gramalingam and @souptc thoughts?

souptc · 2022-05-23T22:44:46Z

i see. i think it is correct. I actually wondering how does our cpu/gpu EP handle it.

onnx/numpy_helper.py

onnx/test/helper_test.py

lgtm-com · 2022-05-24T15:39:17Z

This pull request introduces 2 alerts when merging 9e0d884 into 28ed7f5 - view on LGTM.com

new alerts:

1 for Use of the return value of a procedure
1 for Unused local variable

gramalingam · 2022-05-24T17:13:23Z

Nit: if we decide that rounding is the right behavior instead of truncating, it would be good to fix the "Cast" implementation in the test-case generator also.

manbearian · 2022-05-24T21:13:02Z

Nit: if we decide that rounding is the right behavior instead of truncating, it would be good to fix the "Cast" implementation in the test-case generator also.

@gramalingam
i'm looking at this code now, and suspect i see how it can be changed, but i'm not sure how it gets used. Are there any pointers to how it works?

manbearian · 2022-05-24T21:36:59Z

i see. i think it is correct. I actually wondering how does our cpu/gpu EP handle it.

do you have a pointer to where the code might be, i'm happy to take a look.

jcwchen · 2022-05-24T21:41:38Z

Nit: if we decide that rounding is the right behavior instead of truncating, it would be good to fix the "Cast" implementation in the test-case generator also.

@gramalingam i'm looking at this code now, and suspect i see how it can be changed, but i'm not sure how it gets used. Are there any pointers to how it works?

https://github.com/onnx/onnx/blob/main/onnx/backend/test/case/node/cast.py This file is used to create the node test model (For instance, https://github.com/onnx/onnx/tree/main/onnx/backend/test/data/node/test_cast_FLOAT_to_BFLOAT16). You can update the code with correct behavior in cast.py first and then use tools/update_doc.sh to regenerate related node test models.

jcwchen · 2022-05-24T21:44:23Z

i see. i think it is correct. I actually wondering how does our cpu/gpu EP handle it.

do you have a pointer to where the code might be, i'm happy to take a look.

Probably here. However, different EPs seem to have different behaviors. We are trying to reach out the code owner to understand the inconsistency and which behavior is correct.

manbearian · 2022-05-24T23:38:27Z

@jwchen and @gramalingam i updated cast.py/castlike.py to use rounding for f32->bf16. Let me know if this is what you were looking for.

I was expecting to see some tests fail, but i believe the tests aren't checking for exact values, so the single bit difference isn't detectable.

gramalingam · 2022-05-24T23:39:37Z

I suspect that the checkin for test-data for ops other than cast / cast-like are spurious, I don't think they should change?

jcwchen · 2022-05-25T00:04:30Z

Thank you @manbearian for the quick update! As @gramalingam mentioned, please remove other irrelevant operators' update (like updating output.pb due to different numpy.random behaviors by different machines). tools/update_doc.sh will update every node test data and we should only take Cast/CastLike related tests.

I was expecting to see some tests fail, but i believe the tests aren't checking for exact values, so the single bit difference isn't detectable.

You are right -- these models/input.pb/output.pb just exactly follow what onnx/backend/test/case/node/[operator_name].py defines. The CI only checks that whether the uploaded model can be reproduced by CI environments.

Signed-off-by: Ian Bearman <ianb@microsoft.com>

manbearian · 2022-05-25T16:44:32Z

after talking with @gramalingam offline we decided to drop the CAST/CASTLIKE changes for now. I have them stored in a branch if we want to bring them back. There was also a suggestion that the f32->bf16 helper support both rounding and truncation modes. I can add support for that with a future PR if there is interest (please open an issue and assign it to me).

gramalingam · 2022-05-25T16:52:11Z

Thanks Ian! The rationale for our decision was the question about what Cast should do (rounding or truncation) is separate (it involves a tradeoff between efficiency and precision) and will require more time to get consensus. However, the helper function make_tensor is not as critical (users can round or truncate as they wish), but it is important to fix the error in make_tensor's current handling for this release.

Signed-off-by: Ian Bearman <ianb@microsoft.com>

manbearian requested a review from a team as a code owner May 10, 2022 16:20

manbearian mentioned this pull request May 10, 2022

Encoding BFLOAT16 Constant to ONNX Fails #4189

Closed

jcwchen linked an issue May 10, 2022 that may be closed by this pull request

Encoding BFLOAT16 Constant to ONNX Fails #4189

Closed

jcwchen added the utility label May 10, 2022

manbearian force-pushed the dev/ianb/bf16 branch from 536a677 to 8c47428 Compare May 10, 2022 16:39

gramalingam reviewed May 10, 2022

View reviewed changes

onnx/helper.py Outdated Show resolved Hide resolved

manbearian force-pushed the dev/ianb/bf16 branch from 8c47428 to 28084fd Compare May 11, 2022 19:11

souptc reviewed May 17, 2022

View reviewed changes

onnx/helper.py Outdated Show resolved Hide resolved

manbearian force-pushed the dev/ianb/bf16 branch 2 times, most recently from 0ea5952 to eb78e05 Compare May 20, 2022 21:19

manbearian force-pushed the dev/ianb/bf16 branch from eb78e05 to 89819f9 Compare May 23, 2022 21:06

manbearian force-pushed the dev/ianb/bf16 branch from 89819f9 to cdd9bfb Compare May 23, 2022 22:42

This comment was marked as resolved.

Sign in to view

manbearian force-pushed the dev/ianb/bf16 branch from cdd9bfb to cc391e3 Compare May 23, 2022 23:33

jcwchen reviewed May 23, 2022

View reviewed changes

onnx/numpy_helper.py Outdated Show resolved Hide resolved

onnx/numpy_helper.py Outdated Show resolved Hide resolved

jcwchen reviewed May 24, 2022

View reviewed changes

onnx/numpy_helper.py Outdated Show resolved Hide resolved

onnx/test/helper_test.py Outdated Show resolved Hide resolved

manbearian force-pushed the dev/ianb/bf16 branch from cc391e3 to 9e0d884 Compare May 24, 2022 15:27

manbearian force-pushed the dev/ianb/bf16 branch from 9e0d884 to e794bae Compare May 24, 2022 16:33

jcwchen added this to the 1.12 milestone May 24, 2022

gramalingam approved these changes May 24, 2022

View reviewed changes

manbearian requested a review from a team as a code owner May 24, 2022 23:36

manbearian force-pushed the dev/ianb/bf16 branch from 3e01d95 to 792c163 Compare May 25, 2022 16:01

fix bf16 support

ee27388

Signed-off-by: Ian Bearman <ianb@microsoft.com>

manbearian force-pushed the dev/ianb/bf16 branch 2 times, most recently from eac0d62 to ee27388 Compare May 25, 2022 16:42

gramalingam merged commit fdd8902 into onnx:main May 25, 2022

gramalingam mentioned this pull request Jun 1, 2022

Support different modes for float to bfloat conversion #4237

Closed

jcwchen mentioned this pull request Jun 14, 2022

[Tracking] Use NumPy bfloat16 directly to make bf16 tensors after it has supported bf16 #4273

Open

jcwchen mentioned this pull request Mar 31, 2023

Rounding direction of Cast op #5004

Open

broune pushed a commit to broune/onnx that referenced this pull request May 6, 2023

fix bf16 support (onnx#4193)

cec82a5

Signed-off-by: Ian Bearman <ianb@microsoft.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bf16 Support #4193

Fix bf16 Support #4193

manbearian commented May 10, 2022 •

edited

manbearian commented May 17, 2022

gramalingam commented May 17, 2022

manbearian commented May 20, 2022 •

edited

manbearian commented May 20, 2022

manbearian commented May 23, 2022

souptc commented May 23, 2022

This comment was marked as resolved.

lgtm-com bot commented May 24, 2022

gramalingam commented May 24, 2022

manbearian commented May 24, 2022 •

edited

manbearian commented May 24, 2022

jcwchen commented May 24, 2022

jcwchen commented May 24, 2022 •

edited

manbearian commented May 24, 2022

gramalingam commented May 24, 2022

jcwchen commented May 25, 2022

manbearian commented May 25, 2022

gramalingam commented May 25, 2022

Fix bf16 Support #4193

Fix bf16 Support #4193

Conversation

manbearian commented May 10, 2022 • edited

Open Questions

Not implemented Changes

manbearian commented May 17, 2022

gramalingam commented May 17, 2022

manbearian commented May 20, 2022 • edited

manbearian commented May 20, 2022

manbearian commented May 23, 2022

souptc commented May 23, 2022

This comment was marked as resolved.

lgtm-com bot commented May 24, 2022

gramalingam commented May 24, 2022

manbearian commented May 24, 2022 • edited

manbearian commented May 24, 2022

jcwchen commented May 24, 2022

jcwchen commented May 24, 2022 • edited

manbearian commented May 24, 2022

gramalingam commented May 24, 2022

jcwchen commented May 25, 2022

manbearian commented May 25, 2022

gramalingam commented May 25, 2022

manbearian commented May 10, 2022 •

edited

manbearian commented May 20, 2022 •

edited

manbearian commented May 24, 2022 •

edited

jcwchen commented May 24, 2022 •

edited