New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bf16 Support #4193
Fix bf16 Support #4193
Conversation
How can we move forward on this? I have some time this week to continue to hack on this if there's more folks would like to see. |
It would be great if we could make this work for both big-endian and small-endian. Is that hard? Is there some reference for the transformation you are using? I am trying to understand why trunction of 32-bit to 16-bit is incorrect. |
Endianness makes by brain hurt, but i'll see if i can work through the pain and implement it. :) (i was kind of hoping someone would say it isn't needed ;) See my other reply as to why i chose rounding over truncation for converting from fp32 to bf16 and a pointer to the pytorch bfloat16 code which is a correct implementation that i drew inspiration from. |
0ea5952
to
eb78e05
Compare
NaN is now implemented correctly (my previous implementation could potentially convert NaN to infinity) |
@gramalingam and @souptc thoughts? |
i see. i think it is correct. I actually wondering how does our cpu/gpu EP handle it. |
This comment was marked as resolved.
This comment was marked as resolved.
This pull request introduces 2 alerts when merging 9e0d884 into 28ed7f5 - view on LGTM.com new alerts:
|
Nit: if we decide that rounding is the right behavior instead of truncating, it would be good to fix the "Cast" implementation in the test-case generator also. |
@gramalingam |
do you have a pointer to where the code might be, i'm happy to take a look. |
https://github.com/onnx/onnx/blob/main/onnx/backend/test/case/node/cast.py This file is used to create the node test model (For instance, https://github.com/onnx/onnx/tree/main/onnx/backend/test/data/node/test_cast_FLOAT_to_BFLOAT16). You can update the code with correct behavior in cast.py first and then use tools/update_doc.sh to regenerate related node test models. |
Probably here. However, different EPs seem to have different behaviors. We are trying to reach out the code owner to understand the inconsistency and which behavior is correct. |
@jwchen and @gramalingam i updated cast.py/castlike.py to use rounding for f32->bf16. Let me know if this is what you were looking for. I was expecting to see some tests fail, but i believe the tests aren't checking for exact values, so the single bit difference isn't detectable. |
I suspect that the checkin for test-data for ops other than cast / cast-like are spurious, I don't think they should change? |
Thank you @manbearian for the quick update! As @gramalingam mentioned, please remove other irrelevant operators' update (like updating output.pb due to different numpy.random behaviors by different machines). tools/update_doc.sh will update every node test data and we should only take Cast/CastLike related tests.
You are right -- these models/input.pb/output.pb just exactly follow what onnx/backend/test/case/node/[operator_name].py defines. The CI only checks that whether the uploaded model can be reproduced by CI environments. |
Signed-off-by: Ian Bearman <ianb@microsoft.com>
eac0d62
to
ee27388
Compare
after talking with @gramalingam offline we decided to drop the CAST/CASTLIKE changes for now. I have them stored in a branch if we want to bring them back. There was also a suggestion that the f32->bf16 helper support both rounding and truncation modes. I can add support for that with a future PR if there is interest (please open an issue and assign it to me). |
Thanks Ian! The rationale for our decision was the question about what Cast should do (rounding or truncation) is separate (it involves a tradeoff between efficiency and precision) and will require more time to get consensus. However, the helper function make_tensor is not as critical (users can round or truncate as they wish), but it is important to fix the error in make_tensor's current handling for this release. |
Signed-off-by: Ian Bearman <ianb@microsoft.com>
Fix bf16 Support
Fix bfloat16 support in helper and numpy_helper.py.
Motivation and Context
Without native bfloat16 support in numpy implementing bfloat16 support for ONNX is a bit complicated. Recently support was added by encoding bfloat16 as float16. this is not correct as this is a lossy encoding; it's also a bit confusing as a design and indeed broke some bespoke tools on my end that didn't expect the bfloat16 to be encoded as float16 in the ONNX file.
Open Questions
I've implemented some changes here that give rudimentary support (not completely broken) but i do have a two questions that i believe need to be answered before committing these changes.
Not implemented Changes
Addresses #4189