fixing default communication_data_type for bfloat16_enabled and docs #3370

clumsy · 2023-04-24T22:55:15Z

fixes #2071

#2145 fixed the case when communication_data_type is set explicitly, unfortunately the default value for when bfloat16_enabled=True is not bfp16.
Adding unit tests for default communication_data_type.
Correcting the docs that mentioned communication_data_type as being of type boolean.

Before the fix:

tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[fp16-fp16] PASSED                                                                                                                                    [ 11%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[fp16-bfp16] PASSED                                                                                                                                   [ 22%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[fp16-fp32] PASSED                                                                                                                                    [ 33%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[bfp16-fp16] PASSED                                                                                                                                   [ 44%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[bfp16-bfp16] PASSED                                                                                                                                  [ 55%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[bfp16-fp32] PASSED                                                                                                                                   [ 66%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[default-fp16] PASSED                                                                                                                                 [ 77%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[default-bfp16] FAILED                                                                                                                                [ 88%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[default-fp32] PASSED                                                                                                                                 [100%]

=============================================================================================================== FAILURES ===============================================================================================================
______________________________________________________________________________________________ TestZeroDtypeCocktail.test[default-bfp16]

After the fix:

tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[fp16-fp16] PASSED                                                                                                                                    [ 11%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[fp16-bfp16] PASSED                                                                                                                                   [ 22%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[fp16-fp32] PASSED                                                                                                                                    [ 33%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[bfp16-fp16] PASSED                                                                                                                                   [ 44%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[bfp16-bfp16] PASSED                                                                                                                                  [ 55%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[bfp16-fp32] PASSED                                                                                                                                   [ 66%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[default-fp16] PASSED                                                                                                                                 [ 77%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[default-bfp16] PASSED                                                                                                                                [ 88%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[default-fp32] PASSED                                                                                                                                 [100%]

Sanster · 2023-05-04T05:12:25Z

@clumsy According to the discussion here #2911 (comment), it seems intentional that bf16's default communication_data_type is set to float32. I'm curious, did you encounter any issues when using bf16 + fp32 communication_data_type?

clumsy · 2023-05-12T14:10:58Z

Yes @Sanster, let me mention the regression issue we faced.

Perhaps more importantly this communication_data_type conversion didn't even happen in v0.6.0 for example. So in our bfloat16=true experiments (and w/o setting communication_data_type explicitly) the gradients were reduced as bfloat16 (note no fp32 conversion).
When we switched to v0.7.3 we were unpleasantly surprised that the model trained using newer DeepSpeed under-performed vs. older v0.6.0. It took us significant effort to bisect DeepSpeed versions and changes to identify the offending change.
So in our case not only we're willingly sacrificing precision in mantissa using bfloat16 to fit a larger model, but we were also inadvertently suffering from precision loss in exponent due to fp16 (that was default and hard-coded in v0.7.3) + during conversion between the two.

Now back to your question of whether to default to fp32. In the spirit of Explicit is better than implicit I'd like to avoid this, especially since the doc clearly states here: By default, this feature is not enabled ('none' value) - so as a user I don't expect any conversion happening behind the scenes as much as possible.
Another reason not to use fp32 by default I think is the fact that it was not even possible to use it with reduce_scatter in ZeRO2 before.

tjruwase · 2023-05-12T15:03:37Z

When we switched to v0.7.3 we were unpleasantly pressurized that the model trained using newer DeepSpeed under-performed vs. older v0.6.0. It took us significant effort to bisect DeepSpeed versions and changes to identify the offending change.

@clumsy, thanks for sharing this explanation. Apologies for breaking BC with the above change. I am aligned with Explicit is better than implicit and will ensure we adhere better to that in practice.

fixing default communication_data_type for bfloat16_enabled and docs

6467573

clumsy requested review from jeffra, tjruwase and mrwyattii as code owners April 24, 2023 22:55

Merge branch 'master' into master

66ce548

tjruwase approved these changes Apr 25, 2023

View reviewed changes

tjruwase requested a review from jomayeri April 25, 2023 11:47

Merge branch 'master' into master

724e4ff

tjruwase merged commit d56268f into microsoft:master Apr 25, 2023
18 checks passed

clumsy pushed a commit to clumsy/DeepSpeed that referenced this pull request Apr 25, 2023

backporting microsoft#3370

d4bfd3b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixing default communication_data_type for bfloat16_enabled and docs #3370

fixing default communication_data_type for bfloat16_enabled and docs #3370

clumsy commented Apr 24, 2023 •

edited

Loading

Sanster commented May 4, 2023

clumsy commented May 12, 2023 •

edited

Loading

tjruwase commented May 12, 2023

fixing default communication_data_type for bfloat16_enabled and docs #3370

fixing default communication_data_type for bfloat16_enabled and docs #3370

Conversation

clumsy commented Apr 24, 2023 • edited Loading

Sanster commented May 4, 2023

clumsy commented May 12, 2023 • edited Loading

tjruwase commented May 12, 2023

clumsy commented Apr 24, 2023 •

edited

Loading

clumsy commented May 12, 2023 •

edited

Loading