Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixing default communication_data_type for bfloat16_enabled and docs #3370

Merged
merged 3 commits into from
Apr 25, 2023

Conversation

clumsy
Copy link
Contributor

@clumsy clumsy commented Apr 24, 2023

fixes #2071

#2145 fixed the case when communication_data_type is set explicitly, unfortunately the default value for when bfloat16_enabled=True is not bfp16.
Adding unit tests for default communication_data_type.
Correcting the docs that mentioned communication_data_type as being of type boolean.

Before the fix:

tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[fp16-fp16] PASSED                                                                                                                                    [ 11%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[fp16-bfp16] PASSED                                                                                                                                   [ 22%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[fp16-fp32] PASSED                                                                                                                                    [ 33%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[bfp16-fp16] PASSED                                                                                                                                   [ 44%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[bfp16-bfp16] PASSED                                                                                                                                  [ 55%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[bfp16-fp32] PASSED                                                                                                                                   [ 66%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[default-fp16] PASSED                                                                                                                                 [ 77%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[default-bfp16] FAILED                                                                                                                                [ 88%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[default-fp32] PASSED                                                                                                                                 [100%]

=============================================================================================================== FAILURES ===============================================================================================================
______________________________________________________________________________________________ TestZeroDtypeCocktail.test[default-bfp16] 

After the fix:

tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[fp16-fp16] PASSED                                                                                                                                    [ 11%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[fp16-bfp16] PASSED                                                                                                                                   [ 22%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[fp16-fp32] PASSED                                                                                                                                    [ 33%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[bfp16-fp16] PASSED                                                                                                                                   [ 44%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[bfp16-bfp16] PASSED                                                                                                                                  [ 55%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[bfp16-fp32] PASSED                                                                                                                                   [ 66%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[default-fp16] PASSED                                                                                                                                 [ 77%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[default-bfp16] PASSED                                                                                                                                [ 88%]
tests/unit/runtime/half_precision/test_bf16.py::TestZeroDtypeCocktail::test[default-fp32] PASSED                                                                                                                                 [100%]

@tjruwase tjruwase requested a review from jomayeri April 25, 2023 11:47
@tjruwase tjruwase merged commit d56268f into microsoft:master Apr 25, 2023
18 checks passed
clumsy pushed a commit to clumsy/DeepSpeed that referenced this pull request Apr 25, 2023
@Sanster
Copy link

Sanster commented May 4, 2023

@clumsy According to the discussion here #2911 (comment), it seems intentional that bf16's default communication_data_type is set to float32. I'm curious, did you encounter any issues when using bf16 + fp32 communication_data_type?

@clumsy
Copy link
Contributor Author

clumsy commented May 12, 2023

Yes @Sanster, let me mention the regression issue we faced.

Perhaps more importantly this communication_data_type conversion didn't even happen in v0.6.0 for example. So in our bfloat16=true experiments (and w/o setting communication_data_type explicitly) the gradients were reduced as bfloat16 (note no fp32 conversion).
When we switched to v0.7.3 we were unpleasantly surprised that the model trained using newer DeepSpeed under-performed vs. older v0.6.0. It took us significant effort to bisect DeepSpeed versions and changes to identify the offending change.
So in our case not only we're willingly sacrificing precision in mantissa using bfloat16 to fit a larger model, but we were also inadvertently suffering from precision loss in exponent due to fp16 (that was default and hard-coded in v0.7.3) + during conversion between the two.

Now back to your question of whether to default to fp32. In the spirit of Explicit is better than implicit I'd like to avoid this, especially since the doc clearly states here: By default, this feature is not enabled ('none' value) - so as a user I don't expect any conversion happening behind the scenes as much as possible.
Another reason not to use fp32 by default I think is the fact that it was not even possible to use it with reduce_scatter in ZeRO2 before.

@tjruwase
Copy link
Contributor

When we switched to v0.7.3 we were unpleasantly pressurized that the model trained using newer DeepSpeed under-performed vs. older v0.6.0. It took us significant effort to bisect DeepSpeed versions and changes to identify the offending change.

@clumsy, thanks for sharing this explanation. Apologies for breaking BC with the above change. I am aligned with Explicit is better than implicit and will ensure we adhere better to that in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] FP16 used for all reduce even if BFLOAT16 is enabled
5 participants