Fix issue with BF16 optimizer selection#7788
Merged
tohtana merged 1 commit intodeepspeedai:masterfrom Jan 20, 2026
Merged
Conversation
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
This was referenced Jan 18, 2026
sfc-gh-truwase
approved these changes
Jan 20, 2026
tohtana
added a commit
that referenced
this pull request
Jan 20, 2026
We have been disabled the full unit test workflow for a while. This PR migrates the full test to our AWS test infra. To make the tests pass, we need to merge these PRs: - #7786 - #7788 - #7789 - #7790 - #7793 - #7794 In addition having these PRs merged, this PR has the following changes in the full test workflow and test harness: - Ignore flags for some known issues: - nvme: Requires an actual NVMe device. Our CI currently doesn't have NVMe storage configured - GDS: GDS requires special kernel drivers and NVIDIA Magnum IO to enable direct GPU-to-storage transfers. CI instances don't have this configured. - Zenflow: 1. Stage 3 bugs: The ZenFlow + ZeRO Stage 3 implementation has pre-existing bugs that cause internal pytest errors and worker crashes, 2. CUDA/fork incompatibility: test_zf_torch_adam.py uses torch.optim.AdamW which does CUDA graph capture checks that fail in forked processes (--forked flag, we can just move it to sequential tests) - `/mnt/aio` mount for async I/O tests - CUTLASS installation for Evoformer tests - Add `DS_DISABLE_REUSE_DIST_ENV` to the test harness to prevent worker cleanup hangs Once we merge this PR, we will be able to run the full test manually or at scheduled times. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Collaborator
Author
|
@sfc-gh-truwase Sorry, I missed to change zero stage in the test. Please review #7803. |
tohtana
added a commit
that referenced
this pull request
Jan 23, 2026
Use ZeRO stage 1 to use BF16 optimizer. (We should have switched to ZeRO1 in #7788, but I missed the change. @sfc-gh-truwase) - #7790 removed the fallback that allowed bf16 model + fp32 grad accumulation without ZeRO, so that combo now raises NotImplementedError. - #7788 changed test_bf16_optimizer_fragments to force BF16_Optimizer by setting grad_accum_dtype=fp32, but it kept ZeRO stage 0, which is now invalid after #7790. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
nathon-lee
pushed a commit
to nathon-lee/DeepSpeed_woo
that referenced
this pull request
Jan 24, 2026
Use ZeRO stage 1 to use BF16 optimizer. (We should have switched to ZeRO1 in deepspeedai#7788, but I missed the change. @sfc-gh-truwase) - deepspeedai#7790 removed the fallback that allowed bf16 model + fp32 grad accumulation without ZeRO, so that combo now raises NotImplementedError. - deepspeedai#7788 changed test_bf16_optimizer_fragments to force BF16_Optimizer by setting grad_accum_dtype=fp32, but it kept ZeRO stage 0, which is now invalid after deepspeedai#7790. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
nathon-lee
pushed a commit
to nathon-lee/DeepSpeed_woo
that referenced
this pull request
Jan 24, 2026
Use ZeRO stage 1 to use BF16 optimizer. (We should have switched to ZeRO1 in deepspeedai#7788, but I missed the change. @sfc-gh-truwase) - deepspeedai#7790 removed the fallback that allowed bf16 model + fp32 grad accumulation without ZeRO, so that combo now raises NotImplementedError. - deepspeedai#7788 changed test_bf16_optimizer_fragments to force BF16_Optimizer by setting grad_accum_dtype=fp32, but it kept ZeRO stage 0, which is now invalid after deepspeedai#7790. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
nathon-lee
pushed a commit
to nathon-lee/DeepSpeed_woo
that referenced
this pull request
Jan 26, 2026
Use ZeRO stage 1 to use BF16 optimizer. (We should have switched to ZeRO1 in deepspeedai#7788, but I missed the change. @sfc-gh-truwase) - deepspeedai#7790 removed the fallback that allowed bf16 model + fp32 grad accumulation without ZeRO, so that combo now raises NotImplementedError. - deepspeedai#7788 changed test_bf16_optimizer_fragments to force BF16_Optimizer by setting grad_accum_dtype=fp32, but it kept ZeRO stage 0, which is now invalid after deepspeedai#7790. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
phalani-paladugu
pushed a commit
to phalani-paladugu/DeepSpeed
that referenced
this pull request
Jan 29, 2026
**Note:** Updated based on the change 64b1073 for deepspeedai#7790. With the fix, `BF16_Optimizer` now requires ZeRO stage 1 to be explicitly enabled. The test `test_bf16_optimizer_fragments` fails with an `AssertionError` because the `BF16_Optimizer` is not being instantiated when expected. The test checks for `_hp_mapping` attribute on parameters, which is only set by `BF16_Optimizer`. The test `test_bf16_optimizer_fragments` fails because: 1. The test config (`bf16=True` without grad_accum_dtype) **correctly** uses `FP16_Optimizer`, but the test expects `BF16_Optimizer` (which sets `_hp_mapping`) 2. `BFLOAT16` and `DDP_BFLOAT16` have the same value `"bf16"`, preventing proper optimizer selection 3. `BF16_Optimizer` is missing attributes required by the base class API This PR addresses these issues. Optimizer selection summary: | ZeRO Stage | Config | Optimizer | Gradient Accumulation | |------------|--------|-----------|----------------------| | 0 | `bf16=True` (default) | `FP16_Optimizer` | bf16 | | 0 | `bf16=True` + `grad_accum_dtype=fp32` | `NotImplementedError` | - | | 1 | `bf16=True` + `grad_accum_dtype=fp32` | `BF16_Optimizer` | fp32 | This is confusing (e.g., `FP16_Optimizer` handles both fp16 and bf16). We would need to simplify the code paths and clarify the behaviors in the future. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>
phalani-paladugu
pushed a commit
to phalani-paladugu/DeepSpeed
that referenced
this pull request
Jan 29, 2026
We have been disabled the full unit test workflow for a while. This PR migrates the full test to our AWS test infra. To make the tests pass, we need to merge these PRs: - deepspeedai#7786 - deepspeedai#7788 - deepspeedai#7789 - deepspeedai#7790 - deepspeedai#7793 - deepspeedai#7794 In addition having these PRs merged, this PR has the following changes in the full test workflow and test harness: - Ignore flags for some known issues: - nvme: Requires an actual NVMe device. Our CI currently doesn't have NVMe storage configured - GDS: GDS requires special kernel drivers and NVIDIA Magnum IO to enable direct GPU-to-storage transfers. CI instances don't have this configured. - Zenflow: 1. Stage 3 bugs: The ZenFlow + ZeRO Stage 3 implementation has pre-existing bugs that cause internal pytest errors and worker crashes, 2. CUDA/fork incompatibility: test_zf_torch_adam.py uses torch.optim.AdamW which does CUDA graph capture checks that fail in forked processes (--forked flag, we can just move it to sequential tests) - `/mnt/aio` mount for async I/O tests - CUTLASS installation for Evoformer tests - Add `DS_DISABLE_REUSE_DIST_ENV` to the test harness to prevent worker cleanup hangs Once we merge this PR, we will be able to run the full test manually or at scheduled times. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>
phalani-paladugu
pushed a commit
to phalani-paladugu/DeepSpeed
that referenced
this pull request
Jan 29, 2026
Use ZeRO stage 1 to use BF16 optimizer. (We should have switched to ZeRO1 in deepspeedai#7788, but I missed the change. @sfc-gh-truwase) - deepspeedai#7790 removed the fallback that allowed bf16 model + fp32 grad accumulation without ZeRO, so that combo now raises NotImplementedError. - deepspeedai#7788 changed test_bf16_optimizer_fragments to force BF16_Optimizer by setting grad_accum_dtype=fp32, but it kept ZeRO stage 0, which is now invalid after deepspeedai#7790. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>
nathon-lee
pushed a commit
to nathon-lee/DeepSpeed_woo
that referenced
this pull request
Mar 7, 2026
Use ZeRO stage 1 to use BF16 optimizer. (We should have switched to ZeRO1 in deepspeedai#7788, but I missed the change. @sfc-gh-truwase) - deepspeedai#7790 removed the fallback that allowed bf16 model + fp32 grad accumulation without ZeRO, so that combo now raises NotImplementedError. - deepspeedai#7788 changed test_bf16_optimizer_fragments to force BF16_Optimizer by setting grad_accum_dtype=fp32, but it kept ZeRO stage 0, which is now invalid after deepspeedai#7790. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: nathon-lee <leejianwoo@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note: Updated based on the change 64b1073 for #7790. With the fix,
BF16_Optimizernow requires ZeRO stage 1 to be explicitly enabled.The test
test_bf16_optimizer_fragmentsfails with anAssertionErrorbecause theBF16_Optimizeris not being instantiated when expected. The test checks for_hp_mappingattribute on parameters, which is only set byBF16_Optimizer.The test
test_bf16_optimizer_fragmentsfails because:bf16=Truewithout grad_accum_dtype) correctly usesFP16_Optimizer, but the test expectsBF16_Optimizer(which sets_hp_mapping)BFLOAT16andDDP_BFLOAT16have the same value"bf16", preventing proper optimizer selectionBF16_Optimizeris missing attributes required by the base class APIThis PR addresses these issues.
Optimizer selection summary:
bf16=True(default)FP16_Optimizerbf16=True+grad_accum_dtype=fp32NotImplementedErrorbf16=True+grad_accum_dtype=fp32BF16_OptimizerThis is confusing (e.g.,
FP16_Optimizerhandles both fp16 and bf16). We would need to simplify the code paths and clarify the behaviors in the future.