Fix issue with BF16 optimizer selection by tohtana · Pull Request #7788 · deepspeedai/DeepSpeed

tohtana · 2026-01-18T02:54:14Z

Note: Updated based on the change 64b1073 for #7790. With the fix, BF16_Optimizer now requires ZeRO stage 1 to be explicitly enabled.

The test test_bf16_optimizer_fragments fails with an AssertionError because the BF16_Optimizer is not being instantiated when expected. The test checks for _hp_mapping attribute on parameters, which is only set by BF16_Optimizer.

The test test_bf16_optimizer_fragments fails because:

The test config (bf16=True without grad_accum_dtype) correctly uses FP16_Optimizer, but the test expects BF16_Optimizer (which sets _hp_mapping)
BFLOAT16 and DDP_BFLOAT16 have the same value "bf16", preventing proper optimizer selection
BF16_Optimizer is missing attributes required by the base class API

This PR addresses these issues.

Optimizer selection summary:

ZeRO Stage	Config	Optimizer	Gradient Accumulation
0	`bf16=True` (default)	`FP16_Optimizer`	bf16
0	`bf16=True` + `grad_accum_dtype=fp32`	`NotImplementedError`	-
1	`bf16=True` + `grad_accum_dtype=fp32`	`BF16_Optimizer`	fp32

This is confusing (e.g., FP16_Optimizer handles both fp16 and bf16). We would need to simplify the code paths and clarify the behaviors in the future.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

We have been disabled the full unit test workflow for a while. This PR migrates the full test to our AWS test infra. To make the tests pass, we need to merge these PRs: - #7786 - #7788 - #7789 - #7790 - #7793 - #7794 In addition having these PRs merged, this PR has the following changes in the full test workflow and test harness: - Ignore flags for some known issues: - nvme: Requires an actual NVMe device. Our CI currently doesn't have NVMe storage configured - GDS: GDS requires special kernel drivers and NVIDIA Magnum IO to enable direct GPU-to-storage transfers. CI instances don't have this configured. - Zenflow: 1. Stage 3 bugs: The ZenFlow + ZeRO Stage 3 implementation has pre-existing bugs that cause internal pytest errors and worker crashes, 2. CUDA/fork incompatibility: test_zf_torch_adam.py uses torch.optim.AdamW which does CUDA graph capture checks that fail in forked processes (--forked flag, we can just move it to sequential tests) - `/mnt/aio` mount for async I/O tests - CUTLASS installation for Evoformer tests - Add `DS_DISABLE_REUSE_DIST_ENV` to the test harness to prevent worker cleanup hangs Once we merge this PR, we will be able to run the full test manually or at scheduled times. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana · 2026-01-21T02:07:39Z

@sfc-gh-truwase Sorry, I missed to change zero stage in the test. Please review #7803.

@sfc-gh-truwase

Use ZeRO stage 1 to use BF16 optimizer. (We should have switched to ZeRO1 in #7788, but I missed the change. @sfc-gh-truwase) - #7790 removed the fallback that allowed bf16 model + fp32 grad accumulation without ZeRO, so that combo now raises NotImplementedError. - #7788 changed test_bf16_optimizer_fragments to force BF16_Optimizer by setting grad_accum_dtype=fp32, but it kept ZeRO stage 0, which is now invalid after #7790. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

@sfc-gh-truwase

Use ZeRO stage 1 to use BF16 optimizer. (We should have switched to ZeRO1 in deepspeedai#7788, but I missed the change. @sfc-gh-truwase) - deepspeedai#7790 removed the fallback that allowed bf16 model + fp32 grad accumulation without ZeRO, so that combo now raises NotImplementedError. - deepspeedai#7788 changed test_bf16_optimizer_fragments to force BF16_Optimizer by setting grad_accum_dtype=fp32, but it kept ZeRO stage 0, which is now invalid after deepspeedai#7790. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>

@sfc-gh-truwase

Use ZeRO stage 1 to use BF16 optimizer. (We should have switched to ZeRO1 in deepspeedai#7788, but I missed the change. @sfc-gh-truwase) - deepspeedai#7790 removed the fallback that allowed bf16 model + fp32 grad accumulation without ZeRO, so that combo now raises NotImplementedError. - deepspeedai#7788 changed test_bf16_optimizer_fragments to force BF16_Optimizer by setting grad_accum_dtype=fp32, but it kept ZeRO stage 0, which is now invalid after deepspeedai#7790. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>

@sfc-gh-truwase

Use ZeRO stage 1 to use BF16 optimizer. (We should have switched to ZeRO1 in deepspeedai#7788, but I missed the change. @sfc-gh-truwase) - deepspeedai#7790 removed the fallback that allowed bf16 model + fp32 grad accumulation without ZeRO, so that combo now raises NotImplementedError. - deepspeedai#7788 changed test_bf16_optimizer_fragments to force BF16_Optimizer by setting grad_accum_dtype=fp32, but it kept ZeRO stage 0, which is now invalid after deepspeedai#7790. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>

**Note:** Updated based on the change 64b1073 for deepspeedai#7790. With the fix, `BF16_Optimizer` now requires ZeRO stage 1 to be explicitly enabled. The test `test_bf16_optimizer_fragments` fails with an `AssertionError` because the `BF16_Optimizer` is not being instantiated when expected. The test checks for `_hp_mapping` attribute on parameters, which is only set by `BF16_Optimizer`. The test `test_bf16_optimizer_fragments` fails because: 1. The test config (`bf16=True` without grad_accum_dtype) **correctly** uses `FP16_Optimizer`, but the test expects `BF16_Optimizer` (which sets `_hp_mapping`) 2. `BFLOAT16` and `DDP_BFLOAT16` have the same value `"bf16"`, preventing proper optimizer selection 3. `BF16_Optimizer` is missing attributes required by the base class API This PR addresses these issues. Optimizer selection summary: | ZeRO Stage | Config | Optimizer | Gradient Accumulation | |------------|--------|-----------|----------------------| | 0 | `bf16=True` (default) | `FP16_Optimizer` | bf16 | | 0 | `bf16=True` + `grad_accum_dtype=fp32` | `NotImplementedError` | - | | 1 | `bf16=True` + `grad_accum_dtype=fp32` | `BF16_Optimizer` | fp32 | This is confusing (e.g., `FP16_Optimizer` handles both fp16 and bf16). We would need to simplify the code paths and clarify the behaviors in the future. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>

We have been disabled the full unit test workflow for a while. This PR migrates the full test to our AWS test infra. To make the tests pass, we need to merge these PRs: - deepspeedai#7786 - deepspeedai#7788 - deepspeedai#7789 - deepspeedai#7790 - deepspeedai#7793 - deepspeedai#7794 In addition having these PRs merged, this PR has the following changes in the full test workflow and test harness: - Ignore flags for some known issues: - nvme: Requires an actual NVMe device. Our CI currently doesn't have NVMe storage configured - GDS: GDS requires special kernel drivers and NVIDIA Magnum IO to enable direct GPU-to-storage transfers. CI instances don't have this configured. - Zenflow: 1. Stage 3 bugs: The ZenFlow + ZeRO Stage 3 implementation has pre-existing bugs that cause internal pytest errors and worker crashes, 2. CUDA/fork incompatibility: test_zf_torch_adam.py uses torch.optim.AdamW which does CUDA graph capture checks that fail in forked processes (--forked flag, we can just move it to sequential tests) - `/mnt/aio` mount for async I/O tests - CUTLASS installation for Evoformer tests - Add `DS_DISABLE_REUSE_DIST_ENV` to the test harness to prevent worker cleanup hangs Once we merge this PR, we will be able to run the full test manually or at scheduled times. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>

@sfc-gh-truwase

Use ZeRO stage 1 to use BF16 optimizer. (We should have switched to ZeRO1 in deepspeedai#7788, but I missed the change. @sfc-gh-truwase) - deepspeedai#7790 removed the fallback that allowed bf16 model + fp32 grad accumulation without ZeRO, so that combo now raises NotImplementedError. - deepspeedai#7788 changed test_bf16_optimizer_fragments to force BF16_Optimizer by setting grad_accum_dtype=fp32, but it kept ZeRO stage 0, which is now invalid after deepspeedai#7790. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>

@sfc-gh-truwase

Use ZeRO stage 1 to use BF16 optimizer. (We should have switched to ZeRO1 in deepspeedai#7788, but I missed the change. @sfc-gh-truwase) - deepspeedai#7790 removed the fallback that allowed bf16 model + fp32 grad accumulation without ZeRO, so that combo now raises NotImplementedError. - deepspeedai#7788 changed test_bf16_optimizer_fragments to force BF16_Optimizer by setting grad_accum_dtype=fp32, but it kept ZeRO stage 0, which is now invalid after deepspeedai#7790. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: nathon-lee <leejianwoo@gmail.com>

Fix issue with BF16 optimizer selection

8e83666

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana requested review from loadams and tjruwase as code owners January 18, 2026 02:54

This was referenced Jan 18, 2026

Add full test suite workflow #7795

Merged

Fix BF16_Optimizer being used without ZeRO #7790

Merged

sfc-gh-truwase approved these changes Jan 20, 2026

View reviewed changes

tohtana merged commit 1393f75 into deepspeedai:master Jan 20, 2026
11 checks passed

tohtana mentioned this pull request Jan 21, 2026

Fix ZeRO stage to choose BF16 optimizer in test #7803

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue with BF16 optimizer selection#7788

Fix issue with BF16 optimizer selection#7788
tohtana merged 1 commit intodeepspeedai:masterfrom
tohtana:tohtana/fix_bf16_optimizer_select

tohtana commented Jan 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

tohtana commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tohtana commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tohtana commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tohtana commented Jan 18, 2026 •

edited

Loading