fix(zero): detach flat buffer to prevent autograd inplace error on CP…#7948
Merged
delock merged 6 commits intodeepspeedai:masterfrom Apr 3, 2026
Merged
fix(zero): detach flat buffer to prevent autograd inplace error on CP…#7948delock merged 6 commits intodeepspeedai:masterfrom
delock merged 6 commits intodeepspeedai:masterfrom
Conversation
…U accelerator The on-device flatten path (introduced in deepspeedai#7828) passes nn.Parameter objects with requires_grad=True to torch.cat(), creating a flat buffer with CatBackward0 grad_fn. Later, _unflatten_dense_tensors produces SplitBackward0 views that are assigned to model params. Inplace copy_() on these views during optimizer step raises: RuntimeError: Output 0 of SplitBackward0 is a view and is being modified inplace. This especially affects CPU training where CPU_Accelerator.is_available() returns True and available_memory() returns system RAM, so the on-device path is always taken. Fix: add .detach() to the flattened buffer, matching the implicit detach behavior of the CPU-offload path (param.data.cpu() + .to(device)). Also rename flatten_on_gpu -> flatten_on_accelerator and replace GPU-specific terminology in comments/logs with accelerator-generic equivalents. Signed-off-by: Guokai Ma <guokai.ma@intel.com>
Signed-off-by: Guokai Ma <guokai.ma@intel.com>
47b1ebc to
0bfe45e
Compare
Signed-off-by: Guokai Ma <guokai.ma@intel.com>
tohtana
approved these changes
Apr 3, 2026
| assert flat.grad_fn is None, ("Flat buffer must be detached from autograd graph" | ||
| " to prevent inplace-modification errors during optimizer step") | ||
|
|
||
| data_loader = random_dataloader(model=engine, total_samples=8, hidden_dim=hidden_dim, device=engine.device) |
Collaborator
There was a problem hiding this comment.
Shouldn't random_dataloader take dtype? The default is preferred_dtype(), which could mismatch dtype.
Collaborator
Author
There was a problem hiding this comment.
Thanks for the catch!
…sertion - Pass explicit dtype to random_dataloader to avoid mismatch when preferred_dtype() (bfloat16 on CPU) differs from the test config dtype. Fixes fp32 test failure on CPU-only CI where data was bfloat16 but model expected float32. - Tighten log check from 'sufficient' to '(sufficient memory)' so it does not accidentally match '(insufficient memory)'. Signed-off-by: Guokai Ma <guokai.ma@intel.com> Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Guokai Ma <guokai.ma@intel.com> Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Collaborator
Author
|
@tohtana Thanks for the comments! I also verified that the newly added test will fail before applying this PR. |
tohtana
pushed a commit
to tohtana/DeepSpeed
that referenced
this pull request
Apr 4, 2026
deepspeedai#7948) …U accelerator The on-device flatten path (introduced in deepspeedai#7828) passes nn.Parameter objects with requires_grad=True to torch.cat(), creating a flat buffer with CatBackward0 grad_fn. Later, _unflatten_dense_tensors produces SplitBackward0 views that are assigned to model params. Inplace copy_() on these views during optimizer step raises: RuntimeError: Output 0 of SplitBackward0 is a view and is being modified inplace. This especially affects CPU training where CPU_Accelerator.is_available() returns True and available_memory() returns system RAM, so the on-device path is always taken. Fix: add .detach() to the flattened buffer, matching the implicit detach behavior of the CPU-offload path (param.data.cpu() + .to(device)). Also rename flatten_on_gpu -> flatten_on_accelerator and replace GPU-specific terminology in comments/logs with accelerator-generic equivalents. --------- Signed-off-by: Guokai Ma <guokai.ma@intel.com> Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…U accelerator
The on-device flatten path (introduced in #7828) passes nn.Parameter objects with requires_grad=True to torch.cat(), creating a flat buffer with CatBackward0 grad_fn. Later, unflatten_dense_tensors produces SplitBackward0 views that are assigned to model params. Inplace copy() on these views during optimizer step raises:
RuntimeError: Output 0 of SplitBackward0 is a view and is being modified inplace.
This especially affects CPU training where CPU_Accelerator.is_available() returns True and available_memory() returns system RAM, so the on-device path is always taken.
Fix: add .detach() to the flattened buffer, matching the implicit detach behavior of the CPU-offload path (param.data.cpu() + .to(device)).
Also rename flatten_on_gpu -> flatten_on_accelerator and replace GPU-specific terminology in comments/logs with accelerator-generic equivalents.