Skip to content

fix(zero): detach flat buffer to prevent autograd inplace error on CP…#7948

Merged
delock merged 6 commits intodeepspeedai:masterfrom
delock:gma/fix_cpu_train
Apr 3, 2026
Merged

fix(zero): detach flat buffer to prevent autograd inplace error on CP…#7948
delock merged 6 commits intodeepspeedai:masterfrom
delock:gma/fix_cpu_train

Conversation

@delock
Copy link
Copy Markdown
Collaborator

@delock delock commented Apr 2, 2026

…U accelerator

The on-device flatten path (introduced in #7828) passes nn.Parameter objects with requires_grad=True to torch.cat(), creating a flat buffer with CatBackward0 grad_fn. Later, unflatten_dense_tensors produces SplitBackward0 views that are assigned to model params. Inplace copy() on these views during optimizer step raises:
RuntimeError: Output 0 of SplitBackward0 is a view and is being modified inplace.

This especially affects CPU training where CPU_Accelerator.is_available() returns True and available_memory() returns system RAM, so the on-device path is always taken.

Fix: add .detach() to the flattened buffer, matching the implicit detach behavior of the CPU-offload path (param.data.cpu() + .to(device)).

Also rename flatten_on_gpu -> flatten_on_accelerator and replace GPU-specific terminology in comments/logs with accelerator-generic equivalents.

@delock delock requested review from tjruwase and tohtana as code owners April 2, 2026 13:41
delock added 2 commits April 3, 2026 07:40
…U accelerator

The on-device flatten path (introduced in deepspeedai#7828) passes nn.Parameter objects
with requires_grad=True to torch.cat(), creating a flat buffer with
CatBackward0 grad_fn. Later, _unflatten_dense_tensors produces SplitBackward0
views that are assigned to model params. Inplace copy_() on these views during
optimizer step raises:
  RuntimeError: Output 0 of SplitBackward0 is a view and is being modified inplace.

This especially affects CPU training where CPU_Accelerator.is_available()
returns True and available_memory() returns system RAM, so the on-device path
is always taken.

Fix: add .detach() to the flattened buffer, matching the implicit detach
behavior of the CPU-offload path (param.data.cpu() + .to(device)).

Also rename flatten_on_gpu -> flatten_on_accelerator and replace GPU-specific
terminology in comments/logs with accelerator-generic equivalents.

Signed-off-by: Guokai Ma <guokai.ma@intel.com>
Signed-off-by: Guokai Ma <guokai.ma@intel.com>
@delock delock force-pushed the gma/fix_cpu_train branch from 47b1ebc to 0bfe45e Compare April 2, 2026 23:41
Signed-off-by: Guokai Ma <guokai.ma@intel.com>
@delock delock requested a review from loadams as a code owner April 3, 2026 00:14
Copy link
Copy Markdown
Collaborator

@tohtana tohtana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @delock! I left a comment in the new test.

assert flat.grad_fn is None, ("Flat buffer must be detached from autograd graph"
" to prevent inplace-modification errors during optimizer step")

data_loader = random_dataloader(model=engine, total_samples=8, hidden_dim=hidden_dim, device=engine.device)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't random_dataloader take dtype? The default is preferred_dtype(), which could mismatch dtype.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch!

delock added 2 commits April 2, 2026 19:07
…sertion

- Pass explicit dtype to random_dataloader to avoid mismatch when
  preferred_dtype() (bfloat16 on CPU) differs from the test config dtype.
  Fixes fp32 test failure on CPU-only CI where data was bfloat16 but model
  expected float32.
- Tighten log check from 'sufficient' to '(sufficient memory)' so it does
  not accidentally match '(insufficient memory)'.

Signed-off-by: Guokai Ma <guokai.ma@intel.com>
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Guokai Ma <guokai.ma@intel.com>
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
@delock
Copy link
Copy Markdown
Collaborator Author

delock commented Apr 3, 2026

@tohtana Thanks for the comments! I also verified that the newly added test will fail before applying this PR.

@delock delock enabled auto-merge (squash) April 3, 2026 02:40
@delock delock merged commit 37e232f into deepspeedai:master Apr 3, 2026
8 of 9 checks passed
tohtana pushed a commit to tohtana/DeepSpeed that referenced this pull request Apr 4, 2026
deepspeedai#7948)

…U accelerator

The on-device flatten path (introduced in deepspeedai#7828) passes nn.Parameter
objects with requires_grad=True to torch.cat(), creating a flat buffer
with CatBackward0 grad_fn. Later, _unflatten_dense_tensors produces
SplitBackward0 views that are assigned to model params. Inplace copy_()
on these views during optimizer step raises:
RuntimeError: Output 0 of SplitBackward0 is a view and is being modified
inplace.

This especially affects CPU training where
CPU_Accelerator.is_available() returns True and available_memory()
returns system RAM, so the on-device path is always taken.

Fix: add .detach() to the flattened buffer, matching the implicit detach
behavior of the CPU-offload path (param.data.cpu() + .to(device)).

Also rename flatten_on_gpu -> flatten_on_accelerator and replace
GPU-specific terminology in comments/logs with accelerator-generic
equivalents.

---------

Signed-off-by: Guokai Ma <guokai.ma@intel.com>
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants