Fix OOM regression for FSDP2 + cpu_ram_efficient_loading on large models by AmineDiro · Pull Request #45649 · huggingface/transformers

AmineDiro · 2026-04-25T21:12:04Z

What does this PR do?

PR #45050 replaces torch.empty_like with torch.zeros_like in _move_missing_keys_from_meta_to_device. While this fixes a real issue (NaN garbage in uninitialized memory), it forces a physical-memory commit of the entire model on every non-rank-0 FSDP rank.
With 8 ranks per node loading a 30B model, peak cpu mem jumps from ~60 GB to ~480 GB :/

The regression was identified by bisecting transformers commits between 2026-04-10 (working) and 2026-04-22 (failing) using a 2-node FSDP2 control config:

Commit	Date	Result	MFU
`c43f15c`	2026-04-10	PASS	21.46%
`a001f34439` (pre-#45050)	2026-04-13	PASS	21.13%
`ff49f7c4cb` (PR #45050)	2026-04-13	FAIL	OOM
`e40b0c0`	2026-04-13 (post-#45050)	FAIL	OOM
`8426e7e`	2026-04-15	FAIL	OOM
`7a0d582`	2026-04-20	FAIL	OOM
`9dff7ca`	2026-04-21	FAIL	OOM
`cbe7a02`	2026-04-22	FAIL	OOM

Test config: Qwen/Qwen3-30B-A3B, FSDP2, 2 nodes × 8 H100, DP=16, sdpa, max_steps=5, fsdp_cpu_ram_efficient_loading=true.

The placeholder values on non-rank-0 ranks for state-dict params are immediately overwritte by fsdp2_load_full_state_dict during accelerate's FSDP2 prepare. accelerate moves the entire model to meta device before sharding in accelerate.utils.fsdp_utils.fsdp2_prepare_model
So allocating CPU placeholders for parameters on non-rank-0 ranks is unnecessary work. The parameters can stay on meta. Btw, from what I can understand buffers (RoPE caches, attention masks, etc.) are per-rank and not part of the broadcast, so they still need real allocations.

Fixes # (issue)

Code Agent Policy

I confirm that this is not a pure code agent PR.

Before submitting

Did you read the [contributor guideline]

Who can review?

@albertvillanova @ArthurZucker

HuggingFaceDocBuilderDev · 2026-04-25T21:22:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2026-04-25T21:26:22Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45649&sha=74480d

Rocketknight1 · 2026-04-27T11:29:07Z

cc @Cyrilvallez I think

albertvillanova

Thanks a lot for the clear diagnosis and the fix: Skip CPU param materialization on non-rank-0 FSDP ranks to avoid OOM

The OOM regression in #45050 is real: zeros_like forces an immediate physical-memory commit (page fault on every zero write), whereas empty_like relies on overcommit/lazy allocation. Note this was already commented by @ArthurZucker: https://github.com/huggingface/transformers/pull/45050/changes#r3029107360

the reason I don't want this is because its costly!

Let me trace through the full flow after the change to confirm:

On non-rank-0 FSDP ranks:
- Parameters stay on meta device: zero physical memory committed
- Buffers (both persistent and non-persistent) get real CPU zeros_like placeholders
Then _initialize_missing_keys (PR #44473) marks state-dict parameters (now meta tensors) as _is_hf_initialized = True. initialize_weights() then runs: for RotaryEmbedding, inv_freq and original_inv_freq are non-persistent buffers, so they are not in state_dict(), not marked, and _init_weights correctly computes and copies their values into the real CPU zero tensors
Accelerate's fsdp2_prepare_model then:
- Saves non-persistent buffers (now correctly initialized by _init_weights) from each rank
- Moves the model to meta; parameters that were already on meta: no-op
- Applies fully_shard
- fsdp2_load_full_state_dict broadcasts from rank-0 into all ranks: parameters receive correct values
- Restores non-persistent buffers from each rank's saved copy

The original NaN bug is still fixed: parameters that _init_weights skips (marked as initialized) are subsequently overwritten by the broadcast with rank-0's values. The difference from #45050 is that we never pay the cost of materializing them on non-rank-0 in the first place.

The fix is correct, targeted, and eliminates the OOM without reintroducing the NaN regression (I have confirmed this). 🤗

albertvillanova · 2026-04-28T05:41:32Z

            return

-        # In this case we need to move everything back
+        # Leave parameters on meta on non-rank-0 FSDP ranks (rank-0 broadcast overwrites them); only buffers need real placeholders.


Nit: I think the comment is right, but it under-specifies the mechanism:

Parameters can stay on meta:

accelerate's fsdp2_prepare_model moves the whole model to meta before fully_shard

then fsdp2_load_full_state_dict broadcasts rank-0's state_dict to all ranks

Only buffers need real allocations:

persistent buffers are also broadcast

but non-persistent ones (RoPE caches etc.) are per-rank and must be initialized locally by _init_weights

Skip CPU param materialization on non-rank-0 FSDP ranks to avoid OOM

74480d4

albertvillanova approved these changes Apr 28, 2026

View reviewed changes

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OOM regression for FSDP2 + cpu_ram_efficient_loading on large models#45649

Fix OOM regression for FSDP2 + cpu_ram_efficient_loading on large models#45649
AmineDiro wants to merge 1 commit intohuggingface:mainfrom
AmineDiro:fix-fsdp2-cpu-ram-zeros-like

AmineDiro commented Apr 25, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

Rocketknight1 commented Apr 27, 2026

Uh oh!

albertvillanova left a comment

Uh oh!

albertvillanova Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

AmineDiro commented Apr 25, 2026

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

Rocketknight1 commented Apr 27, 2026

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

albertvillanova Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants