Skip to content

fix zero3 init config#44236

Merged
SunMarc merged 4 commits intomainfrom
fix-dp-config-init
Feb 27, 2026
Merged

fix zero3 init config#44236
SunMarc merged 4 commits intomainfrom
fix-dp-config-init

Conversation

@SunMarc
Copy link
Member

@SunMarc SunMarc commented Feb 23, 2026

What does this PR do?

Supersedes #43847

When using zero3 + from_config, the model was incorrectly initialized as we were not gathering the params. Added a test also.

cc @tohtana

SunMarc and others added 3 commits February 23, 2026 17:17
When using `from_config()` with DeepSpeed ZeRO-3, `_init_weights()` silently
operated on partitioned empty tensors, making custom initialization a no-op.
Parameters retained PyTorch's default kaiming_uniform_ instead of the intended
initialization, causing abnormally large gradients and loss.

The fix suppresses init during construction via `no_init_weights()`, then
re-initializes module-by-module using `GatheredParameters` so each module's
parameters are gathered before init runs.

Co-Authored-By: Masahiro Tanaka <tohtana@users.noreply.github.com>
@SunMarc SunMarc requested a review from Cyrilvallez February 23, 2026 17:23
@SunMarc
Copy link
Member Author

SunMarc commented Feb 23, 2026

@bot /style

@github-actions
Copy link
Contributor

github-actions bot commented Feb 23, 2026

Style fix bot fixed some files and pushed the changes.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@SunMarc SunMarc merged commit a264509 into main Feb 27, 2026
26 checks passed
@SunMarc SunMarc deleted the fix-dp-config-init branch February 27, 2026 11:36
zvik pushed a commit to zvik/transformers that referenced this pull request Mar 1, 2026
* fix zero3 init config

* Fix ZeRO-3 weight initialization in `_from_config()`

When using `from_config()` with DeepSpeed ZeRO-3, `_init_weights()` silently
operated on partitioned empty tensors, making custom initialization a no-op.
Parameters retained PyTorch's default kaiming_uniform_ instead of the intended
initialization, causing abnormally large gradients and loss.

The fix suppresses init during construction via `no_init_weights()`, then
re-initializes module-by-module using `GatheredParameters` so each module's
parameters are gathered before init runs.

Co-Authored-By: Masahiro Tanaka <tohtana@users.noreply.github.com>

* Apply repo consistency fixes

---------

Co-authored-by: Masahiro Tanaka <tohtana@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants