Skip to content

Fix missing rms_norm_eps in DeepseekV3 MLA layernorms#44585

Open
mvanhorn wants to merge 1 commit intohuggingface:mainfrom
mvanhorn:osc/44261-fix-deepseek-v3-layernorm-eps
Open

Fix missing rms_norm_eps in DeepseekV3 MLA layernorms#44585
mvanhorn wants to merge 1 commit intohuggingface:mainfrom
mvanhorn:osc/44261-fix-deepseek-v3-layernorm-eps

Conversation

@mvanhorn
Copy link
Copy Markdown
Contributor

What does this PR do?

Passes eps=config.rms_norm_eps to both q_a_layernorm and kv_a_layernorm in the DeepseekV3 MLA attention module. Without this, these layernorms default to eps=1e-5 instead of the config value (1e-6), causing precision differences compared to vLLM and SGLang implementations.

The fix was applied to modular_deepseek_v3.py and propagated to generated modeling files (deepseek_v3, glm4_moe_lite, longcat_flash, youtu) via make fix-repo.

Note: DeepseekV2 has the same issue but is left for a separate PR to keep this focused.

Fixes #44261

Before submitting

Who can review?

@ArthurZucker @Cyrilvallez (text models, attention)

This contribution was developed with AI assistance (Claude Code).

Pass `eps=config.rms_norm_eps` to both `q_a_layernorm` and
`kv_a_layernorm` in DeepseekV3 attention. Without this, these
layernorms use the default eps (1e-5) instead of the config value
(1e-6), causing precision errors vs vLLM/SGLang implementations.

Edit applied to modular_deepseek_v3.py; generated modeling files
(deepseek_v3, glm4_moe_lite, longcat_flash, youtu) updated via
`make fix-repo`.

Fixes huggingface#44261

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: deepseek_v3, glm4_moe_lite, longcat_flash, youtu

Copy link
Copy Markdown

@alvinttang alvinttang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is correct — without passing eps=config.rms_norm_eps, the MLA layernorms for both q_a_layernorm and kv_a_layernorm would silently use the RMSNorm default epsilon (typically 1e-6) instead of the model-configured value, which would cause subtle numerical divergence from reference implementations without any error. It is worth noting this same fix is applied consistently across all derived models (glm4_moe_lite, longcat_flash, youtu) and the modular source, which is the right approach. One open question: does the DeepseekV3RMSNorm default epsilon happen to match the typical config value, masking this bug in practice? If so, a unit test asserting the epsilon is correctly propagated would prevent future regressions of this class.

@Cyrilvallez
Copy link
Copy Markdown
Member

cc @ArthurZucker, do you know if the norms in the Attention are supposed to use the config value as well or not?

@mvanhorn
Copy link
Copy Markdown
Contributor Author

The MLA attention layernorms (q_a_layernorm and kv_a_layernorm) are the only layernorms in DeepseekV3 that weren't getting config.rms_norm_eps passed through - all other norms in the model (input, post-attention, MLP) already use it. This PR just makes them consistent.

The same pattern is applied across the derived models (glm4_moe_lite, longcat_flash, youtu) and the modular source, so it stays consistent everywhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug/Discussion] MLA q_a_layernorm Missing config.rms_norm_eps, Causing 1e-5/1e-6 Precision Error

3 participants