Fix missing rms_norm_eps in DeepseekV3 MLA layernorms#44585
Fix missing rms_norm_eps in DeepseekV3 MLA layernorms#44585mvanhorn wants to merge 1 commit intohuggingface:mainfrom
Conversation
Pass `eps=config.rms_norm_eps` to both `q_a_layernorm` and `kv_a_layernorm` in DeepseekV3 attention. Without this, these layernorms use the default eps (1e-5) instead of the config value (1e-6), causing precision errors vs vLLM/SGLang implementations. Edit applied to modular_deepseek_v3.py; generated modeling files (deepseek_v3, glm4_moe_lite, longcat_flash, youtu) updated via `make fix-repo`. Fixes huggingface#44261 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
[For maintainers] Suggested jobs to run (before merge) run-slow: deepseek_v3, glm4_moe_lite, longcat_flash, youtu |
alvinttang
left a comment
There was a problem hiding this comment.
The fix is correct — without passing eps=config.rms_norm_eps, the MLA layernorms for both q_a_layernorm and kv_a_layernorm would silently use the RMSNorm default epsilon (typically 1e-6) instead of the model-configured value, which would cause subtle numerical divergence from reference implementations without any error. It is worth noting this same fix is applied consistently across all derived models (glm4_moe_lite, longcat_flash, youtu) and the modular source, which is the right approach. One open question: does the DeepseekV3RMSNorm default epsilon happen to match the typical config value, masking this bug in practice? If so, a unit test asserting the epsilon is correctly propagated would prevent future regressions of this class.
|
cc @ArthurZucker, do you know if the norms in the Attention are supposed to use the config value as well or not? |
|
The MLA attention layernorms ( The same pattern is applied across the derived models (glm4_moe_lite, longcat_flash, youtu) and the modular source, so it stays consistent everywhere. |
What does this PR do?
Passes
eps=config.rms_norm_epsto bothq_a_layernormandkv_a_layernormin the DeepseekV3 MLA attention module. Without this, these layernorms default toeps=1e-5instead of the config value (1e-6), causing precision differences compared to vLLM and SGLang implementations.The fix was applied to
modular_deepseek_v3.pyand propagated to generated modeling files (deepseek_v3,glm4_moe_lite,longcat_flash,youtu) viamake fix-repo.Note: DeepseekV2 has the same issue but is left for a separate PR to keep this focused.
Fixes #44261
Before submitting
Pull Request section?
to it if that's the case. [Bug/Discussion] MLA q_a_layernorm Missing config.rms_norm_eps, Causing 1e-5/1e-6 Precision Error #44261
Who can review?
@ArthurZucker @Cyrilvallez (text models, attention)
This contribution was developed with AI assistance (Claude Code).