feat: support Nemotron-H / Nemotron-Cascade-2 (#1711)#1712
feat: support Nemotron-H / Nemotron-Cascade-2 (#1711)#1712michael-rabe wants to merge 7 commits intointel:mainfrom
Conversation
Adds initial support for hybrid Mamba2 + Attention + MoE models via the
unfused_moe registry. A bare `AutoRound("nvidia/...")` call now produces
a coherent INT4 checkpoint without launcher-side workarounds.
Core additions:
- `unfused_moe/nemotron_h.py`: linear-discoverable MoE block
- `unfused_moe/nemotron_h_setup.py`: post-load fixups (Zamba2 group_size,
SSM/router FP32 restore)
- `utils/source_tensor_overrides.py`: generic source-checkpoint tensor
reload utility
- `MODEL_CONFIG["nemotron_h"]`: dispatch + upstream-rename preservation
- `compressors/base.py`: `apply_post_load_fixups` hook (non-NH no-op)
- Export: `norm_dtype` and `scale_dtype` per-layer overrides
Documentation:
- New `.claude/skills/adapt-unfused-moe` skill covering the pipeline
- `adapt-new-llm` slimmed to point at the dedicated skill
- README / README_CN / step_by_step notes updated
Tests: 89 CPU tests across registration, post-load, export dtype
controls, source-tensor overrides, and missing-tensors symmetry.
Closes intel#1711
Signed-off-by: Michael Rabe <michaelrabe1896@gmail.com>
for more information, see https://pre-commit.ci
unit test now covering cpu as well as GPU devices (XPU tested). New test added for dtype=F32 recommendation for residual opt-in. Documentation: recommendation for Residual-stream precision (norm_dtype) added as well. Signed-off-by: Rabe, Michael (michaelrabe1896@gmail.com)
615e3fe to
69f04b4
Compare
|
Thanks for the PR. Are all these file changes necessary? It seems to modify many unrelated files. |
|
|
||
| - **Residual-stream precision (`norm_dtype`):** | ||
|
|
||
| Opt-in kwarg on `quantize_and_save` / `save_quantized`. In deep residual architectures — especially hybrid SSM + MoE models such as Nemotron-H — accumulating residuals through BF16 norm outputs can lose precision layer over layer. Passing `norm_dtype="fp32"` exports norm weights in FP32 without touching quantized linears; disk/VRAM cost is <0.1% of a 30B model. Recommended for such hybrid architectures; optional elsewhere. Accepts `"fp16" | "bf16" | "fp32"` (string aliases) or a raw `torch.dtype`. Default (kwarg omitted): norm dtype follows the compute `amp_dtype` (FP16 on GPU/XPU with an FP16 checkpoint, BF16 on CPU/HPU). |
There was a problem hiding this comment.
For this case, we should provide a better solution, similar to e_score_correction_bias's handling in deepseek
@xin3he is it a regression as I just found the dtype of https://huggingface.co/Intel/DeepSeek-V3.2-int4-AutoRound/blob/main/model-00059-of-00071.safetensors if bf16
I'm afraid it is, but haven't optimized (reduced) much after I got coherent and stable output. Very welcome if we can reduce the number of changes, nemotron models seem to behave way different to others and are more sensitive to low dtypes. |
Thanks for the prompt reply. Could you let me know what indicator led you to change the norm dtype to FP32? I checked the model and noticed that the original norm weights are in BF16. I’m wondering whether modifying the norm dtype in the weights might cause issues with vLLM or SGLang, and whether they would still behave as expected. For MoE models, I would generally suggest using higher precision for certain parts, for example, 8-bit for critical components (non-moe modules) and 4-bit for others (experts). This is a more common approach when pure 4-bit quantization leads to a significant accuracy drop and avoids the API change. Please feel free to correct me if I’m mistaken, as I’m not very familiar with this model. Thanks again for your pr. |
|
for mixed-bits, you could try scheme="int4_mixed" |
Ah, I missed the proper documentation (I'll fix that tonight). You're right that the original norm weights are BF16, and BF16 is numerically fine for the norm weights themselves — upcasting them to FP32 is lossless but gains nothing at the weight level. The purpose of norm_dtype="fp32" is different: it's a lever for the residual stream, not for the norm weights. In HuggingFace-style RMSNorm the norm output's dtype follows the weight's dtype, so storing norm weights as FP32 pulls the post-norm tensor — and via the residual add, the residual stream itself — into FP32 at runtime. Over 50+ layers the residual is a sum of many block outputs, and keeping that accumulation in FP32 reduces BF16 rounding drift. Two caveats: this only helps on engines that honor the stored norm-weight dtype (HF transformers does, fused-kernel runtimes like vLLM/SGLang often don't). And the Nemotron-H coherence fix itself was not norm_dtype — it was the always-on post-load restore of the SSM core tensors and the router correction bias to FP32, which are the tensors with genuine FP32 requirements. norm_dtype is orthogonal and optional. So: FP32 norm weights aren't about upgrading the norm weights themselves — they're an export-side lever for residual-stream precision on compliant inference stacks. Please correct me, if I'm wrong. I'm new to the club and still in the learning phase ;). |
|
Progressing, the comments were helpful and can reduce the number of changes. |
Schema conformance — nemotron_h.py now matches the deepseek_v3 / glm_moe pattern used by the other unfused-MoE siblings: - remove module docstring and `from __future__ import annotations` - remove unused `layer_idx` constructor parameter - extract per-expert loop into `experts_forward()` helper - fold public API (`apply_nemotron_h_post_load`, `nemotron_h_default_layer_config_patterns`) from `nemotron_h_setup.py` into `nemotron_h.py`; setup module now holds only private helpers - update MODEL_CONFIG paths and test imports accordingly Consolidation: - extract `_resolve_registered_fn()` in `unfused_moe/__init__.py` to DRY up `apply_post_load_fixups` and `get_default_layer_config_patterns` (−76 lines, identical external API) - remove `auto_round/utils/source_tensor_overrides.py`; the sole consumer (`nemotron_h_setup.py`) inlines it as a private helper - consolidate three NH test files (`test_nemotron_h_post_load.py`, `test_nemotron_h_registration.py`, `test_source_tensor_overrides.py`) into a single `test/test_cpu/models/test_nemotron_h.py` covering all 20 scenarios Docs: - compress `norm_dtype` hyperparameter bullet in `step_by_step.md` to match the surrounding bullet style (AdamW, quantized lm-head, etc.) - remove per-model "Verifying a quantized Nemotron-H checkpoint" section — no other model has troubleshooting content in this guide - trim the Known-Issues reference to an internal dev-only skill Misc: - `.gitignore`: exclude local evaluation scripts All 90 tests green: test_nemotron_h, test_norm_dtype, test_scale_dtype, test_missing_tensors. Signed-off-by: Rabe, Michael <michaelrabe1896@gmail.com>
…simplicity. norm_dtype (quantize_and_save/save_quantized), _cast_norm_modules, and the ShardWriter upcast path are removed — NH residual precision is handled automatically by apply_post_load_fixups (A_log, dt_bias, e_score_correction_bias → FP32 via the unfused_moe post-load hook). Also adds layer_idx: int | None = None to LinearNemotronHMoE.__init__ for compatibility with transformers 5.3.0, which now passes layer_idx to all MIXER_TYPES constructors. Signed-off-by: Rabe, Michael (michaelrabe1896@gmail.com)
for more information, see https://pre-commit.ci
Adds initial support for hybrid Mamba2 + Attention + MoE models via the unfused_moe registry. A bare
AutoRound("nvidia/...")call now produces a coherent INT4 checkpoint without launcher-side workarounds.Core additions:
unfused_moe/nemotron_h.py: linear-discoverable MoE blockunfused_moe/nemotron_h_setup.py: post-load fixups (Zamba2 group_size, SSM/router FP32 restore)utils/source_tensor_overrides.py: generic source-checkpoint tensor reload utilityMODEL_CONFIG["nemotron_h"]: dispatch + upstream-rename preservationcompressors/base.py:apply_post_load_fixupshook (non-NH no-op)norm_dtypeandscale_dtypeper-layer overridesDocumentation:
.claude/skills/adapt-unfused-moeskill covering the pipelineadapt-new-llmslimmed to point at the dedicated skillTests: 89 CPU tests across registration, post-load, export dtype controls, source-tensor overrides, and missing-tensors symmetry.
Closes #1711
Description
Please briefly describe your main changes, the motivation.
Type of Change
Related Issues
Fixes or relates to #
Checklist Before Submitting