feat: support Nemotron-H / Nemotron-Cascade-2 (#1711) by michael-rabe · Pull Request #1712 · intel/auto-round

michael-rabe · 2026-04-20T18:41:15Z

Adds initial support for hybrid Mamba2 + Attention + MoE models via the unfused_moe registry. A bare AutoRound("nvidia/...") call now produces a coherent INT4 checkpoint without launcher-side workarounds.

Core additions:

unfused_moe/nemotron_h.py: linear-discoverable MoE block
unfused_moe/nemotron_h_setup.py: post-load fixups (Zamba2 group_size, SSM/router FP32 restore)
utils/source_tensor_overrides.py: generic source-checkpoint tensor reload utility
MODEL_CONFIG["nemotron_h"]: dispatch + upstream-rename preservation
compressors/base.py: apply_post_load_fixups hook (non-NH no-op)
Export: norm_dtype and scale_dtype per-layer overrides

Documentation:

New .claude/skills/adapt-unfused-moe skill covering the pipeline
adapt-new-llm slimmed to point at the dedicated skill
README / README_CN / step_by_step notes updated

Tests: 89 CPU tests across registration, post-load, export dtype controls, source-tensor overrides, and missing-tensors symmetry.

Closes #1711

Description

Please briefly describe your main changes, the motivation.

Type of Change

Related Issues

Fixes or relates to #

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.

Adds initial support for hybrid Mamba2 + Attention + MoE models via the unfused_moe registry. A bare `AutoRound("nvidia/...")` call now produces a coherent INT4 checkpoint without launcher-side workarounds. Core additions: - `unfused_moe/nemotron_h.py`: linear-discoverable MoE block - `unfused_moe/nemotron_h_setup.py`: post-load fixups (Zamba2 group_size, SSM/router FP32 restore) - `utils/source_tensor_overrides.py`: generic source-checkpoint tensor reload utility - `MODEL_CONFIG["nemotron_h"]`: dispatch + upstream-rename preservation - `compressors/base.py`: `apply_post_load_fixups` hook (non-NH no-op) - Export: `norm_dtype` and `scale_dtype` per-layer overrides Documentation: - New `.claude/skills/adapt-unfused-moe` skill covering the pipeline - `adapt-new-llm` slimmed to point at the dedicated skill - README / README_CN / step_by_step notes updated Tests: 89 CPU tests across registration, post-load, export dtype controls, source-tensor overrides, and missing-tensors symmetry. Closes intel#1711 Signed-off-by: Michael Rabe <michaelrabe1896@gmail.com>

for more information, see https://pre-commit.ci

unit test now covering cpu as well as GPU devices (XPU tested). New test added for dtype=F32 recommendation for residual opt-in. Documentation: recommendation for Residual-stream precision (norm_dtype) added as well. Signed-off-by: Rabe, Michael (michaelrabe1896@gmail.com)

wenhuach21 · 2026-04-21T11:15:56Z

Thanks for the PR. Are all these file changes necessary? It seems to modify many unrelated files.

wenhuach21 · 2026-04-21T11:20:44Z


+- **Residual-stream precision (`norm_dtype`):**
+
+  Opt-in kwarg on `quantize_and_save` / `save_quantized`. In deep residual architectures — especially hybrid SSM + MoE models such as Nemotron-H — accumulating residuals through BF16 norm outputs can lose precision layer over layer. Passing `norm_dtype="fp32"` exports norm weights in FP32 without touching quantized linears; disk/VRAM cost is <0.1% of a 30B model. Recommended for such hybrid architectures; optional elsewhere. Accepts `"fp16" | "bf16" | "fp32"` (string aliases) or a raw `torch.dtype`. Default (kwarg omitted): norm dtype follows the compute `amp_dtype` (FP16 on GPU/XPU with an FP16 checkpoint, BF16 on CPU/HPU).


For this case, we should provide a better solution, similar to e_score_correction_bias's handling in deepseek

@xin3he is it a regression as I just found the dtype of https://huggingface.co/Intel/DeepSeek-V3.2-int4-AutoRound/blob/main/model-00059-of-00071.safetensors if bf16

michael-rabe · 2026-04-21T11:28:47Z

Thanks for the PR. Are all these file changes necessary? It seems to modify many unrelated files.

I'm afraid it is, but haven't optimized (reduced) much after I got coherent and stable output.
Looking forward to your feedback ;).

Very welcome if we can reduce the number of changes, nemotron models seem to behave way different to others and are more sensitive to low dtypes.
I had coherence problems without these changes, the post apply "patches" became necessary to bypass the "standard" autoround behaviours without changing too much of the autoround core.

wenhuach21 · 2026-04-21T11:43:09Z

Thanks for the PR. Are all these file changes necessary? It seems to modify many unrelated files.

I'm afraid it is, but haven't optimized (reduced) much after I got coherent and stable output. Looking forward to your feedback ;).

Very welcome if we can reduce the number of changes, nemotron models seem to behave way different to others and are more sensitive to low dtypes. I had coherence problems without these changes, the post apply "patches" became necessary to bypass the "standard" autoround behaviours without changing too much of the autoround core.

Thanks for the prompt reply. Could you let me know what indicator led you to change the norm dtype to FP32? I checked the model and noticed that the original norm weights are in BF16. I’m wondering whether modifying the norm dtype in the weights might cause issues with vLLM or SGLang, and whether they would still behave as expected.

For MoE models, I would generally suggest using higher precision for certain parts, for example, 8-bit for critical components (non-moe modules) and 4-bit for others (experts). This is a more common approach when pure 4-bit quantization leads to a significant accuracy drop and avoids the API change.

Please feel free to correct me if I’m mistaken, as I’m not very familiar with this model.

Thanks again for your pr.

wenhuach21 · 2026-04-21T11:49:32Z

for mixed-bits, you could try scheme="int4_mixed"

michael-rabe · 2026-04-21T12:36:01Z

Thanks for the PR. Are all these file changes necessary? It seems to modify many unrelated files.

I'm afraid it is, but haven't optimized (reduced) much after I got coherent and stable output. Looking forward to your feedback ;).
Very welcome if we can reduce the number of changes, nemotron models seem to behave way different to others and are more sensitive to low dtypes. I had coherence problems without these changes, the post apply "patches" became necessary to bypass the "standard" autoround behaviours without changing too much of the autoround core.

Thanks for the prompt reply. Could you let me know what indicator led you to change the norm dtype to FP32? I checked the model and noticed that the original norm weights are in BF16. I’m wondering whether modifying the norm dtype in the weights might cause issues with vLLM or SGLang, and whether they would still behave as expected.

For MoE models, I would generally suggest using higher precision for certain parts, for example, 8-bit for critical components (non-moe modules) and 4-bit for others (experts). This is a more common approach when pure 4-bit quantization leads to a significant accuracy drop and avoids the API change.

Please feel free to correct me if I’m mistaken, as I’m not very familiar with this model.

Thanks again for your pr.

Ah, I missed the proper documentation (I'll fix that tonight).
Thanks for the valuable suggestions, I'm following up on that and set the PR into "draft" status.

You're right that the original norm weights are BF16, and BF16 is numerically fine for the norm weights themselves — upcasting them to FP32 is lossless but gains nothing at the weight level.

The purpose of norm_dtype="fp32" is different: it's a lever for the residual stream, not for the norm weights. In HuggingFace-style RMSNorm the norm output's dtype follows the weight's dtype, so storing norm weights as FP32 pulls the post-norm tensor — and via the residual add, the residual stream itself — into FP32 at runtime. Over 50+ layers the residual is a sum of many block outputs, and keeping that accumulation in FP32 reduces BF16 rounding drift.

Two caveats: this only helps on engines that honor the stored norm-weight dtype (HF transformers does, fused-kernel runtimes like vLLM/SGLang often don't). And the Nemotron-H coherence fix itself was not norm_dtype — it was the always-on post-load restore of the SSM core tensors and the router correction bias to FP32, which are the tensors with genuine FP32 requirements. norm_dtype is orthogonal and optional.

So: FP32 norm weights aren't about upgrading the norm weights themselves — they're an export-side lever for residual-stream precision on compliant inference stacks.

Please correct me, if I'm wrong. I'm new to the club and still in the learning phase ;).
I appreciate your questions.
Sorry for the wall of text.

michael-rabe · 2026-04-22T16:48:10Z

Progressing, the comments were helpful and can reduce the number of changes.
Still investigating as I'm not yet happy with the results (accuracy) and want to make sure it's not in the quantisation itself.

Schema conformance — nemotron_h.py now matches the deepseek_v3 / glm_moe pattern used by the other unfused-MoE siblings: - remove module docstring and `from __future__ import annotations` - remove unused `layer_idx` constructor parameter - extract per-expert loop into `experts_forward()` helper - fold public API (`apply_nemotron_h_post_load`, `nemotron_h_default_layer_config_patterns`) from `nemotron_h_setup.py` into `nemotron_h.py`; setup module now holds only private helpers - update MODEL_CONFIG paths and test imports accordingly Consolidation: - extract `_resolve_registered_fn()` in `unfused_moe/__init__.py` to DRY up `apply_post_load_fixups` and `get_default_layer_config_patterns` (−76 lines, identical external API) - remove `auto_round/utils/source_tensor_overrides.py`; the sole consumer (`nemotron_h_setup.py`) inlines it as a private helper - consolidate three NH test files (`test_nemotron_h_post_load.py`, `test_nemotron_h_registration.py`, `test_source_tensor_overrides.py`) into a single `test/test_cpu/models/test_nemotron_h.py` covering all 20 scenarios Docs: - compress `norm_dtype` hyperparameter bullet in `step_by_step.md` to match the surrounding bullet style (AdamW, quantized lm-head, etc.) - remove per-model "Verifying a quantized Nemotron-H checkpoint" section — no other model has troubleshooting content in this guide - trim the Known-Issues reference to an internal dev-only skill Misc: - `.gitignore`: exclude local evaluation scripts All 90 tests green: test_nemotron_h, test_norm_dtype, test_scale_dtype, test_missing_tensors. Signed-off-by: Rabe, Michael <michaelrabe1896@gmail.com>

…simplicity. norm_dtype (quantize_and_save/save_quantized), _cast_norm_modules, and the ShardWriter upcast path are removed — NH residual precision is handled automatically by apply_post_load_fixups (A_log, dt_bias, e_score_correction_bias → FP32 via the unfused_moe post-load hook). Also adds layer_idx: int | None = None to LinearNemotronHMoE.__init__ for compatibility with transformers 5.3.0, which now passes layer_idx to all MIXER_TYPES constructors. Signed-off-by: Rabe, Michael (michaelrabe1896@gmail.com)

for more information, see https://pre-commit.ci

Michael Rabe and others added 4 commits April 21, 2026 12:33

[pre-commit.ci] auto fixes from pre-commit.com hooks

cbee618

for more information, see https://pre-commit.ci

fix: update test to use torch_dtype="auto" for baseline model loading

520d411

michael-rabe force-pushed the feat/nemotron-cascade2-1711 branch from 615e3fe to 69f04b4 Compare April 21, 2026 10:33

wenhuach21 reviewed Apr 21, 2026

View reviewed changes

michael-rabe marked this pull request as draft April 21, 2026 17:21

Michael Rabe and others added 3 commits April 24, 2026 18:36

[pre-commit.ci] auto fixes from pre-commit.com hooks

a7e2f31

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support Nemotron-H / Nemotron-Cascade-2 (#1711)#1712

feat: support Nemotron-H / Nemotron-Cascade-2 (#1711)#1712
michael-rabe wants to merge 7 commits intointel:mainfrom
michael-rabe:feat/nemotron-cascade2-1711

michael-rabe commented Apr 20, 2026

Uh oh!

wenhuach21 commented Apr 21, 2026

Uh oh!

wenhuach21 Apr 21, 2026

Uh oh!

michael-rabe commented Apr 21, 2026

Uh oh!

wenhuach21 commented Apr 21, 2026 •

edited

Loading

Uh oh!

wenhuach21 commented Apr 21, 2026

Uh oh!

michael-rabe commented Apr 21, 2026 •

edited

Loading

Uh oh!

michael-rabe commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		- Residual-stream precision (`norm_dtype`):

		Opt-in kwarg on `quantize_and_save` / `save_quantized`. In deep residual architectures — especially hybrid SSM + MoE models such as Nemotron-H — accumulating residuals through BF16 norm outputs can lose precision layer over layer. Passing `norm_dtype="fp32"` exports norm weights in FP32 without touching quantized linears; disk/VRAM cost is <0.1% of a 30B model. Recommended for such hybrid architectures; optional elsewhere. Accepts `"fp16" \| "bf16" \| "fp32"` (string aliases) or a raw `torch.dtype`. Default (kwarg omitted): norm dtype follows the compute `amp_dtype` (FP16 on GPU/XPU with an FP16 checkpoint, BF16 on CPU/HPU).

Conversation

michael-rabe commented Apr 20, 2026

Description

Type of Change

Related Issues

Checklist Before Submitting

Uh oh!

wenhuach21 commented Apr 21, 2026

Uh oh!

wenhuach21 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

michael-rabe commented Apr 21, 2026

Uh oh!

wenhuach21 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wenhuach21 commented Apr 21, 2026

Uh oh!

michael-rabe commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michael-rabe commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wenhuach21 commented Apr 21, 2026 •

edited

Loading

michael-rabe commented Apr 21, 2026 •

edited

Loading