Goal
Enable AutoRound to quantize Nemotron-H family models — in particular nvidia/Nemotron-Cascade-2-30B-A3B — and produce a checkpoint whose decoded output is coherent without requiring launcher-side workarounds.
Target user flow:
from auto_round import AutoRound
AutoRound("nvidia/Nemotron-Cascade-2-30B-A3B", bits=4, group_size=128, sym=True) \
.quantize_and_save(output_dir=..., format="auto_round")
Scope
- Nemotron-H architecture (hybrid Mamba2 SSM + Attention + routed MoE + shared expert)
auto_round export format, W4A16 weight-only
- CPU + CUDA/XPU calibration paths
Acceptance
- Bare
AutoRound("nvidia/...") call produces a checkpoint that loads and generates coherent text in a downstream inference engine.
- No regressions for other
model_type values (non-NH models must be unaffected).
- CPU test suite covers registration, post-load behaviour, and export-time precision controls.
Out of scope (follow-ups)
- Pure-Mamba architectures (no attention layers)
- Other Nemotron-H variants beyond the Cascade-2 checkpoint family
Goal
Enable AutoRound to quantize Nemotron-H family models — in particular
nvidia/Nemotron-Cascade-2-30B-A3B— and produce a checkpoint whose decoded output is coherent without requiring launcher-side workarounds.Target user flow:
Scope
auto_roundexport format, W4A16 weight-onlyAcceptance
AutoRound("nvidia/...")call produces a checkpoint that loads and generates coherent text in a downstream inference engine.model_typevalues (non-NH models must be unaffected).Out of scope (follow-ups)