[GGUF] Add support for Qwen3.5 MoE (qwen35moe arch)#45668
Open
lucaspirola wants to merge 3 commits intohuggingface:mainfrom
Open
[GGUF] Add support for Qwen3.5 MoE (qwen35moe arch)#45668lucaspirola wants to merge 3 commits intohuggingface:mainfrom
lucaspirola wants to merge 3 commits intohuggingface:mainfrom
Conversation
added 3 commits
April 28, 2026 00:09
Qwen3.5 MoE GGUF files identify their architecture as "qwen35moe"
(matches gguf-py>=0.18.0's MODEL_ARCH.QWEN35MOE). Without an entry
here, loading a Qwen3.5 MoE GGUF dies with:
RuntimeError: Architecture qwen35moe not supported
Wire up four spots so Qwen3_5MoeForCausalLM can load a "qwen35moe"
GGUF end to end:
* integrations/ggml.py: GGUF_CONFIG_MAPPING["qwen3_5_moe_text"] for
the metadata-to-config translation. Includes feed_forward_length
for the non-MoE layers in the hybrid stack and the regular
expert_*_length keys.
* integrations/ggml.py: GGUF_TO_FAST_CONVERTERS["qwen3_5_moe_text"] =
GGUFQwen2Converter (Qwen3.5 reuses the qwen2/qwen3 BPE convention).
* integrations/ggml.py: GGUF_CONFIG_DEFAULTS_MAPPING["qwen3_5_moe_text"]
with norm_topk_prob=True. Same trap as qwen3_moe; llama.cpp's
qwen35moe.cpp normalizes routed expert weights, so the HF default
has to be overridden to keep routing math consistent.
* modeling_gguf_pytorch_utils.py: TENSOR_PROCESSORS["qwen35moe"] =
Qwen2MoeTensorProcessor for the fused 3-D ffn_*_exps splitting
into per-expert {gate,up,down}_proj. Without this, MoE expert
weights silently fall through to the default processor and aren't
sliced. Plus the matching qwen3_5_moe_text -> qwen35moe alias in
get_gguf_hf_weights_map.
Text-only target (qwen3_5_moe_text, not qwen3_5_moe) is intentional:
Qwen3_5MoeForCausalLM is backed by Qwen3_5MoeTextConfig, and Qwen3.5
MoE GGUF distributions ship as text-only; vision weights, when
present, ride in a co-located mmproj-*.gguf.
Adds a smoke test in tests/quantization/ggml/test_ggml.py marked
@unittest.skip because the only public Qwen3.5 MoE GGUF (~12.7 GB)
is too large for routine CI. Maintainers with a smaller fixture in
hand can drop the skip.
Signed-off-by: Lucas <lucas@eliteaero.com.br>
Follow-up to 73aa1cb. Map the remaining qwen35moe metadata that convert_hf_to_gguf actually writes (via Qwen3NextModel.set_gguf_parameters): * full_attention_interval -> kwarg consumed by Qwen3_5MoeTextConfig.__post_init__ to derive the hybrid layer_types list. Without this, layer_types silently falls back to the default interval of 4. The 35B-A3B target happens to use 4, but any future GGUF with a different cadence would load with the wrong attention pattern. * ssm.conv_kernel -> linear_conv_kernel_dim * ssm.state_size -> linear_key_head_dim * ssm.group_count -> linear_num_key_heads * ssm.time_step_rank -> linear_num_value_heads ssm.inner_size is derived (linear_value_head_dim * linear_num_value_heads) and has no direct config field, so it's mapped to None — linear_value_head_dim falls back to its config default, which matches the writer's contract. The keys flagged by review but not actually emitted by the qwen35moe converter path (rope.scaling.*, attention.sliding_window) are left unmapped on purpose; adding them now would be speculative. Signed-off-by: Lucas <lucas@eliteaero.com.br>
…size Closes the one remaining silent-default in the qwen35moe load path. The convert_hf_to_gguf writer doesn't emit linear_value_head_dim as its own KV — it only emits ssm.inner_size, which equals linear_value_head_dim * linear_num_value_heads. Without recovery, linear_value_head_dim falls back to the Qwen3_5MoeTextConfig default (128). That happens to match the 35B-A3B target, but any GGUF where the writer's contract holds with a different per-head dim would load silently misconfigured. Mirrors the per-arch post-processing pattern already used for lfm2, gpt_oss, minimax_m2, and gemma3. Signed-off-by: Lucas <lucas@eliteaero.com.br>
Contributor
|
[For maintainers] Suggested jobs to run (before merge) run-slow: ggml |
Contributor
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45668&sha=a29bda |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Currently, GGUF versions of Qwen3.5 MoE models raise
GGUF model with architecture qwen35moe is not supported yet.This PR resolves that.
The arch is the one llama.cpp's
convert_hf_to_gguf.pywrites forQwen3_5MoeForCausalLM/Qwen3_5MoeForConditionalGeneration(seeMODEL_ARCH.QWEN35MOEingguf-py>=0.18.0). Loading routes to thetext-only
Qwen3_5MoeTextConfigsoQwen3_5MoeForCausalLMgets thematching config; vision weights, when present, ride in a co-located
mmproj-*.gguf.Changes
src/transformers/integrations/ggml.pyGGUF_CONFIG_MAPPING["qwen3_5_moe_text"]covering the qwen35moemetadata block: standard transformer fields, the MoE-specific
expert_*_lengthkeys, the hybrid-patternfull_attention_interval,and the SSM block (
ssm.conv_kernel,ssm.state_size,ssm.group_count,ssm.time_step_rank) for the GatedDeltaNetlinear-attention layers.
ssm.inner_sizeis mapped toNone(it's a derived value; recovered via post-processing — see below).
GGUF_CONFIG_DEFAULTS_MAPPING["qwen3_5_moe_text"]withnorm_topk_prob: True— same trap asqwen3_moe. llama.cppnormalizes routed expert weights by default while transformers
defaults to
False, so the override is needed to keep routing mathconsistent (see Override Transformers defaults by GGUF defaults #42770 for the qwen3_moe precedent).
GGUF_TO_FAST_CONVERTERS["qwen3_5_moe_text"] = GGUFQwen2Converter(Qwen3.5 reuses the qwen2/qwen3 BPE tokenizer convention).
src/transformers/modeling_gguf_pytorch_utils.pyqwen35moe→qwen3_5_moe_text.TENSOR_PROCESSORS["qwen35moe"] = Qwen2MoeTensorProcessorplus thematching alias in
get_gguf_hf_weights_map. Without this, thefused 3-D
ffn_{gate,up,down}_expstensors silently fall throughto the default processor and aren't sliced into per-expert
{gate,up,down}_proj.linear_value_head_dimfromssm.inner_size / linear_num_value_heads. The writer doesn'temit
linear_value_head_dimdirectly (it only emits the productvia
ssm.inner_size), so without recovery it would silentlydefault to 128. Mirrors the per-arch hooks already used for
lfm2,gpt_oss,minimax_m2, andgemma3.Testing
A smoke test is added in
tests/quantization/ggml/test_ggml.pyunder
test_qwen35moe_iq3_s, but marked@unittest.skipbecausethe only public Qwen3.5 MoE GGUF (
unsloth/Qwen3.6-35B-A3B-GGUF,~12.7 GB for the IQ3_S quant) is too large for routine CI. Same
approach used for other large MoE models like Qwen3-30B-A3B in
#42854 and MiniMax-M2.1 in #44526. Maintainers with a smaller
fixture in hand can drop the skip.
Verified end-to-end locally by loading the IQ3_S quant of the
35B-A3B model via
AutoModelForCausalLM.from_pretrained(..., gguf_file=...)and confirming generation produces sensible output.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@SunMarc @MekkCyber @Cyrilvallez