[GGUF] Add support for Qwen3.5 MoE (qwen35moe arch) by lucaspirola · Pull Request #45668 · huggingface/transformers

lucaspirola · 2026-04-27T23:02:07Z

What does this PR do?

Currently, GGUF versions of Qwen3.5 MoE models raise
GGUF model with architecture qwen35moe is not supported yet.
This PR resolves that.

The arch is the one llama.cpp's convert_hf_to_gguf.py writes for
Qwen3_5MoeForCausalLM / Qwen3_5MoeForConditionalGeneration (see
MODEL_ARCH.QWEN35MOE in gguf-py>=0.18.0). Loading routes to the
text-only Qwen3_5MoeTextConfig so Qwen3_5MoeForCausalLM gets the
matching config; vision weights, when present, ride in a co-located
mmproj-*.gguf.

Changes

src/transformers/integrations/ggml.py

GGUF_CONFIG_MAPPING["qwen3_5_moe_text"] covering the qwen35moe
metadata block: standard transformer fields, the MoE-specific
expert_*_length keys, the hybrid-pattern full_attention_interval,
and the SSM block (ssm.conv_kernel, ssm.state_size,
ssm.group_count, ssm.time_step_rank) for the GatedDeltaNet
linear-attention layers. ssm.inner_size is mapped to None
(it's a derived value; recovered via post-processing — see below).
GGUF_CONFIG_DEFAULTS_MAPPING["qwen3_5_moe_text"] with
norm_topk_prob: True — same trap as qwen3_moe. llama.cpp
normalizes routed expert weights by default while transformers
defaults to False, so the override is needed to keep routing math
consistent (see Override Transformers defaults by GGUF defaults #42770 for the qwen3_moe precedent).
GGUF_TO_FAST_CONVERTERS["qwen3_5_moe_text"] = GGUFQwen2Converter
(Qwen3.5 reuses the qwen2/qwen3 BPE tokenizer convention).

src/transformers/modeling_gguf_pytorch_utils.py

Architecture detection: qwen35moe → qwen3_5_moe_text.
TENSOR_PROCESSORS["qwen35moe"] = Qwen2MoeTensorProcessor plus the
matching alias in get_gguf_hf_weights_map. Without this, the
fused 3-D ffn_{gate,up,down}_exps tensors silently fall through
to the default processor and aren't sliced into per-expert
{gate,up,down}_proj.
Per-arch post-process to recover linear_value_head_dim from
ssm.inner_size / linear_num_value_heads. The writer doesn't
emit linear_value_head_dim directly (it only emits the product
via ssm.inner_size), so without recovery it would silently
default to 128. Mirrors the per-arch hooks already used for
lfm2, gpt_oss, minimax_m2, and gemma3.

Testing

A smoke test is added in tests/quantization/ggml/test_ggml.py
under test_qwen35moe_iq3_s, but marked @unittest.skip because
the only public Qwen3.5 MoE GGUF (unsloth/Qwen3.6-35B-A3B-GGUF,
~12.7 GB for the IQ3_S quant) is too large for routine CI. Same
approach used for other large MoE models like Qwen3-30B-A3B in
#42854 and MiniMax-M2.1 in #44526. Maintainers with a smaller
fixture in hand can drop the skip.

Verified end-to-end locally by loading the IQ3_S quant of the
35B-A3B model via AutoModelForCausalLM.from_pretrained(..., gguf_file=...) and confirming generation produces sensible output.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@SunMarc @MekkCyber @Cyrilvallez

Qwen3.5 MoE GGUF files identify their architecture as "qwen35moe" (matches gguf-py>=0.18.0's MODEL_ARCH.QWEN35MOE). Without an entry here, loading a Qwen3.5 MoE GGUF dies with: RuntimeError: Architecture qwen35moe not supported Wire up four spots so Qwen3_5MoeForCausalLM can load a "qwen35moe" GGUF end to end: * integrations/ggml.py: GGUF_CONFIG_MAPPING["qwen3_5_moe_text"] for the metadata-to-config translation. Includes feed_forward_length for the non-MoE layers in the hybrid stack and the regular expert_*_length keys. * integrations/ggml.py: GGUF_TO_FAST_CONVERTERS["qwen3_5_moe_text"] = GGUFQwen2Converter (Qwen3.5 reuses the qwen2/qwen3 BPE convention). * integrations/ggml.py: GGUF_CONFIG_DEFAULTS_MAPPING["qwen3_5_moe_text"] with norm_topk_prob=True. Same trap as qwen3_moe; llama.cpp's qwen35moe.cpp normalizes routed expert weights, so the HF default has to be overridden to keep routing math consistent. * modeling_gguf_pytorch_utils.py: TENSOR_PROCESSORS["qwen35moe"] = Qwen2MoeTensorProcessor for the fused 3-D ffn_*_exps splitting into per-expert {gate,up,down}_proj. Without this, MoE expert weights silently fall through to the default processor and aren't sliced. Plus the matching qwen3_5_moe_text -> qwen35moe alias in get_gguf_hf_weights_map. Text-only target (qwen3_5_moe_text, not qwen3_5_moe) is intentional: Qwen3_5MoeForCausalLM is backed by Qwen3_5MoeTextConfig, and Qwen3.5 MoE GGUF distributions ship as text-only; vision weights, when present, ride in a co-located mmproj-*.gguf. Adds a smoke test in tests/quantization/ggml/test_ggml.py marked @unittest.skip because the only public Qwen3.5 MoE GGUF (~12.7 GB) is too large for routine CI. Maintainers with a smaller fixture in hand can drop the skip. Signed-off-by: Lucas <lucas@eliteaero.com.br>

Follow-up to 73aa1cb. Map the remaining qwen35moe metadata that convert_hf_to_gguf actually writes (via Qwen3NextModel.set_gguf_parameters): * full_attention_interval -> kwarg consumed by Qwen3_5MoeTextConfig.__post_init__ to derive the hybrid layer_types list. Without this, layer_types silently falls back to the default interval of 4. The 35B-A3B target happens to use 4, but any future GGUF with a different cadence would load with the wrong attention pattern. * ssm.conv_kernel -> linear_conv_kernel_dim * ssm.state_size -> linear_key_head_dim * ssm.group_count -> linear_num_key_heads * ssm.time_step_rank -> linear_num_value_heads ssm.inner_size is derived (linear_value_head_dim * linear_num_value_heads) and has no direct config field, so it's mapped to None — linear_value_head_dim falls back to its config default, which matches the writer's contract. The keys flagged by review but not actually emitted by the qwen35moe converter path (rope.scaling.*, attention.sliding_window) are left unmapped on purpose; adding them now would be speculative. Signed-off-by: Lucas <lucas@eliteaero.com.br>

…size Closes the one remaining silent-default in the qwen35moe load path. The convert_hf_to_gguf writer doesn't emit linear_value_head_dim as its own KV — it only emits ssm.inner_size, which equals linear_value_head_dim * linear_num_value_heads. Without recovery, linear_value_head_dim falls back to the Qwen3_5MoeTextConfig default (128). That happens to match the 35B-A3B target, but any GGUF where the writer's contract holds with a different per-head dim would load silently misconfigured. Mirrors the per-arch post-processing pattern already used for lfm2, gpt_oss, minimax_m2, and gemma3. Signed-off-by: Lucas <lucas@eliteaero.com.br>

github-actions · 2026-04-27T23:03:14Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: ggml

github-actions · 2026-04-27T23:16:11Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45668&sha=a29bda

Lucas added 3 commits April 28, 2026 00:09

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GGUF] Add support for Qwen3.5 MoE (qwen35moe arch)#45668

[GGUF] Add support for Qwen3.5 MoE (qwen35moe arch)#45668
lucaspirola wants to merge 3 commits intohuggingface:mainfrom
lucaspirola:add-qwen35moe-text-gguf

lucaspirola commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lucaspirola commented Apr 27, 2026

What does this PR do?

Changes

Testing

Before submitting

Who can review?

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant