Skip to content

[GGUF] Add support for Qwen3.5 MoE (qwen35moe arch)#45668

Open
lucaspirola wants to merge 3 commits intohuggingface:mainfrom
lucaspirola:add-qwen35moe-text-gguf
Open

[GGUF] Add support for Qwen3.5 MoE (qwen35moe arch)#45668
lucaspirola wants to merge 3 commits intohuggingface:mainfrom
lucaspirola:add-qwen35moe-text-gguf

Conversation

@lucaspirola
Copy link
Copy Markdown

What does this PR do?

Currently, GGUF versions of Qwen3.5 MoE models raise
GGUF model with architecture qwen35moe is not supported yet.
This PR resolves that.

The arch is the one llama.cpp's convert_hf_to_gguf.py writes for
Qwen3_5MoeForCausalLM / Qwen3_5MoeForConditionalGeneration (see
MODEL_ARCH.QWEN35MOE in gguf-py>=0.18.0). Loading routes to the
text-only Qwen3_5MoeTextConfig so Qwen3_5MoeForCausalLM gets the
matching config; vision weights, when present, ride in a co-located
mmproj-*.gguf.

Changes

src/transformers/integrations/ggml.py

  • GGUF_CONFIG_MAPPING["qwen3_5_moe_text"] covering the qwen35moe
    metadata block: standard transformer fields, the MoE-specific
    expert_*_length keys, the hybrid-pattern full_attention_interval,
    and the SSM block (ssm.conv_kernel, ssm.state_size,
    ssm.group_count, ssm.time_step_rank) for the GatedDeltaNet
    linear-attention layers. ssm.inner_size is mapped to None
    (it's a derived value; recovered via post-processing — see below).
  • GGUF_CONFIG_DEFAULTS_MAPPING["qwen3_5_moe_text"] with
    norm_topk_prob: True — same trap as qwen3_moe. llama.cpp
    normalizes routed expert weights by default while transformers
    defaults to False, so the override is needed to keep routing math
    consistent (see Override Transformers defaults by GGUF defaults #42770 for the qwen3_moe precedent).
  • GGUF_TO_FAST_CONVERTERS["qwen3_5_moe_text"] = GGUFQwen2Converter
    (Qwen3.5 reuses the qwen2/qwen3 BPE tokenizer convention).

src/transformers/modeling_gguf_pytorch_utils.py

  • Architecture detection: qwen35moeqwen3_5_moe_text.
  • TENSOR_PROCESSORS["qwen35moe"] = Qwen2MoeTensorProcessor plus the
    matching alias in get_gguf_hf_weights_map. Without this, the
    fused 3-D ffn_{gate,up,down}_exps tensors silently fall through
    to the default processor and aren't sliced into per-expert
    {gate,up,down}_proj.
  • Per-arch post-process to recover linear_value_head_dim from
    ssm.inner_size / linear_num_value_heads. The writer doesn't
    emit linear_value_head_dim directly (it only emits the product
    via ssm.inner_size), so without recovery it would silently
    default to 128. Mirrors the per-arch hooks already used for
    lfm2, gpt_oss, minimax_m2, and gemma3.

Testing

A smoke test is added in tests/quantization/ggml/test_ggml.py
under test_qwen35moe_iq3_s, but marked @unittest.skip because
the only public Qwen3.5 MoE GGUF (unsloth/Qwen3.6-35B-A3B-GGUF,
~12.7 GB for the IQ3_S quant) is too large for routine CI. Same
approach used for other large MoE models like Qwen3-30B-A3B in
#42854 and MiniMax-M2.1 in #44526. Maintainers with a smaller
fixture in hand can drop the skip.

Verified end-to-end locally by loading the IQ3_S quant of the
35B-A3B model via AutoModelForCausalLM.from_pretrained(..., gguf_file=...) and confirming generation produces sensible output.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@SunMarc @MekkCyber @Cyrilvallez

Lucas added 3 commits April 28, 2026 00:09
Qwen3.5 MoE GGUF files identify their architecture as "qwen35moe"
(matches gguf-py>=0.18.0's MODEL_ARCH.QWEN35MOE). Without an entry
here, loading a Qwen3.5 MoE GGUF dies with:

    RuntimeError: Architecture qwen35moe not supported

Wire up four spots so Qwen3_5MoeForCausalLM can load a "qwen35moe"
GGUF end to end:

* integrations/ggml.py: GGUF_CONFIG_MAPPING["qwen3_5_moe_text"] for
  the metadata-to-config translation. Includes feed_forward_length
  for the non-MoE layers in the hybrid stack and the regular
  expert_*_length keys.
* integrations/ggml.py: GGUF_TO_FAST_CONVERTERS["qwen3_5_moe_text"] =
  GGUFQwen2Converter (Qwen3.5 reuses the qwen2/qwen3 BPE convention).
* integrations/ggml.py: GGUF_CONFIG_DEFAULTS_MAPPING["qwen3_5_moe_text"]
  with norm_topk_prob=True. Same trap as qwen3_moe; llama.cpp's
  qwen35moe.cpp normalizes routed expert weights, so the HF default
  has to be overridden to keep routing math consistent.
* modeling_gguf_pytorch_utils.py: TENSOR_PROCESSORS["qwen35moe"] =
  Qwen2MoeTensorProcessor for the fused 3-D ffn_*_exps splitting
  into per-expert {gate,up,down}_proj. Without this, MoE expert
  weights silently fall through to the default processor and aren't
  sliced. Plus the matching qwen3_5_moe_text -> qwen35moe alias in
  get_gguf_hf_weights_map.

Text-only target (qwen3_5_moe_text, not qwen3_5_moe) is intentional:
Qwen3_5MoeForCausalLM is backed by Qwen3_5MoeTextConfig, and Qwen3.5
MoE GGUF distributions ship as text-only; vision weights, when
present, ride in a co-located mmproj-*.gguf.

Adds a smoke test in tests/quantization/ggml/test_ggml.py marked
@unittest.skip because the only public Qwen3.5 MoE GGUF (~12.7 GB)
is too large for routine CI. Maintainers with a smaller fixture in
hand can drop the skip.

Signed-off-by: Lucas <lucas@eliteaero.com.br>
Follow-up to 73aa1cb. Map the remaining qwen35moe metadata that
convert_hf_to_gguf actually writes (via Qwen3NextModel.set_gguf_parameters):

* full_attention_interval -> kwarg consumed by
  Qwen3_5MoeTextConfig.__post_init__ to derive the hybrid layer_types
  list. Without this, layer_types silently falls back to the default
  interval of 4. The 35B-A3B target happens to use 4, but any future
  GGUF with a different cadence would load with the wrong attention
  pattern.
* ssm.conv_kernel    -> linear_conv_kernel_dim
* ssm.state_size     -> linear_key_head_dim
* ssm.group_count    -> linear_num_key_heads
* ssm.time_step_rank -> linear_num_value_heads

ssm.inner_size is derived (linear_value_head_dim *
linear_num_value_heads) and has no direct config field, so it's
mapped to None — linear_value_head_dim falls back to its config
default, which matches the writer's contract.

The keys flagged by review but not actually emitted by the qwen35moe
converter path (rope.scaling.*, attention.sliding_window) are left
unmapped on purpose; adding them now would be speculative.

Signed-off-by: Lucas <lucas@eliteaero.com.br>
…size

Closes the one remaining silent-default in the qwen35moe load path.

The convert_hf_to_gguf writer doesn't emit linear_value_head_dim as
its own KV — it only emits ssm.inner_size, which equals
linear_value_head_dim * linear_num_value_heads. Without recovery,
linear_value_head_dim falls back to the Qwen3_5MoeTextConfig default
(128). That happens to match the 35B-A3B target, but any GGUF where
the writer's contract holds with a different per-head dim would load
silently misconfigured.

Mirrors the per-arch post-processing pattern already used for lfm2,
gpt_oss, minimax_m2, and gemma3.

Signed-off-by: Lucas <lucas@eliteaero.com.br>
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: ggml

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45668&sha=a29bda

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant