Skip to content

Qwen3 GPU inference produces garbage output (architecture not supported) #479

@noahgift

Description

@noahgift

Summary

Qwen3 models (e.g., Qwen3-4B) produce garbage output during GPU inference via apr run. Both FP16 and Q4K quantized models produce random token sequences. Qwen2.5 models work correctly with the same inference path.

Reproduction

apr import /path/to/qwen3-4b/ --arch qwen3 --quantize q4k -o qwen3-4b-q4k.apr
apr run qwen3-4b-q4k.apr --prompt "def fibonacci(n):" --max-tokens 32 --json --chat
# Output: gibberish tokens (e.g., "obufæīĢæľīçļĦ...")

Same result with FP16 (no quantization), confirming the issue is in the inference architecture, not the import pipeline.

Root Cause Analysis

Qwen3 config differences from Qwen2:

  • attention_bias: false (Qwen2 has QKV biases)
  • head_dim: 128 (explicit, not inferred)
  • model_type: "qwen3" (not "qwen2")
  • Qwen3ForCausalLM architecture class

The realizar GPU inference path likely:

  1. Doesn't recognize qwen3 as a supported architecture
  2. Falls through to a generic path that incorrectly applies QKV biases
  3. Or uses wrong attention pattern (Qwen3 may use different RoPE or attention layout)

Environment

Expected Behavior

Qwen3 models should produce coherent text, matching Qwen2.5 quality for equivalent parameter counts.

Impact

Blocks Qwen3-4B HumanEval evaluation. Currently falling back to Qwen2.5-Coder-7B-Instruct (proven 85.37% pass@1).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions