Summary
Qwen3 models (e.g., Qwen3-4B) produce garbage output during GPU inference via apr run. Both FP16 and Q4K quantized models produce random token sequences. Qwen2.5 models work correctly with the same inference path.
Reproduction
apr import /path/to/qwen3-4b/ --arch qwen3 --quantize q4k -o qwen3-4b-q4k.apr
apr run qwen3-4b-q4k.apr --prompt "def fibonacci(n):" --max-tokens 32 --json --chat
# Output: gibberish tokens (e.g., "obufæīĢæľīçļĦ...")
Same result with FP16 (no quantization), confirming the issue is in the inference architecture, not the import pipeline.
Root Cause Analysis
Qwen3 config differences from Qwen2:
attention_bias: false (Qwen2 has QKV biases)
head_dim: 128 (explicit, not inferred)
model_type: "qwen3" (not "qwen2")
Qwen3ForCausalLM architecture class
The realizar GPU inference path likely:
- Doesn't recognize
qwen3 as a supported architecture
- Falls through to a generic path that incorrectly applies QKV biases
- Or uses wrong attention pattern (Qwen3 may use different RoPE or attention layout)
Environment
Expected Behavior
Qwen3 models should produce coherent text, matching Qwen2.5 quality for equivalent parameter counts.
Impact
Blocks Qwen3-4B HumanEval evaluation. Currently falling back to Qwen2.5-Coder-7B-Instruct (proven 85.37% pass@1).
Summary
Qwen3 models (e.g., Qwen3-4B) produce garbage output during GPU inference via
apr run. Both FP16 and Q4K quantized models produce random token sequences. Qwen2.5 models work correctly with the same inference path.Reproduction
Same result with FP16 (no quantization), confirming the issue is in the inference architecture, not the import pipeline.
Root Cause Analysis
Qwen3 config differences from Qwen2:
attention_bias: false(Qwen2 has QKV biases)head_dim: 128(explicit, not inferred)model_type: "qwen3"(not"qwen2")Qwen3ForCausalLMarchitecture classThe realizar GPU inference path likely:
qwen3as a supported architectureEnvironment
Expected Behavior
Qwen3 models should produce coherent text, matching Qwen2.5 quality for equivalent parameter counts.
Impact
Blocks Qwen3-4B HumanEval evaluation. Currently falling back to Qwen2.5-Coder-7B-Instruct (proven 85.37% pass@1).