Skip to content

Choosing a Model for Coding

maeddesg edited this page Jun 12, 2026 · 2 revisions

Choosing a Model for Coding

VulkanForge runs several models well-suited to coding. They trade off quality, speed, and context window — there is no single "best". Below is how three of them compare on the reference hardware (Radeon RX 9070 XT, 16 GB), from VulkanForge's own coding tests. Pick based on your priorities.

Gemma-4-26B-A4B · Q3_K_M Gemma-4-26B-A4B · QAT (Q4_0) Qwen3.6-27B · Q3_K_S
Quant 3-bit 4-bit (quantization-aware) 3-bit
Architecture MoE, ~4B active/token MoE, ~4B active/token dense, ~27B active/token
Decode speed ~108 tok/s ~119 tok/s ~35 tok/s
Prefill @512 / @1850 2242 / 3018 tok/s 2838 / 3776 tok/s not benchmarked here ¹
Coding — hard task ² did not produce a complete solution ~20 / 25 ~21 / 25
Coding — easy task ² passes passes not tested
Context headroom (16 GB) most (ctx 4096+) ctx 3072 comfortably (4096 tight) ctx 3072 with ample free VRAM ³
File size 12.1 GB 13.3 GB 12.6 GB

¹ Qwen3.6-27B is a dense model (all parameters active each token), so prefill and decode are generally slower than the MoE Gemma variants; it was not separately prefill-benchmarked at these exact lengths. ² An expression-evaluator task (25 cases, greedy decoding) is the "hard" task; an LRU-cache task is the "easy" one. Small sample — indicative, not a comprehensive benchmark. On the easy task the two Gemma variants both produce correct, passing code (Qwen3.6-27B was not run on the easy task); the hard task is where the tested models separate. ³ This Qwen build's native context length is 2048; running longer relies on RoPE position extension. It is the smallest-footprint of the three at runtime, leaving the most spare VRAM at a given context.

Picking a model

  • Fastest, with top-tier coding quality → Gemma-4-26B-A4B QAT (Q4_0). The fastest decode of the three, and among the best on the hard coding task. A solid all-round pick if you want speed and quality together.
  • Most context on 16 GB (large files, long prompts) → Gemma-4-26B-A4B Q3_K_M. The smallest footprint with the most VRAM headroom (comfortably reaches ctx 4096+), but it was the weakest of the three on the hard coding task.
  • A dense (non-MoE) alternative → Qwen3.6-27B. Coding quality comparable to the QAT Gemma in these tests, but roughly 3× slower to decode because all ~27B parameters run for every token.

Context length

--max-context 3072 is a good general setting for coding on 16 GB: it leaves enough generation budget for a substantial function or file (a smaller window can truncate long outputs) while fitting all three models. Use --max-context 4096 only with spare VRAM (the loader warns when free VRAM is low — see Troubleshooting and Configuration). The Gemma-4-26B models require VULKANFORGE_KV_FP8=1 (the engine aborts at load without it; it also halves KV-cache VRAM). On serve, auto context sizing (v0.8.0) picks a fitting context automatically when --ctx-size is omitted.

Caveats (read before trusting the table)

  • Numbers are VulkanForge on a 16 GB RX 9070 XT — other hardware and VRAM budgets will differ.
  • Coding quality here is from a small internal test, not an industry benchmark — treat it as indicative.
  • This is not a quant-controlled comparison. The three models differ on several axes at once — quantization (3-bit vs 4-bit; a 4-bit Qwen-27B does not fit 16 GB), architecture (dense vs MoE), and active parameters. It answers the practical question "which variant that runs on this card codes best for me", not "which quantization or architecture is better in the abstract".

See also: Supported Models · Benchmarks · Configuration.

Clone this wiki locally