-
Notifications
You must be signed in to change notification settings - Fork 1
Usage
VulkanForge has three subcommands: chat, bench, and serve.
vulkanforge chat --model <PATH> [--tokenizer-from <GGUF>] ...
vulkanforge bench --model <PATH> [--tokenizer-from <GGUF>] [--runs N]
vulkanforge serve --model <PATH> [--host 127.0.0.1] [--port 8080] [--cors]
vulkanforge chat --help lists every flag (sampling, max-tokens, think-filter, max-context).
# GGUF — no flag needed (Mesa 26.1+ recommended)
vulkanforge chat --model ~/models/Qwen3-8B-Q4_K_M.gguf
# FP8 SafeTensors — one flag; auto-loads tokenizer.json + chat_template from the dir
VF_FP8=auto vulkanforge chat --model ~/models/Qwen3-8B-FP8/
# Gemma-4-26B-A4B MoE — KV-FP8 recommended to fit 16 GB
VULKANFORGE_KV_FP8=1 vulkanforge chat --model ~/models/google_gemma-4-26B-A4B-it-Q3_K_M.ggufThe REPL accepts /help, /quit, and /reset (clear the KV cache + history without reloading the
model). Common options: --max-tokens, --temperature (0 = greedy), --top-k, --top-p,
--ctx-size, --no-think-filter. See Configuration for the full list and their env-var forms.
VF_PROMPT="What is the capital of France?" \
vulkanforge chat --model ~/models/Qwen3-8B-Q4_K_M.gguf --max-tokens 32 --temperature 0VF_PROMPT="..." runs exactly one turn and exits. It prints the response plus a one-line summary
([N prompt, M gen, prefill … tok/s, decode … tok/s]).
vulkanforge bench --model ~/models/Qwen3-8B-Q4_K_M.gguf --runs 5Reports decode (canonical N/decode_time, prefill-subtracted, warm, median) and a prefill sweep.
bench accepts Q4_K_M GGUF (Q8_0 chat works but does not bench). For the full prefill matrix
methodology used in the Benchmarks page, the repo ships examples/run_pp_bench.
vulkanforge serve --model ~/models/Qwen3-8B-Q4_K_M.gguf --host 127.0.0.1 --port 8080Single-model, single-stream OpenAI Chat Completions API (/v1/chat/completions, streaming SSE,
legacy text completions, tool calling for Qwen3/Hermes). It is single-user — no batching /
concurrent sessions. Optional cross-request KV prefix-reuse via VF_KV_PREFIX_REUSE=1 (default off).
See Supported Models for what to load and Configuration for flags.
VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 ·
Repository · Releases