-
Notifications
You must be signed in to change notification settings - Fork 1
Usage
VulkanForge has three subcommands: chat, bench, and serve.
vulkanforge chat --model <PATH> [--tokenizer-from <GGUF>] ...
vulkanforge bench --model <PATH> [--tokenizer-from <GGUF>] [--runs N]
vulkanforge serve --model <PATH> [--host 127.0.0.1] [--port 8080] [--cors]
vulkanforge chat --help lists every flag (sampling, max-tokens, think-filter, max-context).
# GGUF — no flag needed (Mesa 26.1+ recommended)
vulkanforge chat --model ~/models/Qwen3-8B-Q4_K_M.gguf
# FP8 SafeTensors — one flag; auto-loads tokenizer.json + chat_template from the dir
VF_FP8=auto vulkanforge chat --model ~/models/Qwen3-8B-FP8/
# Gemma-4-26B-A4B MoE — KV-FP8 REQUIRED (engine aborts at load without it)
VULKANFORGE_KV_FP8=1 vulkanforge chat --model ~/models/google_gemma-4-26B-A4B-it-Q3_K_M.ggufThe REPL accepts /help, /quit, and /reset (clear the KV cache + history without reloading the
model). Common options: --max-tokens, --temperature (0 = greedy), --top-k, --top-p,
--ctx-size, --no-think-filter. See Configuration for the full list and their env-var forms.
VF_PROMPT="What is the capital of France?" \
vulkanforge chat --model ~/models/Qwen3-8B-Q4_K_M.gguf --max-tokens 32 --temperature 0VF_PROMPT="..." runs exactly one turn and exits. It prints the response plus a one-line summary
([N prompt, M gen, prefill … tok/s, decode … tok/s]).
vulkanforge bench --model ~/models/Qwen3-8B-Q4_K_M.gguf --runs 5Reports decode (canonical N/decode_time, prefill-subtracted, warm, median) and a prefill sweep.
bench accepts Q4_K_M GGUF (Q8_0 chat works but does not bench). For the full prefill matrix
methodology used in the Benchmarks page, the repo ships examples/run_pp_bench.
vulkanforge serve --model ~/models/Qwen3-8B-Q4_K_M.gguf --host 127.0.0.1 --port 8080Single-model, single-stream OpenAI Chat Completions API (/v1/chat/completions, streaming SSE,
legacy text completions, and tool / function calling for Qwen3 / Hermes and Gemma-4 — Gemma-4's
native format added in v0.8.0). It is single-user — no batching / concurrent sessions. Optional
cross-request KV prefix-reuse via VF_KV_PREFIX_REUSE=1 (default off).
Since v0.8.0, omit --ctx-size and the server auto-sizes the KV context from free VRAM + the
model and prints what it chose at startup (explicit --ctx-size still overrides; capped at 16384 on
RDNA4). See Configuration.
Ctrl+C (or SIGTERM) stops the server. Since v0.9.2 it shuts down cleanly — it lets in-flight
requests finish, then frees all GPU resources in order and exits with code 0 (earlier builds could
crash with a segfault on shutdown).
A lean standalone command-line client (its own crate, no engine dependencies) ships with the engine
for talking to a running server. Chat (streaming REPL + headless one-shot, visible markers for
truncated/empty answers) and, since v0.9.0, an agentic --agent mode (tools, tiered permissions,
workspace confinement, constitution). See vf-clide.
See Supported Models for what to load and Configuration for flags.
VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 ·
Repository · Releases