-
Notifications
You must be signed in to change notification settings - Fork 1
Usage
VulkanForge has three subcommands: chat, bench, and serve.
vulkanforge chat --model <PATH> [--tokenizer-from <GGUF>] ...
vulkanforge bench --model <PATH> [--tokenizer-from <GGUF>] [--runs N]
vulkanforge serve --model <PATH> [--host 127.0.0.1] [--port 8080] [--cors]
vulkanforge chat --help lists every flag (sampling, max-tokens, think-filter, max-context).
# GGUF — no flag needed (Mesa 26.1+ recommended)
vulkanforge chat --model ~/models/Qwen3-8B-Q4_K_M.gguf
# FP8 SafeTensors — one flag; auto-loads tokenizer.json + chat_template from the dir
VF_FP8=auto vulkanforge chat --model ~/models/Qwen3-8B-FP8/
# Gemma-4-26B-A4B MoE — KV-FP8 REQUIRED (engine aborts at load without it)
VULKANFORGE_KV_FP8=1 vulkanforge chat --model ~/models/google_gemma-4-26B-A4B-it-Q3_K_M.ggufThe REPL accepts /help, /quit, and /reset (clear the KV cache + history without reloading the
model). Common options: --max-tokens, --temperature (0 = greedy), --top-k, --top-p,
--ctx-size, --no-think-filter. See Configuration for the full list and their env-var forms.
VF_PROMPT="What is the capital of France?" \
vulkanforge chat --model ~/models/Qwen3-8B-Q4_K_M.gguf --max-tokens 32 --temperature 0VF_PROMPT="..." runs exactly one turn and exits. It prints the response plus a one-line summary
([N prompt, M gen, prefill … tok/s, decode … tok/s]).
vulkanforge bench --model ~/models/Qwen3-8B-Q4_K_M.gguf --runs 5Reports decode (canonical N/decode_time, prefill-subtracted, warm, median) and a prefill sweep.
bench accepts Q4_K_M GGUF (Q8_0 chat works but does not bench). For the full prefill matrix
methodology used in the Benchmarks page, the repo ships examples/run_pp_bench.
vulkanforge serve --model ~/models/Qwen3-8B-Q4_K_M.gguf --host 127.0.0.1 --port 8080Single-model, single-stream OpenAI Chat Completions API (/v1/chat/completions, streaming SSE,
legacy text completions, and tool / function calling for Qwen3 / Hermes and Gemma-4 — Gemma-4's
native format added in v0.8.0). It is single-user — no batching / concurrent sessions. Optional
cross-request KV prefix-reuse via VF_KV_PREFIX_REUSE=1 (default off).
Since v0.8.0, omit --ctx-size and the server auto-sizes the KV context from free VRAM + the
model and prints what it chose at startup (explicit --ctx-size still overrides; capped at 16384 on
RDNA4). See Configuration.
Ctrl+C (or SIGTERM) stops the server. Since v0.9.2 it shuts down cleanly — it lets in-flight
requests finish, then frees all GPU resources in order and exits with code 0 (earlier builds could
crash with a segfault on shutdown).
serve can host a persistent, project-scoped semantic memory — opt-in, off by default. It needs a binary
built with --features memory (see Installation) and is activated per run:
vulkanforge serve --model ~/models/Qwen3-8B-Q4_K_M.gguf --memory
# env alias: VULKANFORGE_MEMORY=1 vulkanforge serve --model … Without --memory, the /memory/* endpoints return 503 and the server runs inference only — no embedder is
loaded and no database is opened. (Pass --memory to a lean binary and it stops with a clear rebuild with --features memory message.)
When enabled, the server exposes VF-native endpoints under /memory/* (a namespace separate from the
OpenAI-compatible /v1/*). project_key is optional everywhere — omit it for a shared global scope:
# Write a note on purpose
curl -s localhost:8080/memory/remember -H 'content-type: application/json' -d '{
"project_key": "vf", "kind": "Learning",
"text": "Dispatch reduction does not help on gfx1201"
}' # → {"id":1}
# Read it back by meaning
curl -s localhost:8080/memory/recall -H 'content-type: application/json' -d '{
"project_key": "vf", "query": "do fewer barriers help performance?", "k": 3
}' # → {"hits":[{"id":1,"kind":"Learning","name":"…","text":"…","status":"active","score":0.72}]}
# Create / list projects
curl -s localhost:8080/memory/projects -H 'content-type: application/json' -d '{"project_key":"vf"}'
curl -s localhost:8080/memory/projects # GET → {"projects":[…]}The store lives at ~/.vulkanforge/memory.db (override VF_MEMORY_DB); the first --memory start downloads the
embedding model once into ~/.vulkanforge/embed-cache/. The embedder runs on the CPU and never takes the GPU
permit, so a recall never waits behind a generation. What memory is (and isn't), the full endpoint shapes, and the
roadmap: Memory.
A lean standalone command-line client (its own crate, no engine dependencies) ships with the engine
for talking to a running server. Chat (streaming REPL + headless one-shot, visible markers for
truncated/empty answers) and, since v0.9.0, an agentic --agent mode (tools, tiered permissions,
workspace confinement, constitution). See vf-clide.
See Supported Models for what to load and Configuration for flags.
VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 ·
Repository · Releases