Skip to content
maeddesg edited this page Jun 13, 2026 · 5 revisions

Usage

VulkanForge has three subcommands: chat, bench, and serve.

vulkanforge chat   --model <PATH> [--tokenizer-from <GGUF>] ...
vulkanforge bench  --model <PATH> [--tokenizer-from <GGUF>] [--runs N]
vulkanforge serve  --model <PATH> [--host 127.0.0.1] [--port 8080] [--cors]

vulkanforge chat --help lists every flag (sampling, max-tokens, think-filter, max-context).

Chat

# GGUF — no flag needed (Mesa 26.1+ recommended)
vulkanforge chat --model ~/models/Qwen3-8B-Q4_K_M.gguf

# FP8 SafeTensors — one flag; auto-loads tokenizer.json + chat_template from the dir
VF_FP8=auto vulkanforge chat --model ~/models/Qwen3-8B-FP8/

# Gemma-4-26B-A4B MoE — KV-FP8 REQUIRED (engine aborts at load without it)
VULKANFORGE_KV_FP8=1 vulkanforge chat --model ~/models/google_gemma-4-26B-A4B-it-Q3_K_M.gguf

The REPL accepts /help, /quit, and /reset (clear the KV cache + history without reloading the model). Common options: --max-tokens, --temperature (0 = greedy), --top-k, --top-p, --ctx-size, --no-think-filter. See Configuration for the full list and their env-var forms.

Single-shot (scripting / CI)

VF_PROMPT="What is the capital of France?" \
  vulkanforge chat --model ~/models/Qwen3-8B-Q4_K_M.gguf --max-tokens 32 --temperature 0

VF_PROMPT="..." runs exactly one turn and exits. It prints the response plus a one-line summary ([N prompt, M gen, prefill … tok/s, decode … tok/s]).

Bench

vulkanforge bench --model ~/models/Qwen3-8B-Q4_K_M.gguf --runs 5

Reports decode (canonical N/decode_time, prefill-subtracted, warm, median) and a prefill sweep. bench accepts Q4_K_M GGUF (Q8_0 chat works but does not bench). For the full prefill matrix methodology used in the Benchmarks page, the repo ships examples/run_pp_bench.

Serve (OpenAI-compatible API)

vulkanforge serve --model ~/models/Qwen3-8B-Q4_K_M.gguf --host 127.0.0.1 --port 8080

Single-model, single-stream OpenAI Chat Completions API (/v1/chat/completions, streaming SSE, legacy text completions, and tool / function calling for Qwen3 / Hermes and Gemma-4 — Gemma-4's native format added in v0.8.0). It is single-user — no batching / concurrent sessions. Optional cross-request KV prefix-reuse via VF_KV_PREFIX_REUSE=1 (default off).

Since v0.8.0, omit --ctx-size and the server auto-sizes the KV context from free VRAM + the model and prints what it chose at startup (explicit --ctx-size still overrides; capped at 16384 on RDNA4). See Configuration.

vf-clide — the CLI chat & agentic coding client

A lean standalone command-line client (its own crate, no engine dependencies) ships with the engine for talking to a running server. Chat (streaming REPL + headless one-shot, visible markers for truncated/empty answers) and, since v0.9.0, an agentic --agent mode (tools, tiered permissions, workspace confinement, constitution). See vf-clide.

See Supported Models for what to load and Configuration for flags.

Clone this wiki locally