Skip to content
maeddesg edited this page Jun 15, 2026 · 5 revisions

Usage

VulkanForge has three subcommands: chat, bench, and serve.

vulkanforge chat   --model <PATH> [--tokenizer-from <GGUF>] ...
vulkanforge bench  --model <PATH> [--tokenizer-from <GGUF>] [--runs N]
vulkanforge serve  --model <PATH> [--host 127.0.0.1] [--port 8080] [--cors]

vulkanforge chat --help lists every flag (sampling, max-tokens, think-filter, max-context).

Chat

# GGUF — no flag needed (Mesa 26.1+ recommended)
vulkanforge chat --model ~/models/Qwen3-8B-Q4_K_M.gguf

# FP8 SafeTensors — one flag; auto-loads tokenizer.json + chat_template from the dir
VF_FP8=auto vulkanforge chat --model ~/models/Qwen3-8B-FP8/

# Gemma-4-26B-A4B MoE — KV-FP8 REQUIRED (engine aborts at load without it)
VULKANFORGE_KV_FP8=1 vulkanforge chat --model ~/models/google_gemma-4-26B-A4B-it-Q3_K_M.gguf

The REPL accepts /help, /quit, and /reset (clear the KV cache + history without reloading the model). Common options: --max-tokens, --temperature (0 = greedy), --top-k, --top-p, --ctx-size, --no-think-filter. See Configuration for the full list and their env-var forms.

Single-shot (scripting / CI)

VF_PROMPT="What is the capital of France?" \
  vulkanforge chat --model ~/models/Qwen3-8B-Q4_K_M.gguf --max-tokens 32 --temperature 0

VF_PROMPT="..." runs exactly one turn and exits. It prints the response plus a one-line summary ([N prompt, M gen, prefill … tok/s, decode … tok/s]).

Bench

vulkanforge bench --model ~/models/Qwen3-8B-Q4_K_M.gguf --runs 5

Reports decode (canonical N/decode_time, prefill-subtracted, warm, median) and a prefill sweep. bench accepts Q4_K_M GGUF (Q8_0 chat works but does not bench). For the full prefill matrix methodology used in the Benchmarks page, the repo ships examples/run_pp_bench.

Serve (OpenAI-compatible API)

vulkanforge serve --model ~/models/Qwen3-8B-Q4_K_M.gguf --host 127.0.0.1 --port 8080

Single-model, single-stream OpenAI Chat Completions API (/v1/chat/completions, streaming SSE, legacy text completions, and tool / function calling for Qwen3 / Hermes and Gemma-4 — Gemma-4's native format added in v0.8.0). It is single-user — no batching / concurrent sessions. Optional cross-request KV prefix-reuse via VF_KV_PREFIX_REUSE=1 (default off).

Since v0.8.0, omit --ctx-size and the server auto-sizes the KV context from free VRAM + the model and prints what it chose at startup (explicit --ctx-size still overrides; capped at 16384 on RDNA4). See Configuration.

Ctrl+C (or SIGTERM) stops the server. Since v0.9.2 it shuts down cleanly — it lets in-flight requests finish, then frees all GPU resources in order and exits with code 0 (earlier builds could crash with a segfault on shutdown).

Memory (optional)

serve can host a persistent, project-scoped semantic memory — opt-in, off by default. It needs a binary built with --features memory (see Installation) and is activated per run:

vulkanforge serve --model ~/models/Qwen3-8B-Q4_K_M.gguf --memory
# env alias: VULKANFORGE_MEMORY=1 vulkanforge serve --model … 

Without --memory, the /memory/* endpoints return 503 and the server runs inference only — no embedder is loaded and no database is opened. (Pass --memory to a lean binary and it stops with a clear rebuild with --features memory message.)

When enabled, the server exposes VF-native endpoints under /memory/* (a namespace separate from the OpenAI-compatible /v1/*). project_key is optional everywhere — omit it for a shared global scope:

# Write a note on purpose
curl -s localhost:8080/memory/remember -H 'content-type: application/json' -d '{
  "project_key": "vf", "kind": "Learning",
  "text": "Dispatch reduction does not help on gfx1201"
}'                                  # → {"id":1}

# Read it back by meaning
curl -s localhost:8080/memory/recall -H 'content-type: application/json' -d '{
  "project_key": "vf", "query": "do fewer barriers help performance?", "k": 3
}'                                  # → {"hits":[{"id":1,"kind":"Learning","name":"…","text":"…","status":"active","score":0.72}]}

# Create / list projects
curl -s localhost:8080/memory/projects -H 'content-type: application/json' -d '{"project_key":"vf"}'
curl -s localhost:8080/memory/projects   # GET → {"projects":[…]}

The store lives at ~/.vulkanforge/memory.db (override VF_MEMORY_DB); the first --memory start downloads the embedding model once into ~/.vulkanforge/embed-cache/. The embedder runs on the CPU and never takes the GPU permit, so a recall never waits behind a generation. What memory is (and isn't), the full endpoint shapes, and the roadmap: Memory.

vf-clide — the CLI chat & agentic coding client

A lean standalone command-line client (its own crate, no engine dependencies) ships with the engine for talking to a running server. Chat (streaming REPL + headless one-shot, visible markers for truncated/empty answers) and, since v0.9.0, an agentic --agent mode (tools, tiered permissions, workspace confinement, constitution). See vf-clide.

See Supported Models for what to load and Configuration for flags.

Clone this wiki locally