Usage

VulkanForge has three subcommands: chat, bench, and serve.

vulkanforge chat   --model <PATH> [--tokenizer-from <GGUF>] ...
vulkanforge bench  --model <PATH> [--tokenizer-from <GGUF>] [--runs N]
vulkanforge serve  --model <PATH> [--host 127.0.0.1] [--port 8080] [--cors]

vulkanforge chat --help lists every flag (sampling, max-tokens, think-filter, max-context).

Chat

# GGUF — no flag needed (Mesa 26.1+ recommended)
vulkanforge chat --model ~/models/Qwen3-8B-Q4_K_M.gguf

# FP8 SafeTensors — one flag; auto-loads tokenizer.json + chat_template from the dir
VF_FP8=auto vulkanforge chat --model ~/models/Qwen3-8B-FP8/

# Gemma-4-26B-A4B MoE — KV-FP8 REQUIRED (engine aborts at load without it)
VULKANFORGE_KV_FP8=1 vulkanforge chat --model ~/models/google_gemma-4-26B-A4B-it-Q3_K_M.gguf

The REPL accepts /help, /quit, and /reset (clear the KV cache + history without reloading the model). Common options: --max-tokens, --temperature (0 = greedy), --top-k, --top-p, --ctx-size, --no-think-filter. See Configuration for the full list and their env-var forms.

Single-shot (scripting / CI)

VF_PROMPT="What is the capital of France?" \
  vulkanforge chat --model ~/models/Qwen3-8B-Q4_K_M.gguf --max-tokens 32 --temperature 0

VF_PROMPT="..." runs exactly one turn and exits. It prints the response plus a one-line summary ([N prompt, M gen, prefill … tok/s, decode … tok/s]).

Bench

vulkanforge bench --model ~/models/Qwen3-8B-Q4_K_M.gguf --runs 5

Reports decode (canonical N/decode_time, prefill-subtracted, warm, median) and a prefill sweep. bench accepts Q4_K_M GGUF (Q8_0 chat works but does not bench). For the full prefill matrix methodology used in the Benchmarks page, the repo ships examples/run_pp_bench.

Serve (OpenAI-compatible API)

vulkanforge serve --model ~/models/Qwen3-8B-Q4_K_M.gguf --host 127.0.0.1 --port 8080

Single-model, single-stream OpenAI Chat Completions API (/v1/chat/completions, streaming SSE, legacy text completions, and tool / function calling for Qwen3 / Hermes and Gemma-4 — Gemma-4's native format added in v0.8.0). It is single-user — no batching / concurrent sessions. Optional cross-request KV prefix-reuse via VF_KV_PREFIX_REUSE=1 (default off).

Since v0.8.0, omit --ctx-size and the server auto-sizes the KV context from free VRAM + the model and prints what it chose at startup (explicit --ctx-size still overrides; capped at 16384 on RDNA4). See Configuration.

Ctrl+C (or SIGTERM) stops the server. Since v0.9.2 it shuts down cleanly — it lets in-flight requests finish, then frees all GPU resources in order and exits with code 0 (earlier builds could crash with a segfault on shutdown).

Memory (optional)

serve can host a persistent, project-scoped semantic memory — opt-in, off by default. It needs a binary built with --features memory (see Installation) and is activated per run:

vulkanforge serve --model ~/models/Qwen3-8B-Q4_K_M.gguf --memory
# env alias: VULKANFORGE_MEMORY=1 vulkanforge serve --model …

Without --memory, the /memory/* endpoints return 503 and the server runs inference only — no embedder is loaded and no database is opened. (Pass --memory to a lean binary and it stops with a clear rebuild with --features memory message.)

When enabled, the server exposes VF-native endpoints under /memory/* (a namespace separate from the OpenAI-compatible /v1/*). project_key is optional everywhere — omit it for a shared global scope:

# Write a note on purpose
curl -s localhost:8080/memory/remember -H 'content-type: application/json' -d '{
  "project_key": "vf", "kind": "Learning",
  "text": "Dispatch reduction does not help on gfx1201"
}'                                  # → {"id":1}

# Read it back by meaning
curl -s localhost:8080/memory/recall -H 'content-type: application/json' -d '{
  "project_key": "vf", "query": "do fewer barriers help performance?", "k": 3
}'                                  # → {"hits":[{"id":1,"kind":"Learning","name":"…","text":"…","status":"active","score":0.72}]}

# Create / list projects
curl -s localhost:8080/memory/projects -H 'content-type: application/json' -d '{"project_key":"vf"}'
curl -s localhost:8080/memory/projects   # GET → {"projects":[…]}

The store lives at ~/.vulkanforge/memory.db (override VF_MEMORY_DB); the first --memory start downloads the embedding model once into ~/.vulkanforge/embed-cache/. The embedder runs on the CPU and never takes the GPU permit, so a recall never waits behind a generation. What memory is (and isn't), the full endpoint shapes, and the roadmap: Memory.

vf-clide — the CLI chat & agentic coding client

A lean standalone command-line client (its own crate, no engine dependencies) ships with the engine for talking to a running server. Chat (streaming REPL + headless one-shot, visible markers for truncated/empty answers) and, since v0.9.0, an agentic --agent mode (tools, tiered permissions, workspace confinement, constitution). See vf-clide.

See Supported Models for what to load and Configuration for flags.

VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 · Repository · Releases

VulkanForge Wiki

Get Started

Use VulkanForge

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage

Usage

Chat

Single-shot (scripting / CI)

Bench

Serve (OpenAI-compatible API)

Memory (optional)

vf-clide — the CLI chat & agentic coding client

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VulkanForge Wiki

Clone this wiki locally