Usage

VulkanForge has three subcommands: chat, bench, and serve.

vulkanforge chat   --model <PATH> [--tokenizer-from <GGUF>] ...
vulkanforge bench  --model <PATH> [--tokenizer-from <GGUF>] [--runs N]
vulkanforge serve  --model <PATH> [--host 127.0.0.1] [--port 8080] [--cors]

vulkanforge chat --help lists every flag (sampling, max-tokens, think-filter, max-context).

Chat

# GGUF — no flag needed (Mesa 26.1+ recommended)
vulkanforge chat --model ~/models/Qwen3-8B-Q4_K_M.gguf

# FP8 SafeTensors — one flag; auto-loads tokenizer.json + chat_template from the dir
VF_FP8=auto vulkanforge chat --model ~/models/Qwen3-8B-FP8/

# Gemma-4-26B-A4B MoE — KV-FP8 REQUIRED (engine aborts at load without it)
VULKANFORGE_KV_FP8=1 vulkanforge chat --model ~/models/google_gemma-4-26B-A4B-it-Q3_K_M.gguf

The REPL accepts /help, /quit, and /reset (clear the KV cache + history without reloading the model). Common options: --max-tokens, --temperature (0 = greedy), --top-k, --top-p, --ctx-size, --no-think-filter. See Configuration for the full list and their env-var forms.

Single-shot (scripting / CI)

VF_PROMPT="What is the capital of France?" \
  vulkanforge chat --model ~/models/Qwen3-8B-Q4_K_M.gguf --max-tokens 32 --temperature 0

VF_PROMPT="..." runs exactly one turn and exits. It prints the response plus a one-line summary ([N prompt, M gen, prefill … tok/s, decode … tok/s]).

Bench

vulkanforge bench --model ~/models/Qwen3-8B-Q4_K_M.gguf --runs 5

Reports decode (canonical N/decode_time, prefill-subtracted, warm, median) and a prefill sweep. bench accepts Q4_K_M GGUF (Q8_0 chat works but does not bench). For the full prefill matrix methodology used in the Benchmarks page, the repo ships examples/run_pp_bench.

Serve (OpenAI-compatible API)

vulkanforge serve --model ~/models/Qwen3-8B-Q4_K_M.gguf --host 127.0.0.1 --port 8080

Single-model, single-stream OpenAI Chat Completions API (/v1/chat/completions, streaming SSE, legacy text completions, and tool / function calling for Qwen3 / Hermes and Gemma-4 — Gemma-4's native format added in v0.8.0). It is single-user — no batching / concurrent sessions. Optional cross-request KV prefix-reuse via VF_KV_PREFIX_REUSE=1 (default off).

Since v0.8.0, omit --ctx-size and the server auto-sizes the KV context from free VRAM + the model and prints what it chose at startup (explicit --ctx-size still overrides; capped at 16384 on RDNA4). See Configuration.

Ctrl+C (or SIGTERM) stops the server. Since v0.9.2 it shuts down cleanly — it lets in-flight requests finish, then frees all GPU resources in order and exits with code 0 (earlier builds could crash with a segfault on shutdown).

vf-clide — the CLI chat & agentic coding client

A lean standalone command-line client (its own crate, no engine dependencies) ships with the engine for talking to a running server. Chat (streaming REPL + headless one-shot, visible markers for truncated/empty answers) and, since v0.9.0, an agentic --agent mode (tools, tiered permissions, workspace confinement, constitution). See vf-clide.

See Supported Models for what to load and Configuration for flags.

VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 · Repository · Releases

VulkanForge Wiki

Get Started

Use VulkanForge

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage

Usage

Chat

Single-shot (scripting / CI)

Bench

Serve (OpenAI-compatible API)

vf-clide — the CLI chat & agentic coding client

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VulkanForge Wiki

Clone this wiki locally