Usage

VulkanForge has three subcommands: chat, bench, and serve.

vulkanforge chat   --model <PATH> [--tokenizer-from <GGUF>] ...
vulkanforge bench  --model <PATH> [--tokenizer-from <GGUF>] [--runs N]
vulkanforge serve  --model <PATH> [--host 127.0.0.1] [--port 8080] [--cors]

vulkanforge chat --help lists every flag (sampling, max-tokens, think-filter, max-context).

Chat

# GGUF — no flag needed (Mesa 26.1+ recommended)
vulkanforge chat --model ~/models/Qwen3-8B-Q4_K_M.gguf

# FP8 SafeTensors — one flag; auto-loads tokenizer.json + chat_template from the dir
VF_FP8=auto vulkanforge chat --model ~/models/Qwen3-8B-FP8/

# Gemma-4-26B-A4B MoE — KV-FP8 recommended to fit 16 GB
VULKANFORGE_KV_FP8=1 vulkanforge chat --model ~/models/google_gemma-4-26B-A4B-it-Q3_K_M.gguf

The REPL accepts /help, /quit, and /reset (clear the KV cache + history without reloading the model). Common options: --max-tokens, --temperature (0 = greedy), --top-k, --top-p, --ctx-size, --no-think-filter. See Configuration for the full list and their env-var forms.

Single-shot (scripting / CI)

VF_PROMPT="What is the capital of France?" \
  vulkanforge chat --model ~/models/Qwen3-8B-Q4_K_M.gguf --max-tokens 32 --temperature 0

VF_PROMPT="..." runs exactly one turn and exits. It prints the response plus a one-line summary ([N prompt, M gen, prefill … tok/s, decode … tok/s]).

Bench

vulkanforge bench --model ~/models/Qwen3-8B-Q4_K_M.gguf --runs 5

Reports decode (canonical N/decode_time, prefill-subtracted, warm, median) and a prefill sweep. bench accepts Q4_K_M GGUF (Q8_0 chat works but does not bench). For the full prefill matrix methodology used in the Benchmarks page, the repo ships examples/run_pp_bench.

Serve (OpenAI-compatible API)

vulkanforge serve --model ~/models/Qwen3-8B-Q4_K_M.gguf --host 127.0.0.1 --port 8080

Single-model, single-stream OpenAI Chat Completions API (/v1/chat/completions, streaming SSE, legacy text completions, tool calling for Qwen3/Hermes). It is single-user — no batching / concurrent sessions. Optional cross-request KV prefix-reuse via VF_KV_PREFIX_REUSE=1 (default off).

See Supported Models for what to load and Configuration for flags.

VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 · Repository · Releases

VulkanForge Wiki

Get Started

Use VulkanForge

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage

Usage

Chat

Single-shot (scripting / CI)

Bench

Serve (OpenAI-compatible API)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VulkanForge Wiki

Clone this wiki locally