Skip to content
maeddesg edited this page Jun 12, 2026 · 8 revisions

vf-clide — the CLI chat client

vf-clide is a lean command-line chat client for the VulkanForge server, shipped alongside the engine since v0.8.0. It is its own crate (GPL-3.0) with no engine dependencies — it talks only to the OpenAI-compatible API over HTTP, so it builds and runs independently of the Vulkan stack.

As of v0.1.0 it is a chat client (streaming/non-streaming, REPL + headless). The agentic loop (tools, file access, memory) is Phase 2.

Full reference — install, flags, limitations, troubleshooting: vf-clide/README.md.

Build

cargo build --release --manifest-path vf-clide/Cargo.toml   # → ./vf-clide/target/release/vf-clide

Two terminals: server, then client

Terminal 1 — start the server (it stays in the foreground; see Usage):

vulkanforge serve --model ~/models/Qwen_Qwen3-14B-Q4_K_M.gguf --port 8080

Terminal 2 — the client:

# interactive REPL (streams live)
vf-clide --url http://localhost:8080

# headless one-shot
vf-clide --url http://localhost:8080 -p "Capital of Japan? One word."

REPL commands

/clear (drop history) · /model <name> (label) · /max-tokens <N> · /think · /no-think · /quit (/q, /exit).

Flags

Flag Default Purpose
-p, --prompt Ask one question, print the answer, exit (headless)
--url http://localhost:8080 Server address
--model Qwen3-14B-Q4_K_M Label sent in the request only (the server decides which model answers)
--max-tokens 6144 Token budget; generous so thinking models have room
--no-think off Append /no_think → answer without the reasoning block
--no-stream off Full answer instead of streaming
--temperature 0.0 Sampling temperature
--project Project scope (placeholder; no effect yet)

Good to know

  • One model per server. --model is only a label — what answers is whatever the server loaded. Switch model = restart the server.
  • Visible markers, not silent failures. If an answer is cut off at the token limit it prints [truncated at the token limit (N) …] on stderr; if a thinking model produces only a <think> block with no visible answer it prints [empty answer — the budget was likely consumed by the <think> block …]. stdout stays clean.
  • The REPL needs a real terminal (TTY) — not pipe-able. For scripting use the headless -p mode.
  • Validated through the client across gemma (QAT / Q3 @KV-FP8), Qwen3 (14B / 8B), Llama-3.1-8B, Mistral-7B and DeepSeek-R1-Distill.

See Usage for the server side and Troubleshooting for empty/truncated answers and the context-size ceiling.

Clone this wiki locally