Run a real coding agent against a local model, route every request through one OpenAI-compatible endpoint, and watch every token in a self-hosted dashboard. No cloud, no per-token bill, nothing invisible.
▶ Live interactive guide: https://hoodini.github.io/local-ai-stack-101/
An animated, single-page walkthrough of a fully-local, fully-observable AI coding setup — the spine you can build on your own machine in an afternoon. Works on Windows + NVIDIA (WSL2 / CUDA + vLLM) or macOS (Apple Silicon + LM Studio). The page is one self-contained index.html — no build step, no dependencies, no tracking.
Coding Agent → LiteLLM Gateway → Local Model every token → Langfuse
(OpenCode / (one OpenAI- (vLLM on NVIDIA / (self-hosted traces,
Hermes) compatible URL) LM Studio on Mac) cost, latency)
The golden rule: the agent talks only to LiteLLM. That single choke point is why every request, token and dollar is observable — the agent never knows (or cares) whether vLLM or LM Studio actually answered. Swap the model behind the gateway and nothing upstream changes.
| Layer | Role | Pick |
|---|---|---|
| Coding agent | Where you actually write code | OpenCode (easy on-ramp) or Hermes (autonomous jobs, cron, skills) — both speak the OpenAI API |
| LiteLLM gateway | One OpenAI-compatible endpoint; logs tokens & cost | Routes to any backend, emits a Langfuse callback on every call |
| Local model | Does the inference, on your hardware | vLLM (NVIDIA/WSL2) or LM Studio (Apple Silicon). Size the model to your VRAM |
| Langfuse | Self-hosted observability | Every trace, token count, latency and cost — MIT, free, runs in Docker |
| Mesh (optional) | Reach a second machine's model | Tailscale, so a Mac can borrow a Windows box's GPU (or vice-versa) |
The live page has copy-paste blocks for each step with platform-specific commands. The short version:
- Stand up Langfuse (Docker) — gives you the trace dashboard at
http://localhost:3000. - Run the LiteLLM proxy with a Langfuse success/failure callback and a Postgres
DATABASE_URL(the DB is what enables the:4000/uilogin and key management). - Point your agent at the gateway — set
OPENAI_BASE_URLtohttp://localhost:4000and use your local master key. The agent thinks it's talking to OpenAI. - Make a request, then open Langfuse → Tracing — your call appears with prompt, response, token counts and latency. If it's there, the spine works. ✅
Open https://hoodini.github.io/local-ai-stack-101/ and click ⚡ Run a request to watch the round-trip animate end-to-end.
- LiteLLM UI (
http://localhost:4000/ui) → spend, request & token counts per model / key / day, plus key management. - Langfuse (
http://localhost:3000) → per-call traces: tokens in/out, latency, model, cost. Click any trace to drill in.
The live page has an interactive stepper — click a stop or press ▶ Play and each hop lights up the matching node in the diagram. Here's the same journey in text. A single prompt travels five stops:
Before you start, open three tabs: LiteLLM :4000/ui (login admin / admin), Langfuse :3000 (the email + password you set on first run), and — if you're using LM Studio — its desktop app on the Developer / Local Server tab.
Open your agent and send a prompt. It talks only to LiteLLM, never to the model directly.
# OpenCode — it's already pointed at the gateway
opencode
# then type your prompt and hit Enter
# …or Hermes, one-shot:
hermes -m qwen-coder -z "Reverse a string in Python."Which model?
qwen-coder/seed-coder→ Ollama (works immediately).qwen3.5-…-mlx→ LM Studio (load a model in its app first). What you'll see: the answer streams back in your terminal.
The gateway reads the model name and picks the backend — same request shape, different engine. This single choke point is why every token is observable: routing and the Langfuse callback both fire here.
What you'll see: nothing to click — routing is instant. A copy of the call has already been shipped to Langfuse via the
success_callback: ["langfuse"]line inconfig.yaml.
Your local model does the actual inference, on your own hardware. Nothing leaves the machine.
What you'll see: in LM Studio, the Developer/Local Server log shows
Received POST /v1/chat/completionsthe instant you sent it. Ollama logs to its own console. That's proof the gateway routed your request into the engine.
Open the gateway dashboard and log in.
- Logs → newest row is your call; click it for the full request & response + token counts.
- Usage → spend, requests & tokens per model / key / day.
What you'll see: your prompt, the reply,
prompt / completion / totaltokens, latency, and which backend answered.
Open Langfuse, pick your project, click Tracing in the sidebar.
- The top row (newest first) is your call — click it.
- Full input + output text, model, latency, tokens in/out, cost.
What you'll see: the same call you saw in LiteLLM, now with full prompt/response and a timing span. If it shows up in both places, the spine works end to end. ✅ Cost reads ~$0 for local models — that's correct.
Agent (OpenCode / Hermes)
│ model name: "qwen-coder" or "qwen3.5-…-mlx"
▼
LiteLLM :4000 ───────────────► Logs + Usage (gateway ledger)
│ routes by name │
│ └─ success_callback ─► Langfuse :3000 (full trace)
▼
Engine: Ollama :11434 OR LM Studio :1234 (the actual inference)
Every credential in this guide — admin/admin, postgres:postgres@localhost, sk-local-master, sk-lf-…, langfuse123 — is an intentional localhost placeholder for teaching, not a real secret. They live behind localhost on your own machine. The compose file ships insecure CHANGEME defaults on purpose; regenerate them (openssl rand -hex 32) before exposing anything to a network. Nothing here is wired to a real deployment.
No hardware is hard-coded either — the guide is written for "NVIDIA on Windows" or "Apple Silicon on Mac" generically. Pick the model that fits your VRAM.
It's a single index.html (pure HTML/CSS/SVG/JS, zero dependencies). Edit it, open it in a browser, done. PRs that improve the explanations, add a platform, or polish the animations are welcome.
MIT — use it, fork it, teach with it.



