Know exactly what you spend on AI — and automatically spend less.
A local proxy that sits between your tools and the LLM APIs. It tracks every token, calculates real costs, and optimizes requests on the fly. Works with Claude Code, VS Code, Claude Desktop, OpenClaw, and 15+ API providers (OpenAI, Groq, Mistral, DeepSeek, xAI, Perplexity, Cerebras, Together, Fireworks, Cohere, OpenRouter, Ollama, Amazon Bedrock, Google Gemini, and more).
If you use Claude Code or VS Code AI tools heavily, you're probably spending $2–10/day without knowing exactly where it goes. This tool:
- Shows you a real-time breakdown by tool, model, session, and task type
- Automatically reduces your bill by 50–80% via prompt caching and smart model routing
- Works locally — your prompts never leave your machine
Full spend analytics at http://localhost:8082/dashboard:
- Cost by day / week / month — trend chart with daily breakdown
- By task type — Coding vs Shell vs Agent vs Planning vs Web
- By model — Claude Fable 5, Opus 4.8, Sonnet 4.6, Haiku 4.5, GPT-4o, etc. (218 models total)
- By source — Claude Code, Claude Desktop, VS Code Extensions, OpenClaw, GitHub Copilot (usage only), API providers
- Cache analytics — hit rate, money saved vs. what you'd pay without caching
- Session view — every conversation: tokens in/out, cost, tools used
- RAW Logs — last 500 requests with full prompt preview
- Health grade A–F — scores your efficiency and tells you what to fix
Always-visible cost tracker in your menu bar. Click to see:
- Today's spend + health grade
- 7-day trend chart with hover tooltips
- Breakdown by task, model, cache, tools
- Smart Routing optimizer stats
- One-click sync from local logs
The proxy applies these silently on every request:
| Optimization | What it does | Typical savings |
|---|---|---|
| Prompt Cache | Auto-tags large system prompts and user messages for caching | 60–90% on repeat reads |
| Smart Routing | Routes simple requests to cheaper models (Fable→Haiku, Opus→Sonnet, etc.) | 5–25× cheaper per request |
| Thinking Budget | Caps budget_tokens for extended thinking based on complexity |
80–90% on thinking tokens |
| Message Trim | Removes old messages when context exceeds 50k tokens | Prevents runaway costs |
| Session Cap | Limits history to 40 messages per session | Prevents VS Code "prompt too long" errors |
| Deduplication | Returns cached response for identical requests within 5s | 100% on accidental double-sends |
Smart Routing scores each prompt 0–10 (length, keywords, code presence) and silently downgrades model when complexity is low. You get the same response, cheaper.
Requirements: Python 3.9+ · macOS (Monterey 12+) or Windows 10/11
Step 1 — first time only:
cd ~ && git clone https://github.com/mr-beaver/tokencost && cd tokencost && bash onbording.shThe setup script installs everything and adds a tokencost command to your terminal.
Step 2 — every time after (restart / update):
Open Terminal and type:
tokencost
First install adds this command automatically. Open a new terminal tab after step 1 for it to appear.
The setup script:
- Creates a Python virtualenv and installs dependencies
- Imports your full Claude usage history from local logs
- Sets
ANTHROPIC_BASE_URL=http://localhost:8082in~/.zshrcand macOS launchd - Adds
tokencostalias for quick restarts - Starts the proxy and opens the dashboard
Step 1 — first time only (PowerShell):
cd $HOME; git clone https://github.com/mr-beaver/tokencost; cd tokencost; powershell -ExecutionPolicy Bypass -File onbording.ps1Step 2 — every time after: double-click tokencost.bat in the repo folder, or run the same onbording.ps1 command again. Pick option 1 to start, 2 to disable.
The Windows setup script (onbording.ps1):
- Creates a Python virtualenv and installs dependencies (
requirements.txt) - Imports your full Claude usage history from local logs (
%APPDATA%\Claude,%APPDATA%\Code,~/.claude) - Sets
ANTHROPIC_BASE_URL=http://localhost:8082as a User environment variable (picked up by Claude Code, VS Code, Claude Desktop, and new terminals) - Registers autostart at logon — a Scheduled Task if you run it as Administrator (this also adds the unattended 5-minute auto-sync task), otherwise a no-admin launcher in your Startup folder
- Starts the proxy and opens the dashboard
Auto-sync needs admin. Without elevation the proxy still autostarts and captures live traffic in real time; historical logs sync whenever you run setup or click Sync now in the dashboard. Run
onbording.ps1from an Administrator PowerShell if you want the unattended 5-minute sync as well.
The macOS menu-bar widget is not available on Windows — use the web dashboard at
http://localhost:8082/dashboard.
Updating (Windows): open the dashboard, click the version badge, and copy the Update command — it's shell-agnostic (runs in cmd.exe or PowerShell), does git pull, and restarts the proxy non-interactively. Or just re-run tokencost.bat.
After install, all your Claude Code and VS Code requests flow through the proxy automatically.
The pre-built app is included in the repo — onbording.sh installs and launches it automatically.
To install manually:
cp -R menubar/TokenCostBar.app ~/Applications/
open ~/Applications/TokenCostBar.appBuild from source (requires Xcode Command Line Tools):
cd menubar && bash build.sh
mv TokenCostBar.app ~/Applications/
open ~/Applications/TokenCostBar.appPaths are resolved per-platform — macOS ~/Library/Application Support/..., Windows %APPDATA%\..., Linux ~/.config/....
- Claude Code / Claude CLI —
~/.claude/projects/**/*.jsonl(all platforms) - Claude Desktop —
local-agent-mode-sessions/**/*.jsonlunderClaude(macOS~/Library/Application Support/Claude, Windows%APPDATA%\Claude) - OpenClaw —
~/.openclaw/agents/**/*.jsonl - VS Code Extensions (Cline, Roo Code, Kilo Code, IBM Bob) —
globalStorage/*/tasks/ui_messages.jsonunderCode(macOS~/Library/Application Support/Code, Windows%APPDATA%\Code) - GitHub Copilot — usage tracking from VS Code logs (model + requests, pricing unavailable)
Sync cadence: every 5 minutes on macOS (launchd) and on admin Windows (Scheduled Task); on non-admin Windows, on each setup run or via Sync now in the dashboard.
Set the base URL for each provider you want to track:
export ANTHROPIC_BASE_URL=http://localhost:8082 # Claude (auto-set by onbording.sh)
export OPENAI_BASE_URL=http://localhost:8082/openai # OpenAI
export GROQ_API_BASE=http://localhost:8082/groq # Groq
export MISTRAL_API_BASE=http://localhost:8082/mistral # Mistral
export DEEPSEEK_API_BASE=http://localhost:8082/deepseek # DeepSeekOn Windows, ANTHROPIC_BASE_URL is set for you by onbording.ps1; set additional providers with PowerShell, e.g.:
[Environment]::SetEnvironmentVariable("OPENAI_BASE_URL", "http://localhost:8082/openai", "User")Supported: Anthropic · OpenAI · Groq · Mistral · DeepSeek · xAI · Perplexity · Cerebras · Together · Fireworks · Cohere · OpenRouter · Ollama · Amazon Bedrock · Google Gemini
218 models in the pricing database.
Your tool (Claude Code / VS Code / etc.)
↓
localhost:8082 ← proxy runs here
↓ scores complexity, applies optimizations, logs tokens+cost
api.anthropic.com / api.openai.com / etc.
↓
Response back through proxy → your tool
The proxy is transparent — your tools see it as the real API. Zero changes to your workflow.
TokenCost forwards every Anthropic path transparently (/v1/* and /api/oauth/*), so it can sit downstream of another Anthropic proxy (e.g. a context-compression layer) that points its upstream at TokenCost:
Your tool → other proxy (compress) → localhost:8082 (TokenCost logs cost) → api.anthropic.com
Point the upstream proxy's Anthropic target at http://localhost:8082 (subscription-usage polls pass through untouched and aren't logged).
When enabled (setup script → option 1), the proxy scores the last user message:
| Score | Original model | Routes to | Savings |
|---|---|---|---|
| 0–2 | Fable 5 | Haiku | ~50× cheaper |
| 0–2 | Opus 4.8 | Haiku | ~25× cheaper |
| 0–2 | Sonnet 4.6 | Haiku | ~5× cheaper |
| 3–5 | Fable 5 | Sonnet | ~17× cheaper |
| 3–5 | Opus 4.8 | Sonnet | ~5× cheaper |
| 6–10 | any | unchanged | — |
Simple questions (what is X, explain Y), short messages, and tool-chain intermediates score 0–2. Long coding tasks with keywords like implement, refactor, debug score 6+.
Supported Claude models: Fable 5 ($10/$50M), Opus 4.8 ($5/$25M), Sonnet 4.6 ($3/$15M), Haiku 4.5 ($1/$5M) — plus legacy 3.x versions.
Fable 5 specifics: Fable is only downrouted for very simple requests (score ≤2). For complex tasks (score 3+), Fable is preserved to take advantage of its reasoning capabilities. This means expensive long-horizon tasks stay on Fable, but throwaway pings and quick questions automatically use Haiku.
Does switching models break the cache? No. The cache key is based on request content, not the model — so when the proxy routes Opus → Haiku, the cheaper model still gets the 90% cache discount. The only nuance: different models have different input prices, so the absolute dollar value of the discount varies. But Haiku's base price is so much lower that the combined savings (cheaper model + cache discount) always exceed staying on the original model with cache alone. High-complexity requests that benefit most from caching (score 6–10) are never routed away from the original model.
curl http://localhost:8082/stats?period=today # today's stats (JSON)
curl http://localhost:8082/stats?period=7d # last 7 days
curl http://localhost:8082/stats?period=30d # last 30 days
curl http://localhost:8082/raw-logs?limit=100 # recent requests- Everything stored locally in
tracker.db(SQLite) - Nothing is sent to any external service
- The proxy only reads your token usage metadata — not the content of your conversations (prompt previews are stored locally, never transmitted)
- Stop the proxy anytime via the setup script's option 2 (
onbording.shon macOS,tokencost.baton Windows)
macOS:
bash onbording.sh # choose option 2 — stops proxy, removes env varsWindows: run tokencost.bat (or onbording.ps1) and choose option 2 — stops the proxy, removes the autostart entry (Scheduled Task / Startup launcher), and clears ANTHROPIC_BASE_URL.
⚠️ Don't kill the proxy process manually if you're using Claude Code — Claude routes through it and the call will fail. Always use the setup script's option 2.
tokencost/
├── proxy.py — FastAPI proxy, port 8082
├── optimizer.py — Request optimizations (cache, routing, trim)
├── db.py — SQLite schema, cost calculations, analytics
├── import_history.py — Import historical logs from Claude CLI, Desktop, OpenClaw
├── projects.py — Project/session tracking
├── dashboard.html — Web dashboard UI
├── onbording.sh — Setup / start / stop script (macOS)
├── onbording.ps1 — Setup / start / stop script (Windows)
├── tokencost.bat — Windows launcher (double-click)
├── requirements.txt — Python dependencies
└── menubar/ — macOS SwiftUI menu bar app
├── build.sh
└── Sources/TokenCostBar/
├── MenuBarView.swift
└── StatsModel.swift
No — and this is the most important thing to understand about how the optimizations interact.
Anthropic's prompt caching works at the model level: each model maintains its own isolated KV cache for a given prefix. Haiku and Opus have completely different architectures and incompatible internal representations — there is no cross-family KV sharing. When the proxy routes a request from Opus to Haiku, Haiku builds its own cache entry for that prefix. The next Haiku request with the same prefix reads from Haiku's cache at 0.1× input cost.
The one real cost: the first request after a model switch pays a cache write (1.25× input cost) rather than a cache read (0.1× input cost). If you were switching models constantly — every request — that write overhead would accumulate. But the routing logic prevents this:
tool_resultturns with no user text → scored 10, never routed. Active tool chains stay on the original model.- Any request where the last 4 messages contain
tool_useortool_resultblocks → +2 to score. - Complex keywords (
implement,refactor,debug, code blocks, file extensions) → score 6–10, original model preserved.
In practice, Haiku handles isolated simple questions ("what is X", short Q&A), while Opus stays on multi-turn coding tasks. Both models warm up their own cache entries and stay on them. The write overhead on the first routing switch is a one-time cost; everything after that is cheap reads.
Net result: prompt cache savings + model routing savings compound rather than cancel. Verified in optimizer.py — has_recent_tool_results() and the tool_result type check in complexity_score() are the guards that make this work correctly.
Benchmarks below use Opus 4.8 base pricing with a typical vibe-coding session mix: 35% simple Q&A (score 0–2 → Haiku), 25% medium refactors (score 3–5 → Sonnet), 40% complex tasks (score 6–10 → Opus). System prompt ~3k tokens reused across the session, 15% duplicate requests filtered.
| Optimization | Saves | Rate | How |
|---|---|---|---|
| Prompt cache | $1,200–1,800 | 12–18% | System prompt reads at 0.1× cost |
| Smart routing | $1,600–2,400 | 16–24% | 60% of requests → Haiku/Sonnet |
| Thinking budget | $300–600 | 3–6% | Caps budget_tokens by complexity |
| Dedup + trim | $100–200 | 1–2% | 5s dedup window + 50k context trim |
| Total | $3,800–5,200 | 38–52% | Combined effect |
$30k ┤
$25k ┤ ◉ Max savings
$20k ┤
$15k ┤
$10k ┤ ◉
$5k ┤ ◉
$2k ┤ ◉
$0k ┼────────────────────────────────────────────────────
$1k $5k $10k $20k $50k
Monthly spend
At $10k/month: saves ~$3,800–5,200 → $45–62k/year.
At $50k/month: saves ~$19k–26k.
| Score | Routes to | Mix |
|---|---|---|
| 0–2 (explain, what is, short Q&A) | Haiku | 35% |
| 3–5 (medium refactors, review) | Sonnet | 25% |
| 6–10 (implement, debug, architect) | Opus | 40% |
Results vary by usage pattern. Heavy Opus users with long system prompts see the largest gains — prompt cache + smart routing compound. Minimal savings if you already route manually or use Haiku exclusively.
MIT