Skip to content

mr-beaver/tokencost

Repository files navigation

TokenCost

Know exactly what you spend on AI — and automatically spend less.

A local proxy that sits between your tools and the LLM APIs. It tracks every token, calculates real costs, and optimizes requests on the fly. Works with Claude Code, VS Code, Claude Desktop, OpenClaw, and 15+ API providers (OpenAI, Groq, Mistral, DeepSeek, xAI, Perplexity, Cerebras, Together, Fireworks, Cohere, OpenRouter, Ollama, Amazon Bedrock, Google Gemini, and more).

image image image image

Why

If you use Claude Code or VS Code AI tools heavily, you're probably spending $2–10/day without knowing exactly where it goes. This tool:

  • Shows you a real-time breakdown by tool, model, session, and task type
  • Automatically reduces your bill by 50–80% via prompt caching and smart model routing
  • Works locally — your prompts never leave your machine

What it does

📊 Live Dashboard

Full spend analytics at http://localhost:8082/dashboard:

  • Cost by day / week / month — trend chart with daily breakdown
  • By task type — Coding vs Shell vs Agent vs Planning vs Web
  • By model — Claude Fable 5, Opus 4.8, Sonnet 4.6, Haiku 4.5, GPT-4o, etc. (218 models total)
  • By source — Claude Code, Claude Desktop, VS Code Extensions, OpenClaw, GitHub Copilot (usage only), API providers
  • Cache analytics — hit rate, money saved vs. what you'd pay without caching
  • Session view — every conversation: tokens in/out, cost, tools used
  • RAW Logs — last 500 requests with full prompt preview
  • Health grade A–F — scores your efficiency and tells you what to fix

🍎 macOS Menu Bar Widget

Always-visible cost tracker in your menu bar. Click to see:

  • Today's spend + health grade
  • 7-day trend chart with hover tooltips
  • Breakdown by task, model, cache, tools
  • Smart Routing optimizer stats
  • One-click sync from local logs
image

⚡ Automatic Optimizations

The proxy applies these silently on every request:

Optimization What it does Typical savings
Prompt Cache Auto-tags large system prompts and user messages for caching 60–90% on repeat reads
Smart Routing Routes simple requests to cheaper models (Fable→Haiku, Opus→Sonnet, etc.) 5–25× cheaper per request
Thinking Budget Caps budget_tokens for extended thinking based on complexity 80–90% on thinking tokens
Message Trim Removes old messages when context exceeds 50k tokens Prevents runaway costs
Session Cap Limits history to 40 messages per session Prevents VS Code "prompt too long" errors
Deduplication Returns cached response for identical requests within 5s 100% on accidental double-sends

Smart Routing scores each prompt 0–10 (length, keywords, code presence) and silently downgrades model when complexity is low. You get the same response, cheaper.

image

Install

Requirements: Python 3.9+ · macOS (Monterey 12+) or Windows 10/11

macOS

Step 1 — first time only:

cd ~ && git clone https://github.com/mr-beaver/tokencost && cd tokencost && bash onbording.sh

The setup script installs everything and adds a tokencost command to your terminal.

Step 2 — every time after (restart / update):

Open Terminal and type:

tokencost

First install adds this command automatically. Open a new terminal tab after step 1 for it to appear.

The setup script:

  1. Creates a Python virtualenv and installs dependencies
  2. Imports your full Claude usage history from local logs
  3. Sets ANTHROPIC_BASE_URL=http://localhost:8082 in ~/.zshrc and macOS launchd
  4. Adds tokencost alias for quick restarts
  5. Starts the proxy and opens the dashboard

Windows

Step 1 — first time only (PowerShell):

cd $HOME; git clone https://github.com/mr-beaver/tokencost; cd tokencost; powershell -ExecutionPolicy Bypass -File onbording.ps1

Step 2 — every time after: double-click tokencost.bat in the repo folder, or run the same onbording.ps1 command again. Pick option 1 to start, 2 to disable.

The Windows setup script (onbording.ps1):

  1. Creates a Python virtualenv and installs dependencies (requirements.txt)
  2. Imports your full Claude usage history from local logs (%APPDATA%\Claude, %APPDATA%\Code, ~/.claude)
  3. Sets ANTHROPIC_BASE_URL=http://localhost:8082 as a User environment variable (picked up by Claude Code, VS Code, Claude Desktop, and new terminals)
  4. Registers autostart at logon — a Scheduled Task if you run it as Administrator (this also adds the unattended 5-minute auto-sync task), otherwise a no-admin launcher in your Startup folder
  5. Starts the proxy and opens the dashboard

Auto-sync needs admin. Without elevation the proxy still autostarts and captures live traffic in real time; historical logs sync whenever you run setup or click Sync now in the dashboard. Run onbording.ps1 from an Administrator PowerShell if you want the unattended 5-minute sync as well.

The macOS menu-bar widget is not available on Windows — use the web dashboard at http://localhost:8082/dashboard.

Updating (Windows): open the dashboard, click the version badge, and copy the Update command — it's shell-agnostic (runs in cmd.exe or PowerShell), does git pull, and restarts the proxy non-interactively. Or just re-run tokencost.bat.

After install, all your Claude Code and VS Code requests flow through the proxy automatically.

macOS Menu Bar Widget (optional)

The pre-built app is included in the repo — onbording.sh installs and launches it automatically.

To install manually:

cp -R menubar/TokenCostBar.app ~/Applications/
open ~/Applications/TokenCostBar.app

Build from source (requires Xcode Command Line Tools):

cd menubar && bash build.sh
mv TokenCostBar.app ~/Applications/
open ~/Applications/TokenCostBar.app

Supported Sources

Local applications (auto-tracked from local logs)

Paths are resolved per-platform — macOS ~/Library/Application Support/..., Windows %APPDATA%\..., Linux ~/.config/....

  • Claude Code / Claude CLI~/.claude/projects/**/*.jsonl (all platforms)
  • Claude Desktoplocal-agent-mode-sessions/**/*.jsonl under Claude (macOS ~/Library/Application Support/Claude, Windows %APPDATA%\Claude)
  • OpenClaw~/.openclaw/agents/**/*.jsonl
  • VS Code Extensions (Cline, Roo Code, Kilo Code, IBM Bob) — globalStorage/*/tasks/ui_messages.json under Code (macOS ~/Library/Application Support/Code, Windows %APPDATA%\Code)
  • GitHub Copilot — usage tracking from VS Code logs (model + requests, pricing unavailable)

Sync cadence: every 5 minutes on macOS (launchd) and on admin Windows (Scheduled Task); on non-admin Windows, on each setup run or via Sync now in the dashboard.

API Providers (real-time via proxy — set base URL)

Set the base URL for each provider you want to track:

export ANTHROPIC_BASE_URL=http://localhost:8082            # Claude (auto-set by onbording.sh)
export OPENAI_BASE_URL=http://localhost:8082/openai        # OpenAI
export GROQ_API_BASE=http://localhost:8082/groq            # Groq
export MISTRAL_API_BASE=http://localhost:8082/mistral      # Mistral
export DEEPSEEK_API_BASE=http://localhost:8082/deepseek    # DeepSeek

On Windows, ANTHROPIC_BASE_URL is set for you by onbording.ps1; set additional providers with PowerShell, e.g.:

[Environment]::SetEnvironmentVariable("OPENAI_BASE_URL", "http://localhost:8082/openai", "User")

Supported: Anthropic · OpenAI · Groq · Mistral · DeepSeek · xAI · Perplexity · Cerebras · Together · Fireworks · Cohere · OpenRouter · Ollama · Amazon Bedrock · Google Gemini

218 models in the pricing database.


How it works

Your tool (Claude Code / VS Code / etc.)
        ↓
  localhost:8082  ← proxy runs here
        ↓  scores complexity, applies optimizations, logs tokens+cost
  api.anthropic.com / api.openai.com / etc.
        ↓
  Response back through proxy → your tool

The proxy is transparent — your tools see it as the real API. Zero changes to your workflow.

Chaining behind another proxy

TokenCost forwards every Anthropic path transparently (/v1/* and /api/oauth/*), so it can sit downstream of another Anthropic proxy (e.g. a context-compression layer) that points its upstream at TokenCost:

Your tool → other proxy (compress) → localhost:8082 (TokenCost logs cost) → api.anthropic.com

Point the upstream proxy's Anthropic target at http://localhost:8082 (subscription-usage polls pass through untouched and aren't logged).


Smart Routing details

When enabled (setup script → option 1), the proxy scores the last user message:

Score Original model Routes to Savings
0–2 Fable 5 Haiku ~50× cheaper
0–2 Opus 4.8 Haiku ~25× cheaper
0–2 Sonnet 4.6 Haiku ~5× cheaper
3–5 Fable 5 Sonnet ~17× cheaper
3–5 Opus 4.8 Sonnet ~5× cheaper
6–10 any unchanged

Simple questions (what is X, explain Y), short messages, and tool-chain intermediates score 0–2. Long coding tasks with keywords like implement, refactor, debug score 6+.

Supported Claude models: Fable 5 ($10/$50M), Opus 4.8 ($5/$25M), Sonnet 4.6 ($3/$15M), Haiku 4.5 ($1/$5M) — plus legacy 3.x versions.

Fable 5 specifics: Fable is only downrouted for very simple requests (score ≤2). For complex tasks (score 3+), Fable is preserved to take advantage of its reasoning capabilities. This means expensive long-horizon tasks stay on Fable, but throwaway pings and quick questions automatically use Haiku.

Does switching models break the cache? No. The cache key is based on request content, not the model — so when the proxy routes Opus → Haiku, the cheaper model still gets the 90% cache discount. The only nuance: different models have different input prices, so the absolute dollar value of the discount varies. But Haiku's base price is so much lower that the combined savings (cheaper model + cache discount) always exceed staying on the original model with cache alone. High-complexity requests that benefit most from caching (score 6–10) are never routed away from the original model.


Stats API

curl http://localhost:8082/stats?period=today   # today's stats (JSON)
curl http://localhost:8082/stats?period=7d      # last 7 days
curl http://localhost:8082/stats?period=30d     # last 30 days
curl http://localhost:8082/raw-logs?limit=100   # recent requests

Data & Privacy

  • Everything stored locally in tracker.db (SQLite)
  • Nothing is sent to any external service
  • The proxy only reads your token usage metadata — not the content of your conversations (prompt previews are stored locally, never transmitted)
  • Stop the proxy anytime via the setup script's option 2 (onbording.sh on macOS, tokencost.bat on Windows)

Stop / Uninstall

macOS:

bash onbording.sh   # choose option 2 — stops proxy, removes env vars

Windows: run tokencost.bat (or onbording.ps1) and choose option 2 — stops the proxy, removes the autostart entry (Scheduled Task / Startup launcher), and clears ANTHROPIC_BASE_URL.

⚠️ Don't kill the proxy process manually if you're using Claude Code — Claude routes through it and the call will fail. Always use the setup script's option 2.


Project structure

tokencost/
├── proxy.py           — FastAPI proxy, port 8082
├── optimizer.py       — Request optimizations (cache, routing, trim)
├── db.py              — SQLite schema, cost calculations, analytics
├── import_history.py  — Import historical logs from Claude CLI, Desktop, OpenClaw
├── projects.py        — Project/session tracking
├── dashboard.html     — Web dashboard UI
├── onbording.sh       — Setup / start / stop script (macOS)
├── onbording.ps1      — Setup / start / stop script (Windows)
├── tokencost.bat      — Windows launcher (double-click)
├── requirements.txt   — Python dependencies
└── menubar/           — macOS SwiftUI menu bar app
    ├── build.sh
    └── Sources/TokenCostBar/
        ├── MenuBarView.swift
        └── StatsModel.swift

FAQ

Does switching models break the prompt cache?

No — and this is the most important thing to understand about how the optimizations interact.

Anthropic's prompt caching works at the model level: each model maintains its own isolated KV cache for a given prefix. Haiku and Opus have completely different architectures and incompatible internal representations — there is no cross-family KV sharing. When the proxy routes a request from Opus to Haiku, Haiku builds its own cache entry for that prefix. The next Haiku request with the same prefix reads from Haiku's cache at 0.1× input cost.

The one real cost: the first request after a model switch pays a cache write (1.25× input cost) rather than a cache read (0.1× input cost). If you were switching models constantly — every request — that write overhead would accumulate. But the routing logic prevents this:

  • tool_result turns with no user text → scored 10, never routed. Active tool chains stay on the original model.
  • Any request where the last 4 messages contain tool_use or tool_result blocks → +2 to score.
  • Complex keywords (implement, refactor, debug, code blocks, file extensions) → score 6–10, original model preserved.

In practice, Haiku handles isolated simple questions ("what is X", short Q&A), while Opus stays on multi-turn coding tasks. Both models warm up their own cache entries and stay on them. The write overhead on the first routing switch is a one-time cost; everything after that is cheap reads.

Net result: prompt cache savings + model routing savings compound rather than cancel. Verified in optimizer.pyhas_recent_tool_results() and the tool_result type check in complexity_score() are the guards that make this work correctly.


How much can you save?

Benchmarks below use Opus 4.8 base pricing with a typical vibe-coding session mix: 35% simple Q&A (score 0–2 → Haiku), 25% medium refactors (score 3–5 → Sonnet), 40% complex tasks (score 6–10 → Opus). System prompt ~3k tokens reused across the session, 15% duplicate requests filtered.

Savings from $10k/month

Optimization Saves Rate How
Prompt cache $1,200–1,800 12–18% System prompt reads at 0.1× cost
Smart routing $1,600–2,400 16–24% 60% of requests → Haiku/Sonnet
Thinking budget $300–600 3–6% Caps budget_tokens by complexity
Dedup + trim $100–200 1–2% 5s dedup window + 50k context trim
Total $3,800–5,200 38–52% Combined effect

Savings by spend level

 $30k ┤
 $25k ┤                                                    ◉ Max savings
 $20k ┤
 $15k ┤
 $10k ┤                                         ◉
  $5k ┤                              ◉
  $2k ┤               ◉
  $0k ┼────────────────────────────────────────────────────
       $1k           $5k          $10k        $20k        $50k
                          Monthly spend

At $10k/month: saves ~$3,800–5,200 → $45–62k/year.
At $50k/month: saves ~$19k–26k.

Usage mix assumptions

Score Routes to Mix
0–2 (explain, what is, short Q&A) Haiku 35%
3–5 (medium refactors, review) Sonnet 25%
6–10 (implement, debug, architect) Opus 40%

Results vary by usage pattern. Heavy Opus users with long system prompts see the largest gains — prompt cache + smart routing compound. Minimal savings if you already route manually or use Haiku exclusively.


License

MIT

About

See exactly what you spend on Claude CLI + cost optimizations functions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors