An MCP-compatible outbound LLM request shim that cuts cloud token usage by running a local small model (via Ollama or any OpenAI-compatible endpoint) as a triage layer in front of a frontier cloud model.
Individual tactics (routing, compression, caching, local drafting) are known in isolation. What is not well-documented is how they combine on a realistic coding-agent workload, and which combinations give the best marginal savings vs. quality loss. This project answers that question empirically.
Requires Python 3.12+, uv, and
Ollama running locally.
# Install
uv sync
# Pull the required local models
ollama pull llama3.2:3b
ollama pull nomic-embed-text
# Configure
cp config.example.yaml config.yaml
# Edit config.yaml: set your cloud endpoint + API key env var
# Run tests
uv run pytest -q
# Start the proxy
uv run local-splitter serve-http --config config.yamllocal-splitter sits between your coding agent and the cloud LLM. It
exposes two interfaces:
- OpenAI-compatible HTTP proxy (
/v1/chat/completions) — point any agent atOPENAI_API_BASE=http://127.0.0.1:7788/v1 - MCP stdio server — register with any MCP-aware agent (Claude Code, Cursor, etc.)
Both interfaces feed the same internal pipeline:
Request → T1 route → T3 cache → T2 compress → T6 intent
→ T5 diff → T7 batch → T4 draft → Cloud
Each tactic is independently togglable via config. Disabled tactics are zero-cost pass-throughs.
| # | Name | Type | What it does |
|---|---|---|---|
| T1 | route |
short-circuit | Local model classifies requests as TRIVIAL/COMPLEX. Trivials answered locally — never hit the cloud. |
| T2 | compress |
transform | Local model shortens long prompts (system prompts, history, RAG chunks) before they reach the cloud. |
| T3 | sem-cache |
short-circuit | Semantic similarity cache (SQLite + sqlite-vec). Near-duplicate queries return cached responses. |
| T4 | draft |
replace | Local model drafts the answer; cloud model reviews/patches it instead of generating from scratch. |
| T5 | diff |
transform | For code-edit requests, extracts minimal diff context so the cloud only sees the surgical change. |
| T6 | intent |
transform | Parses verbose free-text prompts into structured intent fields — cloud gets a tight template. |
| T7 | batch |
tag | Tags stable prompt prefixes with cache_control for vendor-side caching discounts. |
Fail-open everywhere: if the local model is unreachable or returns garbage, every tactic defaults to passing the request through to the cloud unchanged.
Two sets of presets — pick based on how you're using the splitter.
For agents pointed at the splitter's HTTP endpoint. Requires a cloud backend — the splitter calls it on your behalf.
| Preset | Tactics | Savings | Use case |
|---|---|---|---|
proxy/conservative |
T1 | 29-69% | Safest — only routes trivials locally |
proxy/recommended |
T1+T2 | 45-79% | Best default — route + compress |
proxy/max-savings |
T1+T2+T3 | 43-80% | Adds caching — best for repetitive workloads |
proxy/rag-heavy |
T1+T2+T3+T4+T5 | 51% on RAG | Long-context workloads with retrieved chunks |
cp configs/proxy/recommended.yaml config.yaml
# Edit: set your cloud endpoint + API key env varFor Claude Code, Cursor, and other MCP-aware agents. No cloud backend needed — the splitter answers trivials locally and returns compressed prompts for the agent's own model.
| Preset | Tactics | Savings | Use case |
|---|---|---|---|
mcp/conservative |
T1 | 29-69% | Safest — complex requests pass through untouched |
mcp/recommended |
T1+T2 | 45-79% | Best default — route + compress |
mcp/max-savings |
T1+T2+T3 | 43-80% | Adds caching — compounds with query repetition |
mcp/rag-heavy |
T1+T2+T3+T5 | 44-51% | Long-context RAG workloads |
cp configs/mcp/recommended.yaml config.yaml
# No cloud config needed — just OllamaEvaluated with llama3.2:3b (local) and gemma3:4b (cloud), 10 samples per workload, mean of 2 runs:
| Config | WL1 (edit) | WL2 (explain) | WL3 (chat) | WL4 (RAG) | Avg |
|---|---|---|---|---|---|
conservative (T1) |
29% | 69% | 59% | 38% | 49% |
recommended (T1+T2) |
45% | 79% | 57% | 44% | 56% |
max-savings (T1+T2+T3) |
43% | 80% | 60% | 44% | 56% |
rag-heavy (proxy, +T4+T5) |
29% | 72% | 59% | 51% | 53% |
rag-heavy (mcp, +T5) |
— | — | — | 44-51% | — |
Key observations:
- Start with
recommended(T1+T2). 56% average savings, works across all workload types. max-savingsadds T3 caching — same average but compounds over repeated queries (support bots, multi-user teams).rag-heavyproxy wins on WL4 because T4 (draft-review) helps with long outputs. MCP mode skips T4 (the agent is the reviewer).conservativestill saves 49% — use if quality is the top priority.- Quality cost: baseline wins ~3x more judge verdicts on explanation-heavy workloads. Acceptable on edit/RAG workloads. See the paper for details.
Config resolution order: explicit --config flag > $LOCAL_SPLITTER_CONFIG
env var > .local_splitter/config.yaml > ./config.yaml.
Three ways to use local-splitter, from most transparent to most explicit.
The agent has no idea the splitter exists. Every request is intercepted, tactics run, and the response comes back with fewer cloud tokens spent.
Requires a cloud backend — use a configs/proxy/ preset.
# 1. Configure
cp configs/proxy/recommended.yaml config.yaml
# Edit config.yaml: set your cloud endpoint + API key env var
# 2. Start the proxy
uv run local-splitter serve-http --config config.yaml
# 3. Point your agent at it| Agent | Command |
|---|---|
| Claude Code | ANTHROPIC_BASE_URL=http://127.0.0.1:7788 claude |
| Cursor / Continue | Set API Base to http://127.0.0.1:7788/v1 in settings |
| Codex CLI | OPENAI_API_BASE=http://127.0.0.1:7788/v1 codex |
| Any OpenAI-compatible | export OPENAI_API_BASE=http://127.0.0.1:7788/v1 |
The proxy speaks both OpenAI (/v1/chat/completions) and Anthropic
(/v1/messages) formats with streaming support.
The agent registers the splitter as an MCP tool and calls split.transform
before sending prompts. No cloud backend needed — the agent IS the
cloud model.
# 1. Configure (local model only)
cp configs/mcp/recommended.yaml config.yaml
# 2. Register with Claude CodeAdd to ~/.claude/settings.json or your project's .mcp.json:
{
"mcpServers": {
"local-splitter": {
"command": "uv",
"args": ["run", "--directory", "/path/to/local-splitter",
"local-splitter", "serve-mcp", "--config", "config.yaml"]
}
}
}The agent can then call these MCP tools:
| Tool | What it does |
|---|---|
split.transform |
Run tactics, return local answer or transformed prompt |
split.complete |
Full pipeline (auto-detects local-only mode) |
split.classify |
T1 classifier only — TRIVIAL or COMPLEX |
split.cache_lookup |
Check T3 cache without writing |
split.stats |
Aggregate metrics since startup |
split.config |
Read-only config view |
How split.transform works:
Agent calls split.transform(messages=[...])
│
├─ TRIVIAL (T1) → {"action": "answer", "response": "2 + 2 = 4"}
│ Agent uses this directly. Zero cloud tokens.
│
└─ COMPLEX → {"action": "passthrough", "messages": [...]}
Agent sends the (compressed) messages to its own model.
To make the agent use split.transform by default, add to your
project's CLAUDE.md:
Before processing any user request, call split.transform with the full
message. If action=answer, use that response. If action=passthrough,
use the transformed_messages instead of the original prompt.
One-shot CLI command for agent hooks. Reads a prompt, runs tactics, prints JSON to stdout. Bridges local-only mode into a transparent flow.
# Plain text — answered locally
echo "what is 2+2" | local-splitter transform -c config.yaml
# → {"action": "answer", "response": "2 + 2 = 4", "served_by": "local", ...}
# Complex — passes through
echo "refactor the auth middleware with JWT rotation" | local-splitter transform -c config.yaml
# → {"action": "passthrough", "messages": [...], ...}
# Or with --prompt flag
local-splitter transform -p "explain merge sort" -c config.yamlClaude Code hook setup — add to ~/.claude/settings.json:
{
"hooks": {
"PreToolUse": [
{
"matcher": "Task|Bash|Edit|Write",
"hook": "echo \"$PROMPT\" | local-splitter transform -c /path/to/config.yaml"
}
]
}
}| Endpoint | Format | Streaming |
|---|---|---|
POST /v1/chat/completions |
OpenAI | SSE (stream: true) |
POST /v1/messages |
Anthropic | SSE (stream: true) |
GET /v1/models |
OpenAI | — |
GET /v1/splitter/stats |
JSON | — |
GET /healthz |
JSON | — |
Both HTTP surfaces add a splitter key to responses with pipeline trace:
{
"splitter": {
"served_by": "local",
"latency_ms": 42.3,
"pipeline_trace": [
{ "stage": "t1_classify", "decision": "TRIVIAL", "ms": 12.1 },
{ "stage": "t1_local_answer", "decision": "APPLIED", "ms": 30.2 }
],
"tokens_local": { "input": 15, "output": 8 }
}
}Override the pipeline per-request:
# Via HTTP extra_body (OpenAI surface)
{"extra_body": {"splitter": {"force_local": True}}}
{"extra_body": {"splitter": {"force_cloud": True}}}
# Via MCP model_hint
{"model_hint": "local"}
{"model_hint": "cloud"}The evaluation harness measures per-tactic and per-combination savings across four workload classes:
| Workload | Description | Trivial% |
|---|---|---|
| WL1 edit-heavy | Refactoring sessions, many file edits | ~25% |
| WL2 explain | "What does X do" onboarding questions | ~45% |
| WL3 chat | General-purpose mixed chat | ~50% |
| WL4 RAG | Long system prompts with retrieved chunks | ~20% |
# Specific subsets on specific workloads
uv run local-splitter eval \
-w evals/workloads/wl3_chat.jsonl \
--config config.yaml \
--subsets baseline,T1_only,T1_T2_T3
# All subsets on all workloads (full matrix)
uv run local-splitter eval \
-w evals/workloads/wl1_edit.jsonl \
-w evals/workloads/wl2_explain.jsonl \
-w evals/workloads/wl3_chat.jsonl \
-w evals/workloads/wl4_rag.jsonl \
--config config.yaml
# Full eval script (produces paper-ready summary)
uv run python evals/run_eval.py
# Run specific subsets only
uv run python evals/run_eval.py T5_only T6_only T7_only
# Include judge-model quality evaluation (pairwise A/B comparison)
uv run python evals/run_eval.py --qualityResults land in .local_splitter/eval/:
results.csv— one row per (workload × tactic subset)runs.jsonl— per-sample detail logsummary.json— aggregates for the paper (includes quality verdicts with--quality)
baseline, T1_only, T2_only, T3_only, T4_only, T5_only,
T6_only, T7_only, T1_T2, T1_T3, T1_T2_T3, T1_T3_T4,
T1_T2_T3_T6, all
src/local_splitter/
├── cli.py # Typer CLI (serve-http, serve-mcp, eval)
├── config.py # YAML config loader
├── models/ # Backend implementations
│ ├── base.py # ChatClient protocol + data types
│ ├── ollama.py # Ollama native API client
│ ├── openai_compat.py # OpenAI-compatible client
│ └── factory.py # Build client from config
├── pipeline/ # The seven tactics + orchestrator
│ ├── __init__.py # Pipeline orchestrator
│ ├── types.py # PipelineRequest/Response, StageEvent
│ ├── route.py # T1 — local classifier
│ ├── compress.py # T2 — prompt compression
│ ├── sem_cache.py # T3 — semantic cache (sqlite-vec)
│ ├── draft.py # T4 — draft + review
│ ├── diff.py # T5 — minimal diff extraction
│ ├── intent.py # T6 — intent extraction
│ └── batch.py # T7 — prompt-cache tagging
├── transport/ # External interfaces
│ ├── http_proxy.py # FastAPI OpenAI-compat proxy
│ └── mcp_server.py # FastMCP stdio server
└── evals/ # Evaluation harness
├── types.py # WorkloadSample, SampleResult, RunResult
├── runner.py # Matrix runner + tactic subsets
├── metrics.py # Token savings, cost, routing accuracy
├── quality.py # Judge-model pairwise quality evaluation
└── report.py # CSV + markdown export
evals/workloads/ # Evaluation datasets (JSONL)
tests/ # 175 tests
uv sync # install deps
uv run pytest -q # run tests (175 tests, <1s)
uv run ruff check src/ # lint- Create
src/local_splitter/pipeline/<name>.pywith anapply()function - Wire it into
Pipeline.complete()inpipeline/__init__.py - Add the config flag to
TacticsConfiginconfig.py - Add tests in
tests/test_pipeline_<name>.py - Add eval subsets in
evals/runner.py
Every tactic must:
- Emit a
StageEventfor observability - Fail open on errors (pass request through unchanged)
- Use
temperature=0for deterministic classifier/extractor calls
If you use local-splitter in your research, please cite:
@article{agyemang2026localsplitter,
title = {Local-Splitter: A Measurement Study of Seven Tactics for
Reducing Cloud LLM Token Usage on Coding-Agent Workloads},
author = {Owusu Agyemang, Justice and Kponyo, Jerry John and
Amponsah, Elliot and Addo Boakye, Godfred Manu and
Obour Agyekum, Kwame Opuni-Boachie},
journal = {arXiv preprint arXiv:2604.12301},
year = {2026},
url = {https://arxiv.org/abs/2604.12301}
}MIT
- Python scaffold (uv, package skeleton, 175 tests)
- Model backends (Ollama + OpenAI-compatible)
- Transport layer (MCP stdio + HTTP proxy: OpenAI + Anthropic)
- Streaming support (SSE, both API surfaces)
- All seven tactics implemented and tested
- Evaluation harness (runner, metrics, quality judge, CSV/markdown)
- CLI eval command + seed workloads (4 classes, 40 samples)
- Evaluation with real numbers (Ollama llama3.2:3b + gemma3)
- Paper with results (tables, figures, quality evaluation)