local-splitter

An MCP-compatible outbound LLM request shim that cuts cloud token usage by running a local small model (via Ollama or any OpenAI-compatible endpoint) as a triage layer in front of a frontier cloud model.

Individual tactics (routing, compression, caching, local drafting) are known in isolation. What is not well-documented is how they combine on a realistic coding-agent workload, and which combinations give the best marginal savings vs. quality loss. This project answers that question empirically.

Quick start

Requires Python 3.12+, uv, and Ollama running locally.

# Install
uv sync

# Pull the required local models
ollama pull llama3.2:3b
ollama pull nomic-embed-text

# Configure
cp config.example.yaml config.yaml
# Edit config.yaml: set your cloud endpoint + API key env var

# Run tests
uv run pytest -q

# Start the proxy
uv run local-splitter serve-http --config config.yaml

How it works

local-splitter sits between your coding agent and the cloud LLM. It exposes two interfaces:

OpenAI-compatible HTTP proxy (/v1/chat/completions) — point any agent at OPENAI_API_BASE=http://127.0.0.1:7788/v1
MCP stdio server — register with any MCP-aware agent (Claude Code, Cursor, etc.)

Both interfaces feed the same internal pipeline:

Request → T1 route → T3 cache → T2 compress → T6 intent
        → T5 diff  → T7 batch → T4 draft    → Cloud

Each tactic is independently togglable via config. Disabled tactics are zero-cost pass-throughs.

The seven tactics

#	Name	Type	What it does
T1	`route`	short-circuit	Local model classifies requests as TRIVIAL/COMPLEX. Trivials answered locally — never hit the cloud.
T2	`compress`	transform	Local model shortens long prompts (system prompts, history, RAG chunks) before they reach the cloud.
T3	`sem-cache`	short-circuit	Semantic similarity cache (SQLite + sqlite-vec). Near-duplicate queries return cached responses.
T4	`draft`	replace	Local model drafts the answer; cloud model reviews/patches it instead of generating from scratch.
T5	`diff`	transform	For code-edit requests, extracts minimal diff context so the cloud only sees the surgical change.
T6	`intent`	transform	Parses verbose free-text prompts into structured intent fields — cloud gets a tight template.
T7	`batch`	tag	Tags stable prompt prefixes with `cache_control` for vendor-side caching discounts.

Fail-open everywhere: if the local model is unreachable or returns garbage, every tactic defaults to passing the request through to the cloud unchanged.

Configuration

Two sets of presets — pick based on how you're using the splitter.

Proxy mode (transparent interceptor)

For agents pointed at the splitter's HTTP endpoint. Requires a cloud backend — the splitter calls it on your behalf.

Preset	Tactics	Savings	Use case
`proxy/conservative`	T1	29-69%	Safest — only routes trivials locally
`proxy/recommended`	T1+T2	45-79%	Best default — route + compress
`proxy/max-savings`	T1+T2+T3	43-80%	Adds caching — best for repetitive workloads
`proxy/rag-heavy`	T1+T2+T3+T4+T5	51% on RAG	Long-context workloads with retrieved chunks

cp configs/proxy/recommended.yaml config.yaml
# Edit: set your cloud endpoint + API key env var

MCP mode (agent is the cloud model)

For Claude Code, Cursor, and other MCP-aware agents. No cloud backend needed — the splitter answers trivials locally and returns compressed prompts for the agent's own model.

Preset	Tactics	Savings	Use case
`mcp/conservative`	T1	29-69%	Safest — complex requests pass through untouched
`mcp/recommended`	T1+T2	45-79%	Best default — route + compress
`mcp/max-savings`	T1+T2+T3	43-80%	Adds caching — compounds with query repetition
`mcp/rag-heavy`	T1+T2+T3+T5	44-51%	Long-context RAG workloads

cp configs/mcp/recommended.yaml config.yaml
# No cloud config needed — just Ollama

Eval results per workload

Evaluated with llama3.2:3b (local) and gemma3:4b (cloud), 10 samples per workload, mean of 2 runs:

Config	WL1 (edit)	WL2 (explain)	WL3 (chat)	WL4 (RAG)	Avg
`conservative` (T1)	29%	69%	59%	38%	49%
`recommended` (T1+T2)	45%	79%	57%	44%	56%
`max-savings` (T1+T2+T3)	43%	80%	60%	44%	56%
`rag-heavy` (proxy, +T4+T5)	29%	72%	59%	51%	53%
`rag-heavy` (mcp, +T5)	—	—	—	44-51%	—

Key observations:

Start with recommended (T1+T2). 56% average savings, works across all workload types.
max-savings adds T3 caching — same average but compounds over repeated queries (support bots, multi-user teams).
rag-heavy proxy wins on WL4 because T4 (draft-review) helps with long outputs. MCP mode skips T4 (the agent is the reviewer).
conservative still saves 49% — use if quality is the top priority.
Quality cost: baseline wins ~3x more judge verdicts on explanation-heavy workloads. Acceptable on edit/RAG workloads. See the paper for details.

Config resolution order: explicit --config flag > $LOCAL_SPLITTER_CONFIG env var > .local_splitter/config.yaml > ./config.yaml.

Usage

Three ways to use local-splitter, from most transparent to most explicit.

Option A: HTTP proxy (fully transparent)

The agent has no idea the splitter exists. Every request is intercepted, tactics run, and the response comes back with fewer cloud tokens spent.

Requires a cloud backend — use a configs/proxy/ preset.

# 1. Configure
cp configs/proxy/recommended.yaml config.yaml
# Edit config.yaml: set your cloud endpoint + API key env var

# 2. Start the proxy
uv run local-splitter serve-http --config config.yaml

# 3. Point your agent at it

Agent	Command
Claude Code	`ANTHROPIC_BASE_URL=http://127.0.0.1:7788 claude`
Cursor / Continue	Set API Base to `http://127.0.0.1:7788/v1` in settings
Codex CLI	`OPENAI_API_BASE=http://127.0.0.1:7788/v1 codex`
Any OpenAI-compatible	`export OPENAI_API_BASE=http://127.0.0.1:7788/v1`

The proxy speaks both OpenAI (/v1/chat/completions) and Anthropic (/v1/messages) formats with streaming support.

Option B: MCP server (local-only, agent-aware)

The agent registers the splitter as an MCP tool and calls split.transform before sending prompts. No cloud backend needed — the agent IS the cloud model.

# 1. Configure (local model only)
cp configs/mcp/recommended.yaml config.yaml

# 2. Register with Claude Code

Add to ~/.claude/settings.json or your project's .mcp.json:

{
  "mcpServers": {
    "local-splitter": {
      "command": "uv",
      "args": ["run", "--directory", "/path/to/local-splitter",
               "local-splitter", "serve-mcp", "--config", "config.yaml"]
    }
  }
}

The agent can then call these MCP tools:

Tool	What it does
`split.transform`	Run tactics, return local answer or transformed prompt
`split.complete`	Full pipeline (auto-detects local-only mode)
`split.classify`	T1 classifier only — TRIVIAL or COMPLEX
`split.cache_lookup`	Check T3 cache without writing
`split.stats`	Aggregate metrics since startup
`split.config`	Read-only config view

How split.transform works:

Agent calls split.transform(messages=[...])
  │
  ├─ TRIVIAL (T1) → {"action": "answer", "response": "2 + 2 = 4"}
  │                   Agent uses this directly. Zero cloud tokens.
  │
  └─ COMPLEX → {"action": "passthrough", "messages": [...]}
                Agent sends the (compressed) messages to its own model.

To make the agent use split.transform by default, add to your project's CLAUDE.md:

Before processing any user request, call split.transform with the full
message. If action=answer, use that response. If action=passthrough,
use the transformed_messages instead of the original prompt.

Option C: CLI transform (hook-based, automatic)

One-shot CLI command for agent hooks. Reads a prompt, runs tactics, prints JSON to stdout. Bridges local-only mode into a transparent flow.

# Plain text — answered locally
echo "what is 2+2" | local-splitter transform -c config.yaml
# → {"action": "answer", "response": "2 + 2 = 4", "served_by": "local", ...}

# Complex — passes through
echo "refactor the auth middleware with JWT rotation" | local-splitter transform -c config.yaml
# → {"action": "passthrough", "messages": [...], ...}

# Or with --prompt flag
local-splitter transform -p "explain merge sort" -c config.yaml

Claude Code hook setup — add to ~/.claude/settings.json:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Task|Bash|Edit|Write",
        "hook": "echo \"$PROMPT\" | local-splitter transform -c /path/to/config.yaml"
      }
    ]
  }
}

API reference

Endpoint	Format	Streaming
`POST /v1/chat/completions`	OpenAI	SSE (`stream: true`)
`POST /v1/messages`	Anthropic	SSE (`stream: true`)
`GET /v1/models`	OpenAI	—
`GET /v1/splitter/stats`	JSON	—
`GET /healthz`	JSON	—

Both HTTP surfaces add a splitter key to responses with pipeline trace:

{
  "splitter": {
    "served_by": "local",
    "latency_ms": 42.3,
    "pipeline_trace": [
      { "stage": "t1_classify", "decision": "TRIVIAL", "ms": 12.1 },
      { "stage": "t1_local_answer", "decision": "APPLIED", "ms": 30.2 }
    ],
    "tokens_local": { "input": 15, "output": 8 }
  }
}

Force routing

Override the pipeline per-request:

# Via HTTP extra_body (OpenAI surface)
{"extra_body": {"splitter": {"force_local": True}}}
{"extra_body": {"splitter": {"force_cloud": True}}}

# Via MCP model_hint
{"model_hint": "local"}
{"model_hint": "cloud"}

Evaluation

The evaluation harness measures per-tactic and per-combination savings across four workload classes:

Workload	Description	Trivial%
WL1 edit-heavy	Refactoring sessions, many file edits	~25%
WL2 explain	"What does X do" onboarding questions	~45%
WL3 chat	General-purpose mixed chat	~50%
WL4 RAG	Long system prompts with retrieved chunks	~20%

Running evals

# Specific subsets on specific workloads
uv run local-splitter eval \
  -w evals/workloads/wl3_chat.jsonl \
  --config config.yaml \
  --subsets baseline,T1_only,T1_T2_T3

# All subsets on all workloads (full matrix)
uv run local-splitter eval \
  -w evals/workloads/wl1_edit.jsonl \
  -w evals/workloads/wl2_explain.jsonl \
  -w evals/workloads/wl3_chat.jsonl \
  -w evals/workloads/wl4_rag.jsonl \
  --config config.yaml

# Full eval script (produces paper-ready summary)
uv run python evals/run_eval.py

# Run specific subsets only
uv run python evals/run_eval.py T5_only T6_only T7_only

# Include judge-model quality evaluation (pairwise A/B comparison)
uv run python evals/run_eval.py --quality

Results land in .local_splitter/eval/:

results.csv — one row per (workload × tactic subset)
runs.jsonl — per-sample detail log
summary.json — aggregates for the paper (includes quality verdicts with --quality)

Available tactic subsets

baseline, T1_only, T2_only, T3_only, T4_only, T5_only, T6_only, T7_only, T1_T2, T1_T3, T1_T2_T3, T1_T3_T4, T1_T2_T3_T6, all

Project structure

src/local_splitter/
├── cli.py                  # Typer CLI (serve-http, serve-mcp, eval)
├── config.py               # YAML config loader
├── models/                 # Backend implementations
│   ├── base.py             #   ChatClient protocol + data types
│   ├── ollama.py           #   Ollama native API client
│   ├── openai_compat.py    #   OpenAI-compatible client
│   └── factory.py          #   Build client from config
├── pipeline/               # The seven tactics + orchestrator
│   ├── __init__.py         #   Pipeline orchestrator
│   ├── types.py            #   PipelineRequest/Response, StageEvent
│   ├── route.py            #   T1 — local classifier
│   ├── compress.py         #   T2 — prompt compression
│   ├── sem_cache.py        #   T3 — semantic cache (sqlite-vec)
│   ├── draft.py            #   T4 — draft + review
│   ├── diff.py             #   T5 — minimal diff extraction
│   ├── intent.py           #   T6 — intent extraction
│   └── batch.py            #   T7 — prompt-cache tagging
├── transport/              # External interfaces
│   ├── http_proxy.py       #   FastAPI OpenAI-compat proxy
│   └── mcp_server.py       #   FastMCP stdio server
└── evals/                  # Evaluation harness
    ├── types.py            #   WorkloadSample, SampleResult, RunResult
    ├── runner.py           #   Matrix runner + tactic subsets
    ├── metrics.py          #   Token savings, cost, routing accuracy
    ├── quality.py          #   Judge-model pairwise quality evaluation
    └── report.py           #   CSV + markdown export

evals/workloads/            # Evaluation datasets (JSONL)
tests/                      # 175 tests

Development

uv sync                  # install deps
uv run pytest -q         # run tests (175 tests, <1s)
uv run ruff check src/   # lint

Adding a new tactic

Create src/local_splitter/pipeline/<name>.py with an apply() function
Wire it into Pipeline.complete() in pipeline/__init__.py
Add the config flag to TacticsConfig in config.py
Add tests in tests/test_pipeline_<name>.py
Add eval subsets in evals/runner.py

Every tactic must:

Emit a StageEvent for observability
Fail open on errors (pass request through unchanged)
Use temperature=0 for deterministic classifier/extractor calls

Citation

If you use local-splitter in your research, please cite:

@article{agyemang2026localsplitter,
  title   = {Local-Splitter: A Measurement Study of Seven Tactics for
             Reducing Cloud LLM Token Usage on Coding-Agent Workloads},
  author  = {Owusu Agyemang, Justice and Kponyo, Jerry John and
             Amponsah, Elliot and Addo Boakye, Godfred Manu and
             Obour Agyekum, Kwame Opuni-Boachie},
  journal = {arXiv preprint arXiv:2604.12301},
  year    = {2026},
  url     = {https://arxiv.org/abs/2604.12301}
}

License

MIT

Status

Python scaffold (uv, package skeleton, 175 tests)
Model backends (Ollama + OpenAI-compatible)
Transport layer (MCP stdio + HTTP proxy: OpenAI + Anthropic)
Streaming support (SSE, both API surfaces)
All seven tactics implemented and tested
Evaluation harness (runner, metrics, quality judge, CSV/markdown)
CLI eval command + seed workloads (4 classes, 40 samples)
Evaluation with real numbers (Ollama llama3.2:3b + gemma3)
Paper with results (tables, figures, quality evaluation)

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
configs		configs
docs		docs
evals		evals
src/local_splitter		src/local_splitter
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
config.example.yaml		config.example.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

local-splitter

Quick start

How it works

The seven tactics

Configuration

Proxy mode (transparent interceptor)

MCP mode (agent is the cloud model)

Eval results per workload

Usage

Option A: HTTP proxy (fully transparent)

Option B: MCP server (local-only, agent-aware)

Option C: CLI transform (hook-based, automatic)

API reference

Force routing

Evaluation

Running evals

Available tactic subsets

Project structure

Development

Adding a new tactic

Citation

License

Status

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

local-splitter

Quick start

How it works

The seven tactics

Configuration

Proxy mode (transparent interceptor)

MCP mode (agent is the cloud model)

Eval results per workload

Usage

Option A: HTTP proxy (fully transparent)

Option B: MCP server (local-only, agent-aware)

Option C: CLI transform (hook-based, automatic)

API reference

Force routing

Evaluation

Running evals

Available tactic subsets

Project structure

Development

Adding a new tactic

Citation

License

Status

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages