Stop feeding entire API responses to your LLM. Give it a handle instead.
When an agent calls a REST API, the full JSON response lands in the context window — even if the agent only needs one field. On a 500-item list, that's 97 KB of tokens consumed to read two values. ContextCutter intercepts those responses before they reach the model, stores them in a fast in-memory store, and returns a compact structural summary (a teaser) plus a deterministic handle ID. The agent then queries only the fields it actually needs.
The result: 86–99% fewer tokens spent on API responses in typical agent workflows.
┌─────────┐ fetch_json_cutted(url) ┌──────────────────┐ HTTP GET ┌─────────────┐
│ Agent │ ────────────────────────► │ ContextCutter │ ───────────► │ Remote API │
│ (LLM) │ │ MCP Server │ ◄─────────── │ │
│ │ ◄──────────────────────── │ (Rust binary) │ JSON blob └─────────────┘
│ │ { handle_id, teaser } │ │
│ │ │ DashMap store │
│ │ query_handle(id, path) │ (in-memory) │
│ │ ────────────────────────► │ │
│ │ ◄──────────────────────── │ │
└─────────┘ "$.users[0].email" └──────────────────┘
→ "alice@example.com"
Step 1 — fetch: The agent calls fetch_json_cutted(url). The server fetches the URL, stores the full JSON payload, and responds with a teaser (structural summary) and a handle_id.
Step 2 — query: The agent inspects the teaser to understand the shape of the data, then calls query_handle(handle_id, "$.path.to.field") to retrieve only what it needs.
The full payload never enters the context window.
Measured against realistic API response shapes:
| Response type | Full payload | Teaser returned | Tokens saved |
|---|---|---|---|
| 10-item paginated list | 2,005 chars | 287 chars | 86% |
| 50-item repo listing | 11,576 chars | 268 chars | 98% |
| 100-item event stream | 21,005 chars | 283 chars | 99% |
| 500-item batch export | 97,465 chars | 261 chars | 100% |
| Deep nested config blob | 19,943 chars | 341 chars | 98% |
Teaser size stays roughly constant (~250–350 chars) regardless of payload size, because it describes structure, not values.
The fastest way to try ContextCutter is with npx — no install required:
npx context-cutter-mcpAdd it to your agent client in under a minute:
OpenCode (~/.config/opencode/config.json):
{
"mcp": {
"context-cutter": {
"type": "local",
"command": "npx",
"args": ["-y", "context-cutter-mcp"]
}
}
}Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):
{
"mcpServers": {
"context-cutter": {
"command": "npx",
"args": ["-y", "context-cutter-mcp"]
}
}
}Once connected, ContextCutter registers two tools with your agent automatically. No prompting or configuration needed — the server describes itself via MCP.
See examples/ for Cursor, VS Code, OpenAI Agents SDK, and LangChain configs.
Fetches a URL, stores the JSON response, and returns a structural teaser.
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
string | — | HTTPS URL to fetch (required) |
method |
string | GET |
HTTP method |
headers |
object | {} |
Additional request headers |
body |
any | — | Request body (serialized as JSON) |
timeout_seconds |
number | 45 |
Request timeout |
Returns: { handle_id: "hdl_<12hex>", teaser: { ... } }
Runs a JSONPath expression against a previously stored payload.
| Parameter | Type | Description |
|---|---|---|
handle_id |
string | Handle returned by fetch_json_cutted |
json_path |
string | JSONPath expression (e.g. $.users[0].email) |
Returns: The matched value(s) as JSON.
Handle IDs are deterministic (SHA-256 of canonicalized JSON) — the same payload always produces the same hdl_<12hex>, making repeated fetches idempotent.
Download the pre-built binary for your platform from Releases and place it on PATH:
| Platform | Binary name |
|---|---|
| Linux x86_64 | context-cutter-mcp-x86_64-linux-gnu |
| macOS Intel | context-cutter-mcp-x86_64-apple-darwin |
| macOS Apple Silicon | context-cutter-mcp-aarch64-apple-darwin |
| Windows x86_64 | context-cutter-mcp-x86_64-pc-windows-msvc.exe |
Then point your client at the binary directly instead of using npx.
npx context-cutter-mcpDownloads the matching GitHub Release binary on first run. Suitable for development and CI.
npm install -g context-cutter-mcp
context-cutter-mcpdocker run --rm -i ghcr.io/nikitaclicks/context-cutter-mcp:latestRequires Rust 1.77+:
cargo build --release --bin context-cutter-mcp
./target/release/context-cutter-mcpFor embedding ContextCutter directly in a Python agent without running a separate process:
pip install context-cutterfrom context_cutter import store_response, generate_teaser, query_handle
handle = store_response(api_response_dict)
teaser = generate_teaser(handle) # compact summary for the model
value = query_handle(handle, "$.users[0].email")The @lazy_handle decorator wraps any function that returns JSON:
from context_cutter import lazy_handle
@lazy_handle
def get_users() -> dict:
return requests.get("https://api.example.com/users").json()
result = get_users()
# result = {"handle_id": "hdl_...", "teaser": {...}}See CONTRIBUTING.md for full Python SDK documentation.
Environment variables for the MCP server:
| Variable | Default | Description |
|---|---|---|
CONTEXT_CUTTER_MAX_HANDLES |
1000 |
Max payloads held in the LRU store |
CONTEXT_CUTTER_TTL_SECS |
3600 |
Seconds before a handle expires |
CONTEXT_CUTTER_MAX_PAYLOAD_BYTES |
10485760 |
Max accepted response size (10 MB) |
CONTEXT_CUTTER_LOG_FORMAT |
plain |
plain or json structured logs |
RUST_LOG |
info |
Tracing filter (e.g. debug, trace) |
- HTTPS-only URL fetching (SSRF hardening —
http://is rejected) - Null-byte rejection on all string inputs
- JSONPath expressions capped at 4096 characters
- Payload size enforced before storing (
MAX_PAYLOAD_BYTES) - No credentials stored — headers are not persisted with payloads
Operation latencies (median, on commodity hardware):
| Operation | Median latency |
|---|---|
generate_teaser (medium payload) |
35 µs |
store_response (small payload) |
64 µs |
query_handle (wildcard path) |
94 µs |
Throughput: ~10,000–27,000 operations/second per operation type.
The problem of tool-result context bloat is well-recognized across the AI engineering community and is being addressed from several directions. The table below situates ContextCutter among the most relevant approaches at the mechanism level.
| Approach | Who executes filtering | Model must write code? | Requires sandbox? | Scope |
|---|---|---|---|---|
| ContextCutter | Rust MCP proxy — intercepts before the model sees anything | No | No | Any HTTPS JSON API |
| Programmatic Tool Calling (Nov 2025) | Model writes Python; runs in Anthropic's Code Execution sandbox | Yes | Yes | Any tool registered with allowed_callers |
| Web Search Dynamic Filtering (Feb 2026) | Model writes Python; runs in Anthropic's Code Execution sandbox | Yes | Yes | Web search / web fetch tools only |
| Tool Search Tool (Nov 2025) | Host-side deferred loading | No | No | Tool schema definitions — a different problem |
Programmatic Tool Calling and Dynamic Filtering pursue the same goal — keeping intermediate data out of the context window — by letting the model generate filtering code executed in a sandboxed environment. Anthropic reports a 37% token reduction (PTC on complex research tasks) and a 24% token reduction with 11% accuracy improvement (Dynamic Filtering on web search benchmarks). ContextCutter achieves 86–99% savings by intercepting at the transport layer before any model inference, with no code generation or sandbox dependency.
The Tool Search Tool addresses a complementary but distinct problem: schema-level bloat from large tool libraries (one measured case: 106 MySQL tools → 54,600 tokens of schema before a single query [Layered.dev, 2026]). ContextCutter and Tool Search Tool can be used together.
-
SUPO — Summarization-augmented Policy Optimization (ICLR 2026, under review): trains LLM agents via RL to periodically compress tool-use history with LLM-generated summaries, enabling long-horizon tasks beyond a fixed context limit. Related problem (context overflow from sequential tool results) but a learned, fine-tuning-based approach rather than a deterministic proxy. [arXiv preprint]
-
NormCode (arXiv 2512.10563, Dec 2025): a semi-formal language for context-isolated AI planning where each step receives only explicitly-passed inputs, eliminating cross-step contamination by construction. Operates at the workflow-language level rather than the transport layer. [arXiv]
-
Unified Tool Integration for LLMs (arXiv 2508.02979, Aug 2025): a protocol-agnostic function-calling framework with automated schema generation and dual-mode concurrent execution, reporting 60–80% code reduction across integration scenarios. [arXiv]
# Rust
cargo test
cargo clippy -- -D warnings
cargo fmt --check
# Python SDK
pip install -e ".[dev]"
maturin develop --features python
pytest -m "not ai_e2e_live"
# Benchmarks
pytest -m benchmark --benchmark-json benchmark.jsonSee CONTRIBUTING.md for the full contributor workflow and architecture notes.
src/
engine.rs Pure Rust: handle ID, store, teaser, JSONPath query
store.rs Bounded in-memory store (TTL + LRU eviction)
parser.rs Teaser generation and JSONPath helpers
lib.rs Optional PyO3 bindings (--features python)
bin/mcp.rs MCP stdio server binary
python/context_cutter/
core.py store_response, generate_teaser, query_path
interceptor.py @lazy_handle decorator
store.py BaseStore, InMemoryStore, RedisStore
tools.py generate_tool_manifest (OpenAI-style schemas)
examples/
opencode.md Full OpenCode walkthrough with session transcript
claude-desktop.md Claude Desktop showcase
openai-agents-sdk.py
langchain_mcp.md
MIT. See LICENSE.