Skip to content

nikitaclicks/context-cutter

ContextCutter

Tests codecov

Stop feeding entire API responses to your LLM. Give it a handle instead.

When an agent calls a REST API, the full JSON response lands in the context window — even if the agent only needs one field. On a 500-item list, that's 97 KB of tokens consumed to read two values. ContextCutter intercepts those responses before they reach the model, stores them in a fast in-memory store, and returns a compact structural summary (a teaser) plus a deterministic handle ID. The agent then queries only the fields it actually needs.

The result: 86–99% fewer tokens spent on API responses in typical agent workflows.

How it works

┌─────────┐   fetch_json_cutted(url)  ┌──────────────────┐   HTTP GET   ┌─────────────┐
│  Agent  │ ────────────────────────► │  ContextCutter   │ ───────────► │  Remote API │
│  (LLM)  │                           │   MCP Server     │ ◄─────────── │             │
│         │ ◄──────────────────────── │  (Rust binary)   │   JSON blob  └─────────────┘
│         │   { handle_id, teaser }   │                  │
│         │                           │  DashMap store   │
│         │   query_handle(id, path)  │  (in-memory)     │
│         │ ────────────────────────► │                  │
│         │ ◄──────────────────────── │                  │
└─────────┘   "$.users[0].email"      └──────────────────┘
              → "alice@example.com"

Step 1 — fetch: The agent calls fetch_json_cutted(url). The server fetches the URL, stores the full JSON payload, and responds with a teaser (structural summary) and a handle_id.

Step 2 — query: The agent inspects the teaser to understand the shape of the data, then calls query_handle(handle_id, "$.path.to.field") to retrieve only what it needs.

The full payload never enters the context window.

Token savings

Measured against realistic API response shapes:

Response type Full payload Teaser returned Tokens saved
10-item paginated list 2,005 chars 287 chars 86%
50-item repo listing 11,576 chars 268 chars 98%
100-item event stream 21,005 chars 283 chars 99%
500-item batch export 97,465 chars 261 chars 100%
Deep nested config blob 19,943 chars 341 chars 98%

Teaser size stays roughly constant (~250–350 chars) regardless of payload size, because it describes structure, not values.

Quickstart

The fastest way to try ContextCutter is with npx — no install required:

npx context-cutter-mcp

Add it to your agent client in under a minute:

OpenCode (~/.config/opencode/config.json):

{
  "mcp": {
    "context-cutter": {
      "type": "local",
      "command": "npx",
      "args": ["-y", "context-cutter-mcp"]
    }
  }
}

Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "context-cutter": {
      "command": "npx",
      "args": ["-y", "context-cutter-mcp"]
    }
  }
}

Once connected, ContextCutter registers two tools with your agent automatically. No prompting or configuration needed — the server describes itself via MCP.

See examples/ for Cursor, VS Code, OpenAI Agents SDK, and LangChain configs.

MCP tool reference

fetch_json_cutted

Fetches a URL, stores the JSON response, and returns a structural teaser.

Parameter Type Default Description
url string HTTPS URL to fetch (required)
method string GET HTTP method
headers object {} Additional request headers
body any Request body (serialized as JSON)
timeout_seconds number 45 Request timeout

Returns: { handle_id: "hdl_<12hex>", teaser: { ... } }

query_handle

Runs a JSONPath expression against a previously stored payload.

Parameter Type Description
handle_id string Handle returned by fetch_json_cutted
json_path string JSONPath expression (e.g. $.users[0].email)

Returns: The matched value(s) as JSON.

Handle IDs are deterministic (SHA-256 of canonicalized JSON) — the same payload always produces the same hdl_<12hex>, making repeated fetches idempotent.

Install

Binary (recommended for production)

Download the pre-built binary for your platform from Releases and place it on PATH:

Platform Binary name
Linux x86_64 context-cutter-mcp-x86_64-linux-gnu
macOS Intel context-cutter-mcp-x86_64-apple-darwin
macOS Apple Silicon context-cutter-mcp-aarch64-apple-darwin
Windows x86_64 context-cutter-mcp-x86_64-pc-windows-msvc.exe

Then point your client at the binary directly instead of using npx.

npx (zero-install)

npx context-cutter-mcp

Downloads the matching GitHub Release binary on first run. Suitable for development and CI.

npm (global install)

npm install -g context-cutter-mcp
context-cutter-mcp

Docker

docker run --rm -i ghcr.io/nikitaclicks/context-cutter-mcp:latest

Build from source

Requires Rust 1.77+:

cargo build --release --bin context-cutter-mcp
./target/release/context-cutter-mcp

Python SDK (optional)

For embedding ContextCutter directly in a Python agent without running a separate process:

pip install context-cutter
from context_cutter import store_response, generate_teaser, query_handle

handle = store_response(api_response_dict)
teaser = generate_teaser(handle)   # compact summary for the model
value  = query_handle(handle, "$.users[0].email")

The @lazy_handle decorator wraps any function that returns JSON:

from context_cutter import lazy_handle

@lazy_handle
def get_users() -> dict:
    return requests.get("https://api.example.com/users").json()

result = get_users()
# result = {"handle_id": "hdl_...", "teaser": {...}}

See CONTRIBUTING.md for full Python SDK documentation.

Configuration

Environment variables for the MCP server:

Variable Default Description
CONTEXT_CUTTER_MAX_HANDLES 1000 Max payloads held in the LRU store
CONTEXT_CUTTER_TTL_SECS 3600 Seconds before a handle expires
CONTEXT_CUTTER_MAX_PAYLOAD_BYTES 10485760 Max accepted response size (10 MB)
CONTEXT_CUTTER_LOG_FORMAT plain plain or json structured logs
RUST_LOG info Tracing filter (e.g. debug, trace)

Security

  • HTTPS-only URL fetching (SSRF hardening — http:// is rejected)
  • Null-byte rejection on all string inputs
  • JSONPath expressions capped at 4096 characters
  • Payload size enforced before storing (MAX_PAYLOAD_BYTES)
  • No credentials stored — headers are not persisted with payloads

Performance

Operation latencies (median, on commodity hardware):

Operation Median latency
generate_teaser (medium payload) 35 µs
store_response (small payload) 64 µs
query_handle (wildcard path) 94 µs

Throughput: ~10,000–27,000 operations/second per operation type.

Prior art & related work

The problem of tool-result context bloat is well-recognized across the AI engineering community and is being addressed from several directions. The table below situates ContextCutter among the most relevant approaches at the mechanism level.

Comparison with Anthropic's built-in mitigations

Approach Who executes filtering Model must write code? Requires sandbox? Scope
ContextCutter Rust MCP proxy — intercepts before the model sees anything No No Any HTTPS JSON API
Programmatic Tool Calling (Nov 2025) Model writes Python; runs in Anthropic's Code Execution sandbox Yes Yes Any tool registered with allowed_callers
Web Search Dynamic Filtering (Feb 2026) Model writes Python; runs in Anthropic's Code Execution sandbox Yes Yes Web search / web fetch tools only
Tool Search Tool (Nov 2025) Host-side deferred loading No No Tool schema definitions — a different problem

Programmatic Tool Calling and Dynamic Filtering pursue the same goal — keeping intermediate data out of the context window — by letting the model generate filtering code executed in a sandboxed environment. Anthropic reports a 37% token reduction (PTC on complex research tasks) and a 24% token reduction with 11% accuracy improvement (Dynamic Filtering on web search benchmarks). ContextCutter achieves 86–99% savings by intercepting at the transport layer before any model inference, with no code generation or sandbox dependency.

The Tool Search Tool addresses a complementary but distinct problem: schema-level bloat from large tool libraries (one measured case: 106 MySQL tools → 54,600 tokens of schema before a single query [Layered.dev, 2026]). ContextCutter and Tool Search Tool can be used together.

Research context

  • SUPO — Summarization-augmented Policy Optimization (ICLR 2026, under review): trains LLM agents via RL to periodically compress tool-use history with LLM-generated summaries, enabling long-horizon tasks beyond a fixed context limit. Related problem (context overflow from sequential tool results) but a learned, fine-tuning-based approach rather than a deterministic proxy. [arXiv preprint]

  • NormCode (arXiv 2512.10563, Dec 2025): a semi-formal language for context-isolated AI planning where each step receives only explicitly-passed inputs, eliminating cross-step contamination by construction. Operates at the workflow-language level rather than the transport layer. [arXiv]

  • Unified Tool Integration for LLMs (arXiv 2508.02979, Aug 2025): a protocol-agnostic function-calling framework with automated schema generation and dual-mode concurrent execution, reporting 60–80% code reduction across integration scenarios. [arXiv]

Development

# Rust
cargo test
cargo clippy -- -D warnings
cargo fmt --check

# Python SDK
pip install -e ".[dev]"
maturin develop --features python
pytest -m "not ai_e2e_live"

# Benchmarks
pytest -m benchmark --benchmark-json benchmark.json

See CONTRIBUTING.md for the full contributor workflow and architecture notes.

Project layout

src/
  engine.rs        Pure Rust: handle ID, store, teaser, JSONPath query
  store.rs         Bounded in-memory store (TTL + LRU eviction)
  parser.rs        Teaser generation and JSONPath helpers
  lib.rs           Optional PyO3 bindings (--features python)
  bin/mcp.rs       MCP stdio server binary
python/context_cutter/
  core.py          store_response, generate_teaser, query_path
  interceptor.py   @lazy_handle decorator
  store.py         BaseStore, InMemoryStore, RedisStore
  tools.py         generate_tool_manifest (OpenAI-style schemas)
examples/
  opencode.md      Full OpenCode walkthrough with session transcript
  claude-desktop.md  Claude Desktop showcase
  openai-agents-sdk.py
  langchain_mcp.md

License

MIT. See LICENSE.

About

Lightweight, high-performance middleware for token-efficient Tool Calling. Shrink your API overhead by 10x with Lazy Handles.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors