Sandboxing an LLM agent's tool authority with Capa's type system #5
nelsonduarte
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Just landed a worked example in v0.8.2-beta: a Capa program that runs a real Anthropic-model agent loop where the agent loop's declared capabilities provably exclude
Net,Fs,Env, andUnsafe, even though a real model is in the loop deciding which tools to call.The pattern is three pieces.
First, each tool is a user-defined capability:
Second, the agent loop only takes the tools it is allowed to call:
Third, the manifest emits the bound.
agent_loop'sdeclared_capabilitiesis[Stdio, LlmClient, SearchWeb], and theprovably_excluded_capabilitiesfield listsNet,Fs,Env,Unsafe, and every other built-in or user-defined capability the call graph does not reach.Compared to allow-list approaches: allow-lists at the model boundary are advisory (the model may still be tricked into formatting a disallowed tool call). The Capa version is structural (the runtime simply has no
RunCodevalue to invoke).Code
examples/llm_tool_sandbox.capaexamples/llm_agent_runner.capaexamples/llm_anthropic_real.capaexamples/llm_anthropic_agent.capaWriteup with the comparison vs allow-lists and OS-level sandboxes:
docs/llm-tool-sandbox.md.Honest limits
The discipline addresses the authority axis, not the content axis. A
SearchWeb-only agent can still be prompted into leaking conversation history throughsearch.query. The mitigation for that is orthogonal (redaction, output filtering, user-confirmation pattern). What Capa enforces is the upper bound on the set of side-channels the agent can use at all, auditable from the SBOM rather than from a runtime sandbox's config.Question for the room
Does the pattern generalise to other agent frameworks you've used? Anything that breaks if you swap models, or if the dispatch shape is not string-keyed?
Beta Was this translation helpful? Give feedback.
All reactions