A Robot Framework library for testing MCP servers, Agent Skills, Hooks, SubAgents, and coding-agent CLIs — provider-agnostic via LiteLLM, BFCL-grade tool-call matching, statistical N≥10 by default.
AgentGuard turns the moving parts of an agent stack into Robot Framework keywords. Connect to an MCP server, grade a SKILL.md, drive Claude Code / Codex / Aider, replay an A2A subagent task, run BFCL tool-call comparisons, and gate the result with scipy-backed statistics — all without leaving a .robot file.
- MCP servers — stdio / SSE / streamable-HTTP / in-memory, full FastMCP client surface
- Agent Skills —
SKILL.mddiscovery, frontmatter validation, Inspect-AI grading - Hooks — synthesise the 12 Claude Code hook events and assert handler decisions
- SubAgents — A2A 1.0 task lifecycle + LangGraph / CrewAI / AutoGen / OpenAI Agents bridges
- Coding agents — drive Claude Code, Codex CLI, Aider, OpenCode, Cline, Continue; emit the 12 #42796 behavioural metrics
- Public benchmarks — SWE-bench Verified, Aider, HumanEval, MBPP, LiveCodeBench
- Security — default-deny skill scanner, redactor, sandboxed execution, AIDefence integration
- Statistics — Mann-Whitney U, Cliff's δ, Vargha-Delaney A, bootstrap CIs, pass@k, TARr@N
Note: the package is pending publication to PyPI. Until then, install from the GitHub source.
From PyPI (once published):
pip install robotframework-agentguard
# or
uv add robotframework-agentguardFrom source (today):
pip install git+https://github.com/manykarim/robotframework-agentguard.git
# or, with uv
uv add git+https://github.com/manykarim/robotframework-agentguard.gitOptional extras:
pip install 'robotframework-agentguard[bridges]' # LangGraph / CrewAI / AutoGen / OpenAI Agents
pip install 'robotframework-agentguard[benchmarks]' # SWE-bench, Aider, HumanEval, MBPP, LiveCodeBenchConfigure the LLM provider with a .env file in your project root:
OPENROUTER_API_KEY=sk-or-...
# optional overrides
AGENTGUARD_DEFAULT_MODEL=openrouter/anthropic/claude-sonnet-4-5
AGENTGUARD_JUDGE_MODEL=openrouter/openai/gpt-4o-miniVerify the install:
agentguard doctor # provider, env, MCP reachability
agentguard version*** Settings ***
Library AgentGuard provider=litellm model=openrouter/anthropic/claude-sonnet-4-5
*** Test Cases ***
AgentGuard Should Be Loaded
${info}= Get AgentGuard Info
Log ${info}The kitchen-sink Library AgentGuard exposes every keyword. Prefer narrow imports for bigger suites:
*** Settings ***
Library AgentGuard.MCP
Library AgentGuard.Skill
Library AgentGuard.Stats| Import line | Purpose |
|---|---|
Library AgentGuard |
Kitchen-sink — every keyword reachable |
Library AgentGuard.MCP |
Test MCP servers (stdio / SSE / streamable-HTTP / in-memory) |
Library AgentGuard.Skill |
Discover, parse, validate, grade Agent Skills |
Library AgentGuard.Tool |
BFCL-style tool-call AST + trajectory matching |
Library AgentGuard.Stats |
Mann-Whitney U, Cliff's δ, Vargha-Delaney A, bootstrap CIs, pass@k, TARr@N |
Library AgentGuard.Judge |
Classification-based LLM-as-Judge with Cohen's κ calibration |
Library AgentGuard.Security |
Default-deny skill scanner, redactor, sandbox, AIDefence |
Library AgentGuard.Hook |
Claude Code hook lifecycle (12 events × 4 handler types) |
Library AgentGuard.SubAgent |
A2A 1.0 task lifecycle, framework bridges |
Library AgentGuard.Coding |
Drive Claude Code / Codex / Aider / OpenCode + #42796 metric pack |
Library AgentGuard.Benchmark |
SWE-bench Verified, Aider, HumanEval, MBPP, LiveCodeBench |
Library AgentGuard.Scenario |
Unified scenario harness — drop-in for manykarim/rf-mcp tests/e2e/ |
Keyword documentation (browsable HTML, generated by Robot Framework libdoc):
- https://manykarim.github.io/robotframework-agentguard/api/ — full per-library API reference, served via GitHub Pages
docs/KEYWORDS.md— alphabetical text reference for all 147 keywords (Markdown, viewable on GitHub)docs/api/— the source HTML files (regenerated via./docs/api/generate.sh)
Every collapsible Get-style keyword accepts the standard (assertion_operator, assertion_expected, message) parameters from robotframework-assertion-engine, so a single keyword does both retrieval and assertion:
Tool Hit Rate ${result} >= ${0.7}
Failed Tool Call Count ${result} == ${0}
Read Edit Ratio ${session} >= ${0.5}Without operator arguments the same keywords just return the value. See examples/13_assertion_engine_idiom.robot and ADR-022.
Runnable Robot suites under examples/:
| File | Topic |
|---|---|
01_mcp_server_basics.robot |
Connect to an MCP server, list and call tools |
02_skill_grading.robot |
Grade a SKILL.md against an LLM with Cohen's κ calibration |
03_hook_block_destructive.robot |
Synthesise hook events, assert blocking decisions |
04_subagent_a2a.robot |
A2A subagent task lifecycle + trajectory matching |
05_coding_agent_metrics.robot |
Compute the 12 #42796 behavioural metrics from a session |
06_bfcl_tool_selection.robot |
BFCL AST equality + trajectory comparison |
07_sandbox_run.robot |
Run untrusted code under default-deny Docker sandbox |
08_swe_bench.robot |
SWE-bench Verified loader + pass@k gate |
09_humaneval_live.robot |
HumanEval live grading |
10_rf_mcp_integration.robot |
Drop-in replacement for manykarim/rf-mcp e2e patterns |
11_agentskills_grading.robot |
Grade manykarim/robotframework-agentskills SKILL.md files |
12_mcp_scenario_replacement.robot |
YAML-driven scenarios + live LLM driver |
13_assertion_engine_idiom.robot |
Side-by-side: operator form vs old Should-pair form |
14_facade_imports.robot |
Side-by-side import variants |
Run any example:
PYTHONPATH=. robot --outputdir _out examples/01_mcp_server_basics.robot| Context | Purpose |
|---|---|
| Provider | LiteLLM-backed LLMProviderAdapter + thin vendor adapters |
| MCP | FastMCP client wrapper for stdio / SSE / streamable-http / in-memory |
| Skills | SKILL.md discovery, frontmatter validation, Inspect-AI grading |
| Hooks | Synthesise the 12 Claude Code hook events; assert handler decisions |
| SubAgents | A2A 1.0 task lifecycle + delegation-chain assertions |
| CodingAgent | Drive Claude Code, Codex CLI, Aider, OpenCode, Cline, Continue; normalise JSONL |
| Statistics | scipy-backed Mann-Whitney, Cliff's δ, bootstrap CI, pass@k, TARr@N |
| Judge | Classification-based LLM-as-Judge with calibration gating (Cohen's κ ≥ 0.7) |
| Security | Default-deny skill scanner, redactor, sandbox policy, AIDefence integration |
| Telemetry | OTel spans + Robot Framework listener embedding scorecards in log.html |
| BehavioralMetrics | The 12 calculators from anthropic/claude-code#42796 |
| ToolCallCorrectness | BFCL AST/trajectory matcher used by MCP, Skills, SubAgents |
Aggregates, value objects, and ACLs are in docs/ddd/.
Hard ceilings — CI fails if any benchmark exceeds the matching budget by more than 20%.
| Surface | Budget |
|---|---|
| MCP in-memory roundtrip (p50 / p95) | ≤ 5 / 10 ms |
| MCP stdio roundtrip (p50) | ≤ 50 ms |
| BFCL AST match (mean per call) | ≤ 1 ms |
mannwhitneyu n=30/30 |
≤ 5 ms |
bootstrap n=30 / 1000 resamples |
≤ 100 ms |
| Library import + Suite Setup (cold) | ≤ 2 s |
Run the suite locally:
uv run pytest benchmarks/ --benchmark-onlyFull budget table and cost model: docs/performance/.
- Plan —
docs/PLAN.md - Keyword reference — https://manykarim.github.io/robotframework-agentguard/api/ (GitHub Pages) ·
docs/KEYWORDS.md(Markdown) - Architecture Decision Records —
docs/adr/ - Domain model —
docs/ddd/ - Performance budgets —
docs/performance/ - Research dossier —
docs/research/research.md - Contributing —
CONTRIBUTING.md - Security —
SECURITY.md
Apache-2.0 — see LICENSE.