robotframework-agentguard

A Robot Framework library for testing MCP servers, Agent Skills, Hooks, SubAgents, and coding-agent CLIs — provider-agnostic via LiteLLM, BFCL-grade tool-call matching, statistical N≥10 by default.

AgentGuard turns the moving parts of an agent stack into Robot Framework keywords. Connect to an MCP server, grade a SKILL.md, drive Claude Code / Codex / Aider, replay an A2A subagent task, run BFCL tool-call comparisons, and gate the result with scipy-backed statistics — all without leaving a .robot file.

What it tests

MCP servers — stdio / SSE / streamable-HTTP / in-memory, full FastMCP client surface
Agent Skills — SKILL.md discovery, frontmatter validation, Inspect-AI grading
Hooks — synthesise the 12 Claude Code hook events and assert handler decisions
SubAgents — A2A 1.0 task lifecycle + LangGraph / CrewAI / AutoGen / OpenAI Agents bridges
Coding agents — drive Claude Code, Codex CLI, Aider, OpenCode, Cline, Continue; emit the 12 #42796 behavioural metrics
Public benchmarks — SWE-bench Verified, Aider, HumanEval, MBPP, LiveCodeBench
Security — default-deny skill scanner, redactor, sandboxed execution, AIDefence integration
Statistics — Mann-Whitney U, Cliff's δ, Vargha-Delaney A, bootstrap CIs, pass@k, TARr@N

Installation

Note: the package is pending publication to PyPI. Until then, install from the GitHub source.

From PyPI (once published):

pip install robotframework-agentguard
# or
uv add robotframework-agentguard

From source (today):

pip install git+https://github.com/manykarim/robotframework-agentguard.git
# or, with uv
uv add git+https://github.com/manykarim/robotframework-agentguard.git

Optional extras:

pip install 'robotframework-agentguard[bridges]'      # LangGraph / CrewAI / AutoGen / OpenAI Agents
pip install 'robotframework-agentguard[benchmarks]'   # SWE-bench, Aider, HumanEval, MBPP, LiveCodeBench

Configure the LLM provider with a .env file in your project root:

OPENROUTER_API_KEY=sk-or-...
# optional overrides
AGENTGUARD_DEFAULT_MODEL=openrouter/anthropic/claude-sonnet-4-5
AGENTGUARD_JUDGE_MODEL=openrouter/openai/gpt-4o-mini

Verify the install:

agentguard doctor    # provider, env, MCP reachability
agentguard version

Quickstart

*** Settings ***
Library    AgentGuard    provider=litellm    model=openrouter/anthropic/claude-sonnet-4-5

*** Test Cases ***
AgentGuard Should Be Loaded
    ${info}=    Get AgentGuard Info
    Log    ${info}

The kitchen-sink Library AgentGuard exposes every keyword. Prefer narrow imports for bigger suites:

*** Settings ***
Library    AgentGuard.MCP
Library    AgentGuard.Skill
Library    AgentGuard.Stats

Sub-library imports

Import line	Purpose
`Library AgentGuard`	Kitchen-sink — every keyword reachable
`Library AgentGuard.MCP`	Test MCP servers (stdio / SSE / streamable-HTTP / in-memory)
`Library AgentGuard.Skill`	Discover, parse, validate, grade Agent Skills
`Library AgentGuard.Tool`	BFCL-style tool-call AST + trajectory matching
`Library AgentGuard.Stats`	Mann-Whitney U, Cliff's δ, Vargha-Delaney A, bootstrap CIs, pass@k, TARr@N
`Library AgentGuard.Judge`	Classification-based LLM-as-Judge with Cohen's κ calibration
`Library AgentGuard.Security`	Default-deny skill scanner, redactor, sandbox, AIDefence
`Library AgentGuard.Hook`	Claude Code hook lifecycle (12 events × 4 handler types)
`Library AgentGuard.SubAgent`	A2A 1.0 task lifecycle, framework bridges
`Library AgentGuard.Coding`	Drive Claude Code / Codex / Aider / OpenCode + #42796 metric pack
`Library AgentGuard.Benchmark`	SWE-bench Verified, Aider, HumanEval, MBPP, LiveCodeBench
`Library AgentGuard.Scenario`	Unified scenario harness — drop-in for `manykarim/rf-mcp` `tests/e2e/`

Keyword documentation (browsable HTML, generated by Robot Framework libdoc):

https://manykarim.github.io/robotframework-agentguard/api/ — full per-library API reference, served via GitHub Pages
docs/KEYWORDS.md — alphabetical text reference for all 147 keywords (Markdown, viewable on GitHub)
docs/api/ — the source HTML files (regenerated via ./docs/api/generate.sh)

Operator-driven assertions

Every collapsible Get-style keyword accepts the standard (assertion_operator, assertion_expected, message) parameters from robotframework-assertion-engine, so a single keyword does both retrieval and assertion:

Tool Hit Rate              ${result}    >=    ${0.7}
Failed Tool Call Count     ${result}    ==    ${0}
Read Edit Ratio            ${session}   >=    ${0.5}

Without operator arguments the same keywords just return the value. See examples/13_assertion_engine_idiom.robot and ADR-022.

Examples

Runnable Robot suites under examples/:

File	Topic
`01_mcp_server_basics.robot`	Connect to an MCP server, list and call tools
`02_skill_grading.robot`	Grade a `SKILL.md` against an LLM with Cohen's κ calibration
`03_hook_block_destructive.robot`	Synthesise hook events, assert blocking decisions
`04_subagent_a2a.robot`	A2A subagent task lifecycle + trajectory matching
`05_coding_agent_metrics.robot`	Compute the 12 #42796 behavioural metrics from a session
`06_bfcl_tool_selection.robot`	BFCL AST equality + trajectory comparison
`07_sandbox_run.robot`	Run untrusted code under default-deny Docker sandbox
`08_swe_bench.robot`	SWE-bench Verified loader + pass@k gate
`09_humaneval_live.robot`	HumanEval live grading
`10_rf_mcp_integration.robot`	Drop-in replacement for `manykarim/rf-mcp` e2e patterns
`11_agentskills_grading.robot`	Grade `manykarim/robotframework-agentskills` SKILL.md files
`12_mcp_scenario_replacement.robot`	YAML-driven scenarios + live LLM driver
`13_assertion_engine_idiom.robot`	Side-by-side: operator form vs old Should-pair form
`14_facade_imports.robot`	Side-by-side import variants

Run any example:

PYTHONPATH=. robot --outputdir _out examples/01_mcp_server_basics.robot

Architecture (12 bounded contexts)

Context	Purpose
Provider	LiteLLM-backed `LLMProviderAdapter` + thin vendor adapters
MCP	FastMCP client wrapper for stdio / SSE / streamable-http / in-memory
Skills	`SKILL.md` discovery, frontmatter validation, Inspect-AI grading
Hooks	Synthesise the 12 Claude Code hook events; assert handler decisions
SubAgents	A2A 1.0 task lifecycle + delegation-chain assertions
CodingAgent	Drive Claude Code, Codex CLI, Aider, OpenCode, Cline, Continue; normalise JSONL
Statistics	scipy-backed Mann-Whitney, Cliff's δ, bootstrap CI, pass@k, TARr@N
Judge	Classification-based LLM-as-Judge with calibration gating (Cohen's κ ≥ 0.7)
Security	Default-deny skill scanner, redactor, sandbox policy, AIDefence integration
Telemetry	OTel spans + Robot Framework listener embedding scorecards in `log.html`
BehavioralMetrics	The 12 calculators from `anthropic/claude-code#42796`
ToolCallCorrectness	BFCL AST/trajectory matcher used by MCP, Skills, SubAgents

Aggregates, value objects, and ACLs are in docs/ddd/.

Performance

Hard ceilings — CI fails if any benchmark exceeds the matching budget by more than 20%.

Surface	Budget
MCP in-memory roundtrip (p50 / p95)	≤ 5 / 10 ms
MCP stdio roundtrip (p50)	≤ 50 ms
BFCL AST match (mean per call)	≤ 1 ms
`mannwhitneyu` n=30/30	≤ 5 ms
`bootstrap` n=30 / 1000 resamples	≤ 100 ms
Library import + Suite Setup (cold)	≤ 2 s

Run the suite locally:

uv run pytest benchmarks/ --benchmark-only

Full budget table and cost model: docs/performance/.

Documentation

Plan — docs/PLAN.md
Keyword reference — https://manykarim.github.io/robotframework-agentguard/api/ (GitHub Pages) · docs/KEYWORDS.md (Markdown)
Architecture Decision Records — docs/adr/
Domain model — docs/ddd/
Performance budgets — docs/performance/
Research dossier — docs/research/research.md
Contributing — CONTRIBUTING.md
Security — SECURITY.md

License

Apache-2.0 — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.claude-flow		.claude-flow
.claude		.claude
.github		.github
.swarm		.swarm
benchmarks		benchmarks
docs		docs
examples		examples
src/AgentGuard		src/AgentGuard
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.mcp.json		.mcp.json
.python-version		.python-version
.swarm-coordination.md		.swarm-coordination.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

robotframework-agentguard

What it tests

Installation

Quickstart

Sub-library imports

Operator-driven assertions

Examples

Architecture (12 bounded contexts)

Performance

Documentation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

robotframework-agentguard

What it tests

Installation

Quickstart

Sub-library imports

Operator-driven assertions

Examples

Architecture (12 bounded contexts)

Performance

Documentation

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages