Skip to content

manykarim/robotframework-agentguard

robotframework-agentguard

License: Apache-2.0 CI

A Robot Framework library for testing MCP servers, Agent Skills, Hooks, SubAgents, and coding-agent CLIs — provider-agnostic via LiteLLM, BFCL-grade tool-call matching, statistical N≥10 by default.

AgentGuard turns the moving parts of an agent stack into Robot Framework keywords. Connect to an MCP server, grade a SKILL.md, drive Claude Code / Codex / Aider, replay an A2A subagent task, run BFCL tool-call comparisons, and gate the result with scipy-backed statistics — all without leaving a .robot file.

What it tests

  • MCP servers — stdio / SSE / streamable-HTTP / in-memory, full FastMCP client surface
  • Agent SkillsSKILL.md discovery, frontmatter validation, Inspect-AI grading
  • Hooks — synthesise the 12 Claude Code hook events and assert handler decisions
  • SubAgents — A2A 1.0 task lifecycle + LangGraph / CrewAI / AutoGen / OpenAI Agents bridges
  • Coding agents — drive Claude Code, Codex CLI, Aider, OpenCode, Cline, Continue; emit the 12 #42796 behavioural metrics
  • Public benchmarks — SWE-bench Verified, Aider, HumanEval, MBPP, LiveCodeBench
  • Security — default-deny skill scanner, redactor, sandboxed execution, AIDefence integration
  • Statistics — Mann-Whitney U, Cliff's δ, Vargha-Delaney A, bootstrap CIs, pass@k, TARr@N

Installation

Note: the package is pending publication to PyPI. Until then, install from the GitHub source.

From PyPI (once published):

pip install robotframework-agentguard
# or
uv add robotframework-agentguard

From source (today):

pip install git+https://github.com/manykarim/robotframework-agentguard.git
# or, with uv
uv add git+https://github.com/manykarim/robotframework-agentguard.git

Optional extras:

pip install 'robotframework-agentguard[bridges]'      # LangGraph / CrewAI / AutoGen / OpenAI Agents
pip install 'robotframework-agentguard[benchmarks]'   # SWE-bench, Aider, HumanEval, MBPP, LiveCodeBench

Configure the LLM provider with a .env file in your project root:

OPENROUTER_API_KEY=sk-or-...
# optional overrides
AGENTGUARD_DEFAULT_MODEL=openrouter/anthropic/claude-sonnet-4-5
AGENTGUARD_JUDGE_MODEL=openrouter/openai/gpt-4o-mini

Verify the install:

agentguard doctor    # provider, env, MCP reachability
agentguard version

Quickstart

*** Settings ***
Library    AgentGuard    provider=litellm    model=openrouter/anthropic/claude-sonnet-4-5

*** Test Cases ***
AgentGuard Should Be Loaded
    ${info}=    Get AgentGuard Info
    Log    ${info}

The kitchen-sink Library AgentGuard exposes every keyword. Prefer narrow imports for bigger suites:

*** Settings ***
Library    AgentGuard.MCP
Library    AgentGuard.Skill
Library    AgentGuard.Stats

Sub-library imports

Import line Purpose
Library AgentGuard Kitchen-sink — every keyword reachable
Library AgentGuard.MCP Test MCP servers (stdio / SSE / streamable-HTTP / in-memory)
Library AgentGuard.Skill Discover, parse, validate, grade Agent Skills
Library AgentGuard.Tool BFCL-style tool-call AST + trajectory matching
Library AgentGuard.Stats Mann-Whitney U, Cliff's δ, Vargha-Delaney A, bootstrap CIs, pass@k, TARr@N
Library AgentGuard.Judge Classification-based LLM-as-Judge with Cohen's κ calibration
Library AgentGuard.Security Default-deny skill scanner, redactor, sandbox, AIDefence
Library AgentGuard.Hook Claude Code hook lifecycle (12 events × 4 handler types)
Library AgentGuard.SubAgent A2A 1.0 task lifecycle, framework bridges
Library AgentGuard.Coding Drive Claude Code / Codex / Aider / OpenCode + #42796 metric pack
Library AgentGuard.Benchmark SWE-bench Verified, Aider, HumanEval, MBPP, LiveCodeBench
Library AgentGuard.Scenario Unified scenario harness — drop-in for manykarim/rf-mcp tests/e2e/

Keyword documentation (browsable HTML, generated by Robot Framework libdoc):

Operator-driven assertions

Every collapsible Get-style keyword accepts the standard (assertion_operator, assertion_expected, message) parameters from robotframework-assertion-engine, so a single keyword does both retrieval and assertion:

Tool Hit Rate              ${result}    >=    ${0.7}
Failed Tool Call Count     ${result}    ==    ${0}
Read Edit Ratio            ${session}   >=    ${0.5}

Without operator arguments the same keywords just return the value. See examples/13_assertion_engine_idiom.robot and ADR-022.

Examples

Runnable Robot suites under examples/:

File Topic
01_mcp_server_basics.robot Connect to an MCP server, list and call tools
02_skill_grading.robot Grade a SKILL.md against an LLM with Cohen's κ calibration
03_hook_block_destructive.robot Synthesise hook events, assert blocking decisions
04_subagent_a2a.robot A2A subagent task lifecycle + trajectory matching
05_coding_agent_metrics.robot Compute the 12 #42796 behavioural metrics from a session
06_bfcl_tool_selection.robot BFCL AST equality + trajectory comparison
07_sandbox_run.robot Run untrusted code under default-deny Docker sandbox
08_swe_bench.robot SWE-bench Verified loader + pass@k gate
09_humaneval_live.robot HumanEval live grading
10_rf_mcp_integration.robot Drop-in replacement for manykarim/rf-mcp e2e patterns
11_agentskills_grading.robot Grade manykarim/robotframework-agentskills SKILL.md files
12_mcp_scenario_replacement.robot YAML-driven scenarios + live LLM driver
13_assertion_engine_idiom.robot Side-by-side: operator form vs old Should-pair form
14_facade_imports.robot Side-by-side import variants

Run any example:

PYTHONPATH=. robot --outputdir _out examples/01_mcp_server_basics.robot

Architecture (12 bounded contexts)

Context Purpose
Provider LiteLLM-backed LLMProviderAdapter + thin vendor adapters
MCP FastMCP client wrapper for stdio / SSE / streamable-http / in-memory
Skills SKILL.md discovery, frontmatter validation, Inspect-AI grading
Hooks Synthesise the 12 Claude Code hook events; assert handler decisions
SubAgents A2A 1.0 task lifecycle + delegation-chain assertions
CodingAgent Drive Claude Code, Codex CLI, Aider, OpenCode, Cline, Continue; normalise JSONL
Statistics scipy-backed Mann-Whitney, Cliff's δ, bootstrap CI, pass@k, TARr@N
Judge Classification-based LLM-as-Judge with calibration gating (Cohen's κ ≥ 0.7)
Security Default-deny skill scanner, redactor, sandbox policy, AIDefence integration
Telemetry OTel spans + Robot Framework listener embedding scorecards in log.html
BehavioralMetrics The 12 calculators from anthropic/claude-code#42796
ToolCallCorrectness BFCL AST/trajectory matcher used by MCP, Skills, SubAgents

Aggregates, value objects, and ACLs are in docs/ddd/.

Performance

Hard ceilings — CI fails if any benchmark exceeds the matching budget by more than 20%.

Surface Budget
MCP in-memory roundtrip (p50 / p95) ≤ 5 / 10 ms
MCP stdio roundtrip (p50) ≤ 50 ms
BFCL AST match (mean per call) ≤ 1 ms
mannwhitneyu n=30/30 ≤ 5 ms
bootstrap n=30 / 1000 resamples ≤ 100 ms
Library import + Suite Setup (cold) ≤ 2 s

Run the suite locally:

uv run pytest benchmarks/ --benchmark-only

Full budget table and cost model: docs/performance/.

Documentation

License

Apache-2.0 — see LICENSE.

About

Robot Framework library for testing MCP servers, Agent Skills, Hooks, SubAgents, and coding-agent CLIs — provider-agnostic via LiteLLM, BFCL-grade tool-call matching, statistical N≥10 by default.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors