Agent-first research and development loop: propose, code, execute, evaluate, repeat.
- Skills under
skills/— high-level orchestration (rd-agent) and stage-specific skills (rd-propose,rd-code,rd-execute,rd-evaluate) - CLI tool catalog via
rdagent-tool— direct inspection and primitive operations for when a skill boundary is insufficient - Python package
rd_agent— contracts, orchestration services, ports, and algorithms backing the skill and CLI surfaces - Enterprise infrastructure — PostgreSQL state store, Prometheus metrics, structured logging, HTTP API, resource-gated concurrency
- Full-chain execution tracing — crash-safe JSONL trace from agent reasoning through graph engine decisions
- Eval suite — acceptance tests (AC-1/AC-2), property-based tests, Kaggle LLM E2E harness
- Tests — regression suites that lock the public surface and contracts
The public surface is transport-free: skills first, CLI tools second, no server abstraction.
uv sync --extra testOne-command setup with skill installation and verification:
bash scripts/setup_env.sh # Claude, local, quick
bash scripts/setup_env.sh --all --scope-all --full-verify # everythingThe installer copies canonical skills/ packages into Claude/Codex runtime
roots and creates a managed runtime bundle for CLI tool execution.
# Local install (repo-scoped)
uv run python scripts/install_agent_skills.py --runtime claude --scope local
uv run python scripts/install_agent_skills.py --runtime codex --scope local
# Global install (home-scoped)
uv run python scripts/install_agent_skills.py --runtime claude --scope global
uv run python scripts/install_agent_skills.py --runtime codex --scope globalFiles are copied — the install is self-contained and independent of the source repo.
The installer writes:
- Skills into
.codex/skills/or.claude/skills/(local) or~/.codex/skills//~/.claude/skills/(global) - A managed standalone runtime bundle at
.codex/rd-agent/,.claude/rd-agent/,~/.codex/rd-agent/, or~/.claude/rd-agent/
Direct CLI catalog commands should be called from that installed runtime bundle root, not from an unrelated caller repo.
The operator playbook for using the pipeline.
Use rd-agent first. It routes plain-language intent through persisted state:
- If a paused run exists, it recommends the matching continuation skill
- If preflight blockers exist, it surfaces the blocker and a repair action
- If starting fresh, it recommends multi-branch exploration by default
Round execution modes: host_parallel, host_sequential, local_sequential, blocked, unknown.
rd-agent records the best verified round mode before branch dispatch instead of asking the operator to pre-pick a host path.
Use local_sequential as the rollback-safe path when host-assisted execution is unavailable or unverified.
For the public start contract, see skills/rd-agent/SKILL.md.
Inspect current state before continuing. Use the skill contract first; drop to
rd-tool-catalog only when you need a specific CLI tool:
If a round is degraded or blocked, inspect persisted round truth first, apply one recovery action, then continue through rd-agent after truth is repaired.
Simulated evidence verifies V3 contract behavior; it does not prove that a real host runtime behaved the same way.
Treat signal loss as an inspect-first state: re-check host results and persist repaired truth before continuing.
cd ~/.codex/rd-agent
uv run rdagent-tool list
uv run rdagent-tool describe rd_run_startRoute to the stage skill matching the paused run:
| Stage | Skill | Entrypoint |
|---|---|---|
| Framing | rd-propose |
rd_agent.entry.rd_propose.rd_propose |
| Build | rd-code |
rd_agent.entry.rd_code.rd_code |
| Verify | rd-execute |
rd_agent.entry.rd_execute.rd_execute |
| Synthesize | rd-evaluate |
rd_agent.entry.rd_evaluate.rd_evaluate |
Each skill package at skills/<name>/SKILL.md has the exact continuation
contract and field-level details.
- Skill:
skills/rd-agent/SKILL.md - Entrypoint:
rd_agent.entry.rd_agent.rd_agent - Purpose: start or continue the loop across single-branch and multi-branch execution
Two multi-branch contracts:
branch_hypotheses— label-only multi-branch exploration (legacy)hypothesis_specs— structured exploration with DAG topology, parent selection, dynamic pruning, cross-branch sharing, holdout finalization, and standardized ranking. Holdout finalization is enabled when you provideholdout_evaluation_port; default split / evaluation / embedding helpers are available viard_agent.ports.defaults
Optional embedding adapters can be injected through rd_agent(..., embedding_port=...).
For example, an Ollama-backed adapter can live outside the core defaults:
from rd_agent.adapters import OllamaEmbeddingPort
from rd_agent.entry.rd_agent import rd_agent
result = rd_agent(
...,
hypothesis_specs=specs,
embedding_port=OllamaEmbeddingPort(model="embeddinggemma"),
)Pull the embedding model in Ollama before use, for example:
ollama pull embeddinggemmaWhen finalization completes, the response is finalization-first: the holdout winner is the selected branch.
- Skill:
skills/rd-tool-catalog/SKILL.md - Module:
rd_agent.entry.tool_catalog - CLI:
rdagent-tool
uv run rdagent-tool list # list all tools
uv run rdagent-tool describe rd_run_start # inspect one tool
uv run rdagent-tool describe rd_explore_roundTool categories: orchestration, inspection, primitives.
Primitive subcategories: branch_lifecycle, branch_knowledge,
branch_selection, memory.
rd-agent— default entry unless already inside a known stage- Stage skills —
rd-propose/rd-code/rd-execute/rd-evaluatewhen working inside one owned stage rd-tool-catalog— selective downshift when a skill boundary is insufficient- Narrow by category → primitive
subcategory→ specific tool
Quick gate:
make test-quickFull gate:
make test
make lint
uv run lint-importsOptional production features — install with extras:
pip install rd-agent[postgres] # PostgreSQL state store
pip install rd-agent[observability] # structlog + Prometheus
pip install rd-agent[api] # FastAPI HTTP server
pip install rd-agent[enterprise] # all of the above- Structured logging:
structlogwith contextvars propagation (run_id,branch_id,stage_key) - Prometheus metrics: 8 collectors (runs, rounds, branches, dispatch, stages, state ops, memory ops, trace events)
- Full-chain tracing: crash-safe JSONL at
.state/runs/{run_id}/trace.jsonl
Trace events cover three layers:
| Layer | Events | Source |
|---|---|---|
| Graph Engine | round_start/end, branch_spawn/merge/prune, convergence_eval |
MultiBranchService |
| Stage Execution | stage_start/complete/decision, error |
SkillLoopService |
| Agent Reasoning | agent_response, dispatch_result |
Dispatch adapters, ReceiptCollectionService |
Query traces:
cat .state/runs/*/trace.jsonl | jq .kind | sort | uniq -c
cat .state/runs/{run_id}/trace.jsonl | jq 'select(.branch_id=="br-1")'Configure: configure_tracing(enabled=True, root=".state") — auto-configured in all entry points.
Agent-level hooks: AgentTraceHook protocol for Python-side capture, hooks/post_tool_trace.sh for CC PostToolUse capture.
See dev_doc/TRACING_ARCHITECTURE.md for full design.
Wave dispatch uses checkpoint-and-resume:
WaveDispatchCheckpointpersisted after each waveDispatchRecoveryAssessmentdetermines resume/restart/skip action- Resource gate (
SlotBasedResourceGate) bounds concurrent branch execution
See dev_doc/RESOURCE_CONCURRENCY_ARCHITECTURE.md for details.
uvicorn rd_agent.api.app:create_app --factoryRoutes: POST /api/v1/runs, GET /api/v1/runs/{run_id}, GET /health
Acceptance tests under tests/eval/:
| Suite | Purpose |
|---|---|
AC-1 (test_ac1_*) |
Single-round, multi-round, finalization correctness |
AC-2 (test_ac2_*) |
PUCT, pruning, holdout, decay, convergence, fault injection |
| Kaggle E2E | LLM agent on Titanic competition with trajectory scoring |
python -m pytest tests/eval/ -v # all eval tests
python -m tests.eval.kaggle.run_llm_e2e # LLM E2E (requires Claude Code)See dev_doc/EVAL_SPECIFICATION.md for design.
my-RDagent/
pyproject.toml
.importlinter
Makefile
scripts/
setup_env.sh
install_agent_skills.py
bump_version.py
hooks/
post_tool_trace.sh # CC PostToolUse trace hook
skills/
_shared/references/ # cross-skill shared context
rd-agent/ # orchestration skill
SKILL.md
workflows/
references/
rd-propose/ # framing stage
rd-code/ # build stage
rd-execute/ # verify stage
rd-evaluate/ # synthesize stage
rd-tool-catalog/ # CLI tool inspection
rd_agent/
adapters/ # concrete implementations (filesystem, postgres)
algorithms/ # pure math: decay, PUCT, pruning, holdout, merge
api/ # FastAPI HTTP routes and middleware
compat/legacy/ # legacy translation seam (isolated)
contracts/ # pydantic data contracts (18 models)
devtools/ # skill installer
entry/ # public entrypoints and CLI
observability/ # logging, metrics, tracing
orchestration/ # service layer (40+ modules)
ports/ # abstract ports (9 interfaces)
tools/ # CLI tool implementations
tests/
eval/ # AC-1/AC-2 acceptance + Kaggle E2E
dev_doc/ # architecture documents
docs/CODEMAPS/ # token-lean architecture maps