[1/4] RFC 005: Agentic Harness Integration#387
Merged
Conversation
Contributor
Greptile SummaryThis PR introduces RFC 005 (Agentic Harness Integration) and implements foundational components from RFC 004 (Rubrics) to support LLM-as-a-judge evaluation. RFC 005 - Agentic Harness Integration:
RFC 004 Implementation (LLMClient + LLMJudge):
Test Coverage:
Backward Compatibility: Confidence Score: 5/5
Important Files Changed
Flowchartflowchart TB
subgraph pr["PR #387: RFC 005 + RFC 004 Implementation"]
rfc005["RFC 005
Agentic Harness Integration"]
llm["LLMClient Abstraction
OpenAIClient Implementation"]
judge["LLMJudge Rubric
(RFC 004)"]
chess["Chess Environment
Rubric Migration"]
end
subgraph design["Design Decisions"]
wrapping["Wrapping Pattern:
Container provides FS,
harness runs inside"]
injection["MCP Tool Injection:
Env tools → harness config
before session start"]
modes["Two Modes:
Production: proxy to harness
Simulation: step() = full session"]
end
subgraph impl["RFC 004 Implementation"]
llmclient["LLMClient ABC +
OpenAIClient"]
llmjudge["LLMJudge extends Rubric
async forward()"]
chessmig["Chess env migrated to
ChessWinLossRubric"]
end
rfc005 --> wrapping
rfc005 --> injection
rfc005 --> modes
rfc005 -.->|"enables"| judge
llm --> llmclient
judge --> llmjudge
chess --> chessmig
llmjudge -.->|"uses"| llmclient
classDef rfcBox fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef designBox fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef implBox fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
class rfc005,llm,judge,chess rfcBox
class wrapping,injection,modes designBox
class llmclient,llmjudge,chessmig implBox
Last reviewed commit: 79a784c |
Defines how OpenEnv integrates with external agentic harnesses (OpenClaw, Claude Code, Gemini CLI, etc.) that own their own ReAct control loop, tools, and filesystem management. Key design decisions: - Wrapping pattern: OpenEnv container provides filesystem, harness runs inside - MCP tool injection: Env tools plugged into harness before session start - Multi-turn episodes: Each step() is one conversational turn, harness maintains context across turns within an episode - Streaming: send_message_streaming() yields HarnessEvents as they happen - Standard trajectory format: HarnessEvent schema for uniform observability - Production mode: Proxy to harness with streaming, OpenEnv handles sessions - OpenClaw as first concrete integration Closes #385
79a784c to
70ee465
Compare
3 tasks
Add four new sections addressing review feedback: - Harness Security Boundary: network isolation (harness cannot reach orchestration API), MCP tool scoping (no reward tools), reward boundary (rubric runs after turn, not during) - Trajectory Semantics: comparison table of traditional vs harness step/episode/trajectory definitions, with implications for rubric authors, training loops, and monitoring - Temporal Semantics: how harness turns relate to RFC 001's "Time Problem" — wall-clock execution, synchronous from training loop, delays transparent to harness - Architectural note on /harness endpoint: explicitly scoped as harness-specific specialization, not a general agent-hosting pattern; future generalization requires its own RFC
12 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Key Design Decisions
step()triggers entire harness ReAct loopNew Abstractions
HarnessConfig- Pydantic model for harness configurationHarnessAdapter- ABC for harness-specific lifecycle managementHarnessEnvironment- MCPEnvironment subclass wrapping a harnessHarnessTransport- Enum for communication transport (stdio, HTTP, MCP)Open Questions (for review)
Test plan
Closes #385