feat QED Math Environment with MCP tools and LLM-judge rubric#446
feat QED Math Environment with MCP tools and LLM-judge rubric#446rycerzes wants to merge 30 commits into
Conversation
There was a problem hiding this comment.
This PR implements the env properly with rubrics and MCP, both tested.
Merge order:
- #456 must be merged into
mainfirst - it fixes bug #455 in the currentmain - This branch can then be rebased onto
mainand merged (it has already cherry-picked the patch from #456
The GRPO post training code would be a different PR.
@burtenshaw
Greptile SummaryThis PR adds the QED Math Environment — a mathematical proof generation and evaluation env that uses MCP tools as the sole agent API and an LLM-judge rubric (0–7 scale) for grading. It is a well-structured reference implementation that introduces Several concerns raised in previous review rounds have been addressed:
Additionally, the Confidence Score: 4/5Two P1 issues should be resolved before merging: reference_solution leak in reset() and agent-controllable reward shaping via output_length_tokens. The environment is functionally solid and prior review concerns have largely been addressed. The two remaining P1 issues directly affect training validity: the reset() reference_solution leak undermines proof-mode isolation, and the agent-supplied output_length_tokens in the submit_proof tool breaks reward integrity by letting the agent opt out of shaping. envs/qed_math_env/server/qed_math_environment.py (reset() reference_solution leak, dead _verify_math method) and envs/qed_math_env/server/mcp_server.py (output_length_tokens in submit_proof tool). Important Files Changed
Sequence DiagramsequenceDiagram
participant Agent
participant QEDMathEnv as QEDMathEnv (client)
participant Server as QEDMathEnvironment (server)
participant MathVerifier as MathVerifierService
participant LLMJudge as MathProofRubric (LLM)
Agent->>QEDMathEnv: reset(problem_id?)
QEDMathEnv->>Server: POST /reset
Server-->>QEDMathEnv: ProblemObservation (problem, guidelines, [ref_solution if answer-mode])
QEDMathEnv-->>Agent: ProblemObservation
Agent->>QEDMathEnv: call_tool(get_problem)
QEDMathEnv->>Server: CallToolAction get_problem()
Server-->>QEDMathEnv: payload (problem, guidelines, [ref_solution if answer-mode])
QEDMathEnv-->>Agent: ProblemObservation
Agent->>QEDMathEnv: call_tool(submit_proof, proof=...)
QEDMathEnv->>Server: CallToolAction submit_proof(proof)
alt evaluation_mode == answer
Server->>MathVerifier: verify_answer(prediction, gold)
MathVerifier-->>Server: VerifyResponse (correct/wrong)
else evaluation_mode == proof
Server->>LLMJudge: grade(proof, problem, ref_solution, guidelines)
LLMJudge-->>Server: GradingResult (score 0-7, reward 0-1)
end
Server->>Server: _apply_reward_shaping(reward, output_length_tokens)
Server-->>QEDMathEnv: ProofSubmissionObservation (score, reward, done, feedback)
QEDMathEnv-->>Agent: ProofSubmissionObservation
Prompt To Fix All With AIThis is a comment left during a code review.
Path: envs/qed_math_env/server/qed_math_environment.py
Line: 848-860
Comment:
**`reset()` leaks `reference_solution` for proof-mode problems**
`get_problem_payload()` (line 1063) was correctly fixed to gate `reference_solution` on `evaluation_mode == "answer"`. However, the `reset()` method still returns a `ProblemObservation` with `reference_solution` unconditionally — for every problem type including `proof` and `multi_step`.
In the `QEDMathEnv` client (`client.py`), the result of `reset()` is normalized into a `ProblemObservation` and returned to the caller. In the inference example (`examples/qed_math_inference.py`) the agent drives the episode from the reset observation, so `reference_solution` is visible at episode start even without calling `get_problem`.
Apply the same guard used in `get_problem_payload`:
```suggestion
return ProblemObservation(
problem=self._current_problem.get("problem", ""),
reference_solution=(
self._current_problem.get("reference_solution", "")
if self._current_problem.get("evaluation_mode", "proof") == "answer"
else ""
),
grading_guidelines=parse_schema(
self._current_problem.get("grading_guidelines", "") or ""
),
```
**ALIGNMENT FLAG**: Exposes the scoring key to the agent via the reset path for `proof`-type problems, undermining training validity.
- **Principle at stake**: Agent isolation invariant — agents must not access simulation controls or grading keys outside answer-mode.
- **The concern**: `get_problem_payload()` was guarded, but the `reset()` HTTP response is also agent-accessible and contains the same leak.
- **Suggested reviewer**: `@darktex`
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: envs/qed_math_env/server/mcp_server.py
Line: 46-55
Comment:
**`output_length_tokens` is agent-controllable, undermining reward shaping**
The `submit_proof` MCP tool exposes `output_length_tokens` as an agent-supplied parameter. This creates a reward integrity problem:
1. An adversarial agent can pass the default value of `0` to bypass the discount factor and length penalty entirely, receiving the unmodified base reward regardless of actual generation length.
2. Even well-intentioned agents will omit this parameter in most integrations, silently disabling shaping when a non-trivial `discount_factor` or `buffer_tokens` is configured.
The token count of the agent's generation should be determined server-side (e.g., passed by the training harness that dispatches the HTTP step action, or measured on the proof text itself) rather than supplied by the agent through the MCP tool interface.
**ALIGNMENT FLAG**: The reward-inside-environment principle is violated because the agent can choose its own reward-shaping inputs.
- **Principle at stake**: Reward integrity — the environment is the sole authority on reward values.
- **The concern**: Discount factor and length penalty become optional opt-ins from the agent's perspective.
- **Suggested reviewer**: `@darktex`
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: envs/qed_math_env/server/qed_math_environment.py
Line: 1070-1115
Comment:
**`_verify_math` static method is dead code**
`QEDMathEnvironment._verify_math` implements answer verification via `math_verify` with a signal-based timeout, but it is never called in any code path. Answer-mode grading routes through `_grade_answer_submission` → `_verifier_service.verify_answer()` (the process-pool based service), which is the correct async-safe implementation.
This method is also not imported or exercised in the test suite. Consider removing it to avoid the maintenance confusion of two parallel verification implementations — especially since a previous review comment already flagged that the signal-based timeout (`signal.SIGALRM`) is unsafe in async contexts.
How can I resolve this? If you propose a fix, please make it concise.Reviews (2): Last reviewed commit: "Merge branch 'main' into feat/qed-math-e..." | Re-trigger Greptile |
- add QEDMathAction and QEDMathObservation dataclasses in models.py - implement QEDMathEnv client with reset, step, get_problem, submit_proof
- main env class - mcp server tools - step & reset logic
- impl MathProofRubric - LLM Grading Logic - rubric config in env
- map problem data structure - support multiple types of problems
- wss handling - client methods
- `submit_proof` method accepts an optional `output_length_tokens` parameter for reward shaping. - Introduced `remove_reasoning` function to strip reasoning traces from model output. - Added `length_penalty` function to compute penalties for overlong sequences. - Adjusted grading logic to apply discount factors and penalties based on token count.
- implement metrics aggregation
- fix dockerfile
- refer to huggingface#456
d810587 to
f54cc91
Compare
…solution handling - update tests for answer mode
- integrate with QED Math environment
- Fix MCP client parsing bug where reset/step observations were coerced into base Observation, dropping env-specific fields - tests
rycerzes
left a comment
There was a problem hiding this comment.
Moved answer-mode verification into a process-isolated worker pool (math_verify_service.py) since Math-Verify's signal-based timeouts are unsafe in threaded contexts. The service uses a bounded ProcessPoolExecutor with heartbeat monitoring, dead-worker restart, request-ID ownership to discard stale responses, and deterministic status-to-reward mapping so infra failures score 0 without crashing rollouts. Gold targets are pre-parsed and cached by problem_id. Proof-mode grading and the client API are unchanged.
Fixed
- step count on async path
- reset randomness bug
- reference solution leakage
|
Thanks @rycerzes . I'll set of the auto review then try it out tomorrow. |
|
fixed the leak and reward integrity |
|
@burtenshaw if you could PTAL :) |
Darktex
left a comment
There was a problem hiding this comment.
Note: This is an automated review by Claude Code, not a human review.
Tier 1 — Bugs / Lint
envs/qed_math_env/server/qed_math_environment.py+math_verify_service.py— duplicateUnparsableException/NoAnswerException/EmptyBoxedExceptionclasses with different base classes (VerificationErrorvsException). The copy inqed_math_environment.pyis only reachable from the static, never-called_verify_math. Consolidate or remove the duplicates.qed_math_environment.py::_verify_mathis dead code —@staticmethod, never invoked outside tests. Either expose as a public utility (and document) or remove and refactor the tests to exercise_grade_answer_submissionwith a mocked verifier.envs/qed_math_env/server/rubric.py—MathProofRubric.__init__defaultsgrader_model="gemini-2.0-flash", butQEDMathConfigandopenenv.yamlboth default togemini-3-pro. Direct instantiation ofMathProofRubricsilently uses a different model. Align the defaults or add an explicit comment.envs/qed_math_env/prompts/evaluator_prompts/v2.md— template exposes{problem},{marking_scheme},{solution}but no{human_solution}slot, whileMathProofRubric._build_promptpasseshuman_solution=reference_solutionto.format(), which silently ignores it. Likely intentional for the IMO rubric — add a one-line comment so it doesn't read as a forgotten placeholder.envs/qed_math_env/pyproject.toml— missing trailing comma after"trackio>=0.19.0". Valid TOML, but inconsistent with all surrounding entries.
Tier 2 — Alignment (questions for human reviewers)
- Widening
GenericMCPObservationin shared core:src/openenv/core/mcp_client.pyswitches toextra="allow". This weakens the wire-level contract for every MCP environment, not just QED Math. Should this instead be a per-env override of_parse_resultinqed_math_env/client.py(the typed-client pattern inPATTERNS.md)? The_as_problem_observationhelpers already inclient.pysuggest a typed override was considered. output_length_tokensinjection viastep_asynckwargs: read fromkwargsby the training harness and consumed insidesubmit_proof_payloadto drive discount-factor reward shaping. This is an out-of-band channel that bypasses the typed harness interface. If the WebSocketsteppath doesn't enforce session-level authorization, an agent-side caller could inject this value to manipulate reward shaping. Is there an existing threat-model doc, or does this warrant one?- RFC requirement: The PR introduces a non-trivial new grading architecture (dual-mode proof/answer paths, process-pool verifier, LLM judge with retries) and touches shared core (
mcp_client.py). Per project policy, architectural changes warrant an RFC. Isqed_math_envexempt because it's anenvs/addition, or should a short RFC document the design before this lands?
Overall
High-quality, well-documented port. The mechanical Tier-1 items are small but concrete; the core-layer GenericMCPObservation widening and the output_length_tokens channel are the two items I'd want a human sign-off on before merge.
Automated review by Claude Code | Learn more
Summary
Adds the QED Math Environment, a mathematical proof generation and evaluation env ported from QED-Nano. This environment deeply integrates both MCP tools (as the sole agent API) and LLM-judge rubrics (structured 0–7 grading with normalized rewards), making it a reference implementation for the upcoming MCP + Rubrics features.
What's included
Environment (qed_math_env)
QEDMathEnvironment- extendsMCPEnvironment; manages problem lifecycle, dataset loading, and proof submissionMathProofRubric- extendsopenenv.core.rubrics.base.Rubric; LLM-judge grading via OpenAI-compatible endpoints with<score>N</score>parsing, retry logic, and optional score thresholdingQEDMathEnvclient - extendsMCPToolClientwith typedProblemObservation/ProofSubmissionObservationmodelsget_problem,submit_proof,get_grading_guidelinesKey features
math_verify-based\boxed{}checking (no LLM call needed)γ^tokens), length penalty (buffer zone), optional score thresholding (collapses 1–5 → 1)</think>) removed before gradingverifier/rollouts/success,verifier/failures/*, latency, token counts) in observation metadataprompts/evaluator_prompts/v2.md)Testing
Test file (test_qed_math_environment.py) covers:
ListToolsAction,CallToolActionfor all 3 toolsMathProofRubric.grade(), score→reward normalization\boxed{}viamath_verify</think>delimiter handling, fallback behavior\boxed{}wrappingGradingResultand submission payload@pytest.mark.integration)TODO before review & merge
examples/qed_math_inference.py)Env Inference example with logs
Output
CC: @burtenshaw