feat QED Math Environment with MCP tools and LLM-judge rubric by rycerzes · Pull Request #446 · huggingface/OpenEnv

rycerzes · 2026-03-17T09:07:15Z

Summary

Adds the QED Math Environment, a mathematical proof generation and evaluation env ported from QED-Nano. This environment deeply integrates both MCP tools (as the sole agent API) and LLM-judge rubrics (structured 0–7 grading with normalized rewards), making it a reference implementation for the upcoming MCP + Rubrics features.

What's included

Environment (qed_math_env)

QEDMathEnvironment - extends MCPEnvironment; manages problem lifecycle, dataset loading, and proof submission
MathProofRubric - extends openenv.core.rubrics.base.Rubric; LLM-judge grading via OpenAI-compatible endpoints with <score>N</score> parsing, retry logic, and optional score thresholding
QEDMathEnv client - extends MCPToolClient with typed ProblemObservation / ProofSubmissionObservation models
3 MCP tools: get_problem, submit_proof, get_grading_guidelines

Key features

Flexible dataset loading: local JSONL/JSON, Hugging Face Hub, or built-in bootstrap problems
Answer-mode verification: math_verify-based \boxed{} checking (no LLM call needed)
Reward shaping: discount factor (γ^tokens), length penalty (buffer zone), optional score thresholding (collapses 1–5 → 1)
Reasoning stripping: configurable delimiters (e.g. </think>) removed before grading
Multi-step problems: configurable max attempts with per-attempt feedback
Verifier metrics: QED-Nano-compatible metrics (verifier/rollouts/success, verifier/failures/*, latency, token counts) in observation metadata
Evaluator prompts: template-based (prompts/evaluator_prompts/v2.md)

Testing

Test file (test_qed_math_environment.py) covers:

Area	Tests
MCP tools	`ListToolsAction`, `CallToolAction` for all 3 tools
Rubric grading	Mocked `MathProofRubric.grade()`, score→reward normalization
Answer verification	Correct/wrong/missing `\boxed{}` via `math_verify`
Reward shaping	Discount factor, length penalty, score thresholding
Reasoning stripping	`</think>` delimiter handling, fallback behavior
Dataset loading	Local JSONL, problem ID selection, answer-mode `\boxed{}` wrapping
Multi-step problems	Retry logic with attempt tracking
Original problem field	RC stream compatibility
Verifier metrics	Metrics on `GradingResult` and submission payload
Server integration	Full HTTP/WebSocket stack (marked `@pytest.mark.integration`)

PYTHONPATH=src:envs uv run pytest tests/envs/test_qed_math_environment.py -v

TODO before review & merge

Commit test_qed_math_environment.py
Add inference example script (examples/qed_math_inference.py)
Consider adding GRPO training example (separate PR)

Env Inference example with logs

PYTHONPATH=src:envs uv run python examples/qed_math_inference.py

Output

QED Math server: http://localhost:8000
API:   https://api.groq.com/openai/v1
Model: glm-5
Tools: ['get_problem', 'submit_proof', 'get_grading_guidelines']
Episodes: 3
============================================================
Episode 1  |  problem_id=(random)  |  type=proof
============================================================
--- Step 1 ---
Tool: get_problem({})
Result: {"problem": "Prove that the sum of two even integers is even.", ...
--- Step 2 ---
Tool: get_grading_guidelines({})
Result: {"grading_guidelines": "Award full credit for a correct parity argument.", ...
--- Step 3 ---
Tool: submit_proof({"proof": "**Proof.**\n\nAn integer is *even* if and only if it can be expressed as $2k$...
Result: {"done": true, "reward": 1.0, ...
Outcome: CORRECT  score=7/7  reward=1.000

CC: @burtenshaw

rycerzes

This PR implements the env properly with rubrics and MCP, both tested.

Merge order:

#456 must be merged into main first - it fixes bug #455 in the current main
This branch can then be rebased onto main and merged (it has already cherry-picked the patch from #456

The GRPO post training code would be a different PR.
@burtenshaw

greptile-apps · 2026-03-26T11:10:16Z

Greptile Summary

This PR adds the QED Math Environment — a mathematical proof generation and evaluation env that uses MCP tools as the sole agent API and an LLM-judge rubric (0–7 scale) for grading. It is a well-structured reference implementation that introduces MathVerifierService (process-pool based, async-safe), MathProofRubric, flexible dataset loading, and a comprehensive test suite.

Several concerns raised in previous review rounds have been addressed: step_count is now incremented in step_async, unseeded reset() now calls random.choice, get_problem_payload() gates reference_solution to answer-mode only, and the signal-based timeout issue is resolved by moving answer verification into worker processes. Two new issues remain:

reset() still leaks reference_solution for proof-mode problems (qed_math_environment.py:850): the guard applied to get_problem_payload() was not mirrored in the reset() return path. Since QEDMathEnv.reset() normalizes and returns this to the caller, proof-mode agents can see the grading key at episode start.
output_length_tokens is agent-controllable in the submit_proof MCP tool (mcp_server.py:46): the discount factor and length penalty are applied using a token count supplied by the agent. An agent that omits or sets this to zero bypasses all reward shaping, undermining the integrity of the shaped reward signal.

Additionally, the _verify_math static method on QEDMathEnvironment is dead code — the live grading path goes through MathVerifierService — and should be removed to avoid confusion with the signal-unsafe pattern it embodies.

Confidence Score: 4/5

Two P1 issues should be resolved before merging: reference_solution leak in reset() and agent-controllable reward shaping via output_length_tokens.

The environment is functionally solid and prior review concerns have largely been addressed. The two remaining P1 issues directly affect training validity: the reset() reference_solution leak undermines proof-mode isolation, and the agent-supplied output_length_tokens in the submit_proof tool breaks reward integrity by letting the agent opt out of shaping.

envs/qed_math_env/server/qed_math_environment.py (reset() reference_solution leak, dead _verify_math method) and envs/qed_math_env/server/mcp_server.py (output_length_tokens in submit_proof tool).

Important Files Changed

Filename	Overview
envs/qed_math_env/server/qed_math_environment.py	Core environment implementation — two P1 issues: reset() leaks reference_solution for proof-mode (fix applied to get_problem_payload was not mirrored here), and dead _verify_math static method shadowing the correct process-based verifier path.
envs/qed_math_env/server/mcp_server.py	MCP tool registration — P1 alignment issue: submit_proof exposes output_length_tokens as an agent-controllable parameter, allowing agents to bypass or manipulate reward shaping.
envs/qed_math_env/server/rubric.py	LLM judge rubric with 0-7 scoring, retry logic, and verifier metrics — well structured; broad except Exception in _call_llm was flagged in a prior thread and remains.
envs/qed_math_env/server/math_verify_service.py	Process-pool based answer verifier — correctly isolates signal-based timeouts to worker processes, includes lazy start, retry, pool restart, and metrics; no issues found.
envs/qed_math_env/client.py	Typed client wrapper over MCPToolClient — normalizes observations into typed models; no issues beyond inheriting the reference_solution leak from reset().
src/openenv/core/mcp_client.py	Adds GenericMCPObservation with extra=allow to pass through env-specific fields when no typed observation class matches; small whitespace fix in docstring.
tests/envs/test_qed_math_environment.py	Comprehensive test suite covering rubric grading, answer verification, reward shaping, dataset loading, and multi-step logic; autouse fixture patches OpenAI client cleanly.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant QEDMathEnv as QEDMathEnv (client)
    participant Server as QEDMathEnvironment (server)
    participant MathVerifier as MathVerifierService
    participant LLMJudge as MathProofRubric (LLM)

    Agent->>QEDMathEnv: reset(problem_id?)
    QEDMathEnv->>Server: POST /reset
    Server-->>QEDMathEnv: ProblemObservation (problem, guidelines, [ref_solution if answer-mode])
    QEDMathEnv-->>Agent: ProblemObservation

    Agent->>QEDMathEnv: call_tool(get_problem)
    QEDMathEnv->>Server: CallToolAction get_problem()
    Server-->>QEDMathEnv: payload (problem, guidelines, [ref_solution if answer-mode])
    QEDMathEnv-->>Agent: ProblemObservation

    Agent->>QEDMathEnv: call_tool(submit_proof, proof=...)
    QEDMathEnv->>Server: CallToolAction submit_proof(proof)

    alt evaluation_mode == answer
        Server->>MathVerifier: verify_answer(prediction, gold)
        MathVerifier-->>Server: VerifyResponse (correct/wrong)
    else evaluation_mode == proof
        Server->>LLMJudge: grade(proof, problem, ref_solution, guidelines)
        LLMJudge-->>Server: GradingResult (score 0-7, reward 0-1)
    end

    Server->>Server: _apply_reward_shaping(reward, output_length_tokens)
    Server-->>QEDMathEnv: ProofSubmissionObservation (score, reward, done, feedback)
    QEDMathEnv-->>Agent: ProofSubmissionObservation

Prompt To Fix All With AI

This is a comment left during a code review.
Path: envs/qed_math_env/server/qed_math_environment.py
Line: 848-860

Comment:
**`reset()` leaks `reference_solution` for proof-mode problems**

`get_problem_payload()` (line 1063) was correctly fixed to gate `reference_solution` on `evaluation_mode == "answer"`. However, the `reset()` method still returns a `ProblemObservation` with `reference_solution` unconditionally — for every problem type including `proof` and `multi_step`.

In the `QEDMathEnv` client (`client.py`), the result of `reset()` is normalized into a `ProblemObservation` and returned to the caller. In the inference example (`examples/qed_math_inference.py`) the agent drives the episode from the reset observation, so `reference_solution` is visible at episode start even without calling `get_problem`.

Apply the same guard used in `get_problem_payload`:

```suggestion
        return ProblemObservation(
            problem=self._current_problem.get("problem", ""),
            reference_solution=(
                self._current_problem.get("reference_solution", "")
                if self._current_problem.get("evaluation_mode", "proof") == "answer"
                else ""
            ),
            grading_guidelines=parse_schema(
                self._current_problem.get("grading_guidelines", "") or ""
            ),
```

**ALIGNMENT FLAG**: Exposes the scoring key to the agent via the reset path for `proof`-type problems, undermining training validity.
- **Principle at stake**: Agent isolation invariant — agents must not access simulation controls or grading keys outside answer-mode.
- **The concern**: `get_problem_payload()` was guarded, but the `reset()` HTTP response is also agent-accessible and contains the same leak.
- **Suggested reviewer**: `@darktex`

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: envs/qed_math_env/server/mcp_server.py
Line: 46-55

Comment:
**`output_length_tokens` is agent-controllable, undermining reward shaping**

The `submit_proof` MCP tool exposes `output_length_tokens` as an agent-supplied parameter. This creates a reward integrity problem:

1. An adversarial agent can pass the default value of `0` to bypass the discount factor and length penalty entirely, receiving the unmodified base reward regardless of actual generation length.
2. Even well-intentioned agents will omit this parameter in most integrations, silently disabling shaping when a non-trivial `discount_factor` or `buffer_tokens` is configured.

The token count of the agent's generation should be determined server-side (e.g., passed by the training harness that dispatches the HTTP step action, or measured on the proof text itself) rather than supplied by the agent through the MCP tool interface.

**ALIGNMENT FLAG**: The reward-inside-environment principle is violated because the agent can choose its own reward-shaping inputs.
- **Principle at stake**: Reward integrity — the environment is the sole authority on reward values.
- **The concern**: Discount factor and length penalty become optional opt-ins from the agent's perspective.
- **Suggested reviewer**: `@darktex`

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: envs/qed_math_env/server/qed_math_environment.py
Line: 1070-1115

Comment:
**`_verify_math` static method is dead code**

`QEDMathEnvironment._verify_math` implements answer verification via `math_verify` with a signal-based timeout, but it is never called in any code path. Answer-mode grading routes through `_grade_answer_submission` → `_verifier_service.verify_answer()` (the process-pool based service), which is the correct async-safe implementation.

This method is also not imported or exercised in the test suite. Consider removing it to avoid the maintenance confusion of two parallel verification implementations — especially since a previous review comment already flagged that the signal-based timeout (`signal.SIGALRM`) is unsafe in async contexts.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (2): Last reviewed commit: "Merge branch 'main' into feat/qed-math-e..." | Re-trigger Greptile}

burtenshaw · 2026-03-26T12:30:43Z

@rycerzes nice work. #456 is now merged to main.

Could you please address the greplit review. Mainly:

step count on async path
reset randomness bug
reference solution leakage

- add QEDMathAction and QEDMathObservation dataclasses in models.py - implement QEDMathEnv client with reset, step, get_problem, submit_proof

- main env class - mcp server tools - step & reset logic

- impl MathProofRubric - LLM Grading Logic - rubric config in env

- map problem data structure - support multiple types of problems

- wss handling - client methods

- `submit_proof` method accepts an optional `output_length_tokens` parameter for reward shaping. - Introduced `remove_reasoning` function to strip reasoning traces from model output. - Added `length_penalty` function to compute penalties for overlong sequences. - Adjusted grading logic to apply discount factors and penalties based on token count.

- implement metrics aggregation

dockerfile

- fix dockerfile

- refer to huggingface#456

…solution handling - update tests for answer mode

- integrate with QED Math environment

- tests

- Fix MCP client parsing bug where reset/step observations were coerced into base Observation, dropping env-specific fields - tests

rycerzes

Moved answer-mode verification into a process-isolated worker pool (math_verify_service.py) since Math-Verify's signal-based timeouts are unsafe in threaded contexts. The service uses a bounded ProcessPoolExecutor with heartbeat monitoring, dead-worker restart, request-ID ownership to discard stale responses, and deterministic status-to-reward mapping so infra failures score 0 without crashing rollouts. Gold targets are pre-parsed and cached by problem_id. Proof-mode grading and the client API are unchanged.

Fixed

step count on async path
reset randomness bug
reference solution leakage

@burtenshaw

burtenshaw · 2026-03-29T18:42:21Z

Thanks @rycerzes . I'll set of the auto review then try it out tomorrow.

@greptile-apps

…ated methods

rycerzes · 2026-03-30T05:07:49Z

fixed the leak and reward integrity

rycerzes · 2026-04-11T15:20:58Z

@burtenshaw if you could PTAL :)

from huggingface#446

Darktex

Note: This is an automated review by Claude Code, not a human review.

Tier 1 — Bugs / Lint

envs/qed_math_env/server/qed_math_environment.py + math_verify_service.py — duplicate UnparsableException / NoAnswerException / EmptyBoxedException classes with different base classes (VerificationError vs Exception). The copy in qed_math_environment.py is only reachable from the static, never-called _verify_math. Consolidate or remove the duplicates.
qed_math_environment.py::_verify_math is dead code — @staticmethod, never invoked outside tests. Either expose as a public utility (and document) or remove and refactor the tests to exercise _grade_answer_submission with a mocked verifier.
envs/qed_math_env/server/rubric.py — MathProofRubric.__init__ defaults grader_model="gemini-2.0-flash", but QEDMathConfig and openenv.yaml both default to gemini-3-pro. Direct instantiation of MathProofRubric silently uses a different model. Align the defaults or add an explicit comment.
envs/qed_math_env/prompts/evaluator_prompts/v2.md — template exposes {problem}, {marking_scheme}, {solution} but no {human_solution} slot, while MathProofRubric._build_prompt passes human_solution=reference_solution to .format(), which silently ignores it. Likely intentional for the IMO rubric — add a one-line comment so it doesn't read as a forgotten placeholder.
envs/qed_math_env/pyproject.toml — missing trailing comma after "trackio>=0.19.0". Valid TOML, but inconsistent with all surrounding entries.

Tier 2 — Alignment (questions for human reviewers)

Widening GenericMCPObservation in shared core: src/openenv/core/mcp_client.py switches to extra="allow". This weakens the wire-level contract for every MCP environment, not just QED Math. Should this instead be a per-env override of _parse_result in qed_math_env/client.py (the typed-client pattern in PATTERNS.md)? The _as_problem_observation helpers already in client.py suggest a typed override was considered.
output_length_tokens injection via step_async kwargs: read from kwargs by the training harness and consumed inside submit_proof_payload to drive discount-factor reward shaping. This is an out-of-band channel that bypasses the typed harness interface. If the WebSocket step path doesn't enforce session-level authorization, an agent-side caller could inject this value to manipulate reward shaping. Is there an existing threat-model doc, or does this warrant one?
RFC requirement: The PR introduces a non-trivial new grading architecture (dual-mode proof/answer paths, process-pool verifier, LLM judge with retries) and touches shared core (mcp_client.py). Per project policy, architectural changes warrant an RFC. Is qed_math_env exempt because it's an envs/ addition, or should a short RFC document the design before this lands?

Overall

High-quality, well-documented port. The mechanical Tier-1 items are small but concrete; the core-layer GenericMCPObservation widening and the output_length_tokens channel are the two items I'd want a human sign-off on before merge.

Automated review by Claude Code | Learn more

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 17, 2026

rycerzes commented Mar 26, 2026

View reviewed changes

rycerzes marked this pull request as ready for review March 26, 2026 11:03

greptile-apps Bot reviewed Mar 26, 2026

View reviewed changes

rycerzes added 17 commits March 28, 2026 21:53

initial qed-math OpenEnv scaffold with models and client

c3a8f64

- add QEDMathAction and QEDMathObservation dataclasses in models.py - implement QEDMathEnv client with reset, step, get_problem, submit_proof

implement MCP server tools and QEDMath environment

016aa35

- main env class - mcp server tools - step & reset logic

rubric implementation

7d725b5

- impl MathProofRubric - LLM Grading Logic - rubric config in env

data pipeline integration

e01a3f0

- map problem data structure - support multiple types of problems

grading prompts

efa2544

API & Client

bc95317

- wss handling - client methods

fix deps

8d91ff8

fix eval prompt

31ba418

original_problem field for grading in QED-Nano RC stream

8caae19

add verifier metrics to grading results

74be02a

- implement metrics aggregation

docs

5305d42

dockerfile

tests for QED Math Env

477024f

add QED Math inference example

244b5c0

- fix dockerfile

improve async step handling

bdb5678

- refer to huggingface#456

increase timeout for LLM judge calls and adjust WebSocket ping settings

d6b0c15

add feedback output for incorrect answers in QED Math inference

f54cc91

rycerzes force-pushed the feat/qed-math-env branch from d810587 to f54cc91 Compare March 28, 2026 16:23

rycerzes added 6 commits March 28, 2026 22:45

refactor get_problem method to include metadata and adjust reference …

aeebd21

…solution handling - update tests for answer mode

refactor math expression parsing and simplify async calls in tests

a41919f

impl process-based math verification service

f69b83c

- integrate with QED Math environment

gold answer caching and request admission control

83811d4

- tests

worker pool restart and health probe reporting

8d99084

preserve custom observation fields during parsing

e13f22a

- Fix MCP client parsing bug where reset/step observations were coerced into base Observation, dropping env-specific fields - tests

rycerzes added 3 commits March 29, 2026 20:46

metrics tracking for MathVerifierService and QEDMathEnvironment

60c171a

update env docs

8f86fc8

Merge branch 'main' into feat/qed-math-env

3f822b3

rycerzes commented Mar 29, 2026

View reviewed changes

greptile-apps Bot reviewed Mar 29, 2026

View reviewed changes

Comment thread envs/qed_math_env/server/qed_math_environment.py

Comment thread envs/qed_math_env/server/mcp_server.py Outdated

refactor: remove output_length_tokens param from submit_proof and rel…

389980d

…ated methods

rycerzes added a commit to 3xcaffeine/qed-math-openenv that referenced this pull request Apr 2, 2026

tests and inference script from huggingface/OpenEnv#446

d4507be

rycerzes added a commit to 3xcaffeine/qed-math-openenv that referenced this pull request Apr 2, 2026

tests and inference script from huggingface/OpenEnv#446

cfda707

rycerzes pushed a commit to 3xcaffeine/qed-math-openenv that referenced this pull request Apr 2, 2026

tests and inference script from huggingface/OpenEnv#446

ea0063b

Merge branch 'main' into feat/qed-math-env

e35b91e

rycerzes mentioned this pull request Apr 9, 2026

Help with MCP Environment #491

Open

rycerzes added a commit to rycerzes/OpenEnv that referenced this pull request Apr 22, 2026

fix: preserve custom observation fields during parsing

ed144af

from huggingface#446

Saadhana B Venkataraman and others added 2 commits May 11, 2026 21:06

chore(lint): usort + ruff format

7e3f9c6

Merge branch 'main' into feat/qed-math-env

fe3d8de

Darktex suggested changes May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat QED Math Environment with MCP tools and LLM-judge rubric#446

feat QED Math Environment with MCP tools and LLM-judge rubric#446
rycerzes wants to merge 30 commits into
huggingface:mainfrom
rycerzes:feat/qed-math-env

rycerzes commented Mar 17, 2026 •

edited

Loading

Uh oh!

rycerzes left a comment •

edited

Loading

Uh oh!

greptile-apps Bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

burtenshaw commented Mar 26, 2026

Uh oh!

rycerzes left a comment

Uh oh!

burtenshaw commented Mar 29, 2026

Uh oh!

Uh oh!

Uh oh!

rycerzes commented Mar 30, 2026

Uh oh!

rycerzes commented Apr 11, 2026

Uh oh!

Darktex left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rycerzes commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Environment (qed_math_env)

Key features

Testing

TODO before review & merge

Env Inference example with logs

Uh oh!

rycerzes left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

burtenshaw commented Mar 26, 2026

Uh oh!

rycerzes left a comment

Choose a reason for hiding this comment

Uh oh!

burtenshaw commented Mar 29, 2026

Uh oh!

Uh oh!

Uh oh!

rycerzes commented Mar 30, 2026

Uh oh!

rycerzes commented Apr 11, 2026

Uh oh!

Darktex left a comment

Choose a reason for hiding this comment

Tier 1 — Bugs / Lint

Tier 2 — Alignment (questions for human reviewers)

Overall

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rycerzes commented Mar 17, 2026 •

edited

Loading

rycerzes left a comment •

edited

Loading

greptile-apps Bot commented Mar 26, 2026 •

edited

Loading