Recenter verity-benchmark on task-oriented architecture#2
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bbee2a025e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Addressed the open review findings in
Local verification passed with |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8b49a08802
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f747f89bc3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 94daccc3b9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bfee47c8bf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 26d511f520
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 63515ab218
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1816d9460f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a9edaa0bbe
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 72c2e12f0e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8a6dab9f50
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Addressed the current agent-runtime review findings in
Verification run in this workspace:
Live benchmark note:
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8df14ed83c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Follow-up in
Fresh verification in this workspace:
Current live results on
So the harness/runtime path is now verified across all three modes on the same task, and the remaining interactive failure is model behavior rather than benchmark breakage. @codex review |
|
Follow-up in
Verification in this workspace:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 91b379f535
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Addressed the remaining relevant open review threads in
Verification in this workspace:
I also resolved the matching review threads on the PR. The unrelated local-only edit in |
|
Follow-up in
Verification in this workspace:
I also resolved the remaining non-outdated review threads that these changes cover. The unrelated local-only edit in |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f52e480a17
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d3cb1dffc2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e8efdbf0fe
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if statuses: | ||
| if any(status == "failed" for status in statuses): | ||
| case_status = "failed" | ||
| elif any(status == "passed" for status in statuses): | ||
| case_status = "passed" |
There was a problem hiding this comment.
Treat missing task artifacts as non-passing case status
When aggregate_results() summarizes a case, it only looks at the statuses of artifacts that already exist. If one sibling task passed but another task in the same case never produced a result file (for example because run_task.sh crashed before writing JSON), this branch still marks the whole case passed. I reproduced that by leaving only one ethereum/deposit_contract_minimal task artifact on disk: case_summary.json reported the case as passed even though 4 task refs were missing. Any consumer of results/case_summary.json will therefore overstate case health after partial or crashed runs.
Useful? React with 👍 / 👎.
| if name == "search_public_defs": | ||
| limit = int(arguments.get("limit", 20)) | ||
| return self.search_public_defs(str(arguments.get("query", "")), limit=limit) |
There was a problem hiding this comment.
Reject malformed interactive tool limits instead of raising
Interactive runs currently trust the model to serialize search_public_defs.limit as a valid integer. In practice tool-call arguments are untrusted model output, and int(arguments.get("limit", 20)) will raise ValueError/TypeError for payloads like {"limit":"bad"} or {"limit":null}. I reproduced that directly against TaskProofRuntime.execute_tool(). A single malformed tool call will abort the entire benchmark run instead of returning a rejected tool result and letting the model continue.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| return re.sub(r"(?s)<think>.*?</think>\s*", "", "\n".join(parts)).strip() | ||
| reasoning_content = message.get("reasoning_content") | ||
| if isinstance(reasoning_content, str): | ||
| return reasoning_content.strip() |
There was a problem hiding this comment.
Adapter returns reasoning content as candidate proof text
Medium Severity
extract_text in openai_compatible_adapter.py falls back to returning reasoning_content when content is null, but the same function in default_agent.py correctly returns an empty string. This means reasoning-only API responses (common with o1/o3-style models where content is null) would have their internal chain-of-thought submitted as proof candidate text through the command adapter path, instead of triggering the empty_response handling. The check.sh test only covers default_agent.py's version and uses content: "" rather than content: None, so this inconsistency is untested.
Additional Locations (1)
| for index, item in enumerate(value): | ||
| errors.extend(validate(item, schema["items"], f"{path}[{index}]")) | ||
|
|
||
| return errors |
There was a problem hiding this comment.
Schema validation logic duplicated across two files
Low Severity
type_matches and validate are independently implemented in both validate_manifests.py and default_agent.py. The default_agent.py version is a strict superset, adding minLength, minItems, uniqueItems, and minimum checks that the validate_manifests.py copy lacks. A bug fix or feature addition in one copy won't automatically propagate to the other, and the incomplete copy in validate_manifests.py silently skips constraint types that schemas may rely on.
Additional Locations (1)
Codex P2 #1: narrow `environment_error` classification to real infra failures. Previously any missing .olean matched, but that string is also produced when the model imports a non-existent module. In the latter case we want the normal stagnation/temperature logic to kick in so the model can correct itself. Now only classify as environment_error when the missing module is under `Benchmark.*` (our dependency tree which should have been pre-built) or when lean itself is missing. The generic "failed to load environment" substring no longer short-circuits. Codex P2 #2: raise `_backoff_delay` Retry-After ceiling from 60s to 600s. Providers routinely request several-minute waits on 429/503; the old 60s clamp caused retries to fire while the rate limit was still in force, undermining the "honour Retry-After" behaviour. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


Summary
run_task.sh/task aggregation so execution is centered on tasks instead of only casesVerification
python3 scripts/validate_manifests.pypython3 scripts/generate_metadata.py./scripts/run_task.sh ethereum/deposit_contract_minimal/deposit_count(currently fails in this workspace becauselake buildruns out of disk space while compiling Verity dependencies)./scripts/run_all.sh(same disk-space limitation in this workspace)Notes
Note
Medium Risk
Medium risk due to broad refactor of benchmark manifests/layout plus new CI/harness execution paths and encrypted environment handling, which could break builds or automation if misconfigured.
Overview
Re-centers the benchmark around task-level proof generation: adds per-case
tasks/*.yamlmanifests with explicitimplementation_files,specification_files, editable proof templates underBenchmark/Generated/, and updatesbenchmark.toml,REPORT.md, andbenchmark-inventory.jsonaccordingly.Adds two new active benchmark cases (Kleros SortitionTrees and Nexus Mutual RAMM price band), including Verity contract slices, frozen specs, and hidden reference proofs, and wires them into the top-level
Benchmarkimports.Introduces a runner/validation pipeline: new GitHub Actions
checkworkflow, expandedscripts/run_all.shto execute/aggregate tasks, manifest validation, anddotenvx-backed encrypted.envsupport for default/custom agent runs (with new.gitignoreentries and result partitioning paths).Written by Cursor Bugbot for commit e8efdbf. This will update automatically on new commits. Configure here.