Skip to content

Recenter verity-benchmark on task-oriented architecture#2

Merged
Th0rgal merged 41 commits into
mainfrom
codex/task-benchmark-integration
Mar 20, 2026
Merged

Recenter verity-benchmark on task-oriented architecture#2
Th0rgal merged 41 commits into
mainfrom
codex/task-benchmark-integration

Conversation

@Th0rgal
Copy link
Copy Markdown
Member

@Th0rgal Th0rgal commented Mar 18, 2026

Summary

  • add family and implementation manifests plus machine-readable schemas for families, implementations, cases, tasks, and run results
  • migrate the existing benchmark metadata to task-oriented manifests with explicit translation/spec/proof readiness and abstraction metadata
  • add a harness scaffold plus run_task.sh/task aggregation so execution is centered on tasks instead of only cases
  • regenerate the benchmark inventory and report to surface family, implementation, task, and status summaries

Verification

  • python3 scripts/validate_manifests.py
  • python3 scripts/generate_metadata.py
  • ./scripts/run_task.sh ethereum/deposit_contract_minimal/deposit_count (currently fails in this workspace because lake build runs out of disk space while compiling Verity dependencies)
  • ./scripts/run_all.sh (same disk-space limitation in this workspace)

Notes

  • The new harness is honest about current maturity: active tasks are spec-ready and proof-planned, not yet proof-complete.
  • Manifest validation and metadata generation pass cleanly.

Note

Medium Risk
Medium risk due to broad refactor of benchmark manifests/layout plus new CI/harness execution paths and encrypted environment handling, which could break builds or automation if misconfigured.

Overview
Re-centers the benchmark around task-level proof generation: adds per-case tasks/*.yaml manifests with explicit implementation_files, specification_files, editable proof templates under Benchmark/Generated/, and updates benchmark.toml, REPORT.md, and benchmark-inventory.json accordingly.

Adds two new active benchmark cases (Kleros SortitionTrees and Nexus Mutual RAMM price band), including Verity contract slices, frozen specs, and hidden reference proofs, and wires them into the top-level Benchmark imports.

Introduces a runner/validation pipeline: new GitHub Actions check workflow, expanded scripts/run_all.sh to execute/aggregate tasks, manifest validation, and dotenvx-backed encrypted .env support for default/custom agent runs (with new .gitignore entries and result partitioning paths).

Written by Cursor Bugbot for commit e8efdbf. This will update automatically on new commits. Configure here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bbee2a025e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/task_runner.py Outdated
Comment thread harness/task_runner.py
Comment thread harness/task_runner.py
Comment thread scripts/generate_metadata.py
@Th0rgal
Copy link
Copy Markdown
Member Author

Th0rgal commented Mar 18, 2026

Addressed the open review findings in 8b49a08:

  • task execution now validates the selected declaration, not just the module
  • case aggregation includes active cases with no task manifests
  • metadata readiness now matches runner readiness for translation
  • run-result schema matches emitted payload fields

Local verification passed with python3 scripts/validate_manifests.py, python3 scripts/generate_metadata.py, ./scripts/run_all.sh, and ./scripts/check.sh. The remaining GitHub check is the pending Bugbot rerun.

Comment thread schemas/run-result.schema.json
Comment thread scripts/generate_metadata.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8b49a08802

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/validate_manifests.py
Comment thread harness/task_runner.py
Comment thread scripts/generate_metadata.py Outdated
Comment thread Benchmark/Cases/NexusMutual/RammPriceBand/Specs.lean Outdated
Comment thread harness/task_runner.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f747f89bc3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/validate_manifests.py
Comment thread scripts/validate_manifests.py
Comment thread scripts/validate_manifests.py
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 94daccc3b9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread schemas/run-result.schema.json
Comment thread schemas/task.schema.json Outdated
Comment thread scripts/validate_manifests.py
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bfee47c8bf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread Benchmark/Cases/NexusMutual/RammPriceBand/Specs.lean Outdated
Comment thread Benchmark/Cases/Kleros/SortitionTrees/Specs.lean Outdated
Comment thread scripts/validate_manifests.py Outdated
Comment thread cases/paladin_votes/stream_recovery_claim_usdc/case.yaml Outdated
Comment thread schemas/run-result.schema.json
Comment thread cases/nexus_mutual/ramm_price_band/case.yaml Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 26d511f520

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread Benchmark/Cases/Kleros/SortitionTrees/Specs.lean Outdated
Comment thread scripts/validate_manifests.py Outdated
Comment thread scripts/validate_manifests.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 63515ab218

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread Benchmark/Cases/NexusMutual/RammPriceBand/Specs.lean Outdated
Comment thread scripts/validate_manifests.py
Comment thread harness/default_agent.py
Comment thread harness/default_agent.py
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1816d9460f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/default_agent.py Outdated
Comment thread harness/task_runner.py
Comment thread harness/default_agent.py Outdated
Comment thread harness/default_agent.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a9edaa0bbe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread Benchmark/Cases/Ethereum/DepositContractMinimal/Proofs.lean Outdated
Comment thread harness/default_agent.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 72c2e12f0e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/validate_manifests.py Outdated
Comment thread scripts/validate_manifests.py
Comment thread Benchmark/Cases.lean
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8a6dab9f50

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/default_agent.py Outdated
Comment thread scripts/validate_manifests.py Outdated
Comment thread harness/default_agent.py Outdated
@Th0rgal
Copy link
Copy Markdown
Member Author

Th0rgal commented Mar 20, 2026

Addressed the current agent-runtime review findings in 8df14ed:

  • harness/default_agent.py run now returns the task evaluation exit code instead of always exiting 0
  • candidate evaluation now rejects hidden Benchmark.Cases.*.Proofs imports
  • candidate evaluation now rejects theorem statement changes before Lean checking

Verification run in this workspace:

  • python3 scripts/validate_manifests.py
  • python3 scripts/generate_metadata.py
  • lake build Benchmark
  • ./scripts/check.sh
  • python3 harness/default_agent.py validate-config harness/agents/interactive.json
  • local regression check for theorem_statement_mismatch and hidden_proof_import_detected
  • live probe with VERITY_BENCHMARK_AGENT_API_KEY=... VERITY_BENCHMARK_AGENT_MODEL=builtin/smart python3 harness/default_agent.py probe --profile interactive --ensure-model
  • live probe with VERITY_BENCHMARK_AGENT_API_KEY=... VERITY_BENCHMARK_AGENT_BASE_URL=https://agent-backend.thomas.md/v1 VERITY_BENCHMARK_AGENT_MODEL=builtin/smart python3 harness/default_agent.py probe --profile openai-compatible --ensure-model
  • live task run with builtin/smart through interactive and openai-compatible profiles on ethereum/deposit_contract_minimal/deposit_count

Live benchmark note:

  • the new runtimes are wired correctly against the real backend
  • the tested builtin/smart runs still failed the proof task (lean_check_failed), so this was a harness/runtime verification pass rather than a passing benchmark score
  • an attempted full interactive active-suite run stalled after the first two tasks, so I stopped it rather than leave an unbounded backend call running

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8df14ed83c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/interactive_runtime.py
Comment thread harness/default_agent.py Outdated
@Th0rgal
Copy link
Copy Markdown
Member Author

Th0rgal commented Mar 20, 2026

Follow-up in 2097068:

  • fixed the public benchmark proof in Benchmark/Generated/Ethereum/DepositContractMinimal/Tasks/DepositCount.lean; the previous checked-in proof did not type-check, which explained the earlier lean_check_failed result on a correct candidate

Fresh verification in this workspace:

  • ./scripts/check.sh
  • lake env lean --root=. Benchmark/Generated/Ethereum/DepositContractMinimal/Tasks/DepositCount.lean
  • VERITY_BENCHMARK_AGENT_API_KEY=... VERITY_BENCHMARK_AGENT_MODEL=builtin/smart python3 harness/default_agent.py probe --profile interactive --ensure-model
  • VERITY_BENCHMARK_AGENT_API_KEY=... VERITY_BENCHMARK_AGENT_BASE_URL=https://agent-backend.thomas.md/v1 VERITY_BENCHMARK_AGENT_MODEL=builtin/smart python3 harness/default_agent.py probe --profile openai-compatible --ensure-model
  • VERITY_BENCHMARK_AGENT_API_KEY=... VERITY_BENCHMARK_AGENT_MODEL=builtin/smart python3 harness/agent_runner.py run ethereum/deposit_contract_minimal/deposit_count --profile interactive
  • VERITY_BENCHMARK_AGENT_API_KEY=... VERITY_BENCHMARK_AGENT_BASE_URL=https://agent-backend.thomas.md/v1 VERITY_BENCHMARK_AGENT_MODEL=builtin/smart python3 harness/agent_runner.py run ethereum/deposit_contract_minimal/deposit_count --config harness/default-agent.example.json
  • strict-mode spot check via a temporary config cloned from harness/agents/openai-proxy-fast.json with model=builtin/smart

Current live results on ethereum/deposit_contract_minimal/deposit_count:

  • custom with builtin/smart: passes
  • strict with builtin/smart: passes
  • interactive with builtin/smart: still fails, now correctly as theorem_statement_mismatch because the model rewrites the theorem contract instead of preserving the editable declaration

So the harness/runtime path is now verified across all three modes on the same task, and the remaining interactive failure is model behavior rather than benchmark breakage.

@codex review

@Th0rgal
Copy link
Copy Markdown
Member Author

Th0rgal commented Mar 20, 2026

Follow-up in 91b379f:

  • restored Benchmark/Generated/Ethereum/DepositContractMinimal/Tasks/DepositCount.lean to a real benchmark placeholder instead of a solved proof
  • interactive candidate evaluation now rejects non-public Benchmark.Cases.* imports, not just direct *.Proofs imports
  • bundled command adapters that shell out to openai_compatible_adapter.py now fail fast on missing base_url / model / api_key inputs during config resolution

Verification in this workspace:

  • ./scripts/check.sh
  • targeted regression: resolve_config(..., require_secrets=True) now errors immediately for harness/default-agent.example.json when VERITY_BENCHMARK_AGENT_BASE_URL is unset
  • targeted regression: TaskProofRuntime.evaluate_candidate() now returns hidden_case_import_detected for import Benchmark.Cases.Ethereum
  • live probes with builtin/smart still succeed for interactive and custom
  • live task spot checks after the earlier runtime work still complete end to end for strict and custom; both currently fail on proof quality rather than harness errors

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 91b379f535

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/validate_manifests.py
Comment thread scripts/validate_manifests.py
@Th0rgal
Copy link
Copy Markdown
Member Author

Th0rgal commented Mar 20, 2026

Addressed the remaining relevant open review threads in 3db3345:

  • validate case manifests against their suite/project/case directory path
  • validate task manifests against filename and suite path
  • validate implementation case_ids against discovered cases, including inverse coverage
  • reject duplicate task refs across cases/ and backlog
  • scope harness/task_runner.py aggregate case summaries to the explicitly requested task set

Verification in this workspace:

  • python3 scripts/validate_manifests.py
  • python3 harness/task_runner.py aggregate --suite active ethereum/deposit_contract_minimal/deposit_count
  • ./scripts/check.sh

I also resolved the matching review threads on the PR. The unrelated local-only edit in Benchmark/Generated/Ethereum/DepositContractMinimal/Tasks/DepositCount.lean is still left untouched and is not part of this push.

@Th0rgal
Copy link
Copy Markdown
Member Author

Th0rgal commented Mar 20, 2026

Follow-up in f52e480:

  • validated family and implementation manifests against their directory paths
  • validated task family/implementation provenance against the parent case manifest
  • enforced inverse family membership coverage for case/implementation lists
  • emitted evaluation, source_ref, and task_interface_version consistently with schemas/run-result.schema.json
  • made task aggregation surface missing result files explicitly instead of silently dropping them
  • redacted resolved auth headers from agent-run artifacts and describe-command output
  • handled non-JSON HTTP 200 responses from the backend as structured failures
  • stopped one-off run / run-case agent summaries from overwriting the suite summary path
  • removed the extra task re-resolution in build_result()

Verification in this workspace:

  • python3 scripts/validate_manifests.py
  • python3 harness/task_runner.py run ethereum/deposit_contract_minimal/deposit_count
  • missing-result regression via temporary removal of results/tasks/ethereum__deposit_contract_minimal__deposit_count.json followed by python3 harness/task_runner.py aggregate --suite active ethereum/deposit_contract_minimal/deposit_count
  • dry-run regression with a temporary config using header_envs, confirming stored agent headers are redacted and non-suite summaries write under results/agent_summaries/<track>/<run_slug>/...
  • ./scripts/check.sh

I also resolved the remaining non-outdated review threads that these changes cover. The unrelated local-only edit in Benchmark/Generated/Ethereum/DepositContractMinimal/Tasks/DepositCount.lean is still left untouched and is not part of this push.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f52e480a17

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/interactive_runtime.py
Comment thread scripts/validate_manifests.py
Comment thread harness/interactive_runtime.py
Comment thread harness/default_agent.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3cb1dffc2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/default_agent.py Outdated
@Th0rgal Th0rgal merged commit 709e6e9 into main Mar 20, 2026
2 of 3 checks passed
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e8efdbf0fe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/task_runner.py
Comment on lines +377 to +381
if statuses:
if any(status == "failed" for status in statuses):
case_status = "failed"
elif any(status == "passed" for status in statuses):
case_status = "passed"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat missing task artifacts as non-passing case status

When aggregate_results() summarizes a case, it only looks at the statuses of artifacts that already exist. If one sibling task passed but another task in the same case never produced a result file (for example because run_task.sh crashed before writing JSON), this branch still marks the whole case passed. I reproduced that by leaving only one ethereum/deposit_contract_minimal task artifact on disk: case_summary.json reported the case as passed even though 4 task refs were missing. Any consumer of results/case_summary.json will therefore overstate case health after partial or crashed runs.

Useful? React with 👍 / 👎.

Comment on lines +293 to +295
if name == "search_public_defs":
limit = int(arguments.get("limit", 20))
return self.search_public_defs(str(arguments.get("query", "")), limit=limit)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject malformed interactive tool limits instead of raising

Interactive runs currently trust the model to serialize search_public_defs.limit as a valid integer. In practice tool-call arguments are untrusted model output, and int(arguments.get("limit", 20)) will raise ValueError/TypeError for payloads like {"limit":"bad"} or {"limit":null}. I reproduced that directly against TaskProofRuntime.execute_tool(). A single malformed tool call will abort the entire benchmark run instead of returning a rejected tool result and letting the model continue.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

return re.sub(r"(?s)<think>.*?</think>\s*", "", "\n".join(parts)).strip()
reasoning_content = message.get("reasoning_content")
if isinstance(reasoning_content, str):
return reasoning_content.strip()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adapter returns reasoning content as candidate proof text

Medium Severity

extract_text in openai_compatible_adapter.py falls back to returning reasoning_content when content is null, but the same function in default_agent.py correctly returns an empty string. This means reasoning-only API responses (common with o1/o3-style models where content is null) would have their internal chain-of-thought submitted as proof candidate text through the command adapter path, instead of triggering the empty_response handling. The check.sh test only covers default_agent.py's version and uses content: "" rather than content: None, so this inconsistency is untested.

Additional Locations (1)
Fix in Cursor Fix in Web

for index, item in enumerate(value):
errors.extend(validate(item, schema["items"], f"{path}[{index}]"))

return errors
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema validation logic duplicated across two files

Low Severity

type_matches and validate are independently implemented in both validate_manifests.py and default_agent.py. The default_agent.py version is a strict superset, adding minLength, minItems, uniqueItems, and minimum checks that the validate_manifests.py copy lacks. A bug fix or feature addition in one copy won't automatically propagate to the other, and the incomplete copy in validate_manifests.py silently skips constraint types that schemas may rely on.

Additional Locations (1)
Fix in Cursor Fix in Web

Th0rgal pushed a commit that referenced this pull request Apr 22, 2026
Codex P2 #1: narrow `environment_error` classification to real infra
failures. Previously any missing .olean matched, but that string is
also produced when the model imports a non-existent module. In the
latter case we want the normal stagnation/temperature logic to kick in
so the model can correct itself. Now only classify as environment_error
when the missing module is under `Benchmark.*` (our dependency tree
which should have been pre-built) or when lean itself is missing. The
generic "failed to load environment" substring no longer
short-circuits.

Codex P2 #2: raise `_backoff_delay` Retry-After ceiling from 60s to
600s. Providers routinely request several-minute waits on 429/503; the
old 60s clamp caused retries to fire while the rate limit was still in
force, undermining the "honour Retry-After" behaviour.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant