Scaffold generic interactive harness improvement loop by Th0rgal · Pull Request #9 · lfglabs-dev/verity-benchmark

Th0rgal · 2026-03-23T16:45:06Z

Summary

scaffold branch for iterative work on generic interactive harness improvements
intended follow-up work: search frontier, verifier-guided repair, lightweight decomposition, and low-cost gemini benchmarking on a small parallel slice

Notes

this PR intentionally starts with an empty commit
benchmark and implementation work will be added incrementally on this branch

Note

Medium Risk
Medium risk because it significantly changes interactive execution control flow (turn/attempt accounting, transcript rewriting, token budgeting) which can affect benchmark stability and result formats.

Overview
Improves interactive-mode reliability by turning the agent runner into a multi-turn proof-attempt loop that detects token-length cutoffs, nudges the model toward write_editable_proof/run_lean_check, and compacts transcripts after Lean failures (with a forced “restart” strategy after repeated failures).

Enhances repair feedback by expanding heuristic guidance for common Lean errors and by returning repair_hints directly from run_lean_check tool failures.

Updates interactive agent configs to be backend-agnostic (use VERITY_BENCHMARK_AGENT_BASE_URL env var) and adds a new interactive-candidate.json profile for candidate runs.

^{Written by Cursor Bugbot for commit 3afeba2. This will update automatically on new commits. Configure here.}

Three high-leverage fixes for the interactive proof generation harness: 1. Fix interactive.json: rename `attempts` to `max_attempts` (the field the code actually reads), remove unused fields, use env-based base_url for portability. 2. Add structured repair guidance to both run_lean_check tool results (interactive_runtime.py) and the retry messages (default_agent.py). The guidance maps common Lean error patterns to actionable hints (split failures, unsolved goals, type mismatches, unknown identifiers). 3. Fix early-exit bug in execute_interactive_agent_task: when the model returns text without tool calls and the proof fails, inject a repair message and continue the loop instead of immediately returning. This lets the model use remaining attempts (up to 16) rather than wasting them after a single failed text response. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Two changes that together improve 5-task pass rate from 1/5 to 3/5 while reducing total tokens from 572k to 369k: 1. Interactive mode instructions: tell the model that all files are already provided inline, so it should skip redundant read_public_file calls and start with write_editable_proof + run_lean_check directly. This alone saved ~25% of tokens and improved the hard subset 0/3→1/3. 2. Transcript compaction: after a failed run_lean_check, reset the transcript to [base_messages + exploration phase + compact repair msg] instead of accumulating the full history of failed attempts. The exploration phase (file reads, searches) is preserved so the model retains context, but old write-check cycles are replaced with a single repair message containing the latest failed proof + error + guidance. Also adds early-exit when run_lean_check returns passed. Measured on the 5-task confirmation slice (builtin/fast = glm-5-turbo): chain_start_threshold: failed → passed (6k tok, 7s) deposit_count: failed → failed (115k tok) node_id_bijection: passed → passed (96k tok) sync_sets_book_value: failed → passed (23k tok, 21s) claim_marks_user: failed → failed (127k tok) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Enhanced build_repair_guidance (both default_agent.py and interactive_runtime.py) with better error-specific hints: - "unknown constant": explicitly tell the model not to guess lemma names and to use simp/omega/decide instead - "simp made no progress": suggest adding more definitions or removing unproductive simp calls - "failed to infer binder type": hint to add explicit type annotations These changes help the model avoid common failure modes where it hallucinates Lean 4 lemma names that don't exist in the environment. Measured: deposit_count went from never passing to sometimes passing (high variance due to model non-determinism at temp=0 on the proxy). Overall 5-task slice holds at 3/5 with 3 reliably passing tasks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

After 6 consecutive failed lean checks, compact the transcript without the failed proof and instead instruct the model to try a fundamentally different proof strategy. This prevents the model from getting stuck in a local minimum where it makes minor variations on the same failing approach. The restart threshold of 6 was chosen to avoid interfering with tasks that are close to solving (which may need 3-5 repair iterations), while still helping tasks where the model is genuinely stuck. 5-task confirmation: 3/5 passed, 345k tokens (stable). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Search-only turns (read_public_file, search_public_defs) no longer count toward the proof attempt budget. This gives the model ~30-40% more actual proof attempts. A nudge message is injected after 2 consecutive search-only turns to redirect the model back to proof writing. Total turns are capped at 2x max_attempts to prevent infinite search loops. Before: deposit_count got 5 proof attempts out of 17 total turns After: deposit_count gets 16 proof attempts out of 20 total turns Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add hints for common syntax mistakes that waste proof attempts: - "unexpected token" / "expected 'by'" -> remind about := by syntax - "Function expected at" / "unknown identifier" -> remind about s.storage N function application syntax Synced to both default_agent.py and interactive_runtime.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Models that use internal reasoning (thinking tokens) can exhaust max_completion_tokens without producing visible output or tool calls (finish_reason=length). This caused 70-90% of turns to be wasted. Changes: - Add max_tokens_override to send_chat_completion for adaptive budgets - Detect finish_reason=length with no output: bump budget 1.5x (cap 4500) on first occurrence, inject simplification nudge on subsequent ones - Restructure empty-response path: only evaluate proof when a new candidate is actually provided (prevents false lean_check_failed) - Remove empty_response from retry failure modes (now handled separately) - Increase candidate config to 3000 max_completion_tokens, 180s timeout Results (5-task slice): 3/5 pass (sync_sets_book_value, node_id_bijection, chain_start_threshold). Previous baseline: median 1/5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…dance Add targeted repair hints when Lean goals contain `match` expressions (suggesting `split`/`cases` tactics) and improve `if`-expression hints to suggest `by_cases` instead of generic `native_decide`. Also updates REPORT.md with local validation notes for the uniswap backlog case. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Preserves result transcripts from all experimental runs: - baseline_run: 0/3 passed (initial harness, no improvements) - run2_instructions_only: 0/5 passed (prompt-only changes) - run3_aggressive_compact: 0/5 passed (aggressive transcript compaction) - run4_refined_compact: 3/5 passed (refined compaction + all harness improvements) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Without this, the counter accumulated across non-lean turns (search-only turns) and could trigger a premature strategy restart even when recent proof attempts had different failure modes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ents) Resolves merge conflicts by keeping our improved interactive loop (transcript compaction, separate proof/search budgets, strategy restart, adaptive token budget) while accepting main's content updates for README, REPORT, benchmark-inventory, backlog cases, and harness README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-24T08:01:13Z

+            "- Use `s.storage 0` (function application) not `s.storage[0]` or `s.storage.0`. "
+            "ContractState.storage is a function `Nat → Uint256`."
+        )
+    return "\n".join(hints)


Repair guidance function fully duplicated across two files

Medium Severity

_build_repair_guidance in interactive_runtime.py is a character-for-character duplicate of build_repair_guidance in default_agent.py. Since default_agent.py already imports from interactive_runtime, one module could expose a single shared implementation. Maintaining identical logic in two places risks future divergence when only one copy gets a bug fix or new hint.

Additional Locations (1)

harness/default_agent.py#L546-L605

cursor · 2026-03-24T08:01:13Z

+            if tool_name == "run_lean_check" and result.get("failure_mode") == "lean_check_failed":
+                last_lean_error = str(result.get("details", ""))[:MAX_ERROR_FEEDBACK_CHARS]
+                saw_lean_failure = True
+                consecutive_lean_failures += 1


Consecutive lean failure counter increments per tool call

Low Severity

consecutive_lean_failures increments once per run_lean_check tool call, not once per proof-attempt turn. If the model issues multiple write+check cycles in a single turn, the counter advances by more than one while proof_attempts only advances by one. This can cause the RESTART_AFTER_CONSECUTIVE_FAILURES threshold to be reached prematurely, triggering a strategy restart earlier than intended.

Claude Agent and others added 16 commits March 20, 2026 14:30

Simplify benchmark docs

cc7bf4e

Default interactive agent to 16 rounds

6737612

Add Uniswap V2 swap benchmark candidate

94ad4f8

docs: record local validation for uniswap pair backlog case

a4f4bd1

chore: scaffold generic harness search improvements

9f3c65e

Th0rgal marked this pull request as ready for review March 24, 2026 07:51

Th0rgal merged commit 1136ecc into main Mar 24, 2026
2 checks passed

cursor Bot reviewed Mar 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaffold generic interactive harness improvement loop#9

Scaffold generic interactive harness improvement loop#9
Th0rgal merged 16 commits intomainfrom
codex/generic-harness-search-pr

Th0rgal commented Mar 23, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Mar 24, 2026

Uh oh!

cursor Bot Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Th0rgal commented Mar 23, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notes

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Mar 24, 2026

Choose a reason for hiding this comment

Repair guidance function fully duplicated across two files

Uh oh!

cursor Bot Mar 24, 2026

Choose a reason for hiding this comment

Consecutive lean failure counter increments per tool call

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Th0rgal commented Mar 23, 2026 •

edited by cursor Bot

Loading