Scaffold generic interactive harness improvement loop#9
Conversation
Three high-leverage fixes for the interactive proof generation harness: 1. Fix interactive.json: rename `attempts` to `max_attempts` (the field the code actually reads), remove unused fields, use env-based base_url for portability. 2. Add structured repair guidance to both run_lean_check tool results (interactive_runtime.py) and the retry messages (default_agent.py). The guidance maps common Lean error patterns to actionable hints (split failures, unsolved goals, type mismatches, unknown identifiers). 3. Fix early-exit bug in execute_interactive_agent_task: when the model returns text without tool calls and the proof fails, inject a repair message and continue the loop instead of immediately returning. This lets the model use remaining attempts (up to 16) rather than wasting them after a single failed text response. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two changes that together improve 5-task pass rate from 1/5 to 3/5 while reducing total tokens from 572k to 369k: 1. Interactive mode instructions: tell the model that all files are already provided inline, so it should skip redundant read_public_file calls and start with write_editable_proof + run_lean_check directly. This alone saved ~25% of tokens and improved the hard subset 0/3→1/3. 2. Transcript compaction: after a failed run_lean_check, reset the transcript to [base_messages + exploration phase + compact repair msg] instead of accumulating the full history of failed attempts. The exploration phase (file reads, searches) is preserved so the model retains context, but old write-check cycles are replaced with a single repair message containing the latest failed proof + error + guidance. Also adds early-exit when run_lean_check returns passed. Measured on the 5-task confirmation slice (builtin/fast = glm-5-turbo): chain_start_threshold: failed → passed (6k tok, 7s) deposit_count: failed → failed (115k tok) node_id_bijection: passed → passed (96k tok) sync_sets_book_value: failed → passed (23k tok, 21s) claim_marks_user: failed → failed (127k tok) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enhanced build_repair_guidance (both default_agent.py and interactive_runtime.py) with better error-specific hints: - "unknown constant": explicitly tell the model not to guess lemma names and to use simp/omega/decide instead - "simp made no progress": suggest adding more definitions or removing unproductive simp calls - "failed to infer binder type": hint to add explicit type annotations These changes help the model avoid common failure modes where it hallucinates Lean 4 lemma names that don't exist in the environment. Measured: deposit_count went from never passing to sometimes passing (high variance due to model non-determinism at temp=0 on the proxy). Overall 5-task slice holds at 3/5 with 3 reliably passing tasks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After 6 consecutive failed lean checks, compact the transcript without the failed proof and instead instruct the model to try a fundamentally different proof strategy. This prevents the model from getting stuck in a local minimum where it makes minor variations on the same failing approach. The restart threshold of 6 was chosen to avoid interfering with tasks that are close to solving (which may need 3-5 repair iterations), while still helping tasks where the model is genuinely stuck. 5-task confirmation: 3/5 passed, 345k tokens (stable). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Search-only turns (read_public_file, search_public_defs) no longer count toward the proof attempt budget. This gives the model ~30-40% more actual proof attempts. A nudge message is injected after 2 consecutive search-only turns to redirect the model back to proof writing. Total turns are capped at 2x max_attempts to prevent infinite search loops. Before: deposit_count got 5 proof attempts out of 17 total turns After: deposit_count gets 16 proof attempts out of 20 total turns Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add hints for common syntax mistakes that waste proof attempts: - "unexpected token" / "expected 'by'" -> remind about := by syntax - "Function expected at" / "unknown identifier" -> remind about s.storage N function application syntax Synced to both default_agent.py and interactive_runtime.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Models that use internal reasoning (thinking tokens) can exhaust max_completion_tokens without producing visible output or tool calls (finish_reason=length). This caused 70-90% of turns to be wasted. Changes: - Add max_tokens_override to send_chat_completion for adaptive budgets - Detect finish_reason=length with no output: bump budget 1.5x (cap 4500) on first occurrence, inject simplification nudge on subsequent ones - Restructure empty-response path: only evaluate proof when a new candidate is actually provided (prevents false lean_check_failed) - Remove empty_response from retry failure modes (now handled separately) - Increase candidate config to 3000 max_completion_tokens, 180s timeout Results (5-task slice): 3/5 pass (sync_sets_book_value, node_id_bijection, chain_start_threshold). Previous baseline: median 1/5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dance Add targeted repair hints when Lean goals contain `match` expressions (suggesting `split`/`cases` tactics) and improve `if`-expression hints to suggest `by_cases` instead of generic `native_decide`. Also updates REPORT.md with local validation notes for the uniswap backlog case. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Preserves result transcripts from all experimental runs: - baseline_run: 0/3 passed (initial harness, no improvements) - run2_instructions_only: 0/5 passed (prompt-only changes) - run3_aggressive_compact: 0/5 passed (aggressive transcript compaction) - run4_refined_compact: 3/5 passed (refined compaction + all harness improvements) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without this, the counter accumulated across non-lean turns (search-only turns) and could trigger a premature strategy restart even when recent proof attempts had different failure modes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ents) Resolves merge conflicts by keeping our improved interactive loop (transcript compaction, separate proof/search budgets, strategy restart, adaptive token budget) while accepting main's content updates for README, REPORT, benchmark-inventory, backlog cases, and harness README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| "- Use `s.storage 0` (function application) not `s.storage[0]` or `s.storage.0`. " | ||
| "ContractState.storage is a function `Nat → Uint256`." | ||
| ) | ||
| return "\n".join(hints) |
There was a problem hiding this comment.
Repair guidance function fully duplicated across two files
Medium Severity
_build_repair_guidance in interactive_runtime.py is a character-for-character duplicate of build_repair_guidance in default_agent.py. Since default_agent.py already imports from interactive_runtime, one module could expose a single shared implementation. Maintaining identical logic in two places risks future divergence when only one copy gets a bug fix or new hint.
Additional Locations (1)
| if tool_name == "run_lean_check" and result.get("failure_mode") == "lean_check_failed": | ||
| last_lean_error = str(result.get("details", ""))[:MAX_ERROR_FEEDBACK_CHARS] | ||
| saw_lean_failure = True | ||
| consecutive_lean_failures += 1 |
There was a problem hiding this comment.
Consecutive lean failure counter increments per tool call
Low Severity
consecutive_lean_failures increments once per run_lean_check tool call, not once per proof-attempt turn. If the model issues multiple write+check cycles in a single turn, the counter advances by more than one while proof_attempts only advances by one. This can cause the RESTART_AFTER_CONSECUTIVE_FAILURES threshold to be reached prematurely, triggering a strategy restart earlier than intended.


Summary
Notes
Note
Medium Risk
Medium risk because it significantly changes interactive execution control flow (turn/attempt accounting, transcript rewriting, token budgeting) which can affect benchmark stability and result formats.
Overview
Improves interactive-mode reliability by turning the agent runner into a multi-turn proof-attempt loop that detects token-length cutoffs, nudges the model toward
write_editable_proof/run_lean_check, and compacts transcripts after Lean failures (with a forced “restart” strategy after repeated failures).Enhances repair feedback by expanding heuristic guidance for common Lean errors and by returning
repair_hintsdirectly fromrun_lean_checktool failures.Updates interactive agent configs to be backend-agnostic (use
VERITY_BENCHMARK_AGENT_BASE_URLenv var) and adds a newinteractive-candidate.jsonprofile for candidate runs.Written by Cursor Bugbot for commit 3afeba2. This will update automatically on new commits. Configure here.