Skip to content

Scaffold generic interactive harness improvement loop#9

Merged
Th0rgal merged 16 commits intomainfrom
codex/generic-harness-search-pr
Mar 24, 2026
Merged

Scaffold generic interactive harness improvement loop#9
Th0rgal merged 16 commits intomainfrom
codex/generic-harness-search-pr

Conversation

@Th0rgal
Copy link
Copy Markdown
Member

@Th0rgal Th0rgal commented Mar 23, 2026

Summary

  • scaffold branch for iterative work on generic interactive harness improvements
  • intended follow-up work: search frontier, verifier-guided repair, lightweight decomposition, and low-cost gemini benchmarking on a small parallel slice

Notes

  • this PR intentionally starts with an empty commit
  • benchmark and implementation work will be added incrementally on this branch

Note

Medium Risk
Medium risk because it significantly changes interactive execution control flow (turn/attempt accounting, transcript rewriting, token budgeting) which can affect benchmark stability and result formats.

Overview
Improves interactive-mode reliability by turning the agent runner into a multi-turn proof-attempt loop that detects token-length cutoffs, nudges the model toward write_editable_proof/run_lean_check, and compacts transcripts after Lean failures (with a forced “restart” strategy after repeated failures).

Enhances repair feedback by expanding heuristic guidance for common Lean errors and by returning repair_hints directly from run_lean_check tool failures.

Updates interactive agent configs to be backend-agnostic (use VERITY_BENCHMARK_AGENT_BASE_URL env var) and adds a new interactive-candidate.json profile for candidate runs.

Written by Cursor Bugbot for commit 3afeba2. This will update automatically on new commits. Configure here.

Claude Agent and others added 16 commits March 20, 2026 14:30
Three high-leverage fixes for the interactive proof generation harness:

1. Fix interactive.json: rename `attempts` to `max_attempts` (the field
   the code actually reads), remove unused fields, use env-based
   base_url for portability.

2. Add structured repair guidance to both run_lean_check tool results
   (interactive_runtime.py) and the retry messages (default_agent.py).
   The guidance maps common Lean error patterns to actionable hints
   (split failures, unsolved goals, type mismatches, unknown identifiers).

3. Fix early-exit bug in execute_interactive_agent_task: when the model
   returns text without tool calls and the proof fails, inject a repair
   message and continue the loop instead of immediately returning. This
   lets the model use remaining attempts (up to 16) rather than wasting
   them after a single failed text response.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two changes that together improve 5-task pass rate from 1/5 to 3/5 while
reducing total tokens from 572k to 369k:

1. Interactive mode instructions: tell the model that all files are
   already provided inline, so it should skip redundant read_public_file
   calls and start with write_editable_proof + run_lean_check directly.
   This alone saved ~25% of tokens and improved the hard subset 0/3→1/3.

2. Transcript compaction: after a failed run_lean_check, reset the
   transcript to [base_messages + exploration phase + compact repair msg]
   instead of accumulating the full history of failed attempts. The
   exploration phase (file reads, searches) is preserved so the model
   retains context, but old write-check cycles are replaced with a single
   repair message containing the latest failed proof + error + guidance.
   Also adds early-exit when run_lean_check returns passed.

Measured on the 5-task confirmation slice (builtin/fast = glm-5-turbo):
  chain_start_threshold: failed → passed (6k tok, 7s)
  deposit_count: failed → failed (115k tok)
  node_id_bijection: passed → passed (96k tok)
  sync_sets_book_value: failed → passed (23k tok, 21s)
  claim_marks_user: failed → failed (127k tok)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enhanced build_repair_guidance (both default_agent.py and
interactive_runtime.py) with better error-specific hints:

- "unknown constant": explicitly tell the model not to guess lemma
  names and to use simp/omega/decide instead
- "simp made no progress": suggest adding more definitions or removing
  unproductive simp calls
- "failed to infer binder type": hint to add explicit type annotations

These changes help the model avoid common failure modes where it
hallucinates Lean 4 lemma names that don't exist in the environment.

Measured: deposit_count went from never passing to sometimes passing
(high variance due to model non-determinism at temp=0 on the proxy).
Overall 5-task slice holds at 3/5 with 3 reliably passing tasks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After 6 consecutive failed lean checks, compact the transcript without
the failed proof and instead instruct the model to try a fundamentally
different proof strategy. This prevents the model from getting stuck in
a local minimum where it makes minor variations on the same failing
approach.

The restart threshold of 6 was chosen to avoid interfering with tasks
that are close to solving (which may need 3-5 repair iterations), while
still helping tasks where the model is genuinely stuck.

5-task confirmation: 3/5 passed, 345k tokens (stable).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Search-only turns (read_public_file, search_public_defs) no longer count
toward the proof attempt budget. This gives the model ~30-40% more actual
proof attempts. A nudge message is injected after 2 consecutive search-only
turns to redirect the model back to proof writing. Total turns are capped
at 2x max_attempts to prevent infinite search loops.

Before: deposit_count got 5 proof attempts out of 17 total turns
After:  deposit_count gets 16 proof attempts out of 20 total turns

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add hints for common syntax mistakes that waste proof attempts:
- "unexpected token" / "expected 'by'" -> remind about := by syntax
- "Function expected at" / "unknown identifier" -> remind about
  s.storage N function application syntax

Synced to both default_agent.py and interactive_runtime.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Models that use internal reasoning (thinking tokens) can exhaust
max_completion_tokens without producing visible output or tool calls
(finish_reason=length). This caused 70-90% of turns to be wasted.

Changes:
- Add max_tokens_override to send_chat_completion for adaptive budgets
- Detect finish_reason=length with no output: bump budget 1.5x (cap 4500)
  on first occurrence, inject simplification nudge on subsequent ones
- Restructure empty-response path: only evaluate proof when a new
  candidate is actually provided (prevents false lean_check_failed)
- Remove empty_response from retry failure modes (now handled separately)
- Increase candidate config to 3000 max_completion_tokens, 180s timeout

Results (5-task slice): 3/5 pass (sync_sets_book_value, node_id_bijection,
chain_start_threshold). Previous baseline: median 1/5.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dance

Add targeted repair hints when Lean goals contain `match` expressions
(suggesting `split`/`cases` tactics) and improve `if`-expression hints
to suggest `by_cases` instead of generic `native_decide`. Also updates
REPORT.md with local validation notes for the uniswap backlog case.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Preserves result transcripts from all experimental runs:
- baseline_run: 0/3 passed (initial harness, no improvements)
- run2_instructions_only: 0/5 passed (prompt-only changes)
- run3_aggressive_compact: 0/5 passed (aggressive transcript compaction)
- run4_refined_compact: 3/5 passed (refined compaction + all harness improvements)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without this, the counter accumulated across non-lean turns (search-only
turns) and could trigger a premature strategy restart even when recent
proof attempts had different failure modes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ents)

Resolves merge conflicts by keeping our improved interactive loop
(transcript compaction, separate proof/search budgets, strategy restart,
adaptive token budget) while accepting main's content updates for
README, REPORT, benchmark-inventory, backlog cases, and harness README.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Th0rgal Th0rgal marked this pull request as ready for review March 24, 2026 07:51
@Th0rgal Th0rgal merged commit 1136ecc into main Mar 24, 2026
2 checks passed
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

"- Use `s.storage 0` (function application) not `s.storage[0]` or `s.storage.0`. "
"ContractState.storage is a function `Nat → Uint256`."
)
return "\n".join(hints)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repair guidance function fully duplicated across two files

Medium Severity

_build_repair_guidance in interactive_runtime.py is a character-for-character duplicate of build_repair_guidance in default_agent.py. Since default_agent.py already imports from interactive_runtime, one module could expose a single shared implementation. Maintaining identical logic in two places risks future divergence when only one copy gets a bug fix or new hint.

Additional Locations (1)
Fix in Cursor Fix in Web

Comment thread harness/default_agent.py
if tool_name == "run_lean_check" and result.get("failure_mode") == "lean_check_failed":
last_lean_error = str(result.get("details", ""))[:MAX_ERROR_FEEDBACK_CHARS]
saw_lean_failure = True
consecutive_lean_failures += 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consecutive lean failure counter increments per tool call

Low Severity

consecutive_lean_failures increments once per run_lean_check tool call, not once per proof-attempt turn. If the model issues multiple write+check cycles in a single turn, the counter advances by more than one while proof_attempts only advances by one. This can cause the RESTART_AFTER_CONSECUTIVE_FAILURES threshold to be reached prematurely, triggering a strategy restart earlier than intended.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants