Genetic sharp playground#2
Merged
jsboige merged 2 commits intoJun 4, 2024
Merged
Conversation
jsboige
added a commit
that referenced
this pull request
Jan 20, 2026
…ground Genetic sharp playground
jsboige
added a commit
that referenced
this pull request
Feb 25, 2026
Added 4 challenge cells to notebooks for session with advanced students: - Challenge #1 (10 pts): Creative prompt for AI poem - 2_PromptEngineering - Challenge #2 (15 pts): Intelligent router with function calling - 4_Function_Calling - Challenge #3 (15 pts): Audio conversational loop with TTS/Whisper - Audio/01-3 - Challenge #4 (10 pts): Project image generation with DALL-E-3 - Image/01-1 Each challenge includes: - Clear objectives and point values - Success criteria checklist - Technical hints - PR submission instructions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jsboige
added a commit
that referenced
this pull request
Feb 25, 2026
…cution - Fixed 2_PromptEngineering.ipynb: removed temperature params (gpt-5-mini only supports temp=1.0) - Cleaned Papermill error outputs and fixed notebook structure - Updated point values from 10/15 to 1/1.5 (on 20 scale) - Re-executed notebook with clean outputs Challenges now use appropriate point scale: - Challenge #1 (Prompt creatif): 1 pt - Challenge #2 (Routeur intelligent): 1.5 pts - Challenge #3 (Boucle audio): 1.5 pts - Challenge #4 (Image projet): 1 pt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This was referenced Mar 14, 2026
Closed
13 tasks
This was referenced Apr 29, 2026
jsboige
added a commit
that referenced
this pull request
May 4, 2026
…oops CRITICAL #1: Separate multi-asset (n_vars=n_assets) from single-asset (n_vars=n_features) modes to fix adjacency dimension mismatch. CRITICAL #2: Add val_loader for early stopping, test set untouched until final evaluation. MAJOR #4: STGAT now builds and passes sector adjacency with self-loops to prevent NaN from isolated nodes (softmax(-inf) on disconnected VIX). MODERATE: normalize_sequences param order (X_val before X_test). MINOR: shuffle=True on train_loader. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
6 tasks
jsboige
added a commit
that referenced
this pull request
May 4, 2026
…im fixes) (#713) * feat(ml): add MTGNN & STGAT cross-asset GNN training scripts with tests (Stage 3) Implements Stage 3 (GNN Cross-Asset) of the ML training curriculum: Track A - MTGNN (Wu et al., KDD 2020): - GraphConstructor: learnable adaptive adjacency via node embeddings - MixHopPropagation: multi-hop graph convolution with depth parameter - DilatedCausalConv + TCNBlock: temporal convolution with gated activation - MTGNNModel: adaptive/static/dynamic graph combination (alpha/beta/gamma) - Combined adjacency: A = 0.6*A_adaptive + 0.2*A_static + 0.2*A_pearson Track B - STGAT (Spatial-Temporal Graph Attention): - GraphAttentionLayer + MultiHeadGAT (Velickovic et al., ICLR 2018) - TemporalAttention: scaled dot-product with causal mask - STGATBlock: spatial GAT + temporal attention + FFN with residuals - STGATModel: stacked blocks with positional encoding Both scripts share the Stage 2 interface pattern: - build_sequences / normalize_sequences / compute_majority_class_baseline - Anti-bias policy (panier_anti_bias, 26 symbols, 7 asset classes) - SECTOR_MAP for static adjacency construction - AMP training, cosine annealing, gradient clipping - CLI with --dry-run / --walk-forward, architecture hyperparams - Import Stage 0 discipline: walk_forward, baselines, transaction_costs Tests: 50/50 passing (26 MTGNN + 24 STGAT), CPU-only smoke tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(ml): address PR #713 review — val leakage, adjacency dims, self-loops CRITICAL #1: Separate multi-asset (n_vars=n_assets) from single-asset (n_vars=n_features) modes to fix adjacency dimension mismatch. CRITICAL #2: Add val_loader for early stopping, test set untouched until final evaluation. MAJOR #4: STGAT now builds and passes sector adjacency with self-loops to prevent NaN from isolated nodes (softmax(-inf) on disconnected VIX). MODERATE: normalize_sequences param order (X_val before X_test). MINOR: shuffle=True on train_loader. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
jsboige
added a commit
that referenced
this pull request
May 10, 2026
Verified end-to-end on `_SmokeTest.lean` (zero_add) — agents now talk to LLM, use tools, build via real lake. None of these were detectable without traces. 1. **lean_server.py — WSL lake silent success** `_run_lake_build` was calling `wsl bash -c "source ~/.elan/env > /dev/null 2>&1 && lake build"`. WSL Debian 13 has no elan installed -> `source` failed -> `&&` short-circuited -> subprocess returned empty output with exit 0 -> `_extract_errors([])` = [] -> `success=True`. Every lake build was a no-op silently. Fix: new `_resolve_lake_command()` resolves to Windows lake.exe by default (`~/.elan/bin/lake.exe`), shares `.lake/build/lib/` cache with manual builds (~1.7-2.2s cache hit verified). LEAN_LAKE_BIN override + LEAN_USE_WSL=1 opt-in still supported. Also added defensive synthetic error when returncode != 0 but no parsed errors, so the WSL-style silent success can never re-emerge. Also patched check_axioms() to use the same resolver. 2. **workflow.py — asyncio.wait_for breaks framework telemetry** Wrapping `agent.run()` in `asyncio.wait_for(...)` spawns the coroutine in a child Task with its own context. `AgentTelemetryLayer` sets a ContextVar Token in the Task and tries to reset it — fails with `ValueError: <Token> was created in a different Context`. Verified in trace `multi_SMOKE_ZERO_ADD_local_*.spans.jsonl`: every single LLM call errored with this message, the workflow then propagated `"Agent error: <Token>..."` as the next message to every downstream agent. Net effect: 20 LLM calls, 0 useful output, iteration cap hit. Fix: removed the per-call asyncio.wait_for. The workflow-level wall clock cap in provers.py still bounds total session length. 3. **agents.py — max_tokens unset on reasoning models** z.ai GLM-5.1 / local Qwen3.6 burn 90-99% of output budget in `reasoning_content` before producing visible tokens. Default 2048 max_tokens -> `finish_reason: length` on trivial smoke tests, no tactic ever reaches the workflow. Added `default_options=ChatOptions(max_tokens=...)`: 16384 for reasoning agents (Tactic, Critic, Coordinator), 8192 for fast agents (Search). 4. **otel_setup.py (NEW) — JSONL span exporter** Wraps MS Agent Framework's `configure_otel_providers` so every agent span (chat, tool calls, message routing, executor.process) lands as a line in `baselines/traces/<session>.spans.jsonl`. Idempotent across calls; sensitive data (prompts, completions) captured. Without this exporter, the framework's instrumentation goes to the default console exporter — unusable when the prover is iterating on multi-page Lean errors. The smoke-test trace was 133KB / 68 spans and it's how I found bug #2. 5. **workflow.py — DEFAULT_AGENT_TIMEOUT_S 90 -> 600** Reasoning models routinely take 60-300s wall clock on a single Mathlib goal because of `reasoning_content` spend. 6. **_SmokeTest.lean (NEW)** — trivial `0 + n = n` sorry for end-to-end harness validation. Lets us prove the prover bottoms-out on a target that should be one-shot, *before* running on Voting.lean targets. Out of scope of this commit (next): wiring `prover/knowledge.py` (ProofKnowledgeBase, dead code today) into SearchTools so successful proofs persist across sessions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jsboige
added a commit
that referenced
this pull request
May 11, 2026
…ard (#904) * feat(lean-prover): redirect DEMO 9 to provable L325 + honest sorry guard Root cause of "5+ vagues PROVED reverted" cycle on po-2026 (cf project_lean_median_misdiagnosis.md, 2026-05-10): - DEMO 9 was configured `line=254` which auto-resolved to L261 — the *weak* `median_voter_theorem` sorry. The in-file FIXME comment (L252-260) documents that this version is **unprovable** with `single_peaked` (weak preference) and `margin_pos` (strict). Counter-example written in the comment. Yet the prover kept attacking it, reporting "PROVED", and the PR kept getting reverted. This PR makes the prover refuse honest sorrys and target the real work. CHANGES (Python tooling only, no Lean diff): 1. `prover/config.py` — DEMO 9 redirected to L325 (`median_voter_theorem_strict` counting lemma case `peaks_j < median`). Added DEMO 14 for the symmetric L348 case. New `HONEST_SORRIES` registry lists L261 with the reason ("WEAK version unprovable, use strict variant"). 2. `prover/lean_utils.py` — `is_honest_sorry()` scans comments above a sorry for FIXME / "cannot be proved" / "counter-example" / "needs a stronger hypothesis" / etc. Returns (is_honest, reason). 3. `prover/provers.py`: - `_refuse_honest_sorry()` early-exits both MultiAgentSorryProver and AutonomousProver when the target sorry is honest. Returns a structured refusal dict so callers can distinguish "skipped" from "failed". - Auto-detect "closest sorry line" now excludes lines where `sorry` appears only inside a `--` comment (was matching FIXME lines). - Both provers now load + persist cross-session attempt history. 4. `prover/tools.py` — `SearchTools._known_lemmas` enriched with List/Sort/Pairwise/countP/Finset.filter lemmas needed for the median counting argument (verified in Lean 4.29.1 + current Mathlib). Keyword scoring boosted for counting/sorting/perm/card patterns. 5. `prover/attempt_history.py` (new) — JSON-per-(file, line) record of past tactic attempts. Surfaced to agents as "DO NOT REPEAT" context block. Storage: `prover/.prover_history/<stem>_<hash>_L<line>.json`. VERIFICATION (sanity): - L261 -> is_honest=True (matches FIXME pattern) - L325 / L348 -> is_honest=False (real combinatorial work) - Misconfigured `line=254` -> auto-detect resolves to L261 -> refusal kicks in - attempt_history roundtrip (record + load + format_for_agent) works - All 5 modified Python files compile clean NOT IN THIS PR (deferred for live-test follow-up): - Empirical run of the multi-agent prover on DEMO 9 to measure progress on L325. Will be done in a follow-up commit/PR with full trace logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(lean-prover): L325 step 1 verified + DEMO 9 scaffold + known_lemmas API fix Voting.lean L325: replaced bare sorry with proven Step 1 (complementarity A.card + B.card = n via Finset.card_filter_add_card_filter_not + filter_congr + not_le.symm). Step 2 (A.card <= n/2 via sortedness) marked as the remaining work. lake build SUCCESS, sorry count unchanged (4) but proof more advanced. config.py DEMO 9 scaffolding: now ships the verified Step 1 code as a starting point so po-2026's prover doesn't re-derive complementarity. Documents the 3 remaining steps with named Mathlib lemmas (mergeSort_perm, pairwise_mergeSort, Pairwise.rel_get_of_le, countP_le_length, Finset.length_toList). tools.py SearchTools._known_lemmas: replaced obsolete entry Finset.card_filter_add_card_filter_eq_card with the current Mathlib name Finset.card_filter_add_card_filter_not, plus calling-convention note about named args (s := ...) (p := ...). Also added Finset.filter_congr and not_le with usage hints — both encountered on this iteration. Live ai-01 iteration on the actual Voting.lean (not a unit test) — these are the real frictions that the multi-agent prover needs to bypass without a human in the loop. PR #904 base unchanged; this commit extends it with the concrete progress. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(lean-prover): 4 silent-failure bugs found via OTel JSONL traces Verified end-to-end on `_SmokeTest.lean` (zero_add) — agents now talk to LLM, use tools, build via real lake. None of these were detectable without traces. 1. **lean_server.py — WSL lake silent success** `_run_lake_build` was calling `wsl bash -c "source ~/.elan/env > /dev/null 2>&1 && lake build"`. WSL Debian 13 has no elan installed -> `source` failed -> `&&` short-circuited -> subprocess returned empty output with exit 0 -> `_extract_errors([])` = [] -> `success=True`. Every lake build was a no-op silently. Fix: new `_resolve_lake_command()` resolves to Windows lake.exe by default (`~/.elan/bin/lake.exe`), shares `.lake/build/lib/` cache with manual builds (~1.7-2.2s cache hit verified). LEAN_LAKE_BIN override + LEAN_USE_WSL=1 opt-in still supported. Also added defensive synthetic error when returncode != 0 but no parsed errors, so the WSL-style silent success can never re-emerge. Also patched check_axioms() to use the same resolver. 2. **workflow.py — asyncio.wait_for breaks framework telemetry** Wrapping `agent.run()` in `asyncio.wait_for(...)` spawns the coroutine in a child Task with its own context. `AgentTelemetryLayer` sets a ContextVar Token in the Task and tries to reset it — fails with `ValueError: <Token> was created in a different Context`. Verified in trace `multi_SMOKE_ZERO_ADD_local_*.spans.jsonl`: every single LLM call errored with this message, the workflow then propagated `"Agent error: <Token>..."` as the next message to every downstream agent. Net effect: 20 LLM calls, 0 useful output, iteration cap hit. Fix: removed the per-call asyncio.wait_for. The workflow-level wall clock cap in provers.py still bounds total session length. 3. **agents.py — max_tokens unset on reasoning models** z.ai GLM-5.1 / local Qwen3.6 burn 90-99% of output budget in `reasoning_content` before producing visible tokens. Default 2048 max_tokens -> `finish_reason: length` on trivial smoke tests, no tactic ever reaches the workflow. Added `default_options=ChatOptions(max_tokens=...)`: 16384 for reasoning agents (Tactic, Critic, Coordinator), 8192 for fast agents (Search). 4. **otel_setup.py (NEW) — JSONL span exporter** Wraps MS Agent Framework's `configure_otel_providers` so every agent span (chat, tool calls, message routing, executor.process) lands as a line in `baselines/traces/<session>.spans.jsonl`. Idempotent across calls; sensitive data (prompts, completions) captured. Without this exporter, the framework's instrumentation goes to the default console exporter — unusable when the prover is iterating on multi-page Lean errors. The smoke-test trace was 133KB / 68 spans and it's how I found bug #2. 5. **workflow.py — DEFAULT_AGENT_TIMEOUT_S 90 -> 600** Reasoning models routinely take 60-300s wall clock on a single Mathlib goal because of `reasoning_content` spend. 6. **_SmokeTest.lean (NEW)** — trivial `0 + n = n` sorry for end-to-end harness validation. Lets us prove the prover bottoms-out on a target that should be one-shot, *before* running on Voting.lean targets. Out of scope of this commit (next): wiring `prover/knowledge.py` (ProofKnowledgeBase, dead code today) into SearchTools so successful proofs persist across sessions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(lean-prover): bridge submit_tactic/decomposition to msg.tactic + KB wiring Critical bridge fix discovered via trace deep-dive of run iter 1 (BG task bcc62hc5h, DEMO 9 L338 — fr-FR + 4-agent multi-prover): TacticAgent's `submit_tactic` and `submit_decomposition` tools wrote to `state.tactic_history` but never touched the workflow `ProofMessage`. VerifyExecutor reads `msg.tactic` first, finds it None, sets `error="No tactic submitted"`, routes back to TacticAgent until the iteration cap. Result: 0 verify calls in a 12-iteration session. Bridge in AgentExecutor.handle (workflow.py L117-157): - Snapshot `len(state.tactic_history)` before `agent.run` - After agent returns, if history grew, lift latest attempt to `msg.tactic` + `msg.is_decomposition` - Trace-log the bridge event (`role="tactic_bridge"`) Also wires: - `is_decomposition: bool` field on TacticAttempt (state.py) - `add_tactic_attempt(..., is_decomposition=...)` parameter - `submit_decomposition` now actually records via `add_tactic_attempt` (was returning JSON only, never reaching verifier) KB wiring (B.1 from issue #820): - SearchTools accepts `kb` param (defaults to ProofKnowledgeBase singleton) - `lookup_proven_pattern(goal)` exposed as a tool (search agent) - `search_mathlib_lemmas` prepends KB exact + similar before LSP/fallback - VerifyExecutor records (goal,tactic) on success via `kb.record_success` - Single shared `kb` instance flows from MultiAgentSorryProver → SearchTools + ProofWorkflowBuilder → VerifyExecutor (so within-session successes are visible to subsequent SearchAgent calls) Instructions rewrite (no hard-coded Lean tactics — per user directive "sans mettre des conseils specifiques en dur dans le code ce qui reviendrait a tricher"): - SEARCH: tool order (KB → LSP → fallback), no Lean tactics - TACTIC: incremental building, decomposition strategy, generic fix patterns (type mismatch → cast, unfold failed → show) - CRITIC: route by error nature (lemma issue → Search, strategy issue → Coordinator, fragile tactic → Tactic) - COORDINATOR: REQUIRES non-empty plan, natural language only * fix(lean-prover): flush=True on TraceLogger prints for BG observability Without flush, BG runs (no terminal attached) buffered console output indefinitely. The OTel JSONL spans showed real progress while stdout was frozen at the first '[+0.1s] CoordinatorAgent [receive]' line for 8+ minutes. Adding flush=True per-print so 'tail -f' on the BG output file shows live progress. Trivial change, no behavior impact beyond stdout flushing. * fix(lean-prover): bump fast budget to 16384 + empty-response guard Found via BG iter 2 trace deep-dive (multi_VOTING_MEDIAN_COUNTING_LT_zai 1778455641): SearchAgent (local Qwen3.6, fast budget 8192) hit `finish_reason: length` with `parts: []` — burned its entire output budget in reasoning_content and emitted no visible response. Downstream TacticAgent received an empty content string from the inter-agent message and had to redo all the file_read_lines + compile_probe_goal work itself, wasting wall-clock time and tokens. Two fixes: 1. `FAST_MAX_TOKENS = 16384` (was 8192). Local Qwen3.6 needs the same headroom as reasoning models because every call is reasoning-heavy. z.ai GLM-5.1 had similar issues at 2048. 2. AgentExecutor empty-response guard: when an agent returns no visible text AND didn't submit a tactic, inject a synthetic recovery message so the next agent knows to proceed using state + tools directly instead of treating silence as 'nothing to add'. * fix(lean-prover): per-chat-completion timeout 240s + max_retries=1 BG iter 2 (PR #904 branch) hung TacticAgent on a chat call to z.ai for 16+ minutes — Python process alive, CPU 1.9s, idle waiting on HTTP response that never came. Trace mtime froze at 01:32:26 with 51 spans; process killed manually at 01:48Z. Cause: AsyncOpenAI default has no read timeout (httpx.Timeout(None) for read), so a stalled response blocks indefinitely. The workflow-level wall-clock cap (2400s) does eventually kick in, but only after burning the entire budget on one hang. Fix: pass an explicit AsyncOpenAI(timeout=240, max_retries=1) into OpenAIChatCompletionClient so a single chat call can't hang the whole session. Reasoning models legitimately take 30-90s; 240s is generous. With max_retries=1 the framework will retry once before raising, so the empty-response guard added in e9b9b16 picks up the failure cleanly. * feat(lean-prover): commit run_prover_bg.py BG harness + .gitignore traces run_prover_bg.py was created locally to launch the multi-agent prover via Bash run_in_background but was never tracked. Iter 1-3 ran on this script without it being in git, which means iter 4+ on a fresh clone would have nothing to launch. Commit it now. Behavior: - Reads DEMOS dict from prover.config, picks demo_id from CLI - Forces LEAN_PROJECT to social_choice_lean before importing prover modules - Captures pre/post sorry counts on the target file - Emits parseable [BG] lines (iter, FINAL, OTEL_PATH, TRACE_PATH) so external harvest scripts can scrape progress from stdout - TraceLogger writes to prover/baselines/traces/ (per-session subdir) .gitignore: untrack the generated trace dirs and the local proof_knowledge.json (machine-local, would create constant churn). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(lean-prover): trace clarity + multi-message response_text extraction Three small harness fixes spotted while watching iter 3 trace: 1. trace.py: live console line for tool calls printed `set_attack_plan({})` even when the call had 4 KB of args, because log_agent_response wraps the JSON string as `{"call_id": x, "arguments": "..."}` and the print formatter just dumped the whole wrapper truncated to 80 chars. New `_format_tool_args` unwraps `arguments`, falls through to a tool_result preview when the call had no args (e.g. get_proof_state()), and shows "[no args] -> <preview>" so the trace still tells you something happened. 2. tools.py set_attack_plan: tool_args was None when the trace logged the plan, which is what made the trace line print `({})`. Pass `{"steps_count": N, "reason": ...}` explicitly so iter 3+ traces show the count + truncated reason instead of empty parens. 3. workflow.py AgentExecutor: previously `last.text if hasattr(last, 'text') else str(last)` grabbed only the LAST message text. When an agent emits tool calls then text then a result-sentinel as final message, `last.text` is empty and the burned-budget guard fires — even though earlier messages had real content. New `_extract_response_text` walks messages in reverse for the most recent non-empty text, then synthesizes a "[harness summary]" from the last 5 function_call payloads if no text surfaced. Burned-budget guard still fires when both text AND tool calls were truly absent. These don't change the workflow logic, only what the next agent sees / what the live console shows. Effective from iter 4 onwards (iter 3 already running on the old code). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(lean-prover): tighten SearchAgent prompt to stop budget-burn Iter 2 + 3 both showed local Qwen3.6 SearchAgent burning the full output budget (16384 tokens) in reasoning_content with `parts: []` empty visible response. ~236s wall-clock wasted before the empty-response guard fires the fallback. That's 10% of a 40-min cap on no-output. Root cause is prompt scope creep: the agent feels obligated to "think through" the search, when its actual job is just to call 1-3 tools and emit a short candidate list. This patch makes the contract explicit: - "ECONOMIE D'OUTPUT" section at the top forbids in-text reasoning, caps text output at ~200 tokens, mandates a bullet-list format - Stop conditions: 1 KB exact_match OR 3 lemmas → return immediately - get_proof_state max 1 call (was unbounded) - search_mathlib_lemmas max 1-2 reformulations (was unbounded) - file_read_lines marked "rare" (was step 6 default) - Reminds the model that re-reasoning the proof is the TacticAgent's job No tool implementation change; instructions only. Effective from iter 4 onwards. Iter 3 already running on the previous prompt — we'll see if the guard saves it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(lean-prover): allow decomposition + commit partial improvements Two related bugs that combined into all-or-nothing behavior: 1) sorry_guard rejected ANY count increase. Replacing 1 hard sorry by 3 sub-sorries that all build was treated as regression. Now accepts up to original_sorry_count + _decomposition_budget (default 5), with a `decomposition_progress` trace log when the count grows but stays under ceiling. 2) provers.py finally-block restored original whenever final_sorry >= original_sorry_count. A build-passing structural decomposition was thrown away. tools.py now tracks _last_build_ok_content/_sorry_count on every successful build_check; provers.py prefers that snapshot over restore-original, so partial structural progress is committed even when the count didn't drop. Original is only restored when no build-ok edit exists. 3) success predicate gains a `structural_progress` branch so the run isn't logged as FAILED when the file genuinely improved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
3 tasks
simonnaulet
pushed a commit
to simonnaulet/CoursIA
that referenced
this pull request
May 18, 2026
…26 (jsboige#833) Background: po-2026 dashboard post 2026-05-08T21:22:58Z documented 5 issues on the multi-agent prover after the Cycle 12 baseline run on median_voter_theorem (Voting.lean L251). This PR addresses jsboige#1 / jsboige#2 / jsboige#3 — the three blockers that made the prover unsafe and unusable on external LLM providers. CRITICAL jsboige#1 — `lean_server` import was path-dependent (verifier.py) Before: `from lean_server import LeanVerifier` only resolved when the caller's cwd was `agent_tests/`. From any other context (tests, notebooks, sub-process) it raised `ModuleNotFoundError` and either crashed the prover or — worse — was caught silently and made compile() return success=True regardless of build state. Fix: spec-based import via `importlib.util.spec_from_file_location` using `Path(__file__).parent.parent / "lean_server.py"`. Idempotent via a module-level cache. The verifier now works regardless of cwd. HIGH jsboige#3 — sorry guard was bypassable (tools.py) Before: the guard only counted occurrences of the literal token "sorry". An LLM could "improve" sorry_count by inserting `exact _`, `exact ?_`, `admit`, bare `_`, etc. — all non-compiling placeholders that look like progress. Fix: - new `_check_stub_guard()` rejects 6 known placeholder patterns before writing - new `_build_check_or_revert()` runs `lake build` after each write and reverts the file on failure - both `file_replace_sorry` and `file_replace_lines` gain a `build_check=True` default; callers can opt out for speed but the prover loop now never operates on a broken file HIGH jsboige#2 — multi-agent workflow could hang forever (workflow.py + provers.py) Before: `AgentExecutor.handle()` did `await self._agent.run(...)` with no timeout. A stalled provider (zai/openrouter occasionally do) blocked the entire graph. There was also no iteration cap on the cyclic critic→tactic→verify subgraph. Fix: - per-agent timeout (default 90s) via `asyncio.wait_for` - `ProofMessage.iteration` increments on every traversal; when it exceeds `max_iterations` the executor yields output instead of looping - `MultiAgentSorryProver.prove_sorry()` wraps `workflow.run()` in a wall-clock `asyncio.wait_for` (default `max_iterations * 120s`) Tests New `agent_tests/tests/test_prover_guards.py` (20 tests, all passing): - lean_server class loads via spec import (CRITICAL jsboige#1) - get_verifier instantiates with explicit project_dir - 6 stub patterns rejected by `_check_stub_guard` - 7 real tactics accepted by `_check_stub_guard` - comments containing `exact _` ignored - `file_replace_sorry` blocks `exact _` without writing - `file_replace_sorry` writes a real tactic - `ProofMessage` carries iteration cap + bounded default timeout - `MultiAgentSorryProver.prove_sorry` exposes workflow_timeout_s Install New `agent_tests/requirements.txt` documents the prover deps: `agent-framework-openai>=1.3.0`, `openai>=2.0.0`, `python-dotenv`, `pytest`, `pytest-asyncio`. Verified by installing into `coursia-ml-training` env and running the 20 tests. Out of scope (po-2026 issues jsboige#4 / jsboige#5) jsboige#4 (goal extraction cascading errors on Voting.lean) and jsboige#5 (LLM selection) are separate concerns — jsboige#4 needs Voting.lean fixes upstream, jsboige#5 is a config call left to the operator. Both are tracked under jsboige#833. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 tasks
6 tasks
This was referenced May 26, 2026
jsboige
added a commit
that referenced
this pull request
May 26, 2026
- P1: Extract _record_sorry_count() shared method, eliminating duplication between _build_check_or_revert and compile(). Both callers now use the same single method for delta0 streak tracking. - P2: Make probe cache context window configurable (probe_ctx_lines=10). - P3: Make cumulative_fails_threshold configurable via env var PROVER_CUMULATIVE_FAILS_THRESHOLD (default 3, override to 5 for conservative mode). Addresses Hermes CHANGES_REQUESTED on PR #1616 (concern #1 duplication, concerns #2-#3 magic numbers). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
jsboige
added a commit
that referenced
this pull request
May 27, 2026
…ntory (#1616) * fix(prover): harness P1+P2 stagnation guard + probe cache + consolidate Lean inventory P1 HIGH: Move delta0 tracking into _build_check_or_revert so file_replace_lines/sorry also update consecutive_delta0 counter. Previously only compile() updated it, allowing 75+ replace calls to bypass the stagnation guard (20-30h cumulative waste). P2 HIGH: compile_probe_goal cache uses surrounding context hash (±10 lines) instead of full-file MD5. Edits far from the sorry no longer invalidate probe cache. P3 MEDIUM: Lower cumulative_fails_threshold from 5 to 3 for earlier Director escalation. Also: LEAN_INVENTORY.md updated (sorry 4→3 cooperative_games, MechanismDesign added, GO/NO-GO decisions per project). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(prover): address Hermes review — factor delta0 + configurable params - P1: Extract _record_sorry_count() shared method, eliminating duplication between _build_check_or_revert and compile(). Both callers now use the same single method for delta0 streak tracking. - P2: Make probe cache context window configurable (probe_ctx_lines=10). - P3: Make cumulative_fails_threshold configurable via env var PROVER_CUMULATIVE_FAILS_THRESHOLD (default 3, override to 5 for conservative mode). Addresses Hermes CHANGES_REQUESTED on PR #1616 (concern #1 duplication, concerns #2-#3 magic numbers). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(gametheory): execute Lean cells in 15b-CooperativeGames notebook Two Lean 4 code cells (stable marriage definitions, GSState) verified via `lean` CLI in WSL and injected with execution_count + outputs. Notebook now 17/17 cells executed, 0 errors. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(lean): repair lean4-wsl kernel + re-execute 15b with real kernel outputs Root cause: kernel.json pointed to OLD bash wrapper (~/lean4-jupyter-wrapper.sh) which lacks Windows→WSL path conversion and NTFS permission handling. Fixed by pointing to correct Python wrapper (~/.lean4-kernel-wrapper.py v5). - Update kernel.json to use correct Python wrapper with path conversion - Re-execute 15b-CooperativeGames notebook with working lean4-wsl kernel (17/17 cells, 0 errors, 4.6s) replacing manually injected outputs - Update wsl-kernels.md rule with full kernel architecture, diagnostic, setup scripts reference, and incident documentation Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(lean): re-execute 6 Lean intro notebooks via repaired lean4-wsl kernel All notebooks now 100% executed with real kernel outputs: - Lean-2-Dependent-Types: 24/24 (was 20/24) - Lean-3-Propositions-Proofs: 24/24 (was 23/24) - Lean-4-Quantifiers: 22/22 (was 21/22) - Lean-5-Tactics: 37/37 (was 33/37) - Lean-6-Mathlib-Essentials: 19/19 (was 18/19) - Lean-11-TorchLean: 10/10 (was 0/10) Kernel lean4-wsl was repaired in previous commit (955dc2f). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore: regenerate catalog + update README markers for Lean notebook re-executions Catalog regenerated (494→498 entries) after re-executing 7 Lean notebooks. Updated CATALOG-STATUS blocks in 4 READMEs: top-level, GameTheory, QuantConnect, SymbolicAI. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(ci): align catalog drift CI with git-tracked notebooks only Local catalog included 4 untracked files (3 gitignored QuantConnect quantbooks + test_lean4_kernel.ipynb) that CI never sees, causing persistent drift failures. Add --git-tracked-only flag to generate_catalog.py and use it in CI workflow. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
7 tasks
jsboige
added a commit
that referenced
this pull request
May 28, 2026
…1689) Adds check_kernel_wrapper() to validate_lean_setup.py: - Reads ~/.local/share/jupyter/kernels/lean4-wsl/kernel.json (or %APPDATA%\jupyter\kernels\lean4-wsl\kernel.json on Windows). - ERRORS if argv mentions the obsolete bash wrapper ~/lean4-jupyter-wrapper.sh (incident 2026-05-27). - OK if argv points to ~/.lean4-kernel-wrapper.py (v5). - WARNS otherwise. Wired into both validate_windows() and validate_wsl(). Verified locally (Windows mode): 13/13 OK, wrapper detected correct. Addresses acceptance criterion #2 of #1618 ("Validation script detects wrong wrapper registration"). Remaining criteria (single entry point, archive old wrapper, doc) deferred to follow-up PRs. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 28, 2026
This was referenced May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
il manque la gestion du risque , on a de la documentation https://www.abcbourse.com/apprendre/19_variance_covariance.html , désolé pour la nomenclature du dossier