Skip to content

Genetic sharp playground#2

Merged
jsboige merged 2 commits into
jsboige:mainfrom
AntoineBendafiSchulmann:genetic_sharp_playground
Jun 4, 2024
Merged

Genetic sharp playground#2
jsboige merged 2 commits into
jsboige:mainfrom
AntoineBendafiSchulmann:genetic_sharp_playground

Conversation

@AntoineBendafiSchulmann
Copy link
Copy Markdown

il manque la gestion du risque , on a de la documentation https://www.abcbourse.com/apprendre/19_variance_covariance.html , désolé pour la nomenclature du dossier

@jsboige jsboige merged commit ac63f70 into jsboige:main Jun 4, 2024
jsboige pushed a commit that referenced this pull request Oct 8, 2025
jsboige added a commit that referenced this pull request Jan 20, 2026
jsboige pushed a commit that referenced this pull request Jan 20, 2026
jsboige added a commit that referenced this pull request Feb 25, 2026
Added 4 challenge cells to notebooks for session with advanced students:
- Challenge #1 (10 pts): Creative prompt for AI poem - 2_PromptEngineering
- Challenge #2 (15 pts): Intelligent router with function calling - 4_Function_Calling
- Challenge #3 (15 pts): Audio conversational loop with TTS/Whisper - Audio/01-3
- Challenge #4 (10 pts): Project image generation with DALL-E-3 - Image/01-1

Each challenge includes:
- Clear objectives and point values
- Success criteria checklist
- Technical hints
- PR submission instructions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jsboige added a commit that referenced this pull request Feb 25, 2026
…cution

- Fixed 2_PromptEngineering.ipynb: removed temperature params (gpt-5-mini only supports temp=1.0)
- Cleaned Papermill error outputs and fixed notebook structure
- Updated point values from 10/15 to 1/1.5 (on 20 scale)
- Re-executed notebook with clean outputs

Challenges now use appropriate point scale:
- Challenge #1 (Prompt creatif): 1 pt
- Challenge #2 (Routeur intelligent): 1.5 pts
- Challenge #3 (Boucle audio): 1.5 pts
- Challenge #4 (Image projet): 1 pt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jsboige added a commit that referenced this pull request May 4, 2026
…oops

CRITICAL #1: Separate multi-asset (n_vars=n_assets) from single-asset
(n_vars=n_features) modes to fix adjacency dimension mismatch.
CRITICAL #2: Add val_loader for early stopping, test set untouched
until final evaluation.
MAJOR #4: STGAT now builds and passes sector adjacency with self-loops
to prevent NaN from isolated nodes (softmax(-inf) on disconnected VIX).
MODERATE: normalize_sequences param order (X_val before X_test).
MINOR: shuffle=True on train_loader.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
jsboige added a commit that referenced this pull request May 4, 2026
…im fixes) (#713)

* feat(ml): add MTGNN & STGAT cross-asset GNN training scripts with tests (Stage 3)

Implements Stage 3 (GNN Cross-Asset) of the ML training curriculum:

Track A - MTGNN (Wu et al., KDD 2020):
- GraphConstructor: learnable adaptive adjacency via node embeddings
- MixHopPropagation: multi-hop graph convolution with depth parameter
- DilatedCausalConv + TCNBlock: temporal convolution with gated activation
- MTGNNModel: adaptive/static/dynamic graph combination (alpha/beta/gamma)
- Combined adjacency: A = 0.6*A_adaptive + 0.2*A_static + 0.2*A_pearson

Track B - STGAT (Spatial-Temporal Graph Attention):
- GraphAttentionLayer + MultiHeadGAT (Velickovic et al., ICLR 2018)
- TemporalAttention: scaled dot-product with causal mask
- STGATBlock: spatial GAT + temporal attention + FFN with residuals
- STGATModel: stacked blocks with positional encoding

Both scripts share the Stage 2 interface pattern:
- build_sequences / normalize_sequences / compute_majority_class_baseline
- Anti-bias policy (panier_anti_bias, 26 symbols, 7 asset classes)
- SECTOR_MAP for static adjacency construction
- AMP training, cosine annealing, gradient clipping
- CLI with --dry-run / --walk-forward, architecture hyperparams
- Import Stage 0 discipline: walk_forward, baselines, transaction_costs

Tests: 50/50 passing (26 MTGNN + 24 STGAT), CPU-only smoke tests.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(ml): address PR #713 review — val leakage, adjacency dims, self-loops

CRITICAL #1: Separate multi-asset (n_vars=n_assets) from single-asset
(n_vars=n_features) modes to fix adjacency dimension mismatch.
CRITICAL #2: Add val_loader for early stopping, test set untouched
until final evaluation.
MAJOR #4: STGAT now builds and passes sector adjacency with self-loops
to prevent NaN from isolated nodes (softmax(-inf) on disconnected VIX).
MODERATE: normalize_sequences param order (X_val before X_test).
MINOR: shuffle=True on train_loader.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
jsboige added a commit that referenced this pull request May 5, 2026
Switches compute_rsi() from rolling(period).mean() (SMA) to
ewm(alpha=1/period) (Wilder exponential smoothing).

This aligns RSI values with TradingView, Yahoo Finance, and
standard financial platforms. Finding #2 from pipeline audit (#705).
jsboige added a commit that referenced this pull request May 5, 2026
Switches compute_rsi() from rolling(period).mean() (SMA) to
ewm(alpha=1/period) (Wilder exponential smoothing).

This aligns RSI values with TradingView, Yahoo Finance, and
standard financial platforms. Finding #2 from pipeline audit (#705).
jsboige added a commit that referenced this pull request May 10, 2026
Verified end-to-end on `_SmokeTest.lean` (zero_add) — agents now talk to LLM,
use tools, build via real lake. None of these were detectable without traces.

1. **lean_server.py — WSL lake silent success**
   `_run_lake_build` was calling `wsl bash -c "source ~/.elan/env > /dev/null 2>&1 && lake build"`.
   WSL Debian 13 has no elan installed -> `source` failed -> `&&` short-circuited ->
   subprocess returned empty output with exit 0 -> `_extract_errors([])` = []
   -> `success=True`. Every lake build was a no-op silently.

   Fix: new `_resolve_lake_command()` resolves to Windows lake.exe by default
   (`~/.elan/bin/lake.exe`), shares `.lake/build/lib/` cache with manual builds
   (~1.7-2.2s cache hit verified). LEAN_LAKE_BIN override + LEAN_USE_WSL=1
   opt-in still supported. Also added defensive synthetic error when
   returncode != 0 but no parsed errors, so the WSL-style silent success
   can never re-emerge.

   Also patched check_axioms() to use the same resolver.

2. **workflow.py — asyncio.wait_for breaks framework telemetry**
   Wrapping `agent.run()` in `asyncio.wait_for(...)` spawns the coroutine
   in a child Task with its own context. `AgentTelemetryLayer` sets a
   ContextVar Token in the Task and tries to reset it — fails with
   `ValueError: <Token> was created in a different Context`.

   Verified in trace `multi_SMOKE_ZERO_ADD_local_*.spans.jsonl`: every
   single LLM call errored with this message, the workflow then propagated
   `"Agent error: <Token>..."` as the next message to every downstream
   agent. Net effect: 20 LLM calls, 0 useful output, iteration cap hit.

   Fix: removed the per-call asyncio.wait_for. The workflow-level wall
   clock cap in provers.py still bounds total session length.

3. **agents.py — max_tokens unset on reasoning models**
   z.ai GLM-5.1 / local Qwen3.6 burn 90-99% of output budget in
   `reasoning_content` before producing visible tokens. Default 2048
   max_tokens -> `finish_reason: length` on trivial smoke tests, no
   tactic ever reaches the workflow. Added `default_options=ChatOptions(max_tokens=...)`:
   16384 for reasoning agents (Tactic, Critic, Coordinator), 8192 for fast
   agents (Search).

4. **otel_setup.py (NEW) — JSONL span exporter**
   Wraps MS Agent Framework's `configure_otel_providers` so every agent
   span (chat, tool calls, message routing, executor.process) lands as a
   line in `baselines/traces/<session>.spans.jsonl`. Idempotent across
   calls; sensitive data (prompts, completions) captured.

   Without this exporter, the framework's instrumentation goes to the
   default console exporter — unusable when the prover is iterating on
   multi-page Lean errors. The smoke-test trace was 133KB / 68 spans and
   it's how I found bug #2.

5. **workflow.py — DEFAULT_AGENT_TIMEOUT_S 90 -> 600**
   Reasoning models routinely take 60-300s wall clock on a single Mathlib
   goal because of `reasoning_content` spend.

6. **_SmokeTest.lean (NEW)** — trivial `0 + n = n` sorry for end-to-end
   harness validation. Lets us prove the prover bottoms-out on a target
   that should be one-shot, *before* running on Voting.lean targets.

Out of scope of this commit (next): wiring `prover/knowledge.py`
(ProofKnowledgeBase, dead code today) into SearchTools so successful
proofs persist across sessions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jsboige added a commit that referenced this pull request May 11, 2026
…ard (#904)

* feat(lean-prover): redirect DEMO 9 to provable L325 + honest sorry guard

Root cause of "5+ vagues PROVED reverted" cycle on po-2026 (cf
project_lean_median_misdiagnosis.md, 2026-05-10):

- DEMO 9 was configured `line=254` which auto-resolved to L261 — the
  *weak* `median_voter_theorem` sorry. The in-file FIXME comment (L252-260)
  documents that this version is **unprovable** with `single_peaked` (weak
  preference) and `margin_pos` (strict). Counter-example written in the
  comment. Yet the prover kept attacking it, reporting "PROVED", and the
  PR kept getting reverted.

This PR makes the prover refuse honest sorrys and target the real work.

CHANGES (Python tooling only, no Lean diff):

1. `prover/config.py` — DEMO 9 redirected to L325 (`median_voter_theorem_strict`
   counting lemma case `peaks_j < median`). Added DEMO 14 for the symmetric
   L348 case. New `HONEST_SORRIES` registry lists L261 with the reason
   ("WEAK version unprovable, use strict variant").

2. `prover/lean_utils.py` — `is_honest_sorry()` scans comments above a sorry
   for FIXME / "cannot be proved" / "counter-example" / "needs a stronger
   hypothesis" / etc. Returns (is_honest, reason).

3. `prover/provers.py`:
   - `_refuse_honest_sorry()` early-exits both MultiAgentSorryProver and
     AutonomousProver when the target sorry is honest. Returns a structured
     refusal dict so callers can distinguish "skipped" from "failed".
   - Auto-detect "closest sorry line" now excludes lines where `sorry`
     appears only inside a `--` comment (was matching FIXME lines).
   - Both provers now load + persist cross-session attempt history.

4. `prover/tools.py` — `SearchTools._known_lemmas` enriched with
   List/Sort/Pairwise/countP/Finset.filter lemmas needed for the median
   counting argument (verified in Lean 4.29.1 + current Mathlib). Keyword
   scoring boosted for counting/sorting/perm/card patterns.

5. `prover/attempt_history.py` (new) — JSON-per-(file, line) record of past
   tactic attempts. Surfaced to agents as "DO NOT REPEAT" context block.
   Storage: `prover/.prover_history/<stem>_<hash>_L<line>.json`.

VERIFICATION (sanity):
- L261 -> is_honest=True (matches FIXME pattern)
- L325 / L348 -> is_honest=False (real combinatorial work)
- Misconfigured `line=254` -> auto-detect resolves to L261 -> refusal kicks in
- attempt_history roundtrip (record + load + format_for_agent) works
- All 5 modified Python files compile clean

NOT IN THIS PR (deferred for live-test follow-up):
- Empirical run of the multi-agent prover on DEMO 9 to measure progress
  on L325. Will be done in a follow-up commit/PR with full trace logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(lean-prover): L325 step 1 verified + DEMO 9 scaffold + known_lemmas API fix

Voting.lean L325: replaced bare sorry with proven Step 1 (complementarity
A.card + B.card = n via Finset.card_filter_add_card_filter_not + filter_congr
+ not_le.symm). Step 2 (A.card <= n/2 via sortedness) marked as the remaining
work. lake build SUCCESS, sorry count unchanged (4) but proof more advanced.

config.py DEMO 9 scaffolding: now ships the verified Step 1 code as a starting
point so po-2026's prover doesn't re-derive complementarity. Documents the 3
remaining steps with named Mathlib lemmas (mergeSort_perm, pairwise_mergeSort,
Pairwise.rel_get_of_le, countP_le_length, Finset.length_toList).

tools.py SearchTools._known_lemmas: replaced obsolete entry
Finset.card_filter_add_card_filter_eq_card with the current Mathlib name
Finset.card_filter_add_card_filter_not, plus calling-convention note about
named args (s := ...) (p := ...). Also added Finset.filter_congr and not_le
with usage hints — both encountered on this iteration.

Live ai-01 iteration on the actual Voting.lean (not a unit test) — these are
the real frictions that the multi-agent prover needs to bypass without a
human in the loop. PR #904 base unchanged; this commit extends it with the
concrete progress.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(lean-prover): 4 silent-failure bugs found via OTel JSONL traces

Verified end-to-end on `_SmokeTest.lean` (zero_add) — agents now talk to LLM,
use tools, build via real lake. None of these were detectable without traces.

1. **lean_server.py — WSL lake silent success**
   `_run_lake_build` was calling `wsl bash -c "source ~/.elan/env > /dev/null 2>&1 && lake build"`.
   WSL Debian 13 has no elan installed -> `source` failed -> `&&` short-circuited ->
   subprocess returned empty output with exit 0 -> `_extract_errors([])` = []
   -> `success=True`. Every lake build was a no-op silently.

   Fix: new `_resolve_lake_command()` resolves to Windows lake.exe by default
   (`~/.elan/bin/lake.exe`), shares `.lake/build/lib/` cache with manual builds
   (~1.7-2.2s cache hit verified). LEAN_LAKE_BIN override + LEAN_USE_WSL=1
   opt-in still supported. Also added defensive synthetic error when
   returncode != 0 but no parsed errors, so the WSL-style silent success
   can never re-emerge.

   Also patched check_axioms() to use the same resolver.

2. **workflow.py — asyncio.wait_for breaks framework telemetry**
   Wrapping `agent.run()` in `asyncio.wait_for(...)` spawns the coroutine
   in a child Task with its own context. `AgentTelemetryLayer` sets a
   ContextVar Token in the Task and tries to reset it — fails with
   `ValueError: <Token> was created in a different Context`.

   Verified in trace `multi_SMOKE_ZERO_ADD_local_*.spans.jsonl`: every
   single LLM call errored with this message, the workflow then propagated
   `"Agent error: <Token>..."` as the next message to every downstream
   agent. Net effect: 20 LLM calls, 0 useful output, iteration cap hit.

   Fix: removed the per-call asyncio.wait_for. The workflow-level wall
   clock cap in provers.py still bounds total session length.

3. **agents.py — max_tokens unset on reasoning models**
   z.ai GLM-5.1 / local Qwen3.6 burn 90-99% of output budget in
   `reasoning_content` before producing visible tokens. Default 2048
   max_tokens -> `finish_reason: length` on trivial smoke tests, no
   tactic ever reaches the workflow. Added `default_options=ChatOptions(max_tokens=...)`:
   16384 for reasoning agents (Tactic, Critic, Coordinator), 8192 for fast
   agents (Search).

4. **otel_setup.py (NEW) — JSONL span exporter**
   Wraps MS Agent Framework's `configure_otel_providers` so every agent
   span (chat, tool calls, message routing, executor.process) lands as a
   line in `baselines/traces/<session>.spans.jsonl`. Idempotent across
   calls; sensitive data (prompts, completions) captured.

   Without this exporter, the framework's instrumentation goes to the
   default console exporter — unusable when the prover is iterating on
   multi-page Lean errors. The smoke-test trace was 133KB / 68 spans and
   it's how I found bug #2.

5. **workflow.py — DEFAULT_AGENT_TIMEOUT_S 90 -> 600**
   Reasoning models routinely take 60-300s wall clock on a single Mathlib
   goal because of `reasoning_content` spend.

6. **_SmokeTest.lean (NEW)** — trivial `0 + n = n` sorry for end-to-end
   harness validation. Lets us prove the prover bottoms-out on a target
   that should be one-shot, *before* running on Voting.lean targets.

Out of scope of this commit (next): wiring `prover/knowledge.py`
(ProofKnowledgeBase, dead code today) into SearchTools so successful
proofs persist across sessions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(lean-prover): bridge submit_tactic/decomposition to msg.tactic + KB wiring

Critical bridge fix discovered via trace deep-dive of run iter 1
(BG task bcc62hc5h, DEMO 9 L338 — fr-FR + 4-agent multi-prover):
TacticAgent's `submit_tactic` and `submit_decomposition` tools wrote to
`state.tactic_history` but never touched the workflow `ProofMessage`.
VerifyExecutor reads `msg.tactic` first, finds it None, sets
`error="No tactic submitted"`, routes back to TacticAgent until the
iteration cap. Result: 0 verify calls in a 12-iteration session.

Bridge in AgentExecutor.handle (workflow.py L117-157):
  - Snapshot `len(state.tactic_history)` before `agent.run`
  - After agent returns, if history grew, lift latest attempt to
    `msg.tactic` + `msg.is_decomposition`
  - Trace-log the bridge event (`role="tactic_bridge"`)

Also wires:
  - `is_decomposition: bool` field on TacticAttempt (state.py)
  - `add_tactic_attempt(..., is_decomposition=...)` parameter
  - `submit_decomposition` now actually records via `add_tactic_attempt`
    (was returning JSON only, never reaching verifier)

KB wiring (B.1 from issue #820):
  - SearchTools accepts `kb` param (defaults to ProofKnowledgeBase singleton)
  - `lookup_proven_pattern(goal)` exposed as a tool (search agent)
  - `search_mathlib_lemmas` prepends KB exact + similar before LSP/fallback
  - VerifyExecutor records (goal,tactic) on success via `kb.record_success`
  - Single shared `kb` instance flows from MultiAgentSorryProver →
    SearchTools + ProofWorkflowBuilder → VerifyExecutor (so within-session
    successes are visible to subsequent SearchAgent calls)

Instructions rewrite (no hard-coded Lean tactics — per user directive
"sans mettre des conseils specifiques en dur dans le code ce qui
reviendrait a tricher"):
  - SEARCH: tool order (KB → LSP → fallback), no Lean tactics
  - TACTIC: incremental building, decomposition strategy, generic fix
    patterns (type mismatch → cast, unfold failed → show)
  - CRITIC: route by error nature (lemma issue → Search,
    strategy issue → Coordinator, fragile tactic → Tactic)
  - COORDINATOR: REQUIRES non-empty plan, natural language only

* fix(lean-prover): flush=True on TraceLogger prints for BG observability

Without flush, BG runs (no terminal attached) buffered console output
indefinitely. The OTel JSONL spans showed real progress while stdout
was frozen at the first '[+0.1s] CoordinatorAgent [receive]' line for
8+ minutes. Adding flush=True per-print so 'tail -f' on the BG output
file shows live progress.

Trivial change, no behavior impact beyond stdout flushing.

* fix(lean-prover): bump fast budget to 16384 + empty-response guard

Found via BG iter 2 trace deep-dive (multi_VOTING_MEDIAN_COUNTING_LT_zai
1778455641): SearchAgent (local Qwen3.6, fast budget 8192) hit
`finish_reason: length` with `parts: []` — burned its entire output
budget in reasoning_content and emitted no visible response. Downstream
TacticAgent received an empty content string from the inter-agent
message and had to redo all the file_read_lines + compile_probe_goal
work itself, wasting wall-clock time and tokens.

Two fixes:
1. `FAST_MAX_TOKENS = 16384` (was 8192). Local Qwen3.6 needs the same
   headroom as reasoning models because every call is reasoning-heavy.
   z.ai GLM-5.1 had similar issues at 2048.
2. AgentExecutor empty-response guard: when an agent returns no visible
   text AND didn't submit a tactic, inject a synthetic recovery message
   so the next agent knows to proceed using state + tools directly
   instead of treating silence as 'nothing to add'.

* fix(lean-prover): per-chat-completion timeout 240s + max_retries=1

BG iter 2 (PR #904 branch) hung TacticAgent on a chat call to z.ai for
16+ minutes — Python process alive, CPU 1.9s, idle waiting on HTTP
response that never came. Trace mtime froze at 01:32:26 with 51 spans;
process killed manually at 01:48Z.

Cause: AsyncOpenAI default has no read timeout (httpx.Timeout(None) for
read), so a stalled response blocks indefinitely. The workflow-level
wall-clock cap (2400s) does eventually kick in, but only after burning
the entire budget on one hang.

Fix: pass an explicit AsyncOpenAI(timeout=240, max_retries=1) into
OpenAIChatCompletionClient so a single chat call can't hang the whole
session. Reasoning models legitimately take 30-90s; 240s is generous.
With max_retries=1 the framework will retry once before raising, so the
empty-response guard added in e9b9b16 picks up the failure cleanly.

* feat(lean-prover): commit run_prover_bg.py BG harness + .gitignore traces

run_prover_bg.py was created locally to launch the multi-agent prover via
Bash run_in_background but was never tracked. Iter 1-3 ran on this script
without it being in git, which means iter 4+ on a fresh clone would have
nothing to launch. Commit it now.

Behavior:
- Reads DEMOS dict from prover.config, picks demo_id from CLI
- Forces LEAN_PROJECT to social_choice_lean before importing prover modules
- Captures pre/post sorry counts on the target file
- Emits parseable [BG] lines (iter, FINAL, OTEL_PATH, TRACE_PATH) so
  external harvest scripts can scrape progress from stdout
- TraceLogger writes to prover/baselines/traces/ (per-session subdir)

.gitignore: untrack the generated trace dirs and the local
proof_knowledge.json (machine-local, would create constant churn).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(lean-prover): trace clarity + multi-message response_text extraction

Three small harness fixes spotted while watching iter 3 trace:

1. trace.py: live console line for tool calls printed `set_attack_plan({})`
   even when the call had 4 KB of args, because log_agent_response wraps
   the JSON string as `{"call_id": x, "arguments": "..."}` and the print
   formatter just dumped the whole wrapper truncated to 80 chars. New
   `_format_tool_args` unwraps `arguments`, falls through to a tool_result
   preview when the call had no args (e.g. get_proof_state()), and shows
   "[no args] -> <preview>" so the trace still tells you something happened.

2. tools.py set_attack_plan: tool_args was None when the trace logged the
   plan, which is what made the trace line print `({})`. Pass
   `{"steps_count": N, "reason": ...}` explicitly so iter 3+ traces show
   the count + truncated reason instead of empty parens.

3. workflow.py AgentExecutor: previously `last.text if hasattr(last, 'text')
   else str(last)` grabbed only the LAST message text. When an agent emits
   tool calls then text then a result-sentinel as final message, `last.text`
   is empty and the burned-budget guard fires — even though earlier messages
   had real content. New `_extract_response_text` walks messages in reverse
   for the most recent non-empty text, then synthesizes a "[harness summary]"
   from the last 5 function_call payloads if no text surfaced. Burned-budget
   guard still fires when both text AND tool calls were truly absent.

These don't change the workflow logic, only what the next agent sees /
what the live console shows. Effective from iter 4 onwards (iter 3 already
running on the old code).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(lean-prover): tighten SearchAgent prompt to stop budget-burn

Iter 2 + 3 both showed local Qwen3.6 SearchAgent burning the full
output budget (16384 tokens) in reasoning_content with `parts: []` empty
visible response. ~236s wall-clock wasted before the empty-response guard
fires the fallback. That's 10% of a 40-min cap on no-output.

Root cause is prompt scope creep: the agent feels obligated to "think
through" the search, when its actual job is just to call 1-3 tools and
emit a short candidate list.

This patch makes the contract explicit:
- "ECONOMIE D'OUTPUT" section at the top forbids in-text reasoning,
  caps text output at ~200 tokens, mandates a bullet-list format
- Stop conditions: 1 KB exact_match OR 3 lemmas → return immediately
- get_proof_state max 1 call (was unbounded)
- search_mathlib_lemmas max 1-2 reformulations (was unbounded)
- file_read_lines marked "rare" (was step 6 default)
- Reminds the model that re-reasoning the proof is the TacticAgent's job

No tool implementation change; instructions only. Effective from iter 4
onwards. Iter 3 already running on the previous prompt — we'll see if
the guard saves it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(lean-prover): allow decomposition + commit partial improvements

Two related bugs that combined into all-or-nothing behavior:

1) sorry_guard rejected ANY count increase. Replacing 1 hard sorry by 3
   sub-sorries that all build was treated as regression. Now accepts up
   to original_sorry_count + _decomposition_budget (default 5), with a
   `decomposition_progress` trace log when the count grows but stays
   under ceiling.

2) provers.py finally-block restored original whenever final_sorry >=
   original_sorry_count. A build-passing structural decomposition was
   thrown away. tools.py now tracks _last_build_ok_content/_sorry_count
   on every successful build_check; provers.py prefers that snapshot
   over restore-original, so partial structural progress is committed
   even when the count didn't drop. Original is only restored when no
   build-ok edit exists.

3) success predicate gains a `structural_progress` branch so the run
   isn't logged as FAILED when the file genuinely improved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jsboige added a commit that referenced this pull request May 14, 2026
- Fix stale SHA/branch reference (use branch name only)
- Add QWEN_ASR_API_KEY and FUNASR_API_KEY to GenAI/Audio table (#3)
- Add HUGGINGFACE_TOKEN to Phase 3 distribution table (#4)
- Update PR title to reflect Phase 1+2+3 scope (#2)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
simonnaulet pushed a commit to simonnaulet/CoursIA that referenced this pull request May 18, 2026
…26 (jsboige#833)

Background: po-2026 dashboard post 2026-05-08T21:22:58Z documented 5 issues
on the multi-agent prover after the Cycle 12 baseline run on
median_voter_theorem (Voting.lean L251). This PR addresses jsboige#1 / jsboige#2 / jsboige#3 —
the three blockers that made the prover unsafe and unusable on external
LLM providers.

CRITICAL jsboige#1 — `lean_server` import was path-dependent (verifier.py)
  Before: `from lean_server import LeanVerifier` only resolved when the
  caller's cwd was `agent_tests/`. From any other context (tests,
  notebooks, sub-process) it raised `ModuleNotFoundError` and either
  crashed the prover or — worse — was caught silently and made compile()
  return success=True regardless of build state.
  Fix: spec-based import via `importlib.util.spec_from_file_location`
  using `Path(__file__).parent.parent / "lean_server.py"`. Idempotent
  via a module-level cache. The verifier now works regardless of cwd.

HIGH jsboige#3 — sorry guard was bypassable (tools.py)
  Before: the guard only counted occurrences of the literal token
  "sorry". An LLM could "improve" sorry_count by inserting `exact _`,
  `exact ?_`, `admit`, bare `_`, etc. — all non-compiling placeholders
  that look like progress.
  Fix:
   - new `_check_stub_guard()` rejects 6 known placeholder patterns
     before writing
   - new `_build_check_or_revert()` runs `lake build` after each write
     and reverts the file on failure
   - both `file_replace_sorry` and `file_replace_lines` gain a
     `build_check=True` default; callers can opt out for speed but the
     prover loop now never operates on a broken file

HIGH jsboige#2 — multi-agent workflow could hang forever (workflow.py + provers.py)
  Before: `AgentExecutor.handle()` did `await self._agent.run(...)` with
  no timeout. A stalled provider (zai/openrouter occasionally do)
  blocked the entire graph. There was also no iteration cap on the
  cyclic critic→tactic→verify subgraph.
  Fix:
   - per-agent timeout (default 90s) via `asyncio.wait_for`
   - `ProofMessage.iteration` increments on every traversal; when it
     exceeds `max_iterations` the executor yields output instead of
     looping
   - `MultiAgentSorryProver.prove_sorry()` wraps `workflow.run()` in a
     wall-clock `asyncio.wait_for` (default `max_iterations * 120s`)

Tests
  New `agent_tests/tests/test_prover_guards.py` (20 tests, all passing):
   - lean_server class loads via spec import (CRITICAL jsboige#1)
   - get_verifier instantiates with explicit project_dir
   - 6 stub patterns rejected by `_check_stub_guard`
   - 7 real tactics accepted by `_check_stub_guard`
   - comments containing `exact _` ignored
   - `file_replace_sorry` blocks `exact _` without writing
   - `file_replace_sorry` writes a real tactic
   - `ProofMessage` carries iteration cap + bounded default timeout
   - `MultiAgentSorryProver.prove_sorry` exposes workflow_timeout_s

Install
  New `agent_tests/requirements.txt` documents the prover deps:
  `agent-framework-openai>=1.3.0`, `openai>=2.0.0`, `python-dotenv`,
  `pytest`, `pytest-asyncio`. Verified by installing into
  `coursia-ml-training` env and running the 20 tests.

Out of scope (po-2026 issues jsboige#4 / jsboige#5)
  jsboige#4 (goal extraction cascading errors on Voting.lean) and jsboige#5 (LLM
  selection) are separate concerns — jsboige#4 needs Voting.lean fixes upstream,
  jsboige#5 is a config call left to the operator. Both are tracked under jsboige#833.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jsboige added a commit that referenced this pull request May 26, 2026
- P1: Extract _record_sorry_count() shared method, eliminating duplication
  between _build_check_or_revert and compile(). Both callers now use the
  same single method for delta0 streak tracking.
- P2: Make probe cache context window configurable (probe_ctx_lines=10).
- P3: Make cumulative_fails_threshold configurable via env var
  PROVER_CUMULATIVE_FAILS_THRESHOLD (default 3, override to 5 for
  conservative mode).

Addresses Hermes CHANGES_REQUESTED on PR #1616 (concern #1 duplication,
concerns #2-#3 magic numbers).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
jsboige added a commit that referenced this pull request May 27, 2026
…ntory (#1616)

* fix(prover): harness P1+P2 stagnation guard + probe cache + consolidate Lean inventory

P1 HIGH: Move delta0 tracking into _build_check_or_revert so file_replace_lines/sorry
also update consecutive_delta0 counter. Previously only compile() updated it, allowing
75+ replace calls to bypass the stagnation guard (20-30h cumulative waste).

P2 HIGH: compile_probe_goal cache uses surrounding context hash (±10 lines) instead
of full-file MD5. Edits far from the sorry no longer invalidate probe cache.

P3 MEDIUM: Lower cumulative_fails_threshold from 5 to 3 for earlier Director escalation.

Also: LEAN_INVENTORY.md updated (sorry 4→3 cooperative_games, MechanismDesign added,
GO/NO-GO decisions per project).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(prover): address Hermes review — factor delta0 + configurable params

- P1: Extract _record_sorry_count() shared method, eliminating duplication
  between _build_check_or_revert and compile(). Both callers now use the
  same single method for delta0 streak tracking.
- P2: Make probe cache context window configurable (probe_ctx_lines=10).
- P3: Make cumulative_fails_threshold configurable via env var
  PROVER_CUMULATIVE_FAILS_THRESHOLD (default 3, override to 5 for
  conservative mode).

Addresses Hermes CHANGES_REQUESTED on PR #1616 (concern #1 duplication,
concerns #2-#3 magic numbers).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(gametheory): execute Lean cells in 15b-CooperativeGames notebook

Two Lean 4 code cells (stable marriage definitions, GSState) verified
via `lean` CLI in WSL and injected with execution_count + outputs.
Notebook now 17/17 cells executed, 0 errors.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(lean): repair lean4-wsl kernel + re-execute 15b with real kernel outputs

Root cause: kernel.json pointed to OLD bash wrapper (~/lean4-jupyter-wrapper.sh)
which lacks Windows→WSL path conversion and NTFS permission handling.
Fixed by pointing to correct Python wrapper (~/.lean4-kernel-wrapper.py v5).

- Update kernel.json to use correct Python wrapper with path conversion
- Re-execute 15b-CooperativeGames notebook with working lean4-wsl kernel
  (17/17 cells, 0 errors, 4.6s) replacing manually injected outputs
- Update wsl-kernels.md rule with full kernel architecture, diagnostic,
  setup scripts reference, and incident documentation

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(lean): re-execute 6 Lean intro notebooks via repaired lean4-wsl kernel

All notebooks now 100% executed with real kernel outputs:
- Lean-2-Dependent-Types: 24/24 (was 20/24)
- Lean-3-Propositions-Proofs: 24/24 (was 23/24)
- Lean-4-Quantifiers: 22/22 (was 21/22)
- Lean-5-Tactics: 37/37 (was 33/37)
- Lean-6-Mathlib-Essentials: 19/19 (was 18/19)
- Lean-11-TorchLean: 10/10 (was 0/10)

Kernel lean4-wsl was repaired in previous commit (955dc2f).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore: regenerate catalog + update README markers for Lean notebook re-executions

Catalog regenerated (494→498 entries) after re-executing 7 Lean notebooks.
Updated CATALOG-STATUS blocks in 4 READMEs: top-level, GameTheory, QuantConnect, SymbolicAI.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(ci): align catalog drift CI with git-tracked notebooks only

Local catalog included 4 untracked files (3 gitignored QuantConnect
quantbooks + test_lean4_kernel.ipynb) that CI never sees, causing
persistent drift failures. Add --git-tracked-only flag to
generate_catalog.py and use it in CI workflow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
jsboige added a commit that referenced this pull request May 28, 2026
…1689)

Adds check_kernel_wrapper() to validate_lean_setup.py:
- Reads ~/.local/share/jupyter/kernels/lean4-wsl/kernel.json
  (or %APPDATA%\jupyter\kernels\lean4-wsl\kernel.json on Windows).
- ERRORS if argv mentions the obsolete bash wrapper
  ~/lean4-jupyter-wrapper.sh (incident 2026-05-27).
- OK if argv points to ~/.lean4-kernel-wrapper.py (v5).
- WARNS otherwise.

Wired into both validate_windows() and validate_wsl().
Verified locally (Windows mode): 13/13 OK, wrapper detected correct.

Addresses acceptance criterion #2 of #1618 ("Validation script
detects wrong wrapper registration"). Remaining criteria (single
entry point, archive old wrapper, doc) deferred to follow-up PRs.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants