docs(spec): SHIP-TWO-001 §70 — §69 RC3 CONFIRMED on gx10 + FIX DISCHARGED via 3/3 §68-trio flips#1636
Merged
Merged
Conversation
…RGED via 3/3 §68-trio flips (PMAT-CODE-SHIP-TWO-SECTION-70) §69 (PR #1633) enumerated 4 candidate root causes for the apr eval HumanEval false-failure. §70 reports the empirical disambiguation on gx10 via the diagnostic surface (PR #1634), the 1-PR fix (PR #1635), and the discharge proof. 70.1 RC disambiguation on gx10 (canonical 7B Q4K APR teacher): - RC1 (state leak) : FALSIFIED — coherent 1031-byte response - RC2 (false-negative) : FALSIFIED — python3 actually exited 1 - RC3 (format!() bug) : CONFIRMED — imports stripped - RC4 (max_tokens trunc) : FALSIFIED — 524-char completion present 70.2 Why §68 was wrong: §68's R1+R2 0/3 flip rate on the known-failed trio was correct evidence; the inference ("Class B sampling/ quantization") was a leap. The TRUE class was Class C (harness-RC3), invisible to R1+R2 because R1+R2 doesn't touch the format!() at line 400. 70.3 The fix (PR #1635): new `extract_prompt_preamble(prompt, entry)` helper + ChatML-branch prepend in run_humaneval_inference. 7 unit tests cover the helper + RC3 falsifier. 70.4 Discharge proof — 3/3 §68 trio flip: | Task | §68 pre-fix | §68 R1+R2-only | §70 RC3-fix | | HumanEval/1 | FAIL | FAIL | PASS | | HumanEval/3 | FAIL | FAIL | PASS | | HumanEval/6 | FAIL | FAIL | PASS | Flip rate: 100%. 70.5 SHIP-005 path: 164-run dispatched on gx10 (commit b7e69bf); ~5h CPU wall. Discharge condition: post-fix pass@1 >= 84.80%. 70.6 Methodology lesson #17 NEW: pre-fix RED smoke can mask the bug class. A 0/N flip rate in a smoke proves only that the candidate fix doesn't move the needle, NOT that any specific failure class is responsible. The class must be identified via diagnostic instrumentation (APR_EVAL_DEBUG=1), not inferred from a flip rate. 70.7 Cumulative methodology lessons through §70 (lesson #17 added). 70.8 Ship-% movement: MODEL-1 stays 94% pending 164-run completion; path to 95% is single rerun + verdict check, no further code changes. MODEL-2 unchanged at 57%. Spec version: 3.14.0 → **3.16.0** (also reapplies §69 banner at v3.15.0 since PR #1633 has not yet landed on main — when #1633 lands, the §69 section will exist; this commit's banner stack accommodates that). Refs: - contracts/apr-eval-humaneval-harness-invariant-v1.yaml v1.1.0 - evidence/section-70-rc3-fix-2026-05-12/findings.json - /tmp/apr_eval_debug_HumanEval_{1,3,6}.json (gx10 evidence) - PR #1633 (§69 spec), PR #1634 (diagnostic surface), PR #1635 (RC3 fix) Closes task #52 (PMAT-CODE-SHIP-TWO-SECTION-70). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 12, 2026
3 tasks
noahgift
added a commit
that referenced
this pull request
May 12, 2026
…s@1 (PMAT-CODE-SHIP-TWO-SECTION-71) (#1642) §70 (PR #1636) confirmed RC3 (format!() drops imports) on gx10 and shipped the fix (PR #1635) + diagnostic surface (PR #1634). §71 reports the empirical 164-run discharge proof on gx10: Result: 142/164 problems passed → pass@1 = 86.59% Floor: 84.80% (AC-SHIP1-005 with 1.2% tolerance) Headroom above floor: +1.79pp Compared to §67 baseline (H4 ChatML only): 80.49% (132/164) RC3 fix flipped 10 additional problems → +6.10pp gain pass@10 ≈ 100%, pass@100 = 100% SHIP-005 LIVE-DISCHARGED. The §65→§71 cascade is closed for SHIP-005. Run metadata: Host: gx10-a5b5 (Blackwell GB10, aarch64) Binary: /home/noah/src/aprender/target/release/apr @ b7e69bf Artifact: qwen2.5-coder-7b-instruct-q4k.apr Wall: 5h 50min (08:10 → 14:00 UTC) Sample: T=0.0, 1 sample, max_tokens=512 (greedy) §17.5 chain post-§71: SHIP-002 DISCHARGED (no change) SHIP-005 PARTIAL → LIVE-DISCHARGED ← §71 SHIP-006 DISCHARGED (no change) SHIP-007 PARTIAL — multi-PR CUDA cascade (§63 — separate track) SHIP-008 DISCHARGED (no change) MODEL-1 ship %: 94% → 95% (4 of 5 §17.5 PARTIALs LIVE-discharged). Path to 96% requires SHIP-007 multi-PR CUDA cascade. MODEL-2 ship %: unchanged at 57% (independent track). Methodology lesson #18 NEW: §70 → §71 closes the predict-then-verify loop. A fix whose 3/3 smoke flip and whose mechanism-based lift estimate (§70.5 predicted +5-15pp) land within the predicted band (actual +6.10pp) IS the discharge evidence; no further investigation needed. The cascade arc closes when prediction matches empirical. Spec v3.16.0 → v3.17.0. Evidence: - evidence/section-71-ship-005-discharged-2026-05-12/humaneval-164-rc3-gx10.json (full 164-problem JSON, 24KB) - evidence/section-71-ship-005-discharged-2026-05-12/findings.json - evidence/section-70-rc3-fix-2026-05-12/findings.json (3/3 trio) - evidence/section-69-harness-bug-2026-05-12/findings.json (smoking-gun) - evidence/section-67-h4-164-run-result-2026-05-12/findings.json (baseline) Closes task #56 (PMAT-CODE-SHIP-TWO-SECTION-71). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 12, 2026
Closed
noahgift
added a commit
that referenced
this pull request
May 13, 2026
…T-CODE-V0-33-0-RELEASE-PREP) 🎉 v0.33.0 marks **MODEL-1 SHIP % = 100%** for SHIP-TWO-001. All 10 AC-SHIP1-* falsifiers are LIVE-discharged on the canonical 7B Qwen2.5-Coder-Instruct Q4_K_M teacher (lambda-vector RTX 4090, --features cuda). This release prep PR ships: 1. CHANGELOG.md [0.33.0] entry with §69-§75 highlights: - 🎉 MODEL-1 SHIP % = 100% (all 10 AC-SHIP1-* LIVE) - Fixed: SHIP-007 F32 GEMV PTX layout (PR #1651, §75) — 124.6 tok/s - Fixed: SHIP-005 HumanEval RC3 (PR #1635, §70/§71) — pass@1 86.59% - Added: APR_EVAL_DEBUG=1 diagnostic surface (PR #1634) - Added: APR_GPU_STAGE_DUMP=<dir> diagnostic surface (PR #1649) - Added: MBPP harness H4 fix (PR #1645) - Added: 2 new falsifiable contracts (apr-eval-humaneval-harness- invariant v1.1.0, apr-ship-007-gpu-stage-bisection v1.0.0) - Methodology lessons #16-22 captured in MEMORY.md - Spec: v3.13.0 → v3.21.0 across §67-§75 2. Workspace version bump: - [workspace.package].version: 0.32.0 → 0.33.0 - Root [package].version (aprender facade crate): 0.32.0 → 0.33.0 - 28 sub-crate version literals: 0.32.0 → 0.33.0 3. `cargo check -p aprender` → clean (workspace builds at 0.33.0). Out of scope for this PR (separate steps after #1651/1652 land + this PR lands): - Tag release `v0.33.0` on main - Cascade publish to crates.io (per memory project_ship_two_001_v0_32_0_release.md — 15 user-facing crates + 7 internal-tier in topological dependency order; uses `make publish CRATE=<name>`) - Post-publish QA per `feedback_post_publish_qa_required.md` — `cargo install aprender --force` + `/dogfood` GO verdict required before declaring release done (v0.31.1 was yanked for skipping this) - GitHub Release with §75 narrative - HF artifact verification (paiml/qwen2.5-coder-7b-apache-q4k-v1 sha256 already verified by §72 SHIP-010 LIVE evidence; double-check before release announcement) This PR ships ONLY the version-bump + CHANGELOG. Publishing is the next step after merge. Refs: - §75 MODEL-1 100% (PR #1652) - §74 SHIP-007 bug localized (PR #1650) - §73 SHIP-007 cascade reduction (PR #1647) - §72 5-AC LIVE cascade (PR #1646) - §71 SHIP-005 LIVE-DISCHARGED (PR #1642) - §70 RC3 fix (PR #1636) - §69 Q4K hypothesis falsified (PR #1633) - PR #1635 RC3 prepend - PR #1634 diagnostic surface + contract - PR #1648 SHIP-007 contract scaffold - PR #1649 SHIP-007 PR-B stage dump - PR #1651 SHIP-007 PR-E F32 GEMV layout fix Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 13, 2026
…T-CODE-V0-33-0-RELEASE-PREP) (#1653) 🎉 v0.33.0 marks **MODEL-1 SHIP % = 100%** for SHIP-TWO-001. All 10 AC-SHIP1-* falsifiers are LIVE-discharged on the canonical 7B Qwen2.5-Coder-Instruct Q4_K_M teacher (lambda-vector RTX 4090, --features cuda). This release prep PR ships: 1. CHANGELOG.md [0.33.0] entry with §69-§75 highlights: - 🎉 MODEL-1 SHIP % = 100% (all 10 AC-SHIP1-* LIVE) - Fixed: SHIP-007 F32 GEMV PTX layout (PR #1651, §75) — 124.6 tok/s - Fixed: SHIP-005 HumanEval RC3 (PR #1635, §70/§71) — pass@1 86.59% - Added: APR_EVAL_DEBUG=1 diagnostic surface (PR #1634) - Added: APR_GPU_STAGE_DUMP=<dir> diagnostic surface (PR #1649) - Added: MBPP harness H4 fix (PR #1645) - Added: 2 new falsifiable contracts (apr-eval-humaneval-harness- invariant v1.1.0, apr-ship-007-gpu-stage-bisection v1.0.0) - Methodology lessons #16-22 captured in MEMORY.md - Spec: v3.13.0 → v3.21.0 across §67-§75 2. Workspace version bump: - [workspace.package].version: 0.32.0 → 0.33.0 - Root [package].version (aprender facade crate): 0.32.0 → 0.33.0 - 28 sub-crate version literals: 0.32.0 → 0.33.0 3. `cargo check -p aprender` → clean (workspace builds at 0.33.0). Out of scope for this PR (separate steps after #1651/1652 land + this PR lands): - Tag release `v0.33.0` on main - Cascade publish to crates.io (per memory project_ship_two_001_v0_32_0_release.md — 15 user-facing crates + 7 internal-tier in topological dependency order; uses `make publish CRATE=<name>`) - Post-publish QA per `feedback_post_publish_qa_required.md` — `cargo install aprender --force` + `/dogfood` GO verdict required before declaring release done (v0.31.1 was yanked for skipping this) - GitHub Release with §75 narrative - HF artifact verification (paiml/qwen2.5-coder-7b-apache-q4k-v1 sha256 already verified by §72 SHIP-010 LIVE evidence; double-check before release announcement) This PR ships ONLY the version-bump + CHANGELOG. Publishing is the next step after merge. Refs: - §75 MODEL-1 100% (PR #1652) - §74 SHIP-007 bug localized (PR #1650) - §73 SHIP-007 cascade reduction (PR #1647) - §72 5-AC LIVE cascade (PR #1646) - §71 SHIP-005 LIVE-DISCHARGED (PR #1642) - §70 RC3 fix (PR #1636) - §69 Q4K hypothesis falsified (PR #1633) - PR #1635 RC3 prepend - PR #1634 diagnostic surface + contract - PR #1648 SHIP-007 contract scaffold - PR #1649 SHIP-007 PR-B stage dump - PR #1651 SHIP-007 PR-E F32 GEMV layout fix Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
§70 closes the §69 cascade with empirical evidence on gx10:
APR_EVAL_DEBUG=1on HumanEval/1 → RC3 CONFIRMED (NameError: name 'List' is not defined); RC1/RC2/RC4 FALSIFIED.extract_prompt_preamble(prompt, entry_point)helper + ChatML-branch prepend.§68's "Class B sampling/quantization" interpretation FALSIFIED: those were Class C harness-RC3 false-failures all along.
Discharge table
Flip rate: 100%.
SHIP-005 path
b7e69bfc8, full canonical set, T=0.0, ~5h CPU wall)Methodology lesson #17 NEW
Pre-fix RED smoke can mask the bug class. A 0/N flip rate in a smoke proves only that the candidate fix doesn't move the needle — NOT that any specific failure class is responsible. The class must be identified via diagnostic instrumentation (APR_EVAL_DEBUG=1), not inferred from a flip rate.
Ship-% movement
Refs
🤖 Generated with Claude Code