Skip to content

docs(spec): SHIP-TWO-001 §70 — §69 RC3 CONFIRMED on gx10 + FIX DISCHARGED via 3/3 §68-trio flips#1636

Merged
noahgift merged 2 commits into
mainfrom
docs/section-70-rc3-discharge
May 12, 2026
Merged

docs(spec): SHIP-TWO-001 §70 — §69 RC3 CONFIRMED on gx10 + FIX DISCHARGED via 3/3 §68-trio flips#1636
noahgift merged 2 commits into
mainfrom
docs/section-70-rc3-discharge

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

§70 closes the §69 cascade with empirical evidence on gx10:

  1. RC disambiguation via APR_EVAL_DEBUG=1 on HumanEval/1 → RC3 CONFIRMED (NameError: name 'List' is not defined); RC1/RC2/RC4 FALSIFIED.
  2. Fix in PR fix(apr-cli): §69 RC3 CONFIRMED on gx10 — prepend prompt preamble to HumanEval full_program #1635 — new extract_prompt_preamble(prompt, entry_point) helper + ChatML-branch prepend.
  3. Discharge proof — 3/3 of §68's known-failed trio (HumanEval/1, /3, /6) flip to PASS with the fix.

§68's "Class B sampling/quantization" interpretation FALSIFIED: those were Class C harness-RC3 false-failures all along.

Discharge table

Task §68 pre-fix §68 R1+R2-only §70 RC3-fix
HumanEval/1 FAIL FAIL PASS (exit_code=0)
HumanEval/3 FAIL FAIL PASS (exit_code=0)
HumanEval/6 FAIL FAIL PASS (exit_code=0)

Flip rate: 100%.

SHIP-005 path

  • Pre-fix pass@1: 80.49% (§67, gx10 164-run)
  • Floor (AC-SHIP1-005): 84.80%
  • Expected post-fix pass@1: 85-95% (3/3 trio flip + ~70% of canonical HumanEval signatures use typing aliases)
  • Action: 164-run dispatched on gx10 (commit b7e69bfc8, full canonical set, T=0.0, ~5h CPU wall)
  • Discharge condition: post-fix pass@1 ≥ 84.80% → SHIP-005 LIVE-discharges → MODEL-1 ship % 94% → 95%

Methodology lesson #17 NEW

Pre-fix RED smoke can mask the bug class. A 0/N flip rate in a smoke proves only that the candidate fix doesn't move the needle — NOT that any specific failure class is responsible. The class must be identified via diagnostic instrumentation (APR_EVAL_DEBUG=1), not inferred from a flip rate.

Ship-% movement

  • MODEL-1: stays 94% pending 164-run completion. Path to 95% is now a single 164-run + verdict check, no further code changes needed.
  • MODEL-2: unchanged at 57%.

Refs

🤖 Generated with Claude Code

…RGED via 3/3 §68-trio flips (PMAT-CODE-SHIP-TWO-SECTION-70)

§69 (PR #1633) enumerated 4 candidate root causes for the apr eval
HumanEval false-failure. §70 reports the empirical disambiguation on
gx10 via the diagnostic surface (PR #1634), the 1-PR fix (PR #1635),
and the discharge proof.

70.1 RC disambiguation on gx10 (canonical 7B Q4K APR teacher):
  - RC1 (state leak)        : FALSIFIED — coherent 1031-byte response
  - RC2 (false-negative)    : FALSIFIED — python3 actually exited 1
  - RC3 (format!() bug)     : CONFIRMED — imports stripped
  - RC4 (max_tokens trunc)  : FALSIFIED — 524-char completion present

70.2 Why §68 was wrong: §68's R1+R2 0/3 flip rate on the known-failed
trio was correct evidence; the inference ("Class B sampling/
quantization") was a leap. The TRUE class was Class C (harness-RC3),
invisible to R1+R2 because R1+R2 doesn't touch the format!() at line 400.

70.3 The fix (PR #1635): new `extract_prompt_preamble(prompt, entry)`
helper + ChatML-branch prepend in run_humaneval_inference. 7 unit
tests cover the helper + RC3 falsifier.

70.4 Discharge proof — 3/3 §68 trio flip:
  | Task         | §68 pre-fix | §68 R1+R2-only | §70 RC3-fix |
  | HumanEval/1  | FAIL        | FAIL           | PASS        |
  | HumanEval/3  | FAIL        | FAIL           | PASS        |
  | HumanEval/6  | FAIL        | FAIL           | PASS        |
  Flip rate: 100%.

70.5 SHIP-005 path: 164-run dispatched on gx10 (commit b7e69bf);
~5h CPU wall. Discharge condition: post-fix pass@1 >= 84.80%.

70.6 Methodology lesson #17 NEW: pre-fix RED smoke can mask the bug
class. A 0/N flip rate in a smoke proves only that the candidate fix
doesn't move the needle, NOT that any specific failure class is
responsible. The class must be identified via diagnostic instrumentation
(APR_EVAL_DEBUG=1), not inferred from a flip rate.

70.7 Cumulative methodology lessons through §70 (lesson #17 added).

70.8 Ship-% movement: MODEL-1 stays 94% pending 164-run completion;
path to 95% is single rerun + verdict check, no further code changes.
MODEL-2 unchanged at 57%.

Spec version: 3.14.0 → **3.16.0** (also reapplies §69 banner at v3.15.0
since PR #1633 has not yet landed on main — when #1633 lands, the §69
section will exist; this commit's banner stack accommodates that).

Refs:
- contracts/apr-eval-humaneval-harness-invariant-v1.yaml v1.1.0
- evidence/section-70-rc3-fix-2026-05-12/findings.json
- /tmp/apr_eval_debug_HumanEval_{1,3,6}.json (gx10 evidence)
- PR #1633 (§69 spec), PR #1634 (diagnostic surface), PR #1635 (RC3 fix)

Closes task #52 (PMAT-CODE-SHIP-TWO-SECTION-70).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit cd7cd35 into main May 12, 2026
10 checks passed
@noahgift noahgift deleted the docs/section-70-rc3-discharge branch May 12, 2026 12:53
noahgift added a commit that referenced this pull request May 12, 2026
…s@1 (PMAT-CODE-SHIP-TWO-SECTION-71) (#1642)

§70 (PR #1636) confirmed RC3 (format!() drops imports) on gx10 and
shipped the fix (PR #1635) + diagnostic surface (PR #1634). §71
reports the empirical 164-run discharge proof on gx10:

  Result: 142/164 problems passed → pass@1 = 86.59%
  Floor:  84.80% (AC-SHIP1-005 with 1.2% tolerance)
  Headroom above floor: +1.79pp

  Compared to §67 baseline (H4 ChatML only): 80.49% (132/164)
  RC3 fix flipped 10 additional problems → +6.10pp gain
  pass@10 ≈ 100%, pass@100 = 100%

SHIP-005 LIVE-DISCHARGED. The §65→§71 cascade is closed for SHIP-005.

Run metadata:
  Host:    gx10-a5b5 (Blackwell GB10, aarch64)
  Binary:  /home/noah/src/aprender/target/release/apr @ b7e69bf
  Artifact: qwen2.5-coder-7b-instruct-q4k.apr
  Wall:    5h 50min (08:10 → 14:00 UTC)
  Sample:  T=0.0, 1 sample, max_tokens=512 (greedy)

§17.5 chain post-§71:
  SHIP-002  DISCHARGED (no change)
  SHIP-005  PARTIAL → LIVE-DISCHARGED  ←  §71
  SHIP-006  DISCHARGED (no change)
  SHIP-007  PARTIAL — multi-PR CUDA cascade (§63 — separate track)
  SHIP-008  DISCHARGED (no change)

MODEL-1 ship %: 94% → 95% (4 of 5 §17.5 PARTIALs LIVE-discharged).
Path to 96% requires SHIP-007 multi-PR CUDA cascade.

MODEL-2 ship %: unchanged at 57% (independent track).

Methodology lesson #18 NEW: §70 → §71 closes the predict-then-verify
loop. A fix whose 3/3 smoke flip and whose mechanism-based lift
estimate (§70.5 predicted +5-15pp) land within the predicted band
(actual +6.10pp) IS the discharge evidence; no further investigation
needed. The cascade arc closes when prediction matches empirical.

Spec v3.16.0 → v3.17.0.

Evidence:
- evidence/section-71-ship-005-discharged-2026-05-12/humaneval-164-rc3-gx10.json (full 164-problem JSON, 24KB)
- evidence/section-71-ship-005-discharged-2026-05-12/findings.json
- evidence/section-70-rc3-fix-2026-05-12/findings.json (3/3 trio)
- evidence/section-69-harness-bug-2026-05-12/findings.json (smoking-gun)
- evidence/section-67-h4-164-run-result-2026-05-12/findings.json (baseline)

Closes task #56 (PMAT-CODE-SHIP-TWO-SECTION-71).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 13, 2026
…T-CODE-V0-33-0-RELEASE-PREP)

🎉 v0.33.0 marks **MODEL-1 SHIP % = 100%** for SHIP-TWO-001.

All 10 AC-SHIP1-* falsifiers are LIVE-discharged on the canonical
7B Qwen2.5-Coder-Instruct Q4_K_M teacher (lambda-vector RTX 4090,
--features cuda).

This release prep PR ships:
1. CHANGELOG.md [0.33.0] entry with §69-§75 highlights:
   - 🎉 MODEL-1 SHIP % = 100% (all 10 AC-SHIP1-* LIVE)
   - Fixed: SHIP-007 F32 GEMV PTX layout (PR #1651, §75) — 124.6 tok/s
   - Fixed: SHIP-005 HumanEval RC3 (PR #1635, §70/§71) — pass@1 86.59%
   - Added: APR_EVAL_DEBUG=1 diagnostic surface (PR #1634)
   - Added: APR_GPU_STAGE_DUMP=<dir> diagnostic surface (PR #1649)
   - Added: MBPP harness H4 fix (PR #1645)
   - Added: 2 new falsifiable contracts (apr-eval-humaneval-harness-
     invariant v1.1.0, apr-ship-007-gpu-stage-bisection v1.0.0)
   - Methodology lessons #16-22 captured in MEMORY.md
   - Spec: v3.13.0 → v3.21.0 across §67-§75

2. Workspace version bump:
   - [workspace.package].version: 0.32.0 → 0.33.0
   - Root [package].version (aprender facade crate): 0.32.0 → 0.33.0
   - 28 sub-crate version literals: 0.32.0 → 0.33.0

3. `cargo check -p aprender` → clean (workspace builds at 0.33.0).

Out of scope for this PR (separate steps after #1651/1652 land + this
PR lands):
- Tag release `v0.33.0` on main
- Cascade publish to crates.io (per memory project_ship_two_001_v0_32_0_release.md
  — 15 user-facing crates + 7 internal-tier in topological dependency
  order; uses `make publish CRATE=<name>`)
- Post-publish QA per `feedback_post_publish_qa_required.md` —
  `cargo install aprender --force` + `/dogfood` GO verdict required
  before declaring release done (v0.31.1 was yanked for skipping this)
- GitHub Release with §75 narrative
- HF artifact verification (paiml/qwen2.5-coder-7b-apache-q4k-v1 sha256
  already verified by §72 SHIP-010 LIVE evidence; double-check before
  release announcement)

This PR ships ONLY the version-bump + CHANGELOG. Publishing is the
next step after merge.

Refs:
- §75 MODEL-1 100% (PR #1652)
- §74 SHIP-007 bug localized (PR #1650)
- §73 SHIP-007 cascade reduction (PR #1647)
- §72 5-AC LIVE cascade (PR #1646)
- §71 SHIP-005 LIVE-DISCHARGED (PR #1642)
- §70 RC3 fix (PR #1636)
- §69 Q4K hypothesis falsified (PR #1633)
- PR #1635 RC3 prepend
- PR #1634 diagnostic surface + contract
- PR #1648 SHIP-007 contract scaffold
- PR #1649 SHIP-007 PR-B stage dump
- PR #1651 SHIP-007 PR-E F32 GEMV layout fix

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 13, 2026
…T-CODE-V0-33-0-RELEASE-PREP) (#1653)

🎉 v0.33.0 marks **MODEL-1 SHIP % = 100%** for SHIP-TWO-001.

All 10 AC-SHIP1-* falsifiers are LIVE-discharged on the canonical
7B Qwen2.5-Coder-Instruct Q4_K_M teacher (lambda-vector RTX 4090,
--features cuda).

This release prep PR ships:
1. CHANGELOG.md [0.33.0] entry with §69-§75 highlights:
   - 🎉 MODEL-1 SHIP % = 100% (all 10 AC-SHIP1-* LIVE)
   - Fixed: SHIP-007 F32 GEMV PTX layout (PR #1651, §75) — 124.6 tok/s
   - Fixed: SHIP-005 HumanEval RC3 (PR #1635, §70/§71) — pass@1 86.59%
   - Added: APR_EVAL_DEBUG=1 diagnostic surface (PR #1634)
   - Added: APR_GPU_STAGE_DUMP=<dir> diagnostic surface (PR #1649)
   - Added: MBPP harness H4 fix (PR #1645)
   - Added: 2 new falsifiable contracts (apr-eval-humaneval-harness-
     invariant v1.1.0, apr-ship-007-gpu-stage-bisection v1.0.0)
   - Methodology lessons #16-22 captured in MEMORY.md
   - Spec: v3.13.0 → v3.21.0 across §67-§75

2. Workspace version bump:
   - [workspace.package].version: 0.32.0 → 0.33.0
   - Root [package].version (aprender facade crate): 0.32.0 → 0.33.0
   - 28 sub-crate version literals: 0.32.0 → 0.33.0

3. `cargo check -p aprender` → clean (workspace builds at 0.33.0).

Out of scope for this PR (separate steps after #1651/1652 land + this
PR lands):
- Tag release `v0.33.0` on main
- Cascade publish to crates.io (per memory project_ship_two_001_v0_32_0_release.md
  — 15 user-facing crates + 7 internal-tier in topological dependency
  order; uses `make publish CRATE=<name>`)
- Post-publish QA per `feedback_post_publish_qa_required.md` —
  `cargo install aprender --force` + `/dogfood` GO verdict required
  before declaring release done (v0.31.1 was yanked for skipping this)
- GitHub Release with §75 narrative
- HF artifact verification (paiml/qwen2.5-coder-7b-apache-q4k-v1 sha256
  already verified by §72 SHIP-010 LIVE evidence; double-check before
  release announcement)

This PR ships ONLY the version-bump + CHANGELOG. Publishing is the
next step after merge.

Refs:
- §75 MODEL-1 100% (PR #1652)
- §74 SHIP-007 bug localized (PR #1650)
- §73 SHIP-007 cascade reduction (PR #1647)
- §72 5-AC LIVE cascade (PR #1646)
- §71 SHIP-005 LIVE-DISCHARGED (PR #1642)
- §70 RC3 fix (PR #1636)
- §69 Q4K hypothesis falsified (PR #1633)
- PR #1635 RC3 prepend
- PR #1634 diagnostic surface + contract
- PR #1648 SHIP-007 contract scaffold
- PR #1649 SHIP-007 PR-B stage dump
- PR #1651 SHIP-007 PR-E F32 GEMV layout fix

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant