docs(spec): SHIP-TWO-001 §66 — H4 confirmed: SHIP-005 34% is harness methodology, not model by noahgift · Pull Request #1627 · paiml/aprender

noahgift · 2026-05-11T16:25:34Z

Summary

5-min cross-CLI diagnostic on gx10 falsifies H1/H2/H3 from §65 and confirms H4 (NEW): 34% pass@1 in §65 is an evaluation-harness methodology bug, not a model or quantization issue.

Smoking-gun test

Same canonical 7B APR teacher + same HumanEval/2 prompt + same gx10 host:

apr run (ChatML auto-wrap) → CORRECT solution emitted (markdown-wrapped Python using math.modf(number))
apr eval --task humaneval (raw-continuation via with_input_tokens) → FAIL (pass@1: 0.0%)

H4: Evaluation methodology mismatch

Qwen2.5-Coder-7B-Instruct is trained for chat/instruct format. Expects:

<|im_start|>user\n...<|im_end|>\n<|im_start|>assistant\n wrapping
Response is \``python\n\n```` with optional explanation


But our harness uses raw-continuation — model sees prompt as a continuing Python file → emits low-probability tokens. Published Qwen pass@1 = 88.4% uses chat template per Qwen team's methodology.
H1/H2/H3 deprioritized

H1 (artifact): gx10 first-10 passes match lambda-vector 8/10
H2 (dedent): works correctly on HumanEval/0
H3 (BPE): raw-continuation is the underlying issue

All subsumed by H4.
Fix scope (1-PR, ~80-120 LOC)
crates/apr-cli/src/commands/eval/inference.rs::run_humaneval_inference:

ChatML auto-wrap (with_prompt) for instruct models
Parse \``python ... ```` blocks from completion
Fall back to raw-continuation if no code block

Expected post-fix pass@1 ≈ 80-88% (matches Qwen 88.4%). SHIP-005 LIVE-discharge becomes feasible.
Methodology Lesson #13 (NEW)
Cross-CLI behavior comparison falsifies hypotheses fast. 5-min diagnostic localised the bug class without rerunning 5h evaluation. Generalises lessons #8 + #9.
Ship-% Movement

MODEL-1 ship %: stays at 94% (no discharge here; H4 fix is 1-PR follow-up)
MODEL-2 ship %: unchanged at 57%

🤖 Generated with Claude Code

…RITHM_LEVEL — closes 3/3 Bind all three falsification gates of `contracts/apr-format-invariants-v1.yaml` at PARTIAL_ALGORITHM_LEVEL via verdict functions in `crates/aprender-core/src/format/aprfi_001_003.rs` (28 tests pass). ## Five-whys 1. Why bind apr-format-invariants-v1? Unbound; spec mandates ≥ PARTIAL_ALGORITHM_LEVEL before runtime gates engage. 2. Why algorithm-level? All three gates are pure properties of byte buffers and float arrays — no actual MessagePack/serde dispatch required at this layer. 3. Why bytewise roundtrip in the reference? ModelEvidence exact roundtrip is what serde_msgpack guarantees; verdict-level proof needs only the byte-equality predicate, not the actual codec. 4. Why distinguish `is_truncated` × `did_panic` × `validation_error`? APR-002 pins TWO orthogonal failure modes: panic (missing bounds check) AND silent pass (validation skipped). The 3-bit state machine catches both with one verdict. 5. Why 1e-6 regression tolerance (vs the 1e-5 used elsewhere)? Contract concerns metric drift, where tighter tolerance catches subtler regressions; 1e-5 (used for kernel parity) would miss sub-1e-5 metric drift over many evidence values. ## What this binds - APR-001 roundtrip identity: `verdict_from_roundtrip_identity` (input == output bytewise). - APR-002 truncated rejection: `verdict_from_truncated_rejection` (no panic AND truncated ⇒ ValidationError). - APR-003 regression detection: `verdict_from_regression_detection` (identical ⇒ zero count under 1e-6 tolerance). In-module reference: `roundtrip(bytes)`, `count_regressions(b, c)` returning `Option<usize>` (None on length mismatch / NaN). ## Tests (7 sections, 28 cases) - §1 Provenance pin (1) - §2 APR-001 (5) — byte-identical, empty, large, corrupted, length - §3 APR-002 (5) — truncated+VE pass; full pass; silent pass fail; panic on truncated/full fail - §4 APR-003 (8) — identical/within-tol/improvements pass; real regression match; silent fail; expect-some-got-zero; length; NaN - §5 Domain (4) — count_regressions round-trips - §6 Sweep (1) — eps band probe - §7 Realistic (4) — serialize loses field; missing bounds check panic; FP no-tolerance bug; full pipeline 3-gate Pass Coverage: 16+29 → 15+30 (one PARTIAL closes 3 unbound). Refs: - contracts/apr-format-invariants-v1.yaml v1.0.0 (FALSIFY-APR-001..003) - docs/specifications/aprender-train/ship-two-models-spec.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…cross-CLI test (PMAT-CODE-SHIP-TWO-SECTION-66) 5-min cross-CLI diagnostic on gx10: same canonical 7B APR teacher + same HumanEval/2 prompt produces CORRECT solution via apr run (ChatML auto-wrap) but FAIL via apr eval --task humaneval (raw-continuation). H4 NEW (root cause): Qwen2.5-Coder-Instruct expects ChatML chat template; the harness uses raw-continuation via with_input_tokens. 34% pass@1 in §65 is HARNESS issue, not model knowledge. H1/H2/H3 (artifact, dedent, BPE) deprioritized — all subsumed by H4. Fix scope: ~80-120 LOC in run_humaneval_inference (ChatML wrap + markdown code-block parsing). Expected post-fix pass@1 ≈ 80-88% (matches published Qwen 88.4%). Methodology lesson #13 NEW: Cross-CLI behavior comparison falsifies hypotheses fast. No 5h rerun needed when 5-min diagnostic localises the bug class. Spec v3.09.0 → v3.12.0. MODEL-1 ship %: stays at 94%. Closes task #40 PMAT-CODE-SHIP-TWO-SECTION-66. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-12T15:30:40Z

Closing as superseded — the §65→§71 cascade narrative is complete on main via PRs #1629/#1631/#1633/#1634/#1636/#1642 (and the in-tree §67/§68/§69/§70/§71 sections). SHIP-005 LIVE-DISCHARGED at 86.59% pass@1 (§71); see contracts/apr-eval-humaneval-harness-invariant-v1.yaml v1.1.0 for the empirical evidence and root cause.

noahgift and others added 2 commits May 11, 2026 18:24

noahgift enabled auto-merge (squash) May 11, 2026 16:25

noahgift closed this May 12, 2026

auto-merge was automatically disabled May 12, 2026 15:30
Pull request was closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(spec): SHIP-TWO-001 §66 — H4 confirmed: SHIP-005 34% is harness methodology, not model#1627

docs(spec): SHIP-TWO-001 §66 — H4 confirmed: SHIP-005 34% is harness methodology, not model#1627
noahgift wants to merge 2 commits into
mainfrom
docs/section-66-final

noahgift commented May 11, 2026

Uh oh!

noahgift commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 11, 2026

Summary

Smoking-gun test

H4: Evaluation methodology mismatch

H1/H2/H3 deprioritized

Fix scope (1-PR, ~80-120 LOC)

Methodology Lesson #13 (NEW)

Ship-% Movement

Uh oh!

noahgift commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant