docs(spec): SHIP-TWO-001 §66 — H4 confirmed: SHIP-005 34% is harness methodology, not model#1627
Closed
noahgift wants to merge 2 commits into
Closed
docs(spec): SHIP-TWO-001 §66 — H4 confirmed: SHIP-005 34% is harness methodology, not model#1627noahgift wants to merge 2 commits into
noahgift wants to merge 2 commits into
Conversation
…RITHM_LEVEL — closes 3/3 Bind all three falsification gates of `contracts/apr-format-invariants-v1.yaml` at PARTIAL_ALGORITHM_LEVEL via verdict functions in `crates/aprender-core/src/format/aprfi_001_003.rs` (28 tests pass). ## Five-whys 1. Why bind apr-format-invariants-v1? Unbound; spec mandates ≥ PARTIAL_ALGORITHM_LEVEL before runtime gates engage. 2. Why algorithm-level? All three gates are pure properties of byte buffers and float arrays — no actual MessagePack/serde dispatch required at this layer. 3. Why bytewise roundtrip in the reference? ModelEvidence exact roundtrip is what serde_msgpack guarantees; verdict-level proof needs only the byte-equality predicate, not the actual codec. 4. Why distinguish `is_truncated` × `did_panic` × `validation_error`? APR-002 pins TWO orthogonal failure modes: panic (missing bounds check) AND silent pass (validation skipped). The 3-bit state machine catches both with one verdict. 5. Why 1e-6 regression tolerance (vs the 1e-5 used elsewhere)? Contract concerns metric drift, where tighter tolerance catches subtler regressions; 1e-5 (used for kernel parity) would miss sub-1e-5 metric drift over many evidence values. ## What this binds - APR-001 roundtrip identity: `verdict_from_roundtrip_identity` (input == output bytewise). - APR-002 truncated rejection: `verdict_from_truncated_rejection` (no panic AND truncated ⇒ ValidationError). - APR-003 regression detection: `verdict_from_regression_detection` (identical ⇒ zero count under 1e-6 tolerance). In-module reference: `roundtrip(bytes)`, `count_regressions(b, c)` returning `Option<usize>` (None on length mismatch / NaN). ## Tests (7 sections, 28 cases) - §1 Provenance pin (1) - §2 APR-001 (5) — byte-identical, empty, large, corrupted, length - §3 APR-002 (5) — truncated+VE pass; full pass; silent pass fail; panic on truncated/full fail - §4 APR-003 (8) — identical/within-tol/improvements pass; real regression match; silent fail; expect-some-got-zero; length; NaN - §5 Domain (4) — count_regressions round-trips - §6 Sweep (1) — eps band probe - §7 Realistic (4) — serialize loses field; missing bounds check panic; FP no-tolerance bug; full pipeline 3-gate Pass Coverage: 16+29 → 15+30 (one PARTIAL closes 3 unbound). Refs: - contracts/apr-format-invariants-v1.yaml v1.0.0 (FALSIFY-APR-001..003) - docs/specifications/aprender-train/ship-two-models-spec.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…cross-CLI test (PMAT-CODE-SHIP-TWO-SECTION-66) 5-min cross-CLI diagnostic on gx10: same canonical 7B APR teacher + same HumanEval/2 prompt produces CORRECT solution via apr run (ChatML auto-wrap) but FAIL via apr eval --task humaneval (raw-continuation). H4 NEW (root cause): Qwen2.5-Coder-Instruct expects ChatML chat template; the harness uses raw-continuation via with_input_tokens. 34% pass@1 in §65 is HARNESS issue, not model knowledge. H1/H2/H3 (artifact, dedent, BPE) deprioritized — all subsumed by H4. Fix scope: ~80-120 LOC in run_humaneval_inference (ChatML wrap + markdown code-block parsing). Expected post-fix pass@1 ≈ 80-88% (matches published Qwen 88.4%). Methodology lesson #13 NEW: Cross-CLI behavior comparison falsifies hypotheses fast. No 5h rerun needed when 5-min diagnostic localises the bug class. Spec v3.09.0 → v3.12.0. MODEL-1 ship %: stays at 94%. Closes task #40 PMAT-CODE-SHIP-TWO-SECTION-66. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Author
|
Closing as superseded — the §65→§71 cascade narrative is complete on main via PRs #1629/#1631/#1633/#1634/#1636/#1642 (and the in-tree §67/§68/§69/§70/§71 sections). SHIP-005 LIVE-DISCHARGED at 86.59% pass@1 (§71); see contracts/apr-eval-humaneval-harness-invariant-v1.yaml v1.1.0 for the empirical evidence and root cause. |
auto-merge was automatically disabled
May 12, 2026 15:30
Pull request was closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
5-min cross-CLI diagnostic on gx10 falsifies H1/H2/H3 from §65 and confirms H4 (NEW): 34% pass@1 in §65 is an evaluation-harness methodology bug, not a model or quantization issue.
Smoking-gun test
Same canonical 7B APR teacher + same HumanEval/2 prompt + same gx10 host:
apr run(ChatML auto-wrap) → CORRECT solution emitted (markdown-wrapped Python usingmath.modf(number))apr eval --task humaneval(raw-continuation viawith_input_tokens) → FAIL (pass@1: 0.0%)H4: Evaluation methodology mismatch
Qwen2.5-Coder-7B-Instruct is trained for chat/instruct format. Expects:
<|im_start|>user\n...<|im_end|>\n<|im_start|>assistant\nwrapping\``python\n\n```` with optional explanationBut our harness uses raw-continuation — model sees prompt as a continuing Python file → emits low-probability tokens. Published Qwen pass@1 = 88.4% uses chat template per Qwen team's methodology.
H1/H2/H3 deprioritized
All subsumed by H4.
Fix scope (1-PR, ~80-120 LOC)
crates/apr-cli/src/commands/eval/inference.rs::run_humaneval_inference:with_prompt) for instruct models\``python ... ```` blocks from completionExpected post-fix pass@1 ≈ 80-88% (matches Qwen 88.4%). SHIP-005 LIVE-discharge becomes feasible.
Methodology Lesson #13 (NEW)
Cross-CLI behavior comparison falsifies hypotheses fast. 5-min diagnostic localised the bug class without rerunning 5h evaluation. Generalises lessons #8 + #9.
Ship-% Movement
🤖 Generated with Claude Code