Skip to content

docs(spec): SHIP-TWO-001 §66 — H4 confirmed: SHIP-005 34% is harness methodology, not model#1627

Closed
noahgift wants to merge 2 commits into
mainfrom
docs/section-66-final
Closed

docs(spec): SHIP-TWO-001 §66 — H4 confirmed: SHIP-005 34% is harness methodology, not model#1627
noahgift wants to merge 2 commits into
mainfrom
docs/section-66-final

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

5-min cross-CLI diagnostic on gx10 falsifies H1/H2/H3 from §65 and confirms H4 (NEW): 34% pass@1 in §65 is an evaluation-harness methodology bug, not a model or quantization issue.

Smoking-gun test

Same canonical 7B APR teacher + same HumanEval/2 prompt + same gx10 host:

  • apr run (ChatML auto-wrap) → CORRECT solution emitted (markdown-wrapped Python using math.modf(number))
  • apr eval --task humaneval (raw-continuation via with_input_tokens) → FAIL (pass@1: 0.0%)

H4: Evaluation methodology mismatch

Qwen2.5-Coder-7B-Instruct is trained for chat/instruct format. Expects:

  • <|im_start|>user\n...<|im_end|>\n<|im_start|>assistant\n wrapping
  • Response is \``python\n\n```` with optional explanation

But our harness uses raw-continuation — model sees prompt as a continuing Python file → emits low-probability tokens. Published Qwen pass@1 = 88.4% uses chat template per Qwen team's methodology.

H1/H2/H3 deprioritized

  • H1 (artifact): gx10 first-10 passes match lambda-vector 8/10
  • H2 (dedent): works correctly on HumanEval/0
  • H3 (BPE): raw-continuation is the underlying issue

All subsumed by H4.

Fix scope (1-PR, ~80-120 LOC)

crates/apr-cli/src/commands/eval/inference.rs::run_humaneval_inference:

  1. ChatML auto-wrap (with_prompt) for instruct models
  2. Parse \``python ... ```` blocks from completion
  3. Fall back to raw-continuation if no code block

Expected post-fix pass@1 ≈ 80-88% (matches Qwen 88.4%). SHIP-005 LIVE-discharge becomes feasible.

Methodology Lesson #13 (NEW)

Cross-CLI behavior comparison falsifies hypotheses fast. 5-min diagnostic localised the bug class without rerunning 5h evaluation. Generalises lessons #8 + #9.

Ship-% Movement

  • MODEL-1 ship %: stays at 94% (no discharge here; H4 fix is 1-PR follow-up)
  • MODEL-2 ship %: unchanged at 57%

🤖 Generated with Claude Code

noahgift and others added 2 commits May 11, 2026 18:24
…RITHM_LEVEL — closes 3/3

Bind all three falsification gates of `contracts/apr-format-invariants-v1.yaml`
at PARTIAL_ALGORITHM_LEVEL via verdict functions in
`crates/aprender-core/src/format/aprfi_001_003.rs` (28 tests pass).

## Five-whys

1. Why bind apr-format-invariants-v1?  Unbound; spec mandates ≥
   PARTIAL_ALGORITHM_LEVEL before runtime gates engage.
2. Why algorithm-level?  All three gates are pure properties of byte
   buffers and float arrays — no actual MessagePack/serde dispatch
   required at this layer.
3. Why bytewise roundtrip in the reference?  ModelEvidence exact
   roundtrip is what serde_msgpack guarantees; verdict-level proof
   needs only the byte-equality predicate, not the actual codec.
4. Why distinguish `is_truncated` × `did_panic` × `validation_error`?
   APR-002 pins TWO orthogonal failure modes: panic (missing bounds
   check) AND silent pass (validation skipped). The 3-bit state
   machine catches both with one verdict.
5. Why 1e-6 regression tolerance (vs the 1e-5 used elsewhere)?
   Contract concerns metric drift, where tighter tolerance catches
   subtler regressions; 1e-5 (used for kernel parity) would miss
   sub-1e-5 metric drift over many evidence values.

## What this binds

- APR-001 roundtrip identity: `verdict_from_roundtrip_identity`
  (input == output bytewise).
- APR-002 truncated rejection: `verdict_from_truncated_rejection`
  (no panic AND truncated ⇒ ValidationError).
- APR-003 regression detection: `verdict_from_regression_detection`
  (identical ⇒ zero count under 1e-6 tolerance).

In-module reference: `roundtrip(bytes)`, `count_regressions(b, c)`
returning `Option<usize>` (None on length mismatch / NaN).

## Tests (7 sections, 28 cases)

- §1 Provenance pin (1)
- §2 APR-001 (5) — byte-identical, empty, large, corrupted, length
- §3 APR-002 (5) — truncated+VE pass; full pass; silent pass fail;
  panic on truncated/full fail
- §4 APR-003 (8) — identical/within-tol/improvements pass; real
  regression match; silent fail; expect-some-got-zero; length; NaN
- §5 Domain (4) — count_regressions round-trips
- §6 Sweep (1) — eps band probe
- §7 Realistic (4) — serialize loses field; missing bounds check
  panic; FP no-tolerance bug; full pipeline 3-gate Pass

Coverage: 16+29 → 15+30 (one PARTIAL closes 3 unbound).

Refs:
- contracts/apr-format-invariants-v1.yaml v1.0.0 (FALSIFY-APR-001..003)
- docs/specifications/aprender-train/ship-two-models-spec.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…cross-CLI test (PMAT-CODE-SHIP-TWO-SECTION-66)

5-min cross-CLI diagnostic on gx10: same canonical 7B APR teacher +
same HumanEval/2 prompt produces CORRECT solution via apr run
(ChatML auto-wrap) but FAIL via apr eval --task humaneval
(raw-continuation).

H4 NEW (root cause): Qwen2.5-Coder-Instruct expects ChatML chat
template; the harness uses raw-continuation via with_input_tokens.
34% pass@1 in §65 is HARNESS issue, not model knowledge.

H1/H2/H3 (artifact, dedent, BPE) deprioritized — all subsumed by H4.

Fix scope: ~80-120 LOC in run_humaneval_inference (ChatML wrap +
markdown code-block parsing). Expected post-fix pass@1 ≈ 80-88%
(matches published Qwen 88.4%).

Methodology lesson #13 NEW: Cross-CLI behavior comparison falsifies
hypotheses fast. No 5h rerun needed when 5-min diagnostic localises
the bug class.

Spec v3.09.0 → v3.12.0. MODEL-1 ship %: stays at 94%.

Closes task #40 PMAT-CODE-SHIP-TWO-SECTION-66.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 11, 2026 16:25
@noahgift
Copy link
Copy Markdown
Contributor Author

Closing as superseded — the §65→§71 cascade narrative is complete on main via PRs #1629/#1631/#1633/#1634/#1636/#1642 (and the in-tree §67/§68/§69/§70/§71 sections). SHIP-005 LIVE-DISCHARGED at 86.59% pass@1 (§71); see contracts/apr-eval-humaneval-harness-invariant-v1.yaml v1.1.0 for the empirical evidence and root cause.

@noahgift noahgift closed this May 12, 2026
auto-merge was automatically disabled May 12, 2026 15:30

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant