Skip to content

docs(spec): SHIP-TWO-001 §67 — H4 LIVE result: pass@1 = 80.49% (+46pp gain, 4.31pp below floor)#1629

Merged
noahgift merged 1 commit into
mainfrom
docs/section-67-h4-result
May 12, 2026
Merged

docs(spec): SHIP-TWO-001 §67 — H4 LIVE result: pass@1 = 80.49% (+46pp gain, 4.31pp below floor)#1629
noahgift merged 1 commit into
mainfrom
docs/section-67-h4-result

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

PR #1628 H4 fix LIVE 164-run on canonical 7B APR teacher (gx10, 5.8h CPU wall): 132/164 = 80.49% pass@1.

Verdict

passed = 132/164
pass@1   = 0.8049  ← FAIL (4.31pp below 0.848 effective floor)
pass@10  = 1.0000
pass@100 = 1.0000

Massive improvement vs §65 baseline

Run pass@1 Delta Verdict
§65 raw-continuation 34.15% (baseline) FAIL (50pp gap)
§67 H4 ChatML 80.49% +46.34pp FAIL (4.31pp gap)

H4 closed 92% of the original gap. Remaining 4.31pp is refinement-scale.

Model fully capable

pass@10 ≈ 100%, pass@100 = 100%. Every problem solvable given enough samples.

Four refinement candidates for the 4.31pp residual

Candidate Description Est. gain
R1 extraction robustness (failed problems may not fence) 2-3pp
R2 function-targeted extraction (def {entry_point}( preferred) 1-2pp
R3 Q4K → FP16 (published 88.4% may use FP16) 2-3pp
R4 sampling refinement (temp=0.2, samples=3, majority) 1-2pp

R1+R2 = cheapest 1-PR slice in extract_python_code_block + 5h gx10 rerun. Likely closes the gap.

Methodology Lesson #14 (NEW)

Near-miss results bound refinement scope. A 50pp gap (§65) signals a methodology issue; a 4pp gap (§67) signals a refinement issue. Different fix archetypes.

Ship-% Movement

  • MODEL-1 ship %: stays at 94% (no LIVE-discharge). SHIP-005 path bounded to R1-R4 refinement cascade.
  • MODEL-2 ship %: unchanged at 57%.

🤖 Generated with Claude Code

… gain, 4.31pp below floor) (PMAT-CODE-SHIP-TWO-SECTION-67)

PR #1628 H4 fix (ChatML wrap + extract_python_code_block) shipped;
gx10 164-run completed in 5.8h CPU wall.

Result: 132/164 = 80.49% pass@1.

Comparison:
- §65 raw-continuation: 34.15% (baseline)
- §67 H4 ChatML:        80.49% (+46.34pp gain)

pass@10 = 1.0000 (essentially 100%); pass@100 = 1.0000. Model
fully capable. Remaining 4.31pp gap is refinement-scale.

SHIP-005 stays PARTIAL but path bounded to 4 refinement candidates:
- R1: extraction robustness (some completions may not fence)
- R2: function-targeted extraction (prefer def {entry_point}( block)
- R3: Q4K → FP16 (published 88.4% may use FP16; Q4K loses 1-3pp)
- R4: sampling refinement (temperature=0.2, samples=3, majority)

R1+R2 are cheapest 1-PR slice + 5h gx10 rerun.

Methodology lesson #14 NEW: Near-miss results bound refinement
scope. 50pp gap = methodology issue; 4pp gap = refinement issue.
Different fix archetypes. Generalises lesson #11.

Spec movement:
- v3.09.0 → v3.13.0
- MODEL-1 ship %: stays at 94%; will flip to 95% if R1+R2 close
  the 4.31pp gap
- MODEL-2 ship %: unchanged at 57%

Closes task #43 PMAT-CODE-SHIP-TWO-SECTION-67.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant