contracts(ccpa): v1.29.0 — register FALSIFY-CCPA-018 arena_recovery_rate_bound (PROPOSED) by noahgift · Pull Request #1705 · paiml/aprender

noahgift · 2026-05-15T14:47:11Z

Summary

Bumps contracts/claude-code-parity-apr-v1.yaml v1.28.0 → v1.29.0 to
register FALSIFY-CCPA-018 (arena_recovery_rate_bound) at status:
PROPOSED. Gate count 17 → 18.

Chained on #1684 (v1.28.0 / CCPA-017). Once #1684 merges to main,
this PR's base will retarget automatically.

What changed

version: "1.28.0" → "1.29.0" (single field bump)
Status comment: "18/18 gates registered ... + 2 PROPOSED (CCPA-017 + CCPA-018)"
invariants[] summary: added 1-line CCPA-018 entry
falsification_conditions[]: new full FALSIFY-CCPA-018 block (~80 lines)
with assertion / test_harness / rationale / semantic_change_log
status_history[]: new v1.29.0 entry at top with Phase 5 M194-M206
narrative

Pure additive bump — no schema change, no existing gate touched.

CCPA-018 design

DUAL threshold:

recovery_rate >= 0.5 — agent recovers from bash failures
oracle_passed_rate >= 0.3 — agent solves real fixtures

The asymmetric give-up-fast synthetic fixture is the canonical R3
distinguishing test: a system that solves easy tasks zero-shot but
never recovers from a hard task's first failure passes CCPA-017 but
FAILS CCPA-018.

CCPA-018 measures agent quality (recovery), distinct from
CCPA-016/017 which measure functional outcome.

Why PROPOSED

No operator-dispatched Arena bench has produced
`evidence/phase-5/arena-scores.json` yet. The live-evidence test is
`#[ignore]`'d until that file exists. Once the operator runs
`bash scripts/phase-5-arena-bench.sh` and the gate passes against
real data, a v1.30.0 bump will flip PROPOSED → ACTIVE_RUNTIME.

Companion-repo Phase 5 sequence

M194 (docs(M194): Phase 5 Arena runner plan — operationalizes design-audit R2+R3 claude-code-parity-apr#181 4011bea) — phase-5-arena-runner-plan.md
M196 (REGRESSION: apr validate still fails on valid GGUF v3 files after #178 fix #183 6a7fe39) — P5.1 scaffolding
M200 (APR Rosetta: Add format-aware differential tracing to detect embedding/weight layout bugs #187 75ef8e6) — P5.2 multi-turn loop body
M202 (apr chat has repetitive responses #189 e381d05) — P5.3 bench runner
M204 (apr bench: Support APR and SafeTensors formats #191 aa58ed6) — P5.4 CCPA-018 gate test scaffold
M206 (rosetta convert: SafeTensors output missing num_attention_heads in config.json #193 b95be66) — P5.5 falsifier-of-falsifier

Test plan

`pv validate contracts/claude-code-parity-apr-v1.yaml` → "0 error(s), 0 warning(s). Contract is valid."
CI `workspace-test` GREEN
CI `ci / gate` GREEN
Companion-repo M22 5-step ritual mirror PR opens chained on this

🤖 Generated with Claude Code

…recovery_rate_bound (PROPOSED) Adds 1 new falsification gate to claude-code-parity-apr-v1: CCPA-018 (Arena recovery-rate bound) at status: PROPOSED. Gate count: 17 → 18. Phase 5 (companion-repo M194-M206) operationalizes design-audit.md (M192 operator-authored) R2 + R3 recommendations: a live multi-turn execution harness where the agent gets bash/test feedback per turn and must recover from failures. DUAL-threshold design: - recovery_rate >= 0.5 - oracle_passed_rate >= 0.3 The asymmetric give-up-fast synthetic fixture is the canonical R3 distinguishing test: a system that solves easy tasks zero-shot but never recovers from a hard task's first failure passes CCPA-017 but FAILS CCPA-018. CCPA-018 measures AGENT QUALITY (does the agent recover when bash fails?), distinct from CCPA-016/017 which measure FUNCTIONAL OUTCOME. Tentative 0.5/0.3 POC-tier floors; recalibration awaits first operator-dispatched Arena bench against M182 corpus. CCPA-018 enters at status: PROPOSED because no operator-dispatched Arena bench has produced evidence/phase-5/arena-scores.json yet. The live-evidence test is #[ignore]'d until that file exists. Once the operator runs `bash scripts/phase-5-arena-bench.sh` and the gate passes against real data, a v1.30.0 bump will flip PROPOSED → ACTIVE_RUNTIME. Companion-repo Phase 5 sequence: - M194 (PR #181 4011bea) — phase-5-arena-runner-plan.md - M196 (PR #183 6a7fe39) — P5.1 scaffolding - M200 (PR #187 75ef8e6) — P5.2 multi-turn loop body - M202 (PR #189 e381d05) — P5.3 bench runner - M204 (PR #191 aa58ed6) — P5.4 CCPA-018 gate test scaffold - M206 (PR #193 b95be66) — P5.5 falsifier-of-falsifier Gate-level statuses post-v1.29.0: 4 ACTIVE_RUNTIME (CCPA-013/014/015/ 016) + 2 PROPOSED (CCPA-017 + CCPA-018) + rest at PLANNED_M*/ IN_REVIEW per their lifecycle phase. No OPEN residue. Pure additive bump: new gate + new status_history entry. No schema bump. pv validate clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…te CCPA-008 (#1735) THREE changes bundled. v1.29.0 is SKIPPED — aprender#1705 (the original v1.29.0 PR) auto-CLOSED when its base #1684 (v1.28.0) squash-merged and deleted its feature branch. Companion-repo has been at the v1.29.0 contract YAML since M208 (pin.lock pointed at #1705's feature-branch HEAD); this v1.30.0 upstream-flip realigns aprender main with companion's contract content AND adds the M224/M230 deltas the operator-dispatched Phase 5 Arena bench produced. CHANGE (1): FALSIFY-CCPA-018 (arena_recovery_rate_bound) added to gate registry at status: PROPOSED. Gate count: 17 → 18. Asserts recovery_rate >= 0.5 AND oracle_passed_rate >= 0.3 on the M182 5-fixture project-scale corpus driven via the live multi-turn Arena harness (crates/ccpa-arena/, companion-repo M196-M210). Measures AGENT QUALITY (does the agent recover when bash fails?), distinct from CCPA-016/017 which measure FUNCTIONAL OUTCOME. The asymmetric give-up-fast synthetic fixture (100% pass BUT zero recovery → FAILS recovery floor) is the canonical R3 distinguishing test. CHANGE (2): FALSIFY-CCPA-008 (parity_score_bound) soft-deprecated to status: ADVISORY in its summary. Gate STILL enforces aggregate >= 0.95, per-fixture >= 0.80 on the 30 AUTHORED canonical fixtures — only the INTERPRETATION flipped. Reframed from SYSTEM-LEVEL parity validation ("apr code matches claude on real engineering tasks") → METER VALIDATION ("the differ + scorer + per-tool equivalence rules correctly recognize equivalent traces"). The system-level interpretation was empirically FALSIFIED by the M224 Arena bench (see CHANGE (3)). Foreground parity claims move to CCPA-016 (function-scale) + CCPA-017 (project-scale, PROPOSED) + CCPA-018 (Arena recovery-rate, PROPOSED). CHANGE (3): Records M224 first-operator-dispatched Phase 5 Arena bench result in status_history. Operator ran scripts/phase-5-arena-bench.sh against the M182 5-fixture project-scale corpus three times: - Run 1 (180s/turn) was noisy (6/10 timeout-killed) - Run 2 (600s/turn, 2400s wall) — clean teacher, 4/5 student apr-serve errors - Run 3 (post-aprender#1712 workaround M228) — 2/5 student cleanly completed 20 turns All three: oracle_passed_rate = 0.0000 for BOTH systems. recovery_rate = 0 for both. Verdict: evaluate_static_vs_arena(1.0, 0.0, ...) → FalsifierOutcome::StaticFalsified. Important nuance preserved in the status_history reason field: 0/5 for BOTH systems means neither solves these specific tasks under this harness — Axis 2 closure CEILING, not teacher-vs-student gap. The Popperian comparator is deterministic; if a cleaner re-run (post aprender#1712 fix) lifts recovery_rate or oracle_passed_rate, the verdict revises automatically. Cross-references in this PR: - companion-repo M194-M210 = Phase 5 P5.1-P5.5 + coverage closure - companion-repo M208 (PR #195) = the now-obsolete v1.29.0 mirror - companion-repo M224 (PR #211) = evidence + headline revision - companion-repo M226 (PR #213) = aprender#1712 + pkill workaround - companion-repo M230 (PR #215) = soft-deprecation spec rewrite + new docs/specifications/static-fixture-deprecation.md (~140 lines) - aprender#1712 = apr serve subprocess leak (root cause of the 3 remaining student driver_errors in Run 3) Tentative threshold values (CCPA-017: 0.3/0.3; CCPA-018: 0.5/0.3) WILL be recalibrated after a cleaner re-run post-aprender#1712 upstream fix. `pv validate contracts/claude-code-parity-apr-v1.yaml` → 0 errors, 0 warnings. Pure additive bump (CCPA-018) + interpretation amendment (CCPA-008) + history record (M224). No schema change, no existing gate behavior touched. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…P0-K root cause discovered (#1738) * contracts(ccpa): v1.28.0 → v1.30.0 — register CCPA-018 + soft-deprecate CCPA-008 THREE changes bundled. v1.29.0 is SKIPPED — aprender#1705 (the original v1.29.0 PR) auto-CLOSED when its base #1684 (v1.28.0) squash-merged and deleted its feature branch. Companion-repo has been at the v1.29.0 contract YAML since M208 (pin.lock pointed at #1705's feature-branch HEAD); this v1.30.0 upstream-flip realigns aprender main with companion's contract content AND adds the M224/M230 deltas the operator-dispatched Phase 5 Arena bench produced. CHANGE (1): FALSIFY-CCPA-018 (arena_recovery_rate_bound) added to gate registry at status: PROPOSED. Gate count: 17 → 18. Asserts recovery_rate >= 0.5 AND oracle_passed_rate >= 0.3 on the M182 5-fixture project-scale corpus driven via the live multi-turn Arena harness (crates/ccpa-arena/, companion-repo M196-M210). Measures AGENT QUALITY (does the agent recover when bash fails?), distinct from CCPA-016/017 which measure FUNCTIONAL OUTCOME. The asymmetric give-up-fast synthetic fixture (100% pass BUT zero recovery → FAILS recovery floor) is the canonical R3 distinguishing test. CHANGE (2): FALSIFY-CCPA-008 (parity_score_bound) soft-deprecated to status: ADVISORY in its summary. Gate STILL enforces aggregate >= 0.95, per-fixture >= 0.80 on the 30 AUTHORED canonical fixtures — only the INTERPRETATION flipped. Reframed from SYSTEM-LEVEL parity validation ("apr code matches claude on real engineering tasks") → METER VALIDATION ("the differ + scorer + per-tool equivalence rules correctly recognize equivalent traces"). The system-level interpretation was empirically FALSIFIED by the M224 Arena bench (see CHANGE (3)). Foreground parity claims move to CCPA-016 (function-scale) + CCPA-017 (project-scale, PROPOSED) + CCPA-018 (Arena recovery-rate, PROPOSED). CHANGE (3): Records M224 first-operator-dispatched Phase 5 Arena bench result in status_history. Operator ran scripts/phase-5-arena-bench.sh against the M182 5-fixture project-scale corpus three times: - Run 1 (180s/turn) was noisy (6/10 timeout-killed) - Run 2 (600s/turn, 2400s wall) — clean teacher, 4/5 student apr-serve errors - Run 3 (post-aprender#1712 workaround M228) — 2/5 student cleanly completed 20 turns All three: oracle_passed_rate = 0.0000 for BOTH systems. recovery_rate = 0 for both. Verdict: evaluate_static_vs_arena(1.0, 0.0, ...) → FalsifierOutcome::StaticFalsified. Important nuance preserved in the status_history reason field: 0/5 for BOTH systems means neither solves these specific tasks under this harness — Axis 2 closure CEILING, not teacher-vs-student gap. The Popperian comparator is deterministic; if a cleaner re-run (post aprender#1712 fix) lifts recovery_rate or oracle_passed_rate, the verdict revises automatically. Cross-references in this PR: - companion-repo M194-M210 = Phase 5 P5.1-P5.5 + coverage closure - companion-repo M208 (PR #195) = the now-obsolete v1.29.0 mirror - companion-repo M224 (PR #211) = evidence + headline revision - companion-repo M226 (PR #213) = aprender#1712 + pkill workaround - companion-repo M230 (PR #215) = soft-deprecation spec rewrite + new docs/specifications/static-fixture-deprecation.md (~140 lines) - aprender#1712 = apr serve subprocess leak (root cause of the 3 remaining student driver_errors in Run 3) Tentative threshold values (CCPA-017: 0.3/0.3; CCPA-018: 0.5/0.3) WILL be recalibrated after a cleaner re-run post-aprender#1712 upstream fix. `pv validate contracts/claude-code-parity-apr-v1.yaml` → 0 errors, 0 warnings. Pure additive bump (CCPA-018) + interpretation amendment (CCPA-008) + history record (M224). No schema change, no existing gate behavior touched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(p2c): SPEC §84 P2-C live findings — audit hypothesis falsified, P0-K root cause discovered (PMAT-681 → PMAT-690) P2-C 50K-step training dispatched on lambda-vector cuda:0 (2026-05-17): Multi-source corpus assembly: 17 min (49.6B tokens, 18.3M docs, 22.5K docs/s) Pull the-stack-dedup: ~6 min (28.6 GB / 144 parquet) Pull codeparrot-clean: ~6 min (12.8 GB / 54 .json.gz) Decompress gz → jsonl: ~25 s Tokenize merge → qwen-v3: 17 min (5,000 .bin shards, 4× compute-optimal) Training (50K steps requested): EARLY_STOP at 27 epochs / 2700 steps Best val_loss: 4.91 @ epoch 20 — IDENTICAL termination shape to §82 (which had 1.24B-token single-source corpus). The audit's Chinchilla-data-starvation hypothesis is FALSIFIED. Corpus comparison: §82 qwen-v2: 1.24B tokens, 1 source, val_loss best 4.71 P2-C qwen-v3: 49.6B tokens, 2 sources, val_loss best 4.91 80× more corpus tokens produced 0.2 worse val_loss (likely held-out val set distribution effect, not real regression). Root cause discovered (NEW P0-K, PMAT-690): apr convert (HF-safetensors → APR import path) does NOT stamp hf_architecture, embedded tokenizer.vocabulary, or tokenizer.merges into the imported APR. The §81-§83 5-PR Class 3 cascade (P0-D/E/F/G/H/J) wired downstream propagation correctly, but had nothing to propagate because the upstream producer was incomplete. Live P2-C trained checkpoint re-exhibits all 5 prior failures: - apr qa → "APR missing embedded tokenizer" - apr bench → PASS (315.6 tok/s — C-03 arch dims are stamped by training even when init didn't have them) - apr export → 72 qkv biases leak as passthrough (arch stays Llama) - llama-cli → "cannot find tokenizer merges in model file" Methodology lesson #33 NEW: upstream metadata defects masquerade as downstream packaging defects. When 5th Class 3 fix is in the same area, pause and check the upstream producer. ~30 min inventorying the producer is cheaper than a 6th, 7th, 8th consumer fix. Ship %: stays at 79. Next: - PMAT-690 P0-K (NEW critical): apr convert stamps hf_architecture + tokenizer.vocabulary + tokenizer.merges. Scope ~100 LOC. - After P0-K: re-import qwen2.5-coder-0.5b → re-train → re-export → llama-cli should work end-to-end, transitively closing PMAT-679 P0-H. Files: evidence/p2c-2026-05-17/findings.md evidence/p2c-2026-05-17/loss-trajectory.tsv (27-epoch trace) evidence/p2c-2026-05-17/bench-epoch-020.json (315.6 tok/s) evidence/p2c-2026-05-17/epoch-020.metadata.json docs/roadmaps/roadmap.yaml (PMAT-681 → completed, PMAT-690 added) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift and others added 17 commits May 15, 2026 16:43

Merge branch 'm190-ccpa017-v1.28.0' into m208-ccpa018-v1.29.0

f8b4f25

Merge branch 'm190-ccpa017-v1.28.0' into m208-ccpa018-v1.29.0

c962408

Merge branch 'm190-ccpa017-v1.28.0' into m208-ccpa018-v1.29.0

0856c9f

Merge branch 'm190-ccpa017-v1.28.0' into m208-ccpa018-v1.29.0

149c295

Merge branch 'm190-ccpa017-v1.28.0' into m208-ccpa018-v1.29.0

7eb161a

Merge branch 'm190-ccpa017-v1.28.0' into m208-ccpa018-v1.29.0

47e0f7e

Merge branch 'm190-ccpa017-v1.28.0' into m208-ccpa018-v1.29.0

f72b63a

Merge branch 'm190-ccpa017-v1.28.0' into m208-ccpa018-v1.29.0

d7a3c45

Merge branch 'm190-ccpa017-v1.28.0' into m208-ccpa018-v1.29.0

72ab398

Merge branch 'm190-ccpa017-v1.28.0' into m208-ccpa018-v1.29.0

af36f1e

Merge branch 'm190-ccpa017-v1.28.0' into m208-ccpa018-v1.29.0

934c796

Merge branch 'm190-ccpa017-v1.28.0' into m208-ccpa018-v1.29.0

f827d27

Merge branch 'm190-ccpa017-v1.28.0' into m208-ccpa018-v1.29.0

16ff22d

Merge branch 'm190-ccpa017-v1.28.0' into m208-ccpa018-v1.29.0

f557822

Merge branch 'm190-ccpa017-v1.28.0' into m208-ccpa018-v1.29.0

d67ab6f

Merge branch 'm190-ccpa017-v1.28.0' into m208-ccpa018-v1.29.0

f7cf3e4

noahgift deleted the branch m190-ccpa017-v1.28.0 May 16, 2026 06:05

noahgift closed this May 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contracts(ccpa): v1.29.0 — register FALSIFY-CCPA-018 arena_recovery_rate_bound (PROPOSED)#1705

contracts(ccpa): v1.29.0 — register FALSIFY-CCPA-018 arena_recovery_rate_bound (PROPOSED)#1705
noahgift wants to merge 17 commits into
m190-ccpa017-v1.28.0from
m208-ccpa018-v1.29.0

noahgift commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 15, 2026

Summary

What changed

CCPA-018 design

Why PROPOSED

Companion-repo Phase 5 sequence

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant