Skip to content

contracts(ccpa): v1.29.0 — register FALSIFY-CCPA-018 arena_recovery_rate_bound (PROPOSED)#1705

Closed
noahgift wants to merge 17 commits into
m190-ccpa017-v1.28.0from
m208-ccpa018-v1.29.0
Closed

contracts(ccpa): v1.29.0 — register FALSIFY-CCPA-018 arena_recovery_rate_bound (PROPOSED)#1705
noahgift wants to merge 17 commits into
m190-ccpa017-v1.28.0from
m208-ccpa018-v1.29.0

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Bumps contracts/claude-code-parity-apr-v1.yaml v1.28.0 → v1.29.0 to
register FALSIFY-CCPA-018 (arena_recovery_rate_bound) at status:
PROPOSED. Gate count 17 → 18.

Chained on #1684 (v1.28.0 / CCPA-017). Once #1684 merges to main,
this PR's base will retarget automatically.

What changed

  • version: "1.28.0""1.29.0" (single field bump)
  • Status comment: "18/18 gates registered ... + 2 PROPOSED (CCPA-017 + CCPA-018)"
  • invariants[] summary: added 1-line CCPA-018 entry
  • falsification_conditions[]: new full FALSIFY-CCPA-018 block (~80 lines)
    with assertion / test_harness / rationale / semantic_change_log
  • status_history[]: new v1.29.0 entry at top with Phase 5 M194-M206
    narrative

Pure additive bump — no schema change, no existing gate touched.

CCPA-018 design

DUAL threshold:

  • recovery_rate >= 0.5 — agent recovers from bash failures
  • oracle_passed_rate >= 0.3 — agent solves real fixtures

The asymmetric give-up-fast synthetic fixture is the canonical R3
distinguishing test: a system that solves easy tasks zero-shot but
never recovers from a hard task's first failure passes CCPA-017 but
FAILS CCPA-018.

CCPA-018 measures agent quality (recovery), distinct from
CCPA-016/017 which measure functional outcome.

Why PROPOSED

No operator-dispatched Arena bench has produced
`evidence/phase-5/arena-scores.json` yet. The live-evidence test is
`#[ignore]`'d until that file exists. Once the operator runs
`bash scripts/phase-5-arena-bench.sh` and the gate passes against
real data, a v1.30.0 bump will flip PROPOSED → ACTIVE_RUNTIME.

Companion-repo Phase 5 sequence

Test plan

  • `pv validate contracts/claude-code-parity-apr-v1.yaml` → "0 error(s), 0 warning(s). Contract is valid."
  • CI `workspace-test` GREEN
  • CI `ci / gate` GREEN
  • Companion-repo M22 5-step ritual mirror PR opens chained on this

🤖 Generated with Claude Code

noahgift and others added 17 commits May 15, 2026 16:43
…recovery_rate_bound (PROPOSED)

Adds 1 new falsification gate to claude-code-parity-apr-v1: CCPA-018
(Arena recovery-rate bound) at status: PROPOSED. Gate count: 17 → 18.

Phase 5 (companion-repo M194-M206) operationalizes design-audit.md
(M192 operator-authored) R2 + R3 recommendations: a live multi-turn
execution harness where the agent gets bash/test feedback per turn
and must recover from failures.

DUAL-threshold design:
- recovery_rate >= 0.5
- oracle_passed_rate >= 0.3

The asymmetric give-up-fast synthetic fixture is the canonical R3
distinguishing test: a system that solves easy tasks zero-shot but
never recovers from a hard task's first failure passes CCPA-017
but FAILS CCPA-018.

CCPA-018 measures AGENT QUALITY (does the agent recover when bash
fails?), distinct from CCPA-016/017 which measure FUNCTIONAL OUTCOME.

Tentative 0.5/0.3 POC-tier floors; recalibration awaits first
operator-dispatched Arena bench against M182 corpus.

CCPA-018 enters at status: PROPOSED because no operator-dispatched
Arena bench has produced evidence/phase-5/arena-scores.json yet.
The live-evidence test is #[ignore]'d until that file exists.
Once the operator runs `bash scripts/phase-5-arena-bench.sh` and
the gate passes against real data, a v1.30.0 bump will flip
PROPOSED → ACTIVE_RUNTIME.

Companion-repo Phase 5 sequence:
- M194 (PR #181 4011bea) — phase-5-arena-runner-plan.md
- M196 (PR #183 6a7fe39) — P5.1 scaffolding
- M200 (PR #187 75ef8e6) — P5.2 multi-turn loop body
- M202 (PR #189 e381d05) — P5.3 bench runner
- M204 (PR #191 aa58ed6) — P5.4 CCPA-018 gate test scaffold
- M206 (PR #193 b95be66) — P5.5 falsifier-of-falsifier

Gate-level statuses post-v1.29.0: 4 ACTIVE_RUNTIME (CCPA-013/014/015/
016) + 2 PROPOSED (CCPA-017 + CCPA-018) + rest at PLANNED_M*/
IN_REVIEW per their lifecycle phase. No OPEN residue.

Pure additive bump: new gate + new status_history entry. No
schema bump. pv validate clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift deleted the branch m190-ccpa017-v1.28.0 May 16, 2026 06:05
@noahgift noahgift closed this May 16, 2026
noahgift added a commit that referenced this pull request May 17, 2026
…te CCPA-008 (#1735)

THREE changes bundled. v1.29.0 is SKIPPED — aprender#1705 (the original
v1.29.0 PR) auto-CLOSED when its base #1684 (v1.28.0) squash-merged
and deleted its feature branch. Companion-repo has been at the v1.29.0
contract YAML since M208 (pin.lock pointed at #1705's feature-branch
HEAD); this v1.30.0 upstream-flip realigns aprender main with companion's
contract content AND adds the M224/M230 deltas the operator-dispatched
Phase 5 Arena bench produced.

CHANGE (1): FALSIFY-CCPA-018 (arena_recovery_rate_bound) added to gate
registry at status: PROPOSED. Gate count: 17 → 18. Asserts
recovery_rate >= 0.5 AND oracle_passed_rate >= 0.3 on the M182 5-fixture
project-scale corpus driven via the live multi-turn Arena harness
(crates/ccpa-arena/, companion-repo M196-M210). Measures AGENT QUALITY
(does the agent recover when bash fails?), distinct from CCPA-016/017
which measure FUNCTIONAL OUTCOME. The asymmetric give-up-fast synthetic
fixture (100% pass BUT zero recovery → FAILS recovery floor) is the
canonical R3 distinguishing test.

CHANGE (2): FALSIFY-CCPA-008 (parity_score_bound) soft-deprecated to
status: ADVISORY in its summary. Gate STILL enforces aggregate >= 0.95,
per-fixture >= 0.80 on the 30 AUTHORED canonical fixtures — only the
INTERPRETATION flipped. Reframed from SYSTEM-LEVEL parity validation
("apr code matches claude on real engineering tasks") → METER VALIDATION
("the differ + scorer + per-tool equivalence rules correctly recognize
equivalent traces"). The system-level interpretation was empirically
FALSIFIED by the M224 Arena bench (see CHANGE (3)). Foreground parity
claims move to CCPA-016 (function-scale) + CCPA-017 (project-scale,
PROPOSED) + CCPA-018 (Arena recovery-rate, PROPOSED).

CHANGE (3): Records M224 first-operator-dispatched Phase 5 Arena bench
result in status_history. Operator ran scripts/phase-5-arena-bench.sh
against the M182 5-fixture project-scale corpus three times:
  - Run 1 (180s/turn) was noisy (6/10 timeout-killed)
  - Run 2 (600s/turn, 2400s wall) — clean teacher, 4/5 student
    apr-serve errors
  - Run 3 (post-aprender#1712 workaround M228) — 2/5 student cleanly
    completed 20 turns
All three: oracle_passed_rate = 0.0000 for BOTH systems. recovery_rate
= 0 for both. Verdict: evaluate_static_vs_arena(1.0, 0.0, ...) →
FalsifierOutcome::StaticFalsified.

Important nuance preserved in the status_history reason field: 0/5 for
BOTH systems means neither solves these specific tasks under this
harness — Axis 2 closure CEILING, not teacher-vs-student gap. The
Popperian comparator is deterministic; if a cleaner re-run (post
aprender#1712 fix) lifts recovery_rate or oracle_passed_rate, the
verdict revises automatically.

Cross-references in this PR:
- companion-repo M194-M210 = Phase 5 P5.1-P5.5 + coverage closure
- companion-repo M208 (PR #195) = the now-obsolete v1.29.0 mirror
- companion-repo M224 (PR #211) = evidence + headline revision
- companion-repo M226 (PR #213) = aprender#1712 + pkill workaround
- companion-repo M230 (PR #215) = soft-deprecation spec rewrite +
  new docs/specifications/static-fixture-deprecation.md (~140 lines)
- aprender#1712 = apr serve subprocess leak (root cause of the 3
  remaining student driver_errors in Run 3)

Tentative threshold values (CCPA-017: 0.3/0.3; CCPA-018: 0.5/0.3) WILL
be recalibrated after a cleaner re-run post-aprender#1712 upstream fix.

`pv validate contracts/claude-code-parity-apr-v1.yaml` → 0 errors, 0
warnings. Pure additive bump (CCPA-018) + interpretation amendment
(CCPA-008) + history record (M224). No schema change, no existing gate
behavior touched.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 17, 2026
…P0-K root cause discovered (#1738)

* contracts(ccpa): v1.28.0 → v1.30.0 — register CCPA-018 + soft-deprecate CCPA-008

THREE changes bundled. v1.29.0 is SKIPPED — aprender#1705 (the original
v1.29.0 PR) auto-CLOSED when its base #1684 (v1.28.0) squash-merged
and deleted its feature branch. Companion-repo has been at the v1.29.0
contract YAML since M208 (pin.lock pointed at #1705's feature-branch
HEAD); this v1.30.0 upstream-flip realigns aprender main with companion's
contract content AND adds the M224/M230 deltas the operator-dispatched
Phase 5 Arena bench produced.

CHANGE (1): FALSIFY-CCPA-018 (arena_recovery_rate_bound) added to gate
registry at status: PROPOSED. Gate count: 17 → 18. Asserts
recovery_rate >= 0.5 AND oracle_passed_rate >= 0.3 on the M182 5-fixture
project-scale corpus driven via the live multi-turn Arena harness
(crates/ccpa-arena/, companion-repo M196-M210). Measures AGENT QUALITY
(does the agent recover when bash fails?), distinct from CCPA-016/017
which measure FUNCTIONAL OUTCOME. The asymmetric give-up-fast synthetic
fixture (100% pass BUT zero recovery → FAILS recovery floor) is the
canonical R3 distinguishing test.

CHANGE (2): FALSIFY-CCPA-008 (parity_score_bound) soft-deprecated to
status: ADVISORY in its summary. Gate STILL enforces aggregate >= 0.95,
per-fixture >= 0.80 on the 30 AUTHORED canonical fixtures — only the
INTERPRETATION flipped. Reframed from SYSTEM-LEVEL parity validation
("apr code matches claude on real engineering tasks") → METER VALIDATION
("the differ + scorer + per-tool equivalence rules correctly recognize
equivalent traces"). The system-level interpretation was empirically
FALSIFIED by the M224 Arena bench (see CHANGE (3)). Foreground parity
claims move to CCPA-016 (function-scale) + CCPA-017 (project-scale,
PROPOSED) + CCPA-018 (Arena recovery-rate, PROPOSED).

CHANGE (3): Records M224 first-operator-dispatched Phase 5 Arena bench
result in status_history. Operator ran scripts/phase-5-arena-bench.sh
against the M182 5-fixture project-scale corpus three times:
  - Run 1 (180s/turn) was noisy (6/10 timeout-killed)
  - Run 2 (600s/turn, 2400s wall) — clean teacher, 4/5 student
    apr-serve errors
  - Run 3 (post-aprender#1712 workaround M228) — 2/5 student cleanly
    completed 20 turns
All three: oracle_passed_rate = 0.0000 for BOTH systems. recovery_rate
= 0 for both. Verdict: evaluate_static_vs_arena(1.0, 0.0, ...) →
FalsifierOutcome::StaticFalsified.

Important nuance preserved in the status_history reason field: 0/5 for
BOTH systems means neither solves these specific tasks under this
harness — Axis 2 closure CEILING, not teacher-vs-student gap. The
Popperian comparator is deterministic; if a cleaner re-run (post
aprender#1712 fix) lifts recovery_rate or oracle_passed_rate, the
verdict revises automatically.

Cross-references in this PR:
- companion-repo M194-M210 = Phase 5 P5.1-P5.5 + coverage closure
- companion-repo M208 (PR #195) = the now-obsolete v1.29.0 mirror
- companion-repo M224 (PR #211) = evidence + headline revision
- companion-repo M226 (PR #213) = aprender#1712 + pkill workaround
- companion-repo M230 (PR #215) = soft-deprecation spec rewrite +
  new docs/specifications/static-fixture-deprecation.md (~140 lines)
- aprender#1712 = apr serve subprocess leak (root cause of the 3
  remaining student driver_errors in Run 3)

Tentative threshold values (CCPA-017: 0.3/0.3; CCPA-018: 0.5/0.3) WILL
be recalibrated after a cleaner re-run post-aprender#1712 upstream fix.

`pv validate contracts/claude-code-parity-apr-v1.yaml` → 0 errors, 0
warnings. Pure additive bump (CCPA-018) + interpretation amendment
(CCPA-008) + history record (M224). No schema change, no existing gate
behavior touched.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(p2c): SPEC §84 P2-C live findings — audit hypothesis falsified, P0-K root cause discovered (PMAT-681 → PMAT-690)

P2-C 50K-step training dispatched on lambda-vector cuda:0 (2026-05-17):

  Multi-source corpus assembly:  17 min (49.6B tokens, 18.3M docs, 22.5K docs/s)
  Pull the-stack-dedup:           ~6 min  (28.6 GB / 144 parquet)
  Pull codeparrot-clean:          ~6 min  (12.8 GB / 54 .json.gz)
  Decompress gz → jsonl:          ~25 s
  Tokenize merge → qwen-v3:       17 min  (5,000 .bin shards, 4× compute-optimal)
  Training (50K steps requested): EARLY_STOP at 27 epochs / 2700 steps

Best val_loss: 4.91 @ epoch 20 — IDENTICAL termination shape to §82
(which had 1.24B-token single-source corpus). The audit's
Chinchilla-data-starvation hypothesis is FALSIFIED.

Corpus comparison:

  §82 qwen-v2:  1.24B tokens, 1 source, val_loss best 4.71
  P2-C qwen-v3: 49.6B tokens, 2 sources, val_loss best 4.91

80× more corpus tokens produced 0.2 worse val_loss (likely
held-out val set distribution effect, not real regression).

Root cause discovered (NEW P0-K, PMAT-690):

  apr convert (HF-safetensors → APR import path) does NOT stamp
  hf_architecture, embedded tokenizer.vocabulary, or tokenizer.merges
  into the imported APR. The §81-§83 5-PR Class 3 cascade
  (P0-D/E/F/G/H/J) wired downstream propagation correctly, but had
  nothing to propagate because the upstream producer was incomplete.
  Live P2-C trained checkpoint re-exhibits all 5 prior failures:
    - apr qa  → "APR missing embedded tokenizer"
    - apr bench → PASS (315.6 tok/s — C-03 arch dims are stamped
      by training even when init didn't have them)
    - apr export → 72 qkv biases leak as passthrough (arch stays Llama)
    - llama-cli → "cannot find tokenizer merges in model file"

Methodology lesson #33 NEW: upstream metadata defects masquerade as
downstream packaging defects. When 5th Class 3 fix is in the same
area, pause and check the upstream producer. ~30 min inventorying
the producer is cheaper than a 6th, 7th, 8th consumer fix.

Ship %: stays at 79.

Next:
- PMAT-690 P0-K (NEW critical): apr convert stamps hf_architecture +
  tokenizer.vocabulary + tokenizer.merges. Scope ~100 LOC.
- After P0-K: re-import qwen2.5-coder-0.5b → re-train → re-export →
  llama-cli should work end-to-end, transitively closing PMAT-679 P0-H.

Files:
  evidence/p2c-2026-05-17/findings.md
  evidence/p2c-2026-05-17/loss-trajectory.tsv  (27-epoch trace)
  evidence/p2c-2026-05-17/bench-epoch-020.json  (315.6 tok/s)
  evidence/p2c-2026-05-17/epoch-020.metadata.json
  docs/roadmaps/roadmap.yaml  (PMAT-681 → completed, PMAT-690 added)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant