docs(p2c): SPEC §84 P2-C live findings — audit hypothesis falsified, P0-K root cause discovered by noahgift · Pull Request #1738 · paiml/aprender

noahgift · 2026-05-17T08:18:50Z

Summary

P2-C 50K-step training dispatched on lambda-vector (2026-05-17) falsifies the §83 audit hypothesis that corpus diversity was the binding constraint.

Run	Corpus	Tokens	Sources	Best val_loss	Termination
§82 qwen-v2	1 source	1.24B	1	4.71	EARLY_STOP 27 ep / 2700 steps
P2-C qwen-v3	2 sources	49.6B	2	4.91	EARLY_STOP 27 ep / 2700 steps

Identical termination shape, +0.2 val_loss with 80× more corpus tokens. Audit's Chinchilla-data-starvation hypothesis falsified.

Root cause discovered (NEW P0-K)

The fresh P2-C checkpoint re-exhibits every prior P0-D/E/F/G/H failure because their shared upstream — apr convert (HF safetensors → APR import path) — never stamps:

hf_architecture
Embedded tokenizer.vocabulary
Embedded tokenizer.merges

§81-§83's 5-PR Class 3 cascade wired downstream propagation correctly but had nothing to propagate because the upstream producer was incomplete.

Methodology lesson #33 NEW

"Upstream metadata defects masquerade as downstream packaging defects." When the 5th Class 3 fix is in the same area, pause and check the upstream producer. ~30 min inventorying the producer is cheaper than a 6th/7th/8th consumer fix.

Evidence

evidence/p2c-2026-05-17/findings.md — full Five-Whys + ship % analysis
evidence/p2c-2026-05-17/loss-trajectory.tsv — 27-epoch trace
evidence/p2c-2026-05-17/bench-epoch-020.json — 315.6 tok/s
evidence/p2c-2026-05-17/epoch-020.metadata.json — best checkpoint metadata

…te CCPA-008 THREE changes bundled. v1.29.0 is SKIPPED — aprender#1705 (the original v1.29.0 PR) auto-CLOSED when its base #1684 (v1.28.0) squash-merged and deleted its feature branch. Companion-repo has been at the v1.29.0 contract YAML since M208 (pin.lock pointed at #1705's feature-branch HEAD); this v1.30.0 upstream-flip realigns aprender main with companion's contract content AND adds the M224/M230 deltas the operator-dispatched Phase 5 Arena bench produced. CHANGE (1): FALSIFY-CCPA-018 (arena_recovery_rate_bound) added to gate registry at status: PROPOSED. Gate count: 17 → 18. Asserts recovery_rate >= 0.5 AND oracle_passed_rate >= 0.3 on the M182 5-fixture project-scale corpus driven via the live multi-turn Arena harness (crates/ccpa-arena/, companion-repo M196-M210). Measures AGENT QUALITY (does the agent recover when bash fails?), distinct from CCPA-016/017 which measure FUNCTIONAL OUTCOME. The asymmetric give-up-fast synthetic fixture (100% pass BUT zero recovery → FAILS recovery floor) is the canonical R3 distinguishing test. CHANGE (2): FALSIFY-CCPA-008 (parity_score_bound) soft-deprecated to status: ADVISORY in its summary. Gate STILL enforces aggregate >= 0.95, per-fixture >= 0.80 on the 30 AUTHORED canonical fixtures — only the INTERPRETATION flipped. Reframed from SYSTEM-LEVEL parity validation ("apr code matches claude on real engineering tasks") → METER VALIDATION ("the differ + scorer + per-tool equivalence rules correctly recognize equivalent traces"). The system-level interpretation was empirically FALSIFIED by the M224 Arena bench (see CHANGE (3)). Foreground parity claims move to CCPA-016 (function-scale) + CCPA-017 (project-scale, PROPOSED) + CCPA-018 (Arena recovery-rate, PROPOSED). CHANGE (3): Records M224 first-operator-dispatched Phase 5 Arena bench result in status_history. Operator ran scripts/phase-5-arena-bench.sh against the M182 5-fixture project-scale corpus three times: - Run 1 (180s/turn) was noisy (6/10 timeout-killed) - Run 2 (600s/turn, 2400s wall) — clean teacher, 4/5 student apr-serve errors - Run 3 (post-aprender#1712 workaround M228) — 2/5 student cleanly completed 20 turns All three: oracle_passed_rate = 0.0000 for BOTH systems. recovery_rate = 0 for both. Verdict: evaluate_static_vs_arena(1.0, 0.0, ...) → FalsifierOutcome::StaticFalsified. Important nuance preserved in the status_history reason field: 0/5 for BOTH systems means neither solves these specific tasks under this harness — Axis 2 closure CEILING, not teacher-vs-student gap. The Popperian comparator is deterministic; if a cleaner re-run (post aprender#1712 fix) lifts recovery_rate or oracle_passed_rate, the verdict revises automatically. Cross-references in this PR: - companion-repo M194-M210 = Phase 5 P5.1-P5.5 + coverage closure - companion-repo M208 (PR #195) = the now-obsolete v1.29.0 mirror - companion-repo M224 (PR #211) = evidence + headline revision - companion-repo M226 (PR #213) = aprender#1712 + pkill workaround - companion-repo M230 (PR #215) = soft-deprecation spec rewrite + new docs/specifications/static-fixture-deprecation.md (~140 lines) - aprender#1712 = apr serve subprocess leak (root cause of the 3 remaining student driver_errors in Run 3) Tentative threshold values (CCPA-017: 0.3/0.3; CCPA-018: 0.5/0.3) WILL be recalibrated after a cleaner re-run post-aprender#1712 upstream fix. `pv validate contracts/claude-code-parity-apr-v1.yaml` → 0 errors, 0 warnings. Pure additive bump (CCPA-018) + interpretation amendment (CCPA-008) + history record (M224). No schema change, no existing gate behavior touched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…P0-K root cause discovered (PMAT-681 → PMAT-690) P2-C 50K-step training dispatched on lambda-vector cuda:0 (2026-05-17): Multi-source corpus assembly: 17 min (49.6B tokens, 18.3M docs, 22.5K docs/s) Pull the-stack-dedup: ~6 min (28.6 GB / 144 parquet) Pull codeparrot-clean: ~6 min (12.8 GB / 54 .json.gz) Decompress gz → jsonl: ~25 s Tokenize merge → qwen-v3: 17 min (5,000 .bin shards, 4× compute-optimal) Training (50K steps requested): EARLY_STOP at 27 epochs / 2700 steps Best val_loss: 4.91 @ epoch 20 — IDENTICAL termination shape to §82 (which had 1.24B-token single-source corpus). The audit's Chinchilla-data-starvation hypothesis is FALSIFIED. Corpus comparison: §82 qwen-v2: 1.24B tokens, 1 source, val_loss best 4.71 P2-C qwen-v3: 49.6B tokens, 2 sources, val_loss best 4.91 80× more corpus tokens produced 0.2 worse val_loss (likely held-out val set distribution effect, not real regression). Root cause discovered (NEW P0-K, PMAT-690): apr convert (HF-safetensors → APR import path) does NOT stamp hf_architecture, embedded tokenizer.vocabulary, or tokenizer.merges into the imported APR. The §81-§83 5-PR Class 3 cascade (P0-D/E/F/G/H/J) wired downstream propagation correctly, but had nothing to propagate because the upstream producer was incomplete. Live P2-C trained checkpoint re-exhibits all 5 prior failures: - apr qa → "APR missing embedded tokenizer" - apr bench → PASS (315.6 tok/s — C-03 arch dims are stamped by training even when init didn't have them) - apr export → 72 qkv biases leak as passthrough (arch stays Llama) - llama-cli → "cannot find tokenizer merges in model file" Methodology lesson #33 NEW: upstream metadata defects masquerade as downstream packaging defects. When 5th Class 3 fix is in the same area, pause and check the upstream producer. ~30 min inventorying the producer is cheaper than a 6th, 7th, 8th consumer fix. Ship %: stays at 79. Next: - PMAT-690 P0-K (NEW critical): apr convert stamps hf_architecture + tokenizer.vocabulary + tokenizer.merges. Scope ~100 LOC. - After P0-K: re-import qwen2.5-coder-0.5b → re-train → re-export → llama-cli should work end-to-end, transitively closing PMAT-679 P0-H. Files: evidence/p2c-2026-05-17/findings.md evidence/p2c-2026-05-17/loss-trajectory.tsv (27-epoch trace) evidence/p2c-2026-05-17/bench-epoch-020.json (315.6 tok/s) evidence/p2c-2026-05-17/epoch-020.metadata.json docs/roadmaps/roadmap.yaml (PMAT-681 → completed, PMAT-690 added) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift and others added 2 commits May 17, 2026 08:56

noahgift enabled auto-merge (squash) May 17, 2026 08:18

Merge branch 'main' into docs/p2c-2026-05-17-evidence

ea60ba7

noahgift merged commit 7e3bc19 into main May 17, 2026
10 checks passed

noahgift deleted the docs/p2c-2026-05-17-evidence branch May 17, 2026 08:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(p2c): SPEC §84 P2-C live findings — audit hypothesis falsified, P0-K root cause discovered#1738

docs(p2c): SPEC §84 P2-C live findings — audit hypothesis falsified, P0-K root cause discovered#1738
noahgift merged 3 commits into
mainfrom
docs/p2c-2026-05-17-evidence

noahgift commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 17, 2026

Summary

Root cause discovered (NEW P0-K)

Methodology lesson #33 NEW

Evidence

Next

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant