Conversation
70bf0a2 to
f563d5e
Compare
…es-to-apples — spec v3.04.0 → v3.05.0 M-FFN-GGUF-5 fix shipped (aprender PR #1550, MERGED 2026-05-07T05:50) + M-FFN-GGUF-7 multi-layer chain (PR #1548, MERGED 2026-05-07T05:15). MAJOR PLOT TWIST in M-FFN-GGUF-5 fix PR: §27's 18.23× std-ratio was a TEST METHODOLOGY ARTIFACT, NOT a numerical bug. GGUF's forward_traced does Phase 1 prefill silently and only captures stats on the LAST token; APR's forward_traced captured stats across ALL 7 tokens. The §27 measurement compared: APR std across 7 tokens × 28672 elements GGUF std across 1 token × 4096 elements Fundamentally incomparable. Different counts, different distributions. Two coherent fixes in PR #1550: 1. forward_traced uses Q4K+Q8K dispatch (matches production semantics; 7 call sites updated via new matmul_q4k_or_f32_traced helper) 2. M89 harness compares apples-to-apples last-token-only stats EMPIRICAL END-TO-END (2026-05-07, RTX 4090, 178s): layer-3 ratio = 1.245× → H1 CONFIRMED All 28 layers within H1 band [0.5, 2.0] 15,233 lib tests pass; production hot paths byte-unchanged The cascade's per-tensor mechanism (M94 0.077%) and compounding (M95 5.70× / M-FFN-GGUF-7 1.81× saturation) ARE real but didn't explain §27's 1723% — that was methodology-inflated. Methodology lesson #7 NEW (feedback_test_methodology_can_fake_bugs.md): when comparing two implementations via summary statistics, VERIFY both sides measure the same distribution shape BEFORE trusting the comparison. Mismatched shapes can fake bugs. Total session: 28 PRs / 2 days including 1 actual fix landing. Discharge potential per §17.5: 5 MODEL-1 PARTIALs (SHIP-002/005/006/ 007/008) ready for individual discharge follow-ups. MODEL-1 ship % 91% → 96% pending those. Spec v3.04.0 → v3.05.0. Atomic next action banner update only; full §60 narrative deferred to deliberate session. Refs PMAT-CCPA, SHIP-007 §22, M91-M103, M-FFN-GGUF-5 PR #1550. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
f563d5e to
f987be8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Spec banner update recording the SHIP-007 §22 cascade closure with the actual fix landing.
What landed
MAJOR PLOT TWIST
§27's 18.23× std-ratio was a TEST METHODOLOGY ARTIFACT, NOT a numerical bug. GGUF's
forward_tracedonly captures stats on the LAST token; APR's captured stats across ALL 7 tokens. Comparing 7-token APR std vs 1-token GGUF std = inflated apparent magnitude.Empirical end-to-end (2026-05-07, RTX 4090, 178s)
Total session
28 PRs across 2 days including 1 actual fix landing. The cascade's findings (M94 0.077% per-matvec mechanism, M95 5.70× compounding, M-FFN-GGUF-7 1.81× real-saturating) ARE real numerical findings — but §27's 1723% magnitude that made the bug look severe was test-methodology-inflated.
Discharge potential
5 MODEL-1 PARTIALs (SHIP-002/005/006/007/008) ready for individual discharge follow-ups per §17.5. MODEL-1 ship % 91% → 96% pending those.
Methodology lesson #7 NEW
feedback_test_methodology_can_fake_bugs.md— when comparing two implementations via summary statistics, VERIFY both sides measure the same distribution shape BEFORE trusting the comparison.Status changes
Test plan
🤖 Generated with Claude Code