chore(dogfood): add Gates 13-17 to catch the 2026-05-22 bug cluster by noahgift · Pull Request #1872 · paiml/aprender

noahgift · 2026-05-22T07:15:02Z

Summary

The v0.35.0 release dogfood (2026-05-22) surfaced 4 real bugs that the existing 12 gates didn't catch. This PR codifies each as a falsifier so future dogfood runs reject the same regression class without manual hunting.

New gates

Gate	Catches	Contract
G13 Worktree HEAD Sanity	#1862 `apr --version` stale SHA in worktree	`apr-version-traceability-v1` § FALSIFY-VERSION-004
G14 APR → GGUF Export Round-trip	#1865 `apr export` panic on missing `num_layers`	`apr-export-num-layers-v1`
G15 validate --quality Sanity	#1866 Grade F on every working model	`apr-validate-quality-threshold-v1`
G16 `apr run` Exit Sanity	#1864 secondary — gibberish + exit 0	`apr-cpu-vs-gpu-output-parity-v1`
G17 7B Inference Smoke	#1864 directly — Qwen2.5-7B Q4_K Golden Output	`apr-cpu-vs-gpu-output-parity-v1`

Pre-Gate methodology note

Two of the 4 bugs filed in the 2026-05-22 session were briefly mis-flagged because the falsifier piped output to head and then read $? — which returns head's status (0), not the command's. The new note documents the correct pattern and is referenced from each new gate:

# WRONG — $? is head's exit, not apr's
apr publish /nonexistent paiml/test 2>&1 | head -8; echo "exit=$?"

# RIGHT — captures the command's real exit code
OUT=$(apr publish /nonexistent paiml/test 2>&1); EC=$?

See [memory/feedback_test_methodology_can_fake_bugs.md].

Verdict update

Gates 1-17 now enumerated. FAIL conditions include the new failure classes (panic, exit-code lie, silent gibberish, stale --version, validate false-negative).

Test plan

SKILL.md parses as expected (grep "^## Gate" shows 17 gates)
Each new gate references its contract YAML in tree
Pre-Gate methodology note is the first thing a runner sees about exit codes
CI: this is a .claude/skills/ change, no Rust tests affected

🤖 Generated with Claude Code

The v0.35.0 release dogfood (2026-05-22) surfaced 4 real bugs that the existing 12 gates didn't catch. This PR codifies each as a falsifier so future dogfood runs reject the same regression class without manual hunting. New gates: **G13 — Worktree HEAD Sanity** (#1862) Verify `apr --version` SHA matches `git rev-parse --short HEAD` after rebuild, and that build.rs uses `git rev-parse --git-dir / --git-common-dir` instead of a hardcoded `../../.git/HEAD` path (which doesn't exist in a worktree layout). Contract: apr-version-traceability-v1 § FALSIFY-VERSION-004. **G14 — APR → GGUF Export Round-trip** (#1865) Every .apr file in ~/models must export to .gguf without panic. Exit 5 (clean ValidationFailed) is acceptable; exit 101 (panic) is a FAIL. Contract: apr-export-num-layers-v1. **G15 — validate --quality Sanity** (#1866) When `apr qa` says ✓ ALL GATES PASSED, `apr validate --quality` must NOT exit non-zero. The threshold gate cannot count stubbed `Skip(Not implemented)` checks against working models. Contract: apr-validate-quality-threshold-v1. **G16 — `apr run` Exit Code Reflects Output Validity** (#1864 secondary) When `apr run` emits chat-template gibberish (repeated `<|im_start|>` etc.), exit must be non-zero. Catches the silent-success failure mode where a partial GPU fallback produced wrong output but `apr run` exited 0. Contract: apr-cpu-vs-gpu-output-parity-v1. **G17 — 7B Inference Smoke** (#1864 directly) Re-exercise `apr qa` Golden Output gate on the canonical 7B Q4_K model (the README's headline 225 tok/s RTX 4090 configuration). FAILs on the cuBLAS FP8 regression that #1864 captured. Pre-Gate methodology note added: **Exit-code capture**. Two of the 4 bugs filed in the 2026-05-22 session were briefly mis-flagged because the falsifier piped output to `head` and then read `$?` — which returns head's status (0), not the command's. The note documents the correct `OUT=$(...); EC=$?` pattern and is referenced from each new gate. See memory/feedback_test_methodology_can_fake_bugs.md. Verdict section updated: 17 gates, FAIL conditions enumerated by issue. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ly cron (#1875) Adds an end-to-end "Qwen story" that exercises every core apr command group against the Qwen scale ladder (0.5B → 1.5B → 7B → 30B-MoE). The story is the single canonical demo in README.md AND a regression gate via runnable script + falsification contract + nightly cron. ## Beats 1. **Discover** (Registry) — pull, list 2. **Trust** (QA) — qa, validate, lint 3. **Explore** (Inspection) — inspect, tensors, tree 4. **Adapt** (Model ops) — export, diff, convert/quantize 5. **Use** (Inference) — run, chat, code 6. **Serve** (REST) — serve run + curl /v1/chat/completions OpenAI-compat 7. **Operate** (Profiling) — profile, gpu, serve plan (7B Q4K GGUF) 8. **Scale** (MoE) — inspect, tensors on 30B-MoE qwen3moe ## Pmat bug-hunt layer When run with `PMAT_HUNT=1` (default), each beat emits a structured manifest of high-risk untested code in the command-handler modules it just exercised: -- pmat bug-hunt manifest (run chat code) -- gap crates/apr-cli/src/commands/run.rs:resolve_model_alias (impact=42.3) churn crates/apr-cli/src/commands/code.rs:dispatch_agent (commits=11) fault crates/aprender-serve/src/api/cuda_chat_backend.rs:try_qwen3_moe (unwrap) The nightly cron uploads this manifest as an artifact, compares against the previous successful run, and opens (or comments on) a tracking issue when growth exceeds 5 lines — so untested branches in command handlers can't accumulate quietly. ## Files - `scripts/qwen-story.sh` (336 LOC) — runnable story with proper exit-code capture (`OUT=$(cmd); EC=$?` everywhere; no pipe-then-`$?` per memory rule) - `contracts/qwen-story-v1.yaml` — 3 equations + 8 falsifiers, all PASS locally (script exists+executable, 8 beats, run_cmd helper, pmat_hunt per beat, README link, daily cron file, bashrs clean, Beat 7 skips `apr qa` on 7B Q4K due to #1864) - `README.md` — new `## A Qwen story` section replacing the flat `## CLI examples` block. Fixes two README bugs surfaced during dogfood: `apr profile --roofline` (no such flag; just `apr profile <file>`) and `apr bench --assert-tps` (flag is on `apr qa`, not `bench`). - `.github/workflows/qwen-story-daily.yml` — self-hosted GPU runner, 04:17 UTC cron + workflow_dispatch, uploads pmat manifest + story log artifacts, files tracking issue when story regresses or manifest grows. ## Verification $ bash scripts/qwen-story.sh # local smoke -- Beat 1: Discover (Registry) -- ✓ PASS B1 list -- Beat 2: Trust (QA gates) -- ✓ PASS B2 apr qa ✗ FAIL B2 apr validate --quality - exit=5 (after #1866 fix this should be 0) -- Beat 3: Explore (Inspection) -- ✓ PASS B3 apr inspect --json (arch=qwen2) ✓ PASS B3 apr tensors --json (339 tensors) ✓ PASS B3 apr tree -- Beat 4: Adapt (Model ops) -- ✗ FAIL B4 apr export - PANIC (exit=101) - #1865 regression -- Beat 5: Use (Inference) -- ✓ PASS B5 apr run (Rust code completion) ✓ PASS B5 apr code -p -- Beat 6: Serve (REST API) -- ✓ PASS B6 apr serve run (port=22915) ✓ PASS B6 /v1/chat/completions (got OK...) -- Beat 7: Operate (Profiling) -- ✓ PASS B7 apr profile ✓ PASS B7 apr gpu --json ✓ PASS B7 apr serve plan -- 7B VRAM budget -- Beat 8: Scale (MoE introspection) -- ✓ PASS B8 apr inspect --json (arch=qwen3moe) ✓ PASS B8 apr tensors --json (579 tensors) 14 PASS / 2 FAIL / 0 SKIP The 2 FAILs are EXPECTED until the in-flight fixes land: - B2 validate --quality: closed by #1870 - B4 export panic: closed by #1868 Once those PRs merge, this story will be 16 PASS / 0 FAIL / 0 SKIP on a host with all 4 Qwen models cached. ## Follow-up A separate PR will add `/dogfood` Gate 18 that invokes this script (kept separate to avoid conflict with PR #1872 which is already adding Gates 13-17 to the dogfood skill). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 22, 2026 07:15

noahgift mentioned this pull request May 22, 2026

Qwen2.5-7B Q4_K GPU inference produces gibberish — 'ampiezza' (wgpu) / '<|im_start|>' (cuBLAS) — regression vs #374 / #559 #1864

Open

Merge branch 'main' into chore/dogfood-skill-add-gates-13-17

1855f19

noahgift mentioned this pull request May 22, 2026

feat(qwen-story): 8-beat E2E narrative + pmat bug-hunt + daily cron #1875

Merged

7 tasks

noahgift merged commit ed645ce into main May 22, 2026
10 checks passed

noahgift deleted the chore/dogfood-skill-add-gates-13-17 branch May 22, 2026 08:26

noahgift mentioned this pull request May 22, 2026

spec(SPEC-CUBLAS-FP8-7B-FIX-001): epic to root-cause cuBLAS FP8 7B gibberish (holds v0.35.0) #1882

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(dogfood): add Gates 13-17 to catch the 2026-05-22 bug cluster#1872

chore(dogfood): add Gates 13-17 to catch the 2026-05-22 bug cluster#1872
noahgift merged 2 commits into
mainfrom
chore/dogfood-skill-add-gates-13-17

noahgift commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 22, 2026

Summary

New gates

Pre-Gate methodology note

Verdict update

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant