chore(dogfood): add Gates 13-17 to catch the 2026-05-22 bug cluster#1872
Merged
Conversation
The v0.35.0 release dogfood (2026-05-22) surfaced 4 real bugs that the existing 12 gates didn't catch. This PR codifies each as a falsifier so future dogfood runs reject the same regression class without manual hunting. New gates: **G13 — Worktree HEAD Sanity** (#1862) Verify `apr --version` SHA matches `git rev-parse --short HEAD` after rebuild, and that build.rs uses `git rev-parse --git-dir / --git-common-dir` instead of a hardcoded `../../.git/HEAD` path (which doesn't exist in a worktree layout). Contract: apr-version-traceability-v1 § FALSIFY-VERSION-004. **G14 — APR → GGUF Export Round-trip** (#1865) Every .apr file in ~/models must export to .gguf without panic. Exit 5 (clean ValidationFailed) is acceptable; exit 101 (panic) is a FAIL. Contract: apr-export-num-layers-v1. **G15 — validate --quality Sanity** (#1866) When `apr qa` says ✓ ALL GATES PASSED, `apr validate --quality` must NOT exit non-zero. The threshold gate cannot count stubbed `Skip(Not implemented)` checks against working models. Contract: apr-validate-quality-threshold-v1. **G16 — `apr run` Exit Code Reflects Output Validity** (#1864 secondary) When `apr run` emits chat-template gibberish (repeated `<|im_start|>` etc.), exit must be non-zero. Catches the silent-success failure mode where a partial GPU fallback produced wrong output but `apr run` exited 0. Contract: apr-cpu-vs-gpu-output-parity-v1. **G17 — 7B Inference Smoke** (#1864 directly) Re-exercise `apr qa` Golden Output gate on the canonical 7B Q4_K model (the README's headline 225 tok/s RTX 4090 configuration). FAILs on the cuBLAS FP8 regression that #1864 captured. Pre-Gate methodology note added: **Exit-code capture**. Two of the 4 bugs filed in the 2026-05-22 session were briefly mis-flagged because the falsifier piped output to `head` and then read `$?` — which returns head's status (0), not the command's. The note documents the correct `OUT=$(...); EC=$?` pattern and is referenced from each new gate. See memory/feedback_test_methodology_can_fake_bugs.md. Verdict section updated: 17 gates, FAIL conditions enumerated by issue. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
7 tasks
noahgift
added a commit
that referenced
this pull request
May 22, 2026
…ly cron (#1875) Adds an end-to-end "Qwen story" that exercises every core apr command group against the Qwen scale ladder (0.5B → 1.5B → 7B → 30B-MoE). The story is the single canonical demo in README.md AND a regression gate via runnable script + falsification contract + nightly cron. ## Beats 1. **Discover** (Registry) — pull, list 2. **Trust** (QA) — qa, validate, lint 3. **Explore** (Inspection) — inspect, tensors, tree 4. **Adapt** (Model ops) — export, diff, convert/quantize 5. **Use** (Inference) — run, chat, code 6. **Serve** (REST) — serve run + curl /v1/chat/completions OpenAI-compat 7. **Operate** (Profiling) — profile, gpu, serve plan (7B Q4K GGUF) 8. **Scale** (MoE) — inspect, tensors on 30B-MoE qwen3moe ## Pmat bug-hunt layer When run with `PMAT_HUNT=1` (default), each beat emits a structured manifest of high-risk untested code in the command-handler modules it just exercised: -- pmat bug-hunt manifest (run chat code) -- gap crates/apr-cli/src/commands/run.rs:resolve_model_alias (impact=42.3) churn crates/apr-cli/src/commands/code.rs:dispatch_agent (commits=11) fault crates/aprender-serve/src/api/cuda_chat_backend.rs:try_qwen3_moe (unwrap) The nightly cron uploads this manifest as an artifact, compares against the previous successful run, and opens (or comments on) a tracking issue when growth exceeds 5 lines — so untested branches in command handlers can't accumulate quietly. ## Files - `scripts/qwen-story.sh` (336 LOC) — runnable story with proper exit-code capture (`OUT=$(cmd); EC=$?` everywhere; no pipe-then-`$?` per memory rule) - `contracts/qwen-story-v1.yaml` — 3 equations + 8 falsifiers, all PASS locally (script exists+executable, 8 beats, run_cmd helper, pmat_hunt per beat, README link, daily cron file, bashrs clean, Beat 7 skips `apr qa` on 7B Q4K due to #1864) - `README.md` — new `## A Qwen story` section replacing the flat `## CLI examples` block. Fixes two README bugs surfaced during dogfood: `apr profile --roofline` (no such flag; just `apr profile <file>`) and `apr bench --assert-tps` (flag is on `apr qa`, not `bench`). - `.github/workflows/qwen-story-daily.yml` — self-hosted GPU runner, 04:17 UTC cron + workflow_dispatch, uploads pmat manifest + story log artifacts, files tracking issue when story regresses or manifest grows. ## Verification $ bash scripts/qwen-story.sh # local smoke -- Beat 1: Discover (Registry) -- ✓ PASS B1 list -- Beat 2: Trust (QA gates) -- ✓ PASS B2 apr qa ✗ FAIL B2 apr validate --quality - exit=5 (after #1866 fix this should be 0) -- Beat 3: Explore (Inspection) -- ✓ PASS B3 apr inspect --json (arch=qwen2) ✓ PASS B3 apr tensors --json (339 tensors) ✓ PASS B3 apr tree -- Beat 4: Adapt (Model ops) -- ✗ FAIL B4 apr export - PANIC (exit=101) - #1865 regression -- Beat 5: Use (Inference) -- ✓ PASS B5 apr run (Rust code completion) ✓ PASS B5 apr code -p -- Beat 6: Serve (REST API) -- ✓ PASS B6 apr serve run (port=22915) ✓ PASS B6 /v1/chat/completions (got OK...) -- Beat 7: Operate (Profiling) -- ✓ PASS B7 apr profile ✓ PASS B7 apr gpu --json ✓ PASS B7 apr serve plan -- 7B VRAM budget -- Beat 8: Scale (MoE introspection) -- ✓ PASS B8 apr inspect --json (arch=qwen3moe) ✓ PASS B8 apr tensors --json (579 tensors) 14 PASS / 2 FAIL / 0 SKIP The 2 FAILs are EXPECTED until the in-flight fixes land: - B2 validate --quality: closed by #1870 - B4 export panic: closed by #1868 Once those PRs merge, this story will be 16 PASS / 0 FAIL / 0 SKIP on a host with all 4 Qwen models cached. ## Follow-up A separate PR will add `/dogfood` Gate 18 that invokes this script (kept separate to avoid conflict with PR #1872 which is already adding Gates 13-17 to the dogfood skill). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Closed
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The v0.35.0 release dogfood (2026-05-22) surfaced 4 real bugs that the existing 12 gates didn't catch. This PR codifies each as a falsifier so future dogfood runs reject the same regression class without manual hunting.
New gates
apr --versionstale SHA in worktreeapr-version-traceability-v1§ FALSIFY-VERSION-004apr exportpanic on missingnum_layersapr-export-num-layers-v1apr-validate-quality-threshold-v1apr runExit Sanityapr-cpu-vs-gpu-output-parity-v1apr-cpu-vs-gpu-output-parity-v1Pre-Gate methodology note
Two of the 4 bugs filed in the 2026-05-22 session were briefly mis-flagged because the falsifier piped output to
headand then read$?— which returnshead's status (0), not the command's. The new note documents the correct pattern and is referenced from each new gate:See [
memory/feedback_test_methodology_can_fake_bugs.md].Verdict update
Gates 1-17 now enumerated. FAIL conditions include the new failure classes (panic, exit-code lie, silent gibberish, stale --version, validate false-negative).
Test plan
grep "^## Gate"shows 17 gates).claude/skills/change, no Rust tests affected🤖 Generated with Claude Code