Skip to content

chore(dogfood): add Gates 13-17 to catch the 2026-05-22 bug cluster#1872

Merged
noahgift merged 2 commits into
mainfrom
chore/dogfood-skill-add-gates-13-17
May 22, 2026
Merged

chore(dogfood): add Gates 13-17 to catch the 2026-05-22 bug cluster#1872
noahgift merged 2 commits into
mainfrom
chore/dogfood-skill-add-gates-13-17

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

The v0.35.0 release dogfood (2026-05-22) surfaced 4 real bugs that the existing 12 gates didn't catch. This PR codifies each as a falsifier so future dogfood runs reject the same regression class without manual hunting.

New gates

Gate Catches Contract
G13 Worktree HEAD Sanity #1862 apr --version stale SHA in worktree apr-version-traceability-v1 § FALSIFY-VERSION-004
G14 APR → GGUF Export Round-trip #1865 apr export panic on missing num_layers apr-export-num-layers-v1
G15 validate --quality Sanity #1866 Grade F on every working model apr-validate-quality-threshold-v1
G16 apr run Exit Sanity #1864 secondary — gibberish + exit 0 apr-cpu-vs-gpu-output-parity-v1
G17 7B Inference Smoke #1864 directly — Qwen2.5-7B Q4_K Golden Output apr-cpu-vs-gpu-output-parity-v1

Pre-Gate methodology note

Two of the 4 bugs filed in the 2026-05-22 session were briefly mis-flagged because the falsifier piped output to head and then read $? — which returns head's status (0), not the command's. The new note documents the correct pattern and is referenced from each new gate:

# WRONG — $? is head's exit, not apr's
apr publish /nonexistent paiml/test 2>&1 | head -8; echo "exit=$?"

# RIGHT — captures the command's real exit code
OUT=$(apr publish /nonexistent paiml/test 2>&1); EC=$?

See [memory/feedback_test_methodology_can_fake_bugs.md].

Verdict update

Gates 1-17 now enumerated. FAIL conditions include the new failure classes (panic, exit-code lie, silent gibberish, stale --version, validate false-negative).

Test plan

  • SKILL.md parses as expected (grep "^## Gate" shows 17 gates)
  • Each new gate references its contract YAML in tree
  • Pre-Gate methodology note is the first thing a runner sees about exit codes
  • CI: this is a .claude/skills/ change, no Rust tests affected

🤖 Generated with Claude Code

The v0.35.0 release dogfood (2026-05-22) surfaced 4 real bugs that the
existing 12 gates didn't catch. This PR codifies each as a falsifier so
future dogfood runs reject the same regression class without manual
hunting.

New gates:

**G13 — Worktree HEAD Sanity** (#1862)
  Verify `apr --version` SHA matches `git rev-parse --short HEAD` after
  rebuild, and that build.rs uses `git rev-parse --git-dir / --git-common-dir`
  instead of a hardcoded `../../.git/HEAD` path (which doesn't exist in a
  worktree layout). Contract: apr-version-traceability-v1 § FALSIFY-VERSION-004.

**G14 — APR → GGUF Export Round-trip** (#1865)
  Every .apr file in ~/models must export to .gguf without panic. Exit 5
  (clean ValidationFailed) is acceptable; exit 101 (panic) is a FAIL.
  Contract: apr-export-num-layers-v1.

**G15 — validate --quality Sanity** (#1866)
  When `apr qa` says ✓ ALL GATES PASSED, `apr validate --quality` must NOT
  exit non-zero. The threshold gate cannot count stubbed `Skip(Not implemented)`
  checks against working models. Contract: apr-validate-quality-threshold-v1.

**G16 — `apr run` Exit Code Reflects Output Validity** (#1864 secondary)
  When `apr run` emits chat-template gibberish (repeated `<|im_start|>` etc.),
  exit must be non-zero. Catches the silent-success failure mode where a
  partial GPU fallback produced wrong output but `apr run` exited 0.
  Contract: apr-cpu-vs-gpu-output-parity-v1.

**G17 — 7B Inference Smoke** (#1864 directly)
  Re-exercise `apr qa` Golden Output gate on the canonical 7B Q4_K model
  (the README's headline 225 tok/s RTX 4090 configuration). FAILs on the
  cuBLAS FP8 regression that #1864 captured.

Pre-Gate methodology note added:

**Exit-code capture**. Two of the 4 bugs filed in the 2026-05-22 session
were briefly mis-flagged because the falsifier piped output to `head` and
then read `$?` — which returns head's status (0), not the command's. The
note documents the correct `OUT=$(...); EC=$?` pattern and is referenced
from each new gate. See memory/feedback_test_methodology_can_fake_bugs.md.

Verdict section updated: 17 gates, FAIL conditions enumerated by issue.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit ed645ce into main May 22, 2026
10 checks passed
@noahgift noahgift deleted the chore/dogfood-skill-add-gates-13-17 branch May 22, 2026 08:26
noahgift added a commit that referenced this pull request May 22, 2026
…ly cron (#1875)

Adds an end-to-end "Qwen story" that exercises every core apr command
group against the Qwen scale ladder (0.5B → 1.5B → 7B → 30B-MoE). The
story is the single canonical demo in README.md AND a regression gate
via runnable script + falsification contract + nightly cron.

## Beats

1. **Discover** (Registry) — pull, list
2. **Trust** (QA) — qa, validate, lint
3. **Explore** (Inspection) — inspect, tensors, tree
4. **Adapt** (Model ops) — export, diff, convert/quantize
5. **Use** (Inference) — run, chat, code
6. **Serve** (REST) — serve run + curl /v1/chat/completions OpenAI-compat
7. **Operate** (Profiling) — profile, gpu, serve plan (7B Q4K GGUF)
8. **Scale** (MoE) — inspect, tensors on 30B-MoE qwen3moe

## Pmat bug-hunt layer

When run with `PMAT_HUNT=1` (default), each beat emits a structured
manifest of high-risk untested code in the command-handler modules it
just exercised:

    -- pmat bug-hunt manifest (run chat code) --
        gap   crates/apr-cli/src/commands/run.rs:resolve_model_alias (impact=42.3)
        churn crates/apr-cli/src/commands/code.rs:dispatch_agent (commits=11)
        fault crates/aprender-serve/src/api/cuda_chat_backend.rs:try_qwen3_moe (unwrap)

The nightly cron uploads this manifest as an artifact, compares against
the previous successful run, and opens (or comments on) a tracking issue
when growth exceeds 5 lines — so untested branches in command handlers
can't accumulate quietly.

## Files

- `scripts/qwen-story.sh` (336 LOC) — runnable story with proper exit-code
  capture (`OUT=$(cmd); EC=$?` everywhere; no pipe-then-`$?` per memory rule)
- `contracts/qwen-story-v1.yaml` — 3 equations + 8 falsifiers, all PASS
  locally (script exists+executable, 8 beats, run_cmd helper, pmat_hunt
  per beat, README link, daily cron file, bashrs clean, Beat 7 skips
  `apr qa` on 7B Q4K due to #1864)
- `README.md` — new `## A Qwen story` section replacing the flat
  `## CLI examples` block. Fixes two README bugs surfaced during dogfood:
  `apr profile --roofline` (no such flag; just `apr profile <file>`)
  and `apr bench --assert-tps` (flag is on `apr qa`, not `bench`).
- `.github/workflows/qwen-story-daily.yml` — self-hosted GPU runner,
  04:17 UTC cron + workflow_dispatch, uploads pmat manifest + story log
  artifacts, files tracking issue when story regresses or manifest grows.

## Verification

    $ bash scripts/qwen-story.sh   # local smoke
    -- Beat 1: Discover (Registry) --
    ✓ PASS  B1 list
    -- Beat 2: Trust (QA gates) --
    ✓ PASS  B2 apr qa
    ✗ FAIL  B2 apr validate --quality  -  exit=5 (after #1866 fix this should be 0)
    -- Beat 3: Explore (Inspection) --
    ✓ PASS  B3 apr inspect --json (arch=qwen2)
    ✓ PASS  B3 apr tensors --json (339 tensors)
    ✓ PASS  B3 apr tree
    -- Beat 4: Adapt (Model ops) --
    ✗ FAIL  B4 apr export  -  PANIC (exit=101)  -  #1865 regression
    -- Beat 5: Use (Inference) --
    ✓ PASS  B5 apr run (Rust code completion)
    ✓ PASS  B5 apr code -p
    -- Beat 6: Serve (REST API) --
    ✓ PASS  B6 apr serve run (port=22915)
    ✓ PASS  B6 /v1/chat/completions (got OK...)
    -- Beat 7: Operate (Profiling) --
    ✓ PASS  B7 apr profile
    ✓ PASS  B7 apr gpu --json
    ✓ PASS  B7 apr serve plan -- 7B VRAM budget
    -- Beat 8: Scale (MoE introspection) --
    ✓ PASS  B8 apr inspect --json (arch=qwen3moe)
    ✓ PASS  B8 apr tensors --json (579 tensors)
    14 PASS / 2 FAIL / 0 SKIP

The 2 FAILs are EXPECTED until the in-flight fixes land:
- B2 validate --quality: closed by #1870
- B4 export panic: closed by #1868

Once those PRs merge, this story will be 16 PASS / 0 FAIL / 0 SKIP on a
host with all 4 Qwen models cached.

## Follow-up

A separate PR will add `/dogfood` Gate 18 that invokes this script (kept
separate to avoid conflict with PR #1872 which is already adding Gates
13-17 to the dogfood skill).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant