Friction (observed across the last few dogfooding-loop iterations)
Main CI runs sit in queued for hours and never give a timely signal. Root cause is runner capacity: Mutation Testing (rivet-core) is a 16-shard matrix (0/16…15/16), each with a 45-minute budget, on the self-hosted lean-mem pool, on every push — up to ~12 runner-hours per push. That starves the gating jobs (Test, Clippy, Docs, Proptest, Playwright): they queue behind the matrix and don't conclude promptly. Concretely, commit cab7f43's run was still queued >1h after push, so I couldn't pull a job log to diagnose a (separate) proptest failure.
Notably, rivet-core mutation testing is already continue-on-error: true (ci.yml: "rivet-core is NOT a hard gate"). So the job consuming the most runner capacity is non-gating — it blocks nothing yet delays everything.
Suggested fix (maintainer's call on cadence)
Move the non-gating rivet-core mutation matrix off the every-push path:
- run on a schedule (nightly) and/or push-to-
main-only, not PR pushes; or
workflow_dispatch (manual) + nightly; or
- shrink the shard count so fewer concurrent runners are held.
Keep the hard-gate rivet-cli mutation job (single shard, 0 surviving mutants) per-push. This frees the pool so gating checks return promptly. (The Playwright/Test conclusiveness work in #436/#494 only helps if the runs actually run.)
Throughput/capacity issue, not correctness. Found in the hourly dogfooding loop.
Friction (observed across the last few dogfooding-loop iterations)
Main CI runs sit in
queuedfor hours and never give a timely signal. Root cause is runner capacity:Mutation Testing (rivet-core)is a 16-shard matrix (0/16…15/16), each with a 45-minute budget, on the self-hostedlean-mempool, on every push — up to ~12 runner-hours per push. That starves the gating jobs (Test, Clippy, Docs, Proptest, Playwright): they queue behind the matrix and don't conclude promptly. Concretely, commitcab7f43's run was stillqueued>1h after push, so I couldn't pull a job log to diagnose a (separate) proptest failure.Notably, rivet-core mutation testing is already
continue-on-error: true(ci.yml: "rivet-core is NOT a hard gate"). So the job consuming the most runner capacity is non-gating — it blocks nothing yet delays everything.Suggested fix (maintainer's call on cadence)
Move the non-gating rivet-core mutation matrix off the every-push path:
main-only, not PR pushes; orworkflow_dispatch(manual) + nightly; orKeep the hard-gate
rivet-climutation job (single shard, 0 surviving mutants) per-push. This frees the pool so gating checks return promptly. (The Playwright/Test conclusiveness work in #436/#494 only helps if the runs actually run.)Throughput/capacity issue, not correctness. Found in the hourly dogfooding loop.