You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As of 2026-06-06 ~09:15Z, GET /repos/pulseengine/rivet/actions/runners reports total_count: 0 — zero self-hosted runners registered/online — and 46 workflow runs are queued with 0 in_progress. Several are main push-runs:
The #505 Test gate passed ~2h earlier, so the pool went offline recently (not a clean weekend-off pattern). Net effect: the last several main merges have zero CI verification, and nobody is alerted.
The durable problem (why this is an issue, not just a blip)
Every gating job is runs-on: [self-hosted, linux, x64, …]. The self-hosted pool is therefore a single point of failure with two compounding gaps:
No fallback. When the pool is down, all gates — even the ~1s ones (fmt, yaml-lint) — silently queue indefinitely. There is no path to even basic verification.
Suggested fixes (pick per cost/policy — these are choices for the maintainer)
Liveness alert (cheap, high-value). A tiny scheduled job on ubuntu-latest (GitHub-hosted, so it runs even when self-hosted is down) that calls the runners API and the queued-runs API and fails loudly (issue comment / notification) if online runners == 0 or any run has been queued > N minutes. This is the missing smoke alarm.
Route the fast core gates to GitHub-hosted. Move fmt, yaml-lint, validate, and maybe clippy to ubuntu-latest so a self-hosted outage still leaves the cheap correctness gates working. Keeps the heavy jobs (mutation, Kani, Playwright) on self-hosted. Trade-off: GitHub-hosted minutes cost.
Document the operational runbook: how to confirm a runner outage (the two API calls above) and bring the pool back, so a stalled queue is diagnosable in seconds.
I hit this directly in the hourly dogfooding loop: I could not get CI to verify #505/#507, and had to rely entirely on local cargo test --test cli_commands + clippy --all-targets + fmt --check. That local battery is a decent stand-in, but it isn't the gate, and it shouldn't be the only thing standing between an agent's merge and main.
Related: #436 (silently-broken/inconclusive gates). cc maintainer — the immediate operational half (bring runners online) is outside this repo; the resilience half (1–3 above) is in it.
Current outage (the trigger)
As of 2026-06-06 ~09:15Z,
GET /repos/pulseengine/rivet/actions/runnersreportstotal_count: 0— zero self-hosted runners registered/online — and 46 workflow runs arequeuedwith 0in_progress. Several aremainpush-runs:2871c97(feat(list): --full emits description/tags/fields in JSON for bulk queries (REQ-211, #506) #507, merged) — CI queued, never ranecb073a(fix(init): pre-commit hook runs cargo fmt --check on staged Rust (REQ-210, #438) #505, merged) — CI queued, never ran2750e9f(ci: move rivet-core mutation matrix to nightly so it stops saturating runners (#498) #504, merged) — CI queued, never ranThe
#505Test gate passed ~2h earlier, so the pool went offline recently (not a clean weekend-off pattern). Net effect: the last severalmainmerges have zero CI verification, and nobody is alerted.The durable problem (why this is an issue, not just a blip)
Every gating job is
runs-on: [self-hosted, linux, x64, …]. The self-hosted pool is therefore a single point of failure with two compounding gaps:fmt,yaml-lint) — silently queue indefinitely. There is no path to even basic verification.queuedfor hours looks identical to a slow run. Combined with Playwright E2E is a silently-broken gate: non-required + main runs cancelled = never conclusively green #436 (gates that are red/inconclusive go unnoticed), a maintainer can admin-merge — or an automated loop can — believing "CI will catch it," when CI is not running at all.Suggested fixes (pick per cost/policy — these are choices for the maintainer)
ubuntu-latest(GitHub-hosted, so it runs even when self-hosted is down) that calls the runners API and the queued-runs API and fails loudly (issue comment / notification) ifonline runners == 0or any run has beenqueued > Nminutes. This is the missing smoke alarm.fmt,yaml-lint,validate, and maybeclippytoubuntu-latestso a self-hosted outage still leaves the cheap correctness gates working. Keeps the heavy jobs (mutation, Kani, Playwright) on self-hosted. Trade-off: GitHub-hosted minutes cost.I hit this directly in the hourly dogfooding loop: I could not get CI to verify #505/#507, and had to rely entirely on local
cargo test --test cli_commands+clippy --all-targets+fmt --check. That local battery is a decent stand-in, but it isn't the gate, and it shouldn't be the only thing standing between an agent's merge andmain.Related: #436 (silently-broken/inconclusive gates). cc maintainer — the immediate operational half (bring runners online) is outside this repo; the resilience half (1–3 above) is in it.