Skip to content

CI has a single point of failure: when the self-hosted runner pool goes offline, every gate queues forever with no fallback and no liveness alert #509

@avrabe

Description

@avrabe

Current outage (the trigger)

As of 2026-06-06 ~09:15Z, GET /repos/pulseengine/rivet/actions/runners reports total_count: 0 — zero self-hosted runners registered/online — and 46 workflow runs are queued with 0 in_progress. Several are main push-runs:

The #505 Test gate passed ~2h earlier, so the pool went offline recently (not a clean weekend-off pattern). Net effect: the last several main merges have zero CI verification, and nobody is alerted.

The durable problem (why this is an issue, not just a blip)

Every gating job is runs-on: [self-hosted, linux, x64, …]. The self-hosted pool is therefore a single point of failure with two compounding gaps:

  1. No fallback. When the pool is down, all gates — even the ~1s ones (fmt, yaml-lint) — silently queue indefinitely. There is no path to even basic verification.
  2. No liveness signal. A run stuck queued for hours looks identical to a slow run. Combined with Playwright E2E is a silently-broken gate: non-required + main runs cancelled = never conclusively green #436 (gates that are red/inconclusive go unnoticed), a maintainer can admin-merge — or an automated loop can — believing "CI will catch it," when CI is not running at all.

Suggested fixes (pick per cost/policy — these are choices for the maintainer)

  1. Liveness alert (cheap, high-value). A tiny scheduled job on ubuntu-latest (GitHub-hosted, so it runs even when self-hosted is down) that calls the runners API and the queued-runs API and fails loudly (issue comment / notification) if online runners == 0 or any run has been queued > N minutes. This is the missing smoke alarm.
  2. Route the fast core gates to GitHub-hosted. Move fmt, yaml-lint, validate, and maybe clippy to ubuntu-latest so a self-hosted outage still leaves the cheap correctness gates working. Keeps the heavy jobs (mutation, Kani, Playwright) on self-hosted. Trade-off: GitHub-hosted minutes cost.
  3. Document the operational runbook: how to confirm a runner outage (the two API calls above) and bring the pool back, so a stalled queue is diagnosable in seconds.

I hit this directly in the hourly dogfooding loop: I could not get CI to verify #505/#507, and had to rely entirely on local cargo test --test cli_commands + clippy --all-targets + fmt --check. That local battery is a decent stand-in, but it isn't the gate, and it shouldn't be the only thing standing between an agent's merge and main.

Related: #436 (silently-broken/inconclusive gates). cc maintainer — the immediate operational half (bring runners online) is outside this repo; the resilience half (1–3 above) is in it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions