CI has a single point of failure: when the self-hosted runner pool goes offline, every gate queues forever with no fallback and no liveness alert

## Current outage (the trigger)

As of 2026-06-06 ~09:15Z, **`GET /repos/pulseengine/rivet/actions/runners` reports `total_count: 0`** — zero self-hosted runners registered/online — and **46 workflow runs are `queued` with 0 `in_progress`**. Several are `main` push-runs:

- `2871c97` (#507, merged) — CI queued, never ran
- `ecb073a` (#505, merged) — CI queued, never ran
- `2750e9f` (#504, merged) — CI queued, never ran

The `#505` Test gate passed ~2h earlier, so the pool went offline recently (not a clean weekend-off pattern). Net effect: **the last several `main` merges have zero CI verification**, and nobody is alerted.

## The durable problem (why this is an issue, not just a blip)

Every gating job is `runs-on: [self-hosted, linux, x64, …]`. The self-hosted pool is therefore a **single point of failure** with two compounding gaps:

1. **No fallback.** When the pool is down, *all* gates — even the ~1s ones (`fmt`, `yaml-lint`) — silently queue indefinitely. There is no path to even basic verification.
2. **No liveness signal.** A run stuck `queued` for hours looks identical to a slow run. Combined with #436 (gates that are red/inconclusive go unnoticed), a maintainer can admin-merge — or an automated loop can — believing "CI will catch it," when CI is not running at all.

## Suggested fixes (pick per cost/policy — these are choices for the maintainer)

1. **Liveness alert (cheap, high-value).** A tiny scheduled job on `ubuntu-latest` (GitHub-hosted, so it runs even when self-hosted is down) that calls the runners API and the queued-runs API and **fails loudly** (issue comment / notification) if `online runners == 0` or any run has been `queued > N` minutes. This is the missing smoke alarm.
2. **Route the fast core gates to GitHub-hosted.** Move `fmt`, `yaml-lint`, `validate`, and maybe `clippy` to `ubuntu-latest` so a self-hosted outage still leaves the cheap correctness gates working. Keeps the heavy jobs (mutation, Kani, Playwright) on self-hosted. Trade-off: GitHub-hosted minutes cost.
3. **Document the operational runbook**: how to confirm a runner outage (the two API calls above) and bring the pool back, so a stalled queue is diagnosable in seconds.

I hit this directly in the hourly dogfooding loop: I could not get CI to verify #505/#507, and had to rely entirely on local `cargo test --test cli_commands` + `clippy --all-targets` + `fmt --check`. That local battery is a decent stand-in, but it isn't the gate, and it shouldn't be the only thing standing between an agent's merge and `main`.

Related: #436 (silently-broken/inconclusive gates). cc maintainer — the immediate operational half (bring runners online) is outside this repo; the resilience half (1–3 above) is in it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI has a single point of failure: when the self-hosted runner pool goes offline, every gate queues forever with no fallback and no liveness alert #509

Current outage (the trigger)

The durable problem (why this is an issue, not just a blip)

Suggested fixes (pick per cost/policy — these are choices for the maintainer)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

CI has a single point of failure: when the self-hosted runner pool goes offline, every gate queues forever with no fallback and no liveness alert #509

Description

Current outage (the trigger)

The durable problem (why this is an issue, not just a blip)

Suggested fixes (pick per cost/policy — these are choices for the maintainer)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions