Skip to content

Non-gating 16-shard rivet-core mutation matrix runs on every push — saturates self-hosted runners, CI queues for hours #498

@avrabe

Description

@avrabe

Friction (observed across the last few dogfooding-loop iterations)

Main CI runs sit in queued for hours and never give a timely signal. Root cause is runner capacity: Mutation Testing (rivet-core) is a 16-shard matrix (0/1615/16), each with a 45-minute budget, on the self-hosted lean-mem pool, on every push — up to ~12 runner-hours per push. That starves the gating jobs (Test, Clippy, Docs, Proptest, Playwright): they queue behind the matrix and don't conclude promptly. Concretely, commit cab7f43's run was still queued >1h after push, so I couldn't pull a job log to diagnose a (separate) proptest failure.

Notably, rivet-core mutation testing is already continue-on-error: true (ci.yml: "rivet-core is NOT a hard gate"). So the job consuming the most runner capacity is non-gating — it blocks nothing yet delays everything.

Suggested fix (maintainer's call on cadence)

Move the non-gating rivet-core mutation matrix off the every-push path:

  • run on a schedule (nightly) and/or push-to-main-only, not PR pushes; or
  • workflow_dispatch (manual) + nightly; or
  • shrink the shard count so fewer concurrent runners are held.

Keep the hard-gate rivet-cli mutation job (single shard, 0 surviving mutants) per-push. This frees the pool so gating checks return promptly. (The Playwright/Test conclusiveness work in #436/#494 only helps if the runs actually run.)

Throughput/capacity issue, not correctness. Found in the hourly dogfooding loop.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions