Skip to content

CI: adopt GitHub Merge Queue + tiered CI #770

@danielmeppiel

Description

@danielmeppiel

Summary

Adopt GitHub's native Merge Queue and split CI into two tiers so the heavy integration suite runs once at merge time instead of on every PR push, and so PR branches stay automatically up to date with main.

Problem

Two bottlenecks slow down contributors and reviewers today:

  1. Manual branch updates. Every PR must be manually rebased or "Update branch"-clicked whenever main moves. With multiple PRs in flight this serializes contributors and adds round trips that have nothing to do with the change being reviewed.
  2. Per-push integration approval. The integration + release-validation suite runs on every PR push and is gated by an environment with required reviewers. Approval must be granted on every new push (including WIP commits that will be rewritten), and the approver pool is small relative to the contributor base, which does not scale.

The result: long, repetitive feedback loops, wasted CI minutes on commits that will be force-pushed away, and a single point of contention for getting anything merged.

Proposed change

Split CI into two tiers, gated by GitHub's native Merge Queue:

  • Tier 1 (every PR push) — unit tests + binary build. Fast, no CI secrets, fork-safe. Stays in ci.yml. Provides quick correctness feedback to contributors.
  • Tier 2 (merge queue only) — integration tests + release validation. Runs against the tentative merge commit on the gh-readonly-queue/main/* ref that the queue creates. Triggered by the merge_group event.

When a reviewer adds a PR to the queue, GitHub:

  1. Builds a tentative merge of the PR onto the latest main (no manual "Update branch" needed).
  2. Runs Tier 2 against that ref.
  3. Auto-merges if checks pass; ejects the PR if they fail.

Multiple PRs can be batched into one Tier 2 run when traffic is high.

Before

flowchart LR
    A[PR push] --> B[ci.yml<br/>unit + build]
    B -->|workflow_run| C{environment<br/>approval}
    C -->|approved| D[smoke]
    D --> E[integration]
    E --> F[release-validation]
    F --> G[commit status<br/>back to PR]
    H[main moves] -.-> I[manual<br/>Update branch] -.-> A
    C -.->|every push<br/>needs re-approval| C
Loading

After

flowchart LR
    A[PR push] --> B[ci.yml Tier 1<br/>unit + build<br/>no secrets]
    B --> R[review]
    R --> Q[reviewer adds<br/>to merge queue]
    Q --> M[merge_group event<br/>tentative merge ref]
    M --> T[Tier 2<br/>integration +<br/>release-validation]
    T -->|pass| X[auto-merge to main]
    T -->|fail| Y[eject from queue]
Loading

Trust model

The trust boundary moves to write access. PRs from contributors without write access run only Tier 1 and never touch CI secrets, so fork PRs remain safe. Adding a PR to the queue requires write access; when the queue runs Tier 2, the merged code executes with secrets. This is the standard GitHub trust model: write access implies trust to run code with secrets, and access grants are managed in repo settings.

Because Tier 2 only runs in merge_group context (not pull_request), there is no longer a need for the workflow_run indirection or a per-push environment approval.

Benefits

  • Removes the manual "Update branch" loop entirely.
  • Integration suite runs once per merged PR (or once per batched group), not once per WIP push. Large reduction in CI minutes.
  • No single approver in the critical path for every PR.
  • Native GitHub feature: no third-party bots, no extra services.
  • Batching support scales throughput when activity is high.

Rollout plan

  1. Additive merge_group trigger — add merge_group to ci.yml so Tier 1 runs in both contexts. Land alongside the existing workflow_run path so nothing changes for PR authors yet.
  2. Rewrite ci-integration.yml to trigger on merge_group instead of workflow_run. Drop the approval, smoke-as-separate-job, and commit-status shim. Keep the legacy workflow_run path running in parallel during a shadow week to compare results.
  3. Verify check names in a real merge_group run and capture them verbatim.
  4. Enable merge queue on main in branch protection settings, with the verified check names as required. Disable "require branches to be up to date" (the queue handles it). Enable repo-wide auto-merge.
  5. Retire the legacy approval environment and remove the old workflow_run path from ci-integration.yml.
  6. Update docs (CONTRIBUTING.md, CI internals docs) and add a CHANGELOG.md entry.

Each step ships as its own PR so it can be reverted independently.

Out of scope

  • Third-party merge bots (Mergify, Kodiak, bors) — the native GitHub feature covers this.
  • Required-review policy changes — orthogonal to CI plumbing.
  • Test parallelization, sharding, or runner type changes — separate optimization.
  • Live-inference test scheduling (ci-runtime.yml) — already isolated and stays as-is.

Risks and mitigations

  • Required check names must match exactly in merge_group context. Mitigated by verifying in a real run before flipping branch protection.
  • Cutover bug at merge time would block merges. Mitigated by the additive shadow week before removing the legacy path.
  • Queue stalls on a flaky test — the queue can be disabled in one click from settings without touching workflow YAML.

Tracking

This issue is the design doc. Implementation PRs will reference it. Discussion welcome — especially from contributors who have hit either bottleneck.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions