Skip to content

chore: speed up CI (cancel superseded runs + pre-built runner image)#7

Merged
timzsu merged 7 commits into
mainfrom
zsu/ci-concurrency
Apr 30, 2026
Merged

chore: speed up CI (cancel superseded runs + pre-built runner image)#7
timzsu merged 7 commits into
mainfrom
zsu/ci-concurrency

Conversation

@timzsu
Copy link
Copy Markdown
Collaborator

@timzsu timzsu commented Apr 30, 2026

Purpose

Speed up CI on the self-hosted runners.

  1. Cancel superseded runs. When a new commit lands on a branch with an in-flight CI run, the older run cancels instead of competing for runner capacity.
  2. Pre-built runner image with the uv wheel cache baked in. A custom image based on myoung34/github-runner ships with /root/.cache/uv already populated for the project's --all-extras install. CI workflows that run uv sync --all-extras --frozen warm-hit this cache instead of redownloading every wheel. Published as flowmesh_ci_runner:dev in this repo's container registry.

Changes

  • All six existing workflows — adds:

    concurrency:
      group: ${{ github.workflow }}-${{ github.ref }}
      cancel-in-progress: true

    ${{ github.ref }} keys per branch so two open PRs don't cancel each other.

  • Five workflows that use astral-sh/setup-uv — pin version: "0.11.8". Must match the runner image's UV_VERSION; uv's cache layout changes across minor versions, so a mismatch leaves the image-baked cache unused.

  • .github/runners/Dockerfile — new. Bases on myoung34/github-runner:latest, installs pinned UV_VERSION (default 0.11.8), COPYs the build context into /tmp/flowmesh, runs uv sync --all-extras --frozen to populate /root/.cache/uv, then deletes the source tree. Only the populated cache survives in the image.

  • .dockerignore — drops .python-version from the venv-exclusions block (it's a tool-config file, not venv state — it was being stripped from the build context which made uv fall back to the latest CPython on the build host). Adds .claude/ so local agent state never bakes into any image build.

Design

Why a pre-baked image rather than setup-uv's GHA cache. GHA cache is the wrong backend for --all-extras:

  • uv cache prune --ci (run by setup-uv before save) strips the cache from ~10 GB to ~2 MB. Subsequent runs hit a near-empty cache and redownload everything.
  • Disabling prune leaves a ~10 GB cache that hits the GHA cache 10 GB per-repo cap and evicts other caches LRU.
  • A shared host-mounted writable cache would be a poisoning vector — PR builds (including from external contributors) writing into a cache later read by main builds.

A pre-baked image side-steps all three: cache lives in the image's read-only layers, externally managed (rebuild on schedule), no per-run management, no concurrent-write races, no GHA cache size contention. Stale cache is never unsafe — uv falls through to network for any wheel not in the image — just slower until the rebuild.

Why pin UV_VERSION. uv changes its cache layout across minor versions. If the image bakes uv 0.11.8 but CI's setup-uv resolves uv 0.12.x, the cache may not be readable and CI falls through to a cold install. Pinning both sides (image build arg + workflow version:) keeps them aligned.

Python version comes from .python-version. The repo's .python-version file is the single source of truth (currently 3.12). uv resolves Python from it automatically when the file is in the build context — no explicit Python pin in the Dockerfile. Bumping the project to a new Python only requires editing .python-version and rebuilding the image.

Test Plan

uv run python -c "import yaml, glob; [yaml.safe_load(open(f)) for f in glob.glob('.github/workflows/*.yml')]"

# Build the runner image (from the FlowMesh checkout root, on a linux/amd64 host)
git checkout main && git pull
docker build -t flowmesh_ci_runner:dev -f .github/runners/Dockerfile .

# Sanity-check the cache is non-empty and uv is the pinned version
docker run --rm flowmesh_ci_runner:dev sh -c 'uv --version && du -sh /root/.cache/uv'

Test Result

YAML parses across all workflow files. Image build + runtime verification is the actual measurement; once a runner is using the new image, uv sync --all-extras --frozen finishes in seconds (warm cache) rather than minutes (cold download).


Pre-submission Checklist
  • I have read CONTRIBUTING.md (or AGENTS.md if no CONTRIBUTING.md).
  • I have run uv run pre-commit run --all-files and fixed any issues.
  • I have added or updated tests covering my changes (if applicable).
  • I have verified that the test suite passes locally.
  • If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.

timzsu and others added 2 commits April 30, 2026 08:21
Adds the same ``concurrency: { group: ${{ github.workflow }}-${{ github.ref }},
cancel-in-progress: true }`` block to every workflow that currently lacks
one. When a new commit lands on a branch with an in-flight CI run, the
older run cancels instead of running to completion alongside the newer
one. The ``${{ github.ref }}`` keying scopes cancellation to the same
branch — concurrent runs on different branches don't fight each other.

Particularly meaningful for ``unit-tests`` (~12 min full-extras install +
pytest) and ``lint-typecheck`` (~5–10 min) on the limited self-hosted
cuda runner pool, where every superseded push otherwise pins a runner
slot until completion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
Adds ``with: enable-cache: true`` to every ``astral-sh/setup-uv@v7``
invocation. The action then persists ``~/.cache/uv`` between runs via
GitHub Actions Cache, keyed automatically by uv version + workflow's
detected dependency-glob (``uv.lock`` here). Cache hits skip the
download step of every wheel resolved by ``uv sync --frozen`` /
``uvx`` — meaningful for ``unit-tests`` and ``lint-typecheck`` whose
``--all-extras`` install pulls hundreds of MB of torch / transformers /
etc. The ``check-signoff`` workflow doesn't use setup-uv (just bash +
git), so nothing to add there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
@timzsu timzsu changed the title chore: cancel superseded CI runs via concurrency block chore: speed up CI (cancel superseded runs, enable uv cache) Apr 30, 2026
@timzsu timzsu requested a review from kaiitunnz April 30, 2026 09:02
@kaiitunnz
Copy link
Copy Markdown
Collaborator

@timzsu Have you investigated the cache poisoning issue?

Without ``cache-suffix``, all workflows that use ``setup-uv`` share the
same cache key (``cache-dependency-glob`` hashes the same set of repo
files for all of them). The first workflow to run populates the shared
cache; here that's ``check-pr-title``, which runs ``uv run`` against a
trivial script and uploads ~200 KB. Subsequent heavy workflows
(``unit-tests``, ``lint-typecheck``) then hit that key, restore the tiny
cache, redownload everything, and don't re-save (``actions/cache``
treats hit-then-modify as no-op).

``cache-suffix: ${{ github.workflow }}`` keys the cache per workflow.
Each workflow now maintains its own cache sized to its own install — a
fat one for ``--all-extras`` jobs, a tiny one for the script-runners.
First run on each new key is cold (full download); subsequent runs warm
hit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
@timzsu
Copy link
Copy Markdown
Collaborator Author

timzsu commented Apr 30, 2026

@timzsu Have you investigated the cache poisoning issue?

The cache is stored at GitHub's storage as an immutable copy and retrieved via the hash key. As long as a PR is not merged, other PRs will not be affected. @kaiitunnz

@timzsu timzsu changed the title chore: speed up CI (cancel superseded runs, enable uv cache) chore: speed up CI (cancel superseded runs + pre-built runner image) Apr 30, 2026
timzsu and others added 2 commits April 30, 2026 10:18
Replaces the ``setup-uv`` GHA-cache approach with a custom self-hosted
runner image that bakes the uv wheel cache at image-build time.

Why the swap. ``enable-cache: true`` via setup-uv routed through
GitHub Actions Cache. For FlowMesh's ``--all-extras`` install:

* ``uv cache prune --ci`` (run by setup-uv before save) stripped the
  cache from ~10 GB to ~2 MB. Subsequent runs hit a near-empty cache
  and re-downloaded everything.
* Disabling prune left a ~10 GB cache that hit the GHA cache 10 GB
  per-repo cap, evicting other caches LRU.
* A shared host-mounted writable cache would have been a poisoning
  vector (PR builds writing into a cache later read by main builds).

A pre-baked runner image side-steps all three: cache lives in the
image's read-only layers, externally managed (rebuild on schedule),
no per-run management, no concurrent-write races, no GHA cache size
contention.

Changes:

* ``.github/runners/Dockerfile`` — bases on ``myoung34/github-runner``,
  installs pinned ``UV_VERSION`` (default ``0.11.8``), ``COPY``s the
  build context (the repo) into a temp dir, runs
  ``uv sync --all-extras --frozen`` to populate ``/root/.cache/uv``,
  then deletes the source tree. Only the populated cache survives in
  the image.
* ``.github/runners/README.md`` — build command, deployment to runner
  hosts (image tag swap in the systemd units + ``systemctl restart``),
  refresh cadence (rebuild on ``uv.lock`` changes or ``UV_VERSION``
  bumps), and the workflow-side requirements.
* ``.github/workflows/{check-pr-title,check-signoff,env-examples,
  lint-typecheck,requirements-sync,unit-tests}.yml`` — drop
  ``enable-cache: true`` and ``cache-suffix:`` from every
  ``setup-uv@v7`` invocation; pin ``version: "0.11.8"`` instead. The
  pin must match Dockerfile's ``UV_VERSION`` so the cache layout
  written at image-build time is what CI's setup-uv reads back.
* ``.dockerignore`` — adds ``.claude/`` and ``.pr-body.md`` so local
  agent state and PR-body drafts never bake into any image build.

The ``concurrency:`` blocks added in this PR's earlier commits stay
unchanged — that's a separate efficiency win independent of the
caching strategy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
The build was picking the latest CPython on the build host (3.14 at the
time) because ``requires-python = ">=3.12"`` is a lower bound, not a
pin. uv sync then failed because the resolved lockfile contains wheels
(e.g. ``nvidia-cudnn-frontend==1.15.0``) that only ship for cp312/cp313:

  error: Distribution `nvidia-cudnn-frontend==1.15.0` can't be installed
  because it doesn't have a source distribution or wheel for the current
  platform. You're using CPython 3.14 (`cp314`) but the wheels only have
  ABI tags `cp312`, `cp313`.

``ARG PYTHON_VERSION=3.12`` + ``ENV UV_PYTHON=${PYTHON_VERSION}`` locks
every uv invocation in the image to Python 3.12. The arg is overridable
at build time so we can bump to 3.13 once every transitive dep has
cp313 wheels.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
@timzsu timzsu force-pushed the zsu/ci-concurrency branch from 8392388 to 4f43bc0 Compare April 30, 2026 10:19
Copy link
Copy Markdown
Collaborator

@kaiitunnz kaiitunnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave some comments. PTAL.

Comment thread .github/runners/Dockerfile Outdated
# ``uv sync --all-extras --frozen`` warm-hit this cache instead of
# re-downloading every wheel on each run.
#
# Build context MUST be the repository root (FlowMesh is a private repo;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should remove the mention about "private repo".

Comment thread .github/runners/Dockerfile Outdated
# git checkout main && git pull
# docker build \
# --build-arg UV_VERSION=0.11.8 \
# -t flowmesh-oss-ci-runner:0.11.8-$(date +%Y%m%d) \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to include oss in the tag? Will it collide with the other runners?

Comment thread .github/runners/Dockerfile Outdated
ARG UV_VERSION=0.11.8

# Pinned Python version. ``pyproject.toml`` says ``requires-python =
# ">=3.12"`` (a lower bound), but the resolved lockfile contains wheels
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have .python-version.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was ignored in .dockerignore. I have removed the line.

Comment thread .github/runners/README.md Outdated
--build-arg UV_VERSION=0.11.8 \
--build-arg PYTHON_VERSION=3.12 \
-t flowmesh-oss-ci-runner:0.11.8-$(date +%Y%m%d) \
-t flowmesh-oss-ci-runner:latest \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the oss tag here.

Comment thread .github/runners/README.md Outdated

The two tags give a date-stamped reference for rollback plus a moving `:latest` for the runner systemd units. Build it on a `linux/amd64` host so the cached wheels are platform-correct for the runners (which are also `linux/amd64`).

`PYTHON_VERSION` defaults to `3.12`. The lockfile includes wheels (e.g. `nvidia-cudnn-frontend`) that only ship for cp312/cp313; `requires-python = ">=3.12"` in `pyproject.toml` is a lower bound, so uv would otherwise pick the latest available CPython (3.14+) and fail. Bump to `3.13` only when every transitive dep has cp313 wheels.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it read from .python-version.

Comment thread .github/runners/README.md Outdated
```bash
sudo systemctl daemon-reload
sudo systemctl restart 'flowmesh-oss-ci-cuda-runner@*.service'
sudo systemctl restart 'flowmesh-oss-ci-gpu-runner@*.service'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming again.

* Rename image: ``flowmesh-oss-ci-runner`` → ``flowmesh-ci-runner``.
  No non-OSS counterpart in the same namespace; the ``-oss-`` infix was
  noise.
* Drop ``ARG PYTHON_VERSION`` and ``ENV UV_PYTHON``. uv reads
  ``.python-version`` (``3.12``) from the build context automatically,
  so the explicit pin was redundant — once ``.python-version`` is
  actually present in the context.
* Remove ``.python-version`` from ``.dockerignore``. It was misclassified
  under "Virtual environments" and stripped from the build context, which
  is what made uv fall back to the latest CPython on the build host. It's
  a tool-config file, not venv state. Other Dockerfiles in the repo
  (``src/server/Dockerfile``, the worker images) all use base images with
  Python baked in; none of them depend on this exclusion.
* Drop the ``# syntax=docker/dockerfile:1`` directive — no BuildKit-only
  features in this Dockerfile, and no other Dockerfile in the repo uses
  the directive.
* Drop the lengthy header / private-repo justification / inline comments
  in the Dockerfile and remove ``.github/runners/README.md`` entirely.
  The Dockerfile header now carries the single build command anyone
  needs; deploy / refresh details live in operator runbooks, not in the
  repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
@timzsu timzsu requested a review from kaiitunnz April 30, 2026 12:25
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
@timzsu timzsu force-pushed the zsu/ci-concurrency branch from 79e57af to c4329e5 Compare April 30, 2026 12:57
Copy link
Copy Markdown
Collaborator

@kaiitunnz kaiitunnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@timzsu timzsu merged commit f7ff506 into main Apr 30, 2026
6 checks passed
@timzsu timzsu deleted the zsu/ci-concurrency branch April 30, 2026 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants