chore: speed up CI (cancel superseded runs + pre-built runner image) by timzsu · Pull Request #7 · mlsys-io/FlowMesh

timzsu · 2026-04-30T08:22:03Z

Purpose

Speed up CI on the self-hosted runners.

Cancel superseded runs. When a new commit lands on a branch with an in-flight CI run, the older run cancels instead of competing for runner capacity.
Pre-built runner image with the uv wheel cache baked in. A custom image based on myoung34/github-runner ships with /root/.cache/uv already populated for the project's --all-extras install. CI workflows that run uv sync --all-extras --frozen warm-hit this cache instead of redownloading every wheel. Published as flowmesh_ci_runner:dev in this repo's container registry.

Changes

All six existing workflows — adds:
```
concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true
```
${{ github.ref }} keys per branch so two open PRs don't cancel each other.
Five workflows that use astral-sh/setup-uv — pin version: "0.11.8". Must match the runner image's UV_VERSION; uv's cache layout changes across minor versions, so a mismatch leaves the image-baked cache unused.
.github/runners/Dockerfile — new. Bases on myoung34/github-runner:latest, installs pinned UV_VERSION (default 0.11.8), COPYs the build context into /tmp/flowmesh, runs uv sync --all-extras --frozen to populate /root/.cache/uv, then deletes the source tree. Only the populated cache survives in the image.
.dockerignore — drops .python-version from the venv-exclusions block (it's a tool-config file, not venv state — it was being stripped from the build context which made uv fall back to the latest CPython on the build host). Adds .claude/ so local agent state never bakes into any image build.

Design

Why a pre-baked image rather than setup-uv's GHA cache. GHA cache is the wrong backend for --all-extras:

uv cache prune --ci (run by setup-uv before save) strips the cache from ~10 GB to ~2 MB. Subsequent runs hit a near-empty cache and redownload everything.
Disabling prune leaves a ~10 GB cache that hits the GHA cache 10 GB per-repo cap and evicts other caches LRU.
A shared host-mounted writable cache would be a poisoning vector — PR builds (including from external contributors) writing into a cache later read by main builds.

A pre-baked image side-steps all three: cache lives in the image's read-only layers, externally managed (rebuild on schedule), no per-run management, no concurrent-write races, no GHA cache size contention. Stale cache is never unsafe — uv falls through to network for any wheel not in the image — just slower until the rebuild.

Why pin UV_VERSION. uv changes its cache layout across minor versions. If the image bakes uv 0.11.8 but CI's setup-uv resolves uv 0.12.x, the cache may not be readable and CI falls through to a cold install. Pinning both sides (image build arg + workflow version:) keeps them aligned.

Python version comes from .python-version. The repo's .python-version file is the single source of truth (currently 3.12). uv resolves Python from it automatically when the file is in the build context — no explicit Python pin in the Dockerfile. Bumping the project to a new Python only requires editing .python-version and rebuilding the image.

Test Plan

uv run python -c "import yaml, glob; [yaml.safe_load(open(f)) for f in glob.glob('.github/workflows/*.yml')]"

# Build the runner image (from the FlowMesh checkout root, on a linux/amd64 host)
git checkout main && git pull
docker build -t flowmesh_ci_runner:dev -f .github/runners/Dockerfile .

# Sanity-check the cache is non-empty and uv is the pinned version
docker run --rm flowmesh_ci_runner:dev sh -c 'uv --version && du -sh /root/.cache/uv'

Test Result

YAML parses across all workflow files. Image build + runtime verification is the actual measurement; once a runner is using the new image, uv sync --all-extras --frozen finishes in seconds (warm cache) rather than minutes (cold download).

Pre-submission Checklist

I have read CONTRIBUTING.md (or AGENTS.md if no CONTRIBUTING.md).
I have run uv run pre-commit run --all-files and fixed any issues.
I have added or updated tests covering my changes (if applicable).
I have verified that the test suite passes locally.
If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.

Adds the same ``concurrency: { group: ${{ github.workflow }}-${{ github.ref }}, cancel-in-progress: true }`` block to every workflow that currently lacks one. When a new commit lands on a branch with an in-flight CI run, the older run cancels instead of running to completion alongside the newer one. The ``${{ github.ref }}`` keying scopes cancellation to the same branch — concurrent runs on different branches don't fight each other. Particularly meaningful for ``unit-tests`` (~12 min full-extras install + pytest) and ``lint-typecheck`` (~5–10 min) on the limited self-hosted cuda runner pool, where every superseded push otherwise pins a runner slot until completion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

Adds ``with: enable-cache: true`` to every ``astral-sh/setup-uv@v7`` invocation. The action then persists ``~/.cache/uv`` between runs via GitHub Actions Cache, keyed automatically by uv version + workflow's detected dependency-glob (``uv.lock`` here). Cache hits skip the download step of every wheel resolved by ``uv sync --frozen`` / ``uvx`` — meaningful for ``unit-tests`` and ``lint-typecheck`` whose ``--all-extras`` install pulls hundreds of MB of torch / transformers / etc. The ``check-signoff`` workflow doesn't use setup-uv (just bash + git), so nothing to add there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

kaiitunnz · 2026-04-30T09:11:36Z

@timzsu Have you investigated the cache poisoning issue?

Without ``cache-suffix``, all workflows that use ``setup-uv`` share the same cache key (``cache-dependency-glob`` hashes the same set of repo files for all of them). The first workflow to run populates the shared cache; here that's ``check-pr-title``, which runs ``uv run`` against a trivial script and uploads ~200 KB. Subsequent heavy workflows (``unit-tests``, ``lint-typecheck``) then hit that key, restore the tiny cache, redownload everything, and don't re-save (``actions/cache`` treats hit-then-modify as no-op). ``cache-suffix: ${{ github.workflow }}`` keys the cache per workflow. Each workflow now maintains its own cache sized to its own install — a fat one for ``--all-extras`` jobs, a tiny one for the script-runners. First run on each new key is cold (full download); subsequent runs warm hit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

timzsu · 2026-04-30T09:23:27Z

@timzsu Have you investigated the cache poisoning issue?

The cache is stored at GitHub's storage as an immutable copy and retrieved via the hash key. As long as a PR is not merged, other PRs will not be affected. @kaiitunnz

Replaces the ``setup-uv`` GHA-cache approach with a custom self-hosted runner image that bakes the uv wheel cache at image-build time. Why the swap. ``enable-cache: true`` via setup-uv routed through GitHub Actions Cache. For FlowMesh's ``--all-extras`` install: * ``uv cache prune --ci`` (run by setup-uv before save) stripped the cache from ~10 GB to ~2 MB. Subsequent runs hit a near-empty cache and re-downloaded everything. * Disabling prune left a ~10 GB cache that hit the GHA cache 10 GB per-repo cap, evicting other caches LRU. * A shared host-mounted writable cache would have been a poisoning vector (PR builds writing into a cache later read by main builds). A pre-baked runner image side-steps all three: cache lives in the image's read-only layers, externally managed (rebuild on schedule), no per-run management, no concurrent-write races, no GHA cache size contention. Changes: * ``.github/runners/Dockerfile`` — bases on ``myoung34/github-runner``, installs pinned ``UV_VERSION`` (default ``0.11.8``), ``COPY``s the build context (the repo) into a temp dir, runs ``uv sync --all-extras --frozen`` to populate ``/root/.cache/uv``, then deletes the source tree. Only the populated cache survives in the image. * ``.github/runners/README.md`` — build command, deployment to runner hosts (image tag swap in the systemd units + ``systemctl restart``), refresh cadence (rebuild on ``uv.lock`` changes or ``UV_VERSION`` bumps), and the workflow-side requirements. * ``.github/workflows/{check-pr-title,check-signoff,env-examples, lint-typecheck,requirements-sync,unit-tests}.yml`` — drop ``enable-cache: true`` and ``cache-suffix:`` from every ``setup-uv@v7`` invocation; pin ``version: "0.11.8"`` instead. The pin must match Dockerfile's ``UV_VERSION`` so the cache layout written at image-build time is what CI's setup-uv reads back. * ``.dockerignore`` — adds ``.claude/`` and ``.pr-body.md`` so local agent state and PR-body drafts never bake into any image build. The ``concurrency:`` blocks added in this PR's earlier commits stay unchanged — that's a separate efficiency win independent of the caching strategy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

The build was picking the latest CPython on the build host (3.14 at the time) because ``requires-python = ">=3.12"`` is a lower bound, not a pin. uv sync then failed because the resolved lockfile contains wheels (e.g. ``nvidia-cudnn-frontend==1.15.0``) that only ship for cp312/cp313: error: Distribution `nvidia-cudnn-frontend==1.15.0` can't be installed because it doesn't have a source distribution or wheel for the current platform. You're using CPython 3.14 (`cp314`) but the wheels only have ABI tags `cp312`, `cp313`. ``ARG PYTHON_VERSION=3.12`` + ``ENV UV_PYTHON=${PYTHON_VERSION}`` locks every uv invocation in the image to Python 3.12. The arg is overridable at build time so we can bump to 3.13 once every transitive dep has cp313 wheels. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

kaiitunnz

Leave some comments. PTAL.

kaiitunnz · 2026-04-30T10:47:25Z

+# ``uv sync --all-extras --frozen`` warm-hit this cache instead of
+# re-downloading every wheel on each run.
+#
+# Build context MUST be the repository root (FlowMesh is a private repo;


Should remove the mention about "private repo".

kaiitunnz · 2026-04-30T10:48:47Z

+#   git checkout main && git pull
+#   docker build \
+#     --build-arg UV_VERSION=0.11.8 \
+#     -t flowmesh-oss-ci-runner:0.11.8-$(date +%Y%m%d) \


Do we need to include oss in the tag? Will it collide with the other runners?

kaiitunnz · 2026-04-30T11:01:03Z

+ARG UV_VERSION=0.11.8
+
+# Pinned Python version. ``pyproject.toml`` says ``requires-python =
+# ">=3.12"`` (a lower bound), but the resolved lockfile contains wheels


We have .python-version.

It was ignored in .dockerignore. I have removed the line.

kaiitunnz · 2026-04-30T11:08:53Z

+  --build-arg UV_VERSION=0.11.8 \
+  --build-arg PYTHON_VERSION=3.12 \
+  -t flowmesh-oss-ci-runner:0.11.8-$(date +%Y%m%d) \
+  -t flowmesh-oss-ci-runner:latest \


Consider the oss tag here.

kaiitunnz · 2026-04-30T11:11:18Z

+
+The two tags give a date-stamped reference for rollback plus a moving `:latest` for the runner systemd units. Build it on a `linux/amd64` host so the cached wheels are platform-correct for the runners (which are also `linux/amd64`).
+
+`PYTHON_VERSION` defaults to `3.12`. The lockfile includes wheels (e.g. `nvidia-cudnn-frontend`) that only ship for cp312/cp313; `requires-python = ">=3.12"` in `pyproject.toml` is a lower bound, so uv would otherwise pick the latest available CPython (3.14+) and fail. Bump to `3.13` only when every transitive dep has cp313 wheels.


Make it read from .python-version.

kaiitunnz · 2026-04-30T11:12:18Z

+```bash
+sudo systemctl daemon-reload
+sudo systemctl restart 'flowmesh-oss-ci-cuda-runner@*.service'
+sudo systemctl restart 'flowmesh-oss-ci-gpu-runner@*.service'


The naming again.

* Rename image: ``flowmesh-oss-ci-runner`` → ``flowmesh-ci-runner``. No non-OSS counterpart in the same namespace; the ``-oss-`` infix was noise. * Drop ``ARG PYTHON_VERSION`` and ``ENV UV_PYTHON``. uv reads ``.python-version`` (``3.12``) from the build context automatically, so the explicit pin was redundant — once ``.python-version`` is actually present in the context. * Remove ``.python-version`` from ``.dockerignore``. It was misclassified under "Virtual environments" and stripped from the build context, which is what made uv fall back to the latest CPython on the build host. It's a tool-config file, not venv state. Other Dockerfiles in the repo (``src/server/Dockerfile``, the worker images) all use base images with Python baked in; none of them depend on this exclusion. * Drop the ``# syntax=docker/dockerfile:1`` directive — no BuildKit-only features in this Dockerfile, and no other Dockerfile in the repo uses the directive. * Drop the lengthy header / private-repo justification / inline comments in the Dockerfile and remove ``.github/runners/README.md`` entirely. The Dockerfile header now carries the single build command anyone needs; deploy / refresh details live in operator runbooks, not in the repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

kaiitunnz

LGTM.

timzsu and others added 2 commits April 30, 2026 08:21

timzsu changed the title ~~chore: cancel superseded CI runs via concurrency block~~ chore: speed up CI (cancel superseded runs, enable uv cache) Apr 30, 2026

timzsu requested a review from kaiitunnz April 30, 2026 09:02

timzsu changed the title ~~chore: speed up CI (cancel superseded runs, enable uv cache)~~ chore: speed up CI (cancel superseded runs + pre-built runner image) Apr 30, 2026

timzsu and others added 2 commits April 30, 2026 10:18

timzsu force-pushed the zsu/ci-concurrency branch from 8392388 to 4f43bc0 Compare April 30, 2026 10:19

kaiitunnz requested changes Apr 30, 2026

View reviewed changes

timzsu requested a review from kaiitunnz April 30, 2026 12:25

Update Dockerfile.

c4329e5

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

timzsu force-pushed the zsu/ci-concurrency branch from 79e57af to c4329e5 Compare April 30, 2026 12:57

kaiitunnz approved these changes Apr 30, 2026

View reviewed changes

timzsu merged commit f7ff506 into main Apr 30, 2026
6 checks passed

timzsu deleted the zsu/ci-concurrency branch April 30, 2026 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: speed up CI (cancel superseded runs + pre-built runner image)#7

chore: speed up CI (cancel superseded runs + pre-built runner image)#7
timzsu merged 7 commits into
mainfrom
zsu/ci-concurrency

timzsu commented Apr 30, 2026 •

edited

Loading

Uh oh!

kaiitunnz commented Apr 30, 2026

Uh oh!

timzsu commented Apr 30, 2026

Uh oh!

kaiitunnz left a comment

Uh oh!

kaiitunnz Apr 30, 2026

Uh oh!

kaiitunnz Apr 30, 2026

Uh oh!

kaiitunnz Apr 30, 2026

Uh oh!

timzsu Apr 30, 2026

Uh oh!

kaiitunnz Apr 30, 2026

Uh oh!

kaiitunnz Apr 30, 2026

Uh oh!

kaiitunnz Apr 30, 2026

Uh oh!

kaiitunnz left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		The two tags give a date-stamped reference for rollback plus a moving `:latest` for the runner systemd units. Build it on a `linux/amd64` host so the cached wheels are platform-correct for the runners (which are also `linux/amd64`).

		`PYTHON_VERSION` defaults to `3.12`. The lockfile includes wheels (e.g. `nvidia-cudnn-frontend`) that only ship for cp312/cp313; `requires-python = ">=3.12"` in `pyproject.toml` is a lower bound, so uv would otherwise pick the latest available CPython (3.14+) and fail. Bump to `3.13` only when every transitive dep has cp313 wheels.

Conversation

timzsu commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Design

Test Plan

Test Result

Uh oh!

kaiitunnz commented Apr 30, 2026

Uh oh!

timzsu commented Apr 30, 2026

Uh oh!

kaiitunnz left a comment

Choose a reason for hiding this comment

Uh oh!

kaiitunnz Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

kaiitunnz Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

kaiitunnz Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

timzsu Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

kaiitunnz Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

kaiitunnz Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

kaiitunnz Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

kaiitunnz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

timzsu commented Apr 30, 2026 •

edited

Loading