chore: speed up CI (cancel superseded runs + pre-built runner image)#7
Conversation
Adds the same ``concurrency: { group: ${{ github.workflow }}-${{ github.ref }},
cancel-in-progress: true }`` block to every workflow that currently lacks
one. When a new commit lands on a branch with an in-flight CI run, the
older run cancels instead of running to completion alongside the newer
one. The ``${{ github.ref }}`` keying scopes cancellation to the same
branch — concurrent runs on different branches don't fight each other.
Particularly meaningful for ``unit-tests`` (~12 min full-extras install +
pytest) and ``lint-typecheck`` (~5–10 min) on the limited self-hosted
cuda runner pool, where every superseded push otherwise pins a runner
slot until completion.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
Adds ``with: enable-cache: true`` to every ``astral-sh/setup-uv@v7`` invocation. The action then persists ``~/.cache/uv`` between runs via GitHub Actions Cache, keyed automatically by uv version + workflow's detected dependency-glob (``uv.lock`` here). Cache hits skip the download step of every wheel resolved by ``uv sync --frozen`` / ``uvx`` — meaningful for ``unit-tests`` and ``lint-typecheck`` whose ``--all-extras`` install pulls hundreds of MB of torch / transformers / etc. The ``check-signoff`` workflow doesn't use setup-uv (just bash + git), so nothing to add there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
|
@timzsu Have you investigated the cache poisoning issue? |
Without ``cache-suffix``, all workflows that use ``setup-uv`` share the
same cache key (``cache-dependency-glob`` hashes the same set of repo
files for all of them). The first workflow to run populates the shared
cache; here that's ``check-pr-title``, which runs ``uv run`` against a
trivial script and uploads ~200 KB. Subsequent heavy workflows
(``unit-tests``, ``lint-typecheck``) then hit that key, restore the tiny
cache, redownload everything, and don't re-save (``actions/cache``
treats hit-then-modify as no-op).
``cache-suffix: ${{ github.workflow }}`` keys the cache per workflow.
Each workflow now maintains its own cache sized to its own install — a
fat one for ``--all-extras`` jobs, a tiny one for the script-runners.
First run on each new key is cold (full download); subsequent runs warm
hit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
The cache is stored at GitHub's storage as an immutable copy and retrieved via the hash key. As long as a PR is not merged, other PRs will not be affected. @kaiitunnz |
Replaces the ``setup-uv`` GHA-cache approach with a custom self-hosted
runner image that bakes the uv wheel cache at image-build time.
Why the swap. ``enable-cache: true`` via setup-uv routed through
GitHub Actions Cache. For FlowMesh's ``--all-extras`` install:
* ``uv cache prune --ci`` (run by setup-uv before save) stripped the
cache from ~10 GB to ~2 MB. Subsequent runs hit a near-empty cache
and re-downloaded everything.
* Disabling prune left a ~10 GB cache that hit the GHA cache 10 GB
per-repo cap, evicting other caches LRU.
* A shared host-mounted writable cache would have been a poisoning
vector (PR builds writing into a cache later read by main builds).
A pre-baked runner image side-steps all three: cache lives in the
image's read-only layers, externally managed (rebuild on schedule),
no per-run management, no concurrent-write races, no GHA cache size
contention.
Changes:
* ``.github/runners/Dockerfile`` — bases on ``myoung34/github-runner``,
installs pinned ``UV_VERSION`` (default ``0.11.8``), ``COPY``s the
build context (the repo) into a temp dir, runs
``uv sync --all-extras --frozen`` to populate ``/root/.cache/uv``,
then deletes the source tree. Only the populated cache survives in
the image.
* ``.github/runners/README.md`` — build command, deployment to runner
hosts (image tag swap in the systemd units + ``systemctl restart``),
refresh cadence (rebuild on ``uv.lock`` changes or ``UV_VERSION``
bumps), and the workflow-side requirements.
* ``.github/workflows/{check-pr-title,check-signoff,env-examples,
lint-typecheck,requirements-sync,unit-tests}.yml`` — drop
``enable-cache: true`` and ``cache-suffix:`` from every
``setup-uv@v7`` invocation; pin ``version: "0.11.8"`` instead. The
pin must match Dockerfile's ``UV_VERSION`` so the cache layout
written at image-build time is what CI's setup-uv reads back.
* ``.dockerignore`` — adds ``.claude/`` and ``.pr-body.md`` so local
agent state and PR-body drafts never bake into any image build.
The ``concurrency:`` blocks added in this PR's earlier commits stay
unchanged — that's a separate efficiency win independent of the
caching strategy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
The build was picking the latest CPython on the build host (3.14 at the
time) because ``requires-python = ">=3.12"`` is a lower bound, not a
pin. uv sync then failed because the resolved lockfile contains wheels
(e.g. ``nvidia-cudnn-frontend==1.15.0``) that only ship for cp312/cp313:
error: Distribution `nvidia-cudnn-frontend==1.15.0` can't be installed
because it doesn't have a source distribution or wheel for the current
platform. You're using CPython 3.14 (`cp314`) but the wheels only have
ABI tags `cp312`, `cp313`.
``ARG PYTHON_VERSION=3.12`` + ``ENV UV_PYTHON=${PYTHON_VERSION}`` locks
every uv invocation in the image to Python 3.12. The arg is overridable
at build time so we can bump to 3.13 once every transitive dep has
cp313 wheels.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
8392388 to
4f43bc0
Compare
kaiitunnz
left a comment
There was a problem hiding this comment.
Leave some comments. PTAL.
| # ``uv sync --all-extras --frozen`` warm-hit this cache instead of | ||
| # re-downloading every wheel on each run. | ||
| # | ||
| # Build context MUST be the repository root (FlowMesh is a private repo; |
There was a problem hiding this comment.
Should remove the mention about "private repo".
| # git checkout main && git pull | ||
| # docker build \ | ||
| # --build-arg UV_VERSION=0.11.8 \ | ||
| # -t flowmesh-oss-ci-runner:0.11.8-$(date +%Y%m%d) \ |
There was a problem hiding this comment.
Do we need to include oss in the tag? Will it collide with the other runners?
| ARG UV_VERSION=0.11.8 | ||
|
|
||
| # Pinned Python version. ``pyproject.toml`` says ``requires-python = | ||
| # ">=3.12"`` (a lower bound), but the resolved lockfile contains wheels |
There was a problem hiding this comment.
We have .python-version.
There was a problem hiding this comment.
It was ignored in .dockerignore. I have removed the line.
| --build-arg UV_VERSION=0.11.8 \ | ||
| --build-arg PYTHON_VERSION=3.12 \ | ||
| -t flowmesh-oss-ci-runner:0.11.8-$(date +%Y%m%d) \ | ||
| -t flowmesh-oss-ci-runner:latest \ |
There was a problem hiding this comment.
Consider the oss tag here.
|
|
||
| The two tags give a date-stamped reference for rollback plus a moving `:latest` for the runner systemd units. Build it on a `linux/amd64` host so the cached wheels are platform-correct for the runners (which are also `linux/amd64`). | ||
|
|
||
| `PYTHON_VERSION` defaults to `3.12`. The lockfile includes wheels (e.g. `nvidia-cudnn-frontend`) that only ship for cp312/cp313; `requires-python = ">=3.12"` in `pyproject.toml` is a lower bound, so uv would otherwise pick the latest available CPython (3.14+) and fail. Bump to `3.13` only when every transitive dep has cp313 wheels. |
There was a problem hiding this comment.
Make it read from .python-version.
| ```bash | ||
| sudo systemctl daemon-reload | ||
| sudo systemctl restart 'flowmesh-oss-ci-cuda-runner@*.service' | ||
| sudo systemctl restart 'flowmesh-oss-ci-gpu-runner@*.service' |
* Rename image: ``flowmesh-oss-ci-runner`` → ``flowmesh-ci-runner``. No non-OSS counterpart in the same namespace; the ``-oss-`` infix was noise. * Drop ``ARG PYTHON_VERSION`` and ``ENV UV_PYTHON``. uv reads ``.python-version`` (``3.12``) from the build context automatically, so the explicit pin was redundant — once ``.python-version`` is actually present in the context. * Remove ``.python-version`` from ``.dockerignore``. It was misclassified under "Virtual environments" and stripped from the build context, which is what made uv fall back to the latest CPython on the build host. It's a tool-config file, not venv state. Other Dockerfiles in the repo (``src/server/Dockerfile``, the worker images) all use base images with Python baked in; none of them depend on this exclusion. * Drop the ``# syntax=docker/dockerfile:1`` directive — no BuildKit-only features in this Dockerfile, and no other Dockerfile in the repo uses the directive. * Drop the lengthy header / private-repo justification / inline comments in the Dockerfile and remove ``.github/runners/README.md`` entirely. The Dockerfile header now carries the single build command anyone needs; deploy / refresh details live in operator runbooks, not in the repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
79e57af to
c4329e5
Compare
Purpose
Speed up CI on the self-hosted runners.
myoung34/github-runnerships with/root/.cache/uvalready populated for the project's--all-extrasinstall. CI workflows that runuv sync --all-extras --frozenwarm-hit this cache instead of redownloading every wheel. Published asflowmesh_ci_runner:devin this repo's container registry.Changes
All six existing workflows — adds:
${{ github.ref }}keys per branch so two open PRs don't cancel each other.Five workflows that use
astral-sh/setup-uv— pinversion: "0.11.8". Must match the runner image'sUV_VERSION; uv's cache layout changes across minor versions, so a mismatch leaves the image-baked cache unused..github/runners/Dockerfile— new. Bases onmyoung34/github-runner:latest, installs pinnedUV_VERSION(default0.11.8),COPYs the build context into/tmp/flowmesh, runsuv sync --all-extras --frozento populate/root/.cache/uv, then deletes the source tree. Only the populated cache survives in the image..dockerignore— drops.python-versionfrom the venv-exclusions block (it's a tool-config file, not venv state — it was being stripped from the build context which made uv fall back to the latest CPython on the build host). Adds.claude/so local agent state never bakes into any image build.Design
Why a pre-baked image rather than
setup-uv's GHA cache. GHA cache is the wrong backend for--all-extras:uv cache prune --ci(run by setup-uv before save) strips the cache from ~10 GB to ~2 MB. Subsequent runs hit a near-empty cache and redownload everything.mainbuilds.A pre-baked image side-steps all three: cache lives in the image's read-only layers, externally managed (rebuild on schedule), no per-run management, no concurrent-write races, no GHA cache size contention. Stale cache is never unsafe — uv falls through to network for any wheel not in the image — just slower until the rebuild.
Why pin
UV_VERSION. uv changes its cache layout across minor versions. If the image bakes uv 0.11.8 but CI's setup-uv resolves uv 0.12.x, the cache may not be readable and CI falls through to a cold install. Pinning both sides (image build arg + workflowversion:) keeps them aligned.Python version comes from
.python-version. The repo's.python-versionfile is the single source of truth (currently3.12). uv resolves Python from it automatically when the file is in the build context — no explicit Python pin in the Dockerfile. Bumping the project to a new Python only requires editing.python-versionand rebuilding the image.Test Plan
Test Result
YAML parses across all workflow files. Image build + runtime verification is the actual measurement; once a runner is using the new image,
uv sync --all-extras --frozenfinishes in seconds (warm cache) rather than minutes (cold download).Pre-submission Checklist
CONTRIBUTING.md(orAGENTS.mdif noCONTRIBUTING.md).uv run pre-commit run --all-filesand fixed any issues.[BREAKING]and described migration steps above.