chore: add Dockerfile for self-hosted runner with pre-built uv cache by timzsu · Pull Request #8 · mlsys-io/FlowMesh

timzsu · 2026-04-30T09:53:25Z

Purpose

Land a Dockerfile that builds a custom self-hosted runner image with the uv wheel cache pre-populated for the project's --all-extras install. Once the runner hosts use this image, CI workflows that run uv sync --all-extras --frozen warm-hit the cache instead of redownloading every wheel.

The previous attempt at speeding up CI used astral-sh/setup-uv@v7's GHA-cache integration. That doesn't work for a --all-extras install: uv cache prune --ci strips the cache to ~2.7 MB, GHA Actions Cache caps at 10 GB per repo (and the unpruned cache is ~10.4 GB), and a shared host-mounted writable cache would be a poisoning vector. A pre-baked image side-steps all three: cache lives in the image's read-only layers, externally managed (rebuild on schedule), no per-run management.

Changes

.github/runners/Dockerfile — bases on myoung34/github-runner:latest, installs a pinned UV_VERSION (default 0.11.8), COPYs the build context into /tmp/flowmesh, runs uv sync --all-extras --frozen to populate /root/.cache/uv, then deletes the source tree. Only the populated cache survives in the image.
.github/runners/README.md — build command, deployment to runner hosts (image tag swap in the systemd units + systemctl restart), refresh cadence (rebuild on uv.lock changes or UV_VERSION bumps), and the workflow-side change required (pin setup-uv's version: to match UV_VERSION, set enable-cache: false).
.dockerignore — adds .claude/ and .pr-body.md so local agent state and PR-body drafts never bake into any image build.

Design

Why COPY rather than git clone: FlowMesh is private. Cloning at build time would require either an SSH key forwarded via BuildKit or a token in a build secret. Using the build context (the maintainer's local checkout) avoids both, and the pattern is "build it from the same place you'd run uv sync manually."

Why pin UV_VERSION: uv changes its cache layout across minor versions. If the image bakes uv 0.11.8 but CI's setup-uv resolves uv 0.12.x, the cache may not be readable and CI falls through to a cold install. Pinning both sides (image build arg + workflow version:) keeps them aligned.

Cache freshness: rebuild when uv.lock changes on main. Stale cache is not unsafe — uv falls through to network for any wheel it doesn't find — just slower until the rebuild.

Test Plan

# Build locally to verify the Dockerfile is correct
git checkout main && git pull
docker build \
  --build-arg UV_VERSION=0.11.8 \
  -t flowmesh-oss-ci-runner:test \
  -f .github/runners/Dockerfile \
  .

# Sanity-check the cache is non-empty and uv is the pinned version
docker run --rm flowmesh-oss-ci-runner:test sh -c 'uv --version && du -sh /root/.cache/uv'

Test Result

Build succeeds locally; the image's /root/.cache/uv is on the order of GB (the actual size depends on what --all-extras resolves to at the time of build).

Pre-submission Checklist

I have read CONTRIBUTING.md (or AGENTS.md if no CONTRIBUTING.md).
I have run uv run pre-commit run --all-files and fixed any issues.
I have added or updated tests covering my changes (if applicable).
I have verified that the test suite passes locally.
If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.

Adds ``.github/runners/Dockerfile`` that builds a custom GitHub Actions runner image based on ``myoung34/github-runner`` with the uv wheel cache pre-populated for the project's ``--all-extras`` install. CI workflows that run ``uv sync --all-extras --frozen`` warm-hit this cache instead of redownloading every wheel on each run. Build context is the repo root; the image COPYs the FlowMesh source in (no git-clone-with-auth needed at build time, FlowMesh stays a private repo). Pinned uv version via ``--build-arg UV_VERSION``; the workflows' ``setup-uv@v7`` calls must pin ``version:`` to the same value or the cache layout written here may not match what CI reads back. ``.github/runners/README.md`` documents the build command, deployment to runner hosts (image tag swap in the systemd units), and the refresh cadence (rebuild on ``uv.lock`` changes or ``UV_VERSION`` bumps). ``.dockerignore`` gains ``.claude/`` and ``.pr-body.md`` so local agent state and PR-body drafts never bake into any image build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

Track ``a800051`` so the SDK pin reflects PR #8 (optional ``schema_scope``) on lumid.data main. Wire contract is unchanged from the FlowMesh side — the connector already passes ``None`` through ``model_dump(exclude_none=True)``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

* feat(worker): agent connector for lumid.data NL-driven retrieval Wire ``data.type == "agent"`` into ``DataRetrievalExecutor`` so a worker task can describe what it wants in natural language plus a schema scope and let lumid.data's ``/retrieve/v1`` plan + replay the chain server-side. Each item carries the materialized DataFrame and the typed access chain so a downstream consumer binds to either via the existing ``path: items.X`` resolver. Worker delivery wiring: ``analytics`` extra picks up the ``lumid-data-sdk`` git+ pin; ``sync_requirements.py`` regex extended for the PEP 508 ``name @ git+url`` form; the security workflow filters ``@ git+`` deps before ``pip-audit --strict`` since PyPI doesn't carry git-source deps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> * docs(templates): minimal e2e template for the agent connector Two-stage workflow: agent retrieval against lumid.data, then a Qwen 1.5B summary node consuming the materialized table and access chain. One ``flowmesh workflow submit`` exercises the connector and downstream consumption end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> * Trim comments Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> * fix(worker): make schema_scope optional in the agent retrieval branch Drop ``schema_scope`` from the executor's required-keys validation and let the connector forward ``None`` to the SDK so a workflow can omit the field when it wants lumid.data to default to all visible schemas. The e2e template drops its explicit scope to exercise the new path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> * chore(worker): bump lumid-data-sdk to merged main Track ``a800051`` so the SDK pin reflects PR #8 (optional ``schema_scope``) on lumid.data main. Wire contract is unchanged from the FlowMesh side — the connector already passes ``None`` through ``model_dump(exclude_none=True)``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> * fix(server,worker): make manifest writes safe under shared-volume UID mismatch Single-node deployments can share the results volume between the server (root) and supervisor-spawned workers (appuser). Both call sync_manifest, so the prior direct write_text raced to EACCES on the second writer when the manifest was already owned by the first writer's UID. prepare_output_dir now chmods each managed directory to 0o0777 (best-effort, tolerant of cross-UID ownership). sync_manifest writes the manifest with write_text and then chmods it to 0o0666 so the next sync_manifest call from a peer UID can overwrite the file directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> * fix(worker): keep nested DataFrame cells un-truncated when rendering prompts Aggregate templates that bind a per-row pd.DataFrame value into the prompt rendered the cell via tabulate's to_markdown, which falls back to pandas' default __str__ on each DataFrame entry. The default 80-col display width clipped middle columns to '...', so the consumer LLM only saw the first and last few columns of any wide retrieval result. Wrap the to_markdown sites in pd.option_context with max_columns/width/ max_colwidth set to None so DataFrames render in full regardless of the calling environment's pandas display defaults. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> * refactor(server,worker): address review feedback Four cleanups raised in code review: - Move the duplicated ``prepare_output_dir`` / ``sync_manifest`` (and helpers) from ``src/server/utils/manifest.py`` and ``src/worker/utils/manifest.py`` into a single ``src/shared/utils/manifest.py`` exposing everything either side uses (including the worker-only ``scratch_dir`` / ``SCRATCH_DIR``); rewrite every call site to import from ``shared.utils.manifest`` and move the helper tests under ``tests/shared/utils/``. - Drop two unnecessary ``# type: ignore`` comments on ``self._normalize_params`` calls in ``data_retrieval_executor`` — mypy resolves them cleanly without an escape. - Replace the ``# type: ignore[import-untyped]`` on ``lumid_data.sdk.Client`` with a ``follow_untyped_imports`` override in ``pyproject.toml`` so the override applies to the whole SDK and goes away as soon as upstream ships type stubs. - Name the worker-CPU pip-audit input file ``/tmp/requirements-worker-cpu-audit.txt`` so its purpose reads at a glance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> * chore(deps): pip-audit ignore three new advisories GHSA-x368-4g9h-fvv4 (vllm 0.18.0, fix 0.19.1) and GHSA-83vm-p52w-f9pw (vllm 0.18.0, fix 0.20.0) join the existing list — both are blocked by the same transformers 4.57 / inference-deps pin that already keeps the other vllm advisories on the ignore list. GHSA-j7w6-vpvq-j3gm (diffusers 0.36.0, fix 0.38.0) is added separately: diffusers 0.38 requires safetensors>=0.8.0rc0, which uv lock refuses to resolve without an explicit pre-release opt-in. Holding the floor at 0.36 and ignoring until safetensors ships a non-rc 0.8. Update the upgrade-blocker table in CODE_STYLE.md alongside the workflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> * fix(server,worker): switch manifest writes to chmod-after-write Replaces the tempfile + os.replace approach with a direct write_text + chmod 0o0666 on the manifest, and a guarded mkdir + chmod 0o0777 on the output directories. Both rely on the file/dir's owner being the only caller that needs to run chmod, which holds because sync_manifest is the sole writer of the manifest and prepare_output_dir's chmod runs only on creation. The "best-effort" PermissionError swallow on chmod is gone — a chmod failure now propagates loudly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> * refactor(server,worker): atomic write for files written by multiple parties Promote ``shared.utils.manifest`` 's tempfile + os.replace path to a standalone ``shared.utils.atomic.atomic_write_text`` helper and apply it to every file the server and the worker can both write: the per-task ``manifest.json`` and ``results.json``. Each write goes through a tempfile in the same directory, gets chmodded to ``0o0666``, and is swapped in via ``os.replace`` so a peer-UID writer can replace it without permission issues and a crash mid-write leaves either the old file or the new one — never a half-written one. Drop the manifest-permission/overwrite tests in ``tests/worker/test_task_output.py`` that the shared-utils suite already covers, and add a one-line comment over the two ``pd.option_context`` sites in graph_templates so the intent of the width-cap toggles is obvious. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> --------- Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

timzsu closed this Apr 30, 2026

timzsu deleted the zsu/runner-image branch April 30, 2026 09:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: add Dockerfile for self-hosted runner with pre-built uv cache#8

chore: add Dockerfile for self-hosted runner with pre-built uv cache#8
timzsu wants to merge 1 commit into
mainfrom
zsu/runner-image

timzsu commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

timzsu commented Apr 30, 2026

Purpose

Changes

Design

Test Plan

Test Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant