Skip to content

chore: add Dockerfile for self-hosted runner with pre-built uv cache#8

Closed
timzsu wants to merge 1 commit into
mainfrom
zsu/runner-image
Closed

chore: add Dockerfile for self-hosted runner with pre-built uv cache#8
timzsu wants to merge 1 commit into
mainfrom
zsu/runner-image

Conversation

@timzsu
Copy link
Copy Markdown
Collaborator

@timzsu timzsu commented Apr 30, 2026

Purpose

Land a Dockerfile that builds a custom self-hosted runner image with the uv wheel cache pre-populated for the project's --all-extras install. Once the runner hosts use this image, CI workflows that run uv sync --all-extras --frozen warm-hit the cache instead of redownloading every wheel.

The previous attempt at speeding up CI used astral-sh/setup-uv@v7's GHA-cache integration. That doesn't work for a --all-extras install: uv cache prune --ci strips the cache to ~2.7 MB, GHA Actions Cache caps at 10 GB per repo (and the unpruned cache is ~10.4 GB), and a shared host-mounted writable cache would be a poisoning vector. A pre-baked image side-steps all three: cache lives in the image's read-only layers, externally managed (rebuild on schedule), no per-run management.

Changes

  • .github/runners/Dockerfile — bases on myoung34/github-runner:latest, installs a pinned UV_VERSION (default 0.11.8), COPYs the build context into /tmp/flowmesh, runs uv sync --all-extras --frozen to populate /root/.cache/uv, then deletes the source tree. Only the populated cache survives in the image.
  • .github/runners/README.md — build command, deployment to runner hosts (image tag swap in the systemd units + systemctl restart), refresh cadence (rebuild on uv.lock changes or UV_VERSION bumps), and the workflow-side change required (pin setup-uv's version: to match UV_VERSION, set enable-cache: false).
  • .dockerignore — adds .claude/ and .pr-body.md so local agent state and PR-body drafts never bake into any image build.

Design

Why COPY rather than git clone: FlowMesh is private. Cloning at build time would require either an SSH key forwarded via BuildKit or a token in a build secret. Using the build context (the maintainer's local checkout) avoids both, and the pattern is "build it from the same place you'd run uv sync manually."

Why pin UV_VERSION: uv changes its cache layout across minor versions. If the image bakes uv 0.11.8 but CI's setup-uv resolves uv 0.12.x, the cache may not be readable and CI falls through to a cold install. Pinning both sides (image build arg + workflow version:) keeps them aligned.

Cache freshness: rebuild when uv.lock changes on main. Stale cache is not unsafe — uv falls through to network for any wheel it doesn't find — just slower until the rebuild.

Test Plan

# Build locally to verify the Dockerfile is correct
git checkout main && git pull
docker build \
  --build-arg UV_VERSION=0.11.8 \
  -t flowmesh-oss-ci-runner:test \
  -f .github/runners/Dockerfile \
  .

# Sanity-check the cache is non-empty and uv is the pinned version
docker run --rm flowmesh-oss-ci-runner:test sh -c 'uv --version && du -sh /root/.cache/uv'

Test Result

Build succeeds locally; the image's /root/.cache/uv is on the order of GB (the actual size depends on what --all-extras resolves to at the time of build).


Pre-submission Checklist
  • I have read CONTRIBUTING.md (or AGENTS.md if no CONTRIBUTING.md).
  • I have run uv run pre-commit run --all-files and fixed any issues.
  • I have added or updated tests covering my changes (if applicable).
  • I have verified that the test suite passes locally.
  • If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.

Adds ``.github/runners/Dockerfile`` that builds a custom GitHub Actions
runner image based on ``myoung34/github-runner`` with the uv wheel
cache pre-populated for the project's ``--all-extras`` install. CI
workflows that run ``uv sync --all-extras --frozen`` warm-hit this
cache instead of redownloading every wheel on each run.

Build context is the repo root; the image COPYs the FlowMesh source in
(no git-clone-with-auth needed at build time, FlowMesh stays a private
repo). Pinned uv version via ``--build-arg UV_VERSION``; the workflows'
``setup-uv@v7`` calls must pin ``version:`` to the same value or the
cache layout written here may not match what CI reads back.

``.github/runners/README.md`` documents the build command, deployment
to runner hosts (image tag swap in the systemd units), and the refresh
cadence (rebuild on ``uv.lock`` changes or ``UV_VERSION`` bumps).

``.dockerignore`` gains ``.claude/`` and ``.pr-body.md`` so local agent
state and PR-body drafts never bake into any image build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
@timzsu timzsu closed this Apr 30, 2026
@timzsu timzsu deleted the zsu/runner-image branch April 30, 2026 09:54
timzsu added a commit that referenced this pull request May 6, 2026
Track ``a800051`` so the SDK pin reflects PR #8 (optional
``schema_scope``) on lumid.data main. Wire contract is unchanged
from the FlowMesh side — the connector already passes ``None``
through ``model_dump(exclude_none=True)``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
timzsu added a commit that referenced this pull request May 7, 2026
* feat(worker): agent connector for lumid.data NL-driven retrieval

Wire ``data.type == "agent"`` into ``DataRetrievalExecutor`` so a
worker task can describe what it wants in natural language plus a
schema scope and let lumid.data's ``/retrieve/v1`` plan + replay
the chain server-side. Each item carries the materialized DataFrame
and the typed access chain so a downstream consumer binds to either
via the existing ``path: items.X`` resolver.

Worker delivery wiring: ``analytics`` extra picks up the
``lumid-data-sdk`` git+ pin; ``sync_requirements.py`` regex extended
for the PEP 508 ``name @ git+url`` form; the security workflow
filters ``@ git+`` deps before ``pip-audit --strict`` since PyPI
doesn't carry git-source deps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

* docs(templates): minimal e2e template for the agent connector

Two-stage workflow: agent retrieval against lumid.data, then a
Qwen 1.5B summary node consuming the materialized table and
access chain. One ``flowmesh workflow submit`` exercises the
connector and downstream consumption end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

* Trim comments

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

* fix(worker): make schema_scope optional in the agent retrieval branch

Drop ``schema_scope`` from the executor's required-keys validation
and let the connector forward ``None`` to the SDK so a workflow can
omit the field when it wants lumid.data to default to all visible
schemas. The e2e template drops its explicit scope to exercise the
new path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

* chore(worker): bump lumid-data-sdk to merged main

Track ``a800051`` so the SDK pin reflects PR #8 (optional
``schema_scope``) on lumid.data main. Wire contract is unchanged
from the FlowMesh side — the connector already passes ``None``
through ``model_dump(exclude_none=True)``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

* fix(server,worker): make manifest writes safe under shared-volume UID mismatch

Single-node deployments can share the results volume between the server
(root) and supervisor-spawned workers (appuser). Both call sync_manifest,
so the prior direct write_text raced to EACCES on the second writer when
the manifest was already owned by the first writer's UID.

prepare_output_dir now chmods each managed directory to 0o0777
(best-effort, tolerant of cross-UID ownership). sync_manifest writes the
manifest with write_text and then chmods it to 0o0666 so the next
sync_manifest call from a peer UID can overwrite the file directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

* fix(worker): keep nested DataFrame cells un-truncated when rendering prompts

Aggregate templates that bind a per-row pd.DataFrame value into the prompt
rendered the cell via tabulate's to_markdown, which falls back to pandas'
default __str__ on each DataFrame entry. The default 80-col display width
clipped middle columns to '...', so the consumer LLM only saw the first
and last few columns of any wide retrieval result.

Wrap the to_markdown sites in pd.option_context with max_columns/width/
max_colwidth set to None so DataFrames render in full regardless of the
calling environment's pandas display defaults.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

* refactor(server,worker): address review feedback

Four cleanups raised in code review:

- Move the duplicated ``prepare_output_dir`` / ``sync_manifest`` (and
  helpers) from ``src/server/utils/manifest.py`` and
  ``src/worker/utils/manifest.py`` into a single
  ``src/shared/utils/manifest.py`` exposing everything either side uses
  (including the worker-only ``scratch_dir`` / ``SCRATCH_DIR``); rewrite
  every call site to import from ``shared.utils.manifest`` and move the
  helper tests under ``tests/shared/utils/``.
- Drop two unnecessary ``# type: ignore`` comments on
  ``self._normalize_params`` calls in ``data_retrieval_executor`` — mypy
  resolves them cleanly without an escape.
- Replace the ``# type: ignore[import-untyped]`` on
  ``lumid_data.sdk.Client`` with a ``follow_untyped_imports`` override
  in ``pyproject.toml`` so the override applies to the whole SDK and
  goes away as soon as upstream ships type stubs.
- Name the worker-CPU pip-audit input file
  ``/tmp/requirements-worker-cpu-audit.txt`` so its purpose reads at a
  glance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

* chore(deps): pip-audit ignore three new advisories

GHSA-x368-4g9h-fvv4 (vllm 0.18.0, fix 0.19.1) and GHSA-83vm-p52w-f9pw
(vllm 0.18.0, fix 0.20.0) join the existing list — both are blocked by
the same transformers 4.57 / inference-deps pin that already keeps the
other vllm advisories on the ignore list.

GHSA-j7w6-vpvq-j3gm (diffusers 0.36.0, fix 0.38.0) is added separately:
diffusers 0.38 requires safetensors>=0.8.0rc0, which uv lock refuses to
resolve without an explicit pre-release opt-in. Holding the floor at
0.36 and ignoring until safetensors ships a non-rc 0.8.

Update the upgrade-blocker table in CODE_STYLE.md alongside the
workflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

* fix(server,worker): switch manifest writes to chmod-after-write

Replaces the tempfile + os.replace approach with a direct write_text +
chmod 0o0666 on the manifest, and a guarded mkdir + chmod 0o0777 on the
output directories. Both rely on the file/dir's owner being the only
caller that needs to run chmod, which holds because sync_manifest is
the sole writer of the manifest and prepare_output_dir's chmod runs
only on creation. The "best-effort" PermissionError swallow on chmod
is gone — a chmod failure now propagates loudly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

* refactor(server,worker): atomic write for files written by multiple parties

Promote ``shared.utils.manifest`` 's tempfile + os.replace path to a
standalone ``shared.utils.atomic.atomic_write_text`` helper and apply it
to every file the server and the worker can both write: the per-task
``manifest.json`` and ``results.json``. Each write goes through a
tempfile in the same directory, gets chmodded to ``0o0666``, and is
swapped in via ``os.replace`` so a peer-UID writer can replace it
without permission issues and a crash mid-write leaves either the old
file or the new one — never a half-written one.

Drop the manifest-permission/overwrite tests in
``tests/worker/test_task_output.py`` that the shared-utils suite
already covers, and add a one-line comment over the two ``pd.option_context``
sites in graph_templates so the intent of the width-cap toggles is
obvious.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

---------

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant