chore: add Dockerfile for self-hosted runner with pre-built uv cache#8
Closed
timzsu wants to merge 1 commit into
Closed
chore: add Dockerfile for self-hosted runner with pre-built uv cache#8timzsu wants to merge 1 commit into
timzsu wants to merge 1 commit into
Conversation
Adds ``.github/runners/Dockerfile`` that builds a custom GitHub Actions runner image based on ``myoung34/github-runner`` with the uv wheel cache pre-populated for the project's ``--all-extras`` install. CI workflows that run ``uv sync --all-extras --frozen`` warm-hit this cache instead of redownloading every wheel on each run. Build context is the repo root; the image COPYs the FlowMesh source in (no git-clone-with-auth needed at build time, FlowMesh stays a private repo). Pinned uv version via ``--build-arg UV_VERSION``; the workflows' ``setup-uv@v7`` calls must pin ``version:`` to the same value or the cache layout written here may not match what CI reads back. ``.github/runners/README.md`` documents the build command, deployment to runner hosts (image tag swap in the systemd units), and the refresh cadence (rebuild on ``uv.lock`` changes or ``UV_VERSION`` bumps). ``.dockerignore`` gains ``.claude/`` and ``.pr-body.md`` so local agent state and PR-body drafts never bake into any image build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
timzsu
added a commit
that referenced
this pull request
May 6, 2026
Track ``a800051`` so the SDK pin reflects PR #8 (optional ``schema_scope``) on lumid.data main. Wire contract is unchanged from the FlowMesh side — the connector already passes ``None`` through ``model_dump(exclude_none=True)``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
timzsu
added a commit
that referenced
this pull request
May 7, 2026
* feat(worker): agent connector for lumid.data NL-driven retrieval Wire ``data.type == "agent"`` into ``DataRetrievalExecutor`` so a worker task can describe what it wants in natural language plus a schema scope and let lumid.data's ``/retrieve/v1`` plan + replay the chain server-side. Each item carries the materialized DataFrame and the typed access chain so a downstream consumer binds to either via the existing ``path: items.X`` resolver. Worker delivery wiring: ``analytics`` extra picks up the ``lumid-data-sdk`` git+ pin; ``sync_requirements.py`` regex extended for the PEP 508 ``name @ git+url`` form; the security workflow filters ``@ git+`` deps before ``pip-audit --strict`` since PyPI doesn't carry git-source deps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> * docs(templates): minimal e2e template for the agent connector Two-stage workflow: agent retrieval against lumid.data, then a Qwen 1.5B summary node consuming the materialized table and access chain. One ``flowmesh workflow submit`` exercises the connector and downstream consumption end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> * Trim comments Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> * fix(worker): make schema_scope optional in the agent retrieval branch Drop ``schema_scope`` from the executor's required-keys validation and let the connector forward ``None`` to the SDK so a workflow can omit the field when it wants lumid.data to default to all visible schemas. The e2e template drops its explicit scope to exercise the new path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> * chore(worker): bump lumid-data-sdk to merged main Track ``a800051`` so the SDK pin reflects PR #8 (optional ``schema_scope``) on lumid.data main. Wire contract is unchanged from the FlowMesh side — the connector already passes ``None`` through ``model_dump(exclude_none=True)``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> * fix(server,worker): make manifest writes safe under shared-volume UID mismatch Single-node deployments can share the results volume between the server (root) and supervisor-spawned workers (appuser). Both call sync_manifest, so the prior direct write_text raced to EACCES on the second writer when the manifest was already owned by the first writer's UID. prepare_output_dir now chmods each managed directory to 0o0777 (best-effort, tolerant of cross-UID ownership). sync_manifest writes the manifest with write_text and then chmods it to 0o0666 so the next sync_manifest call from a peer UID can overwrite the file directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> * fix(worker): keep nested DataFrame cells un-truncated when rendering prompts Aggregate templates that bind a per-row pd.DataFrame value into the prompt rendered the cell via tabulate's to_markdown, which falls back to pandas' default __str__ on each DataFrame entry. The default 80-col display width clipped middle columns to '...', so the consumer LLM only saw the first and last few columns of any wide retrieval result. Wrap the to_markdown sites in pd.option_context with max_columns/width/ max_colwidth set to None so DataFrames render in full regardless of the calling environment's pandas display defaults. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> * refactor(server,worker): address review feedback Four cleanups raised in code review: - Move the duplicated ``prepare_output_dir`` / ``sync_manifest`` (and helpers) from ``src/server/utils/manifest.py`` and ``src/worker/utils/manifest.py`` into a single ``src/shared/utils/manifest.py`` exposing everything either side uses (including the worker-only ``scratch_dir`` / ``SCRATCH_DIR``); rewrite every call site to import from ``shared.utils.manifest`` and move the helper tests under ``tests/shared/utils/``. - Drop two unnecessary ``# type: ignore`` comments on ``self._normalize_params`` calls in ``data_retrieval_executor`` — mypy resolves them cleanly without an escape. - Replace the ``# type: ignore[import-untyped]`` on ``lumid_data.sdk.Client`` with a ``follow_untyped_imports`` override in ``pyproject.toml`` so the override applies to the whole SDK and goes away as soon as upstream ships type stubs. - Name the worker-CPU pip-audit input file ``/tmp/requirements-worker-cpu-audit.txt`` so its purpose reads at a glance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> * chore(deps): pip-audit ignore three new advisories GHSA-x368-4g9h-fvv4 (vllm 0.18.0, fix 0.19.1) and GHSA-83vm-p52w-f9pw (vllm 0.18.0, fix 0.20.0) join the existing list — both are blocked by the same transformers 4.57 / inference-deps pin that already keeps the other vllm advisories on the ignore list. GHSA-j7w6-vpvq-j3gm (diffusers 0.36.0, fix 0.38.0) is added separately: diffusers 0.38 requires safetensors>=0.8.0rc0, which uv lock refuses to resolve without an explicit pre-release opt-in. Holding the floor at 0.36 and ignoring until safetensors ships a non-rc 0.8. Update the upgrade-blocker table in CODE_STYLE.md alongside the workflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> * fix(server,worker): switch manifest writes to chmod-after-write Replaces the tempfile + os.replace approach with a direct write_text + chmod 0o0666 on the manifest, and a guarded mkdir + chmod 0o0777 on the output directories. Both rely on the file/dir's owner being the only caller that needs to run chmod, which holds because sync_manifest is the sole writer of the manifest and prepare_output_dir's chmod runs only on creation. The "best-effort" PermissionError swallow on chmod is gone — a chmod failure now propagates loudly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> * refactor(server,worker): atomic write for files written by multiple parties Promote ``shared.utils.manifest`` 's tempfile + os.replace path to a standalone ``shared.utils.atomic.atomic_write_text`` helper and apply it to every file the server and the worker can both write: the per-task ``manifest.json`` and ``results.json``. Each write goes through a tempfile in the same directory, gets chmodded to ``0o0666``, and is swapped in via ``os.replace`` so a peer-UID writer can replace it without permission issues and a crash mid-write leaves either the old file or the new one — never a half-written one. Drop the manifest-permission/overwrite tests in ``tests/worker/test_task_output.py`` that the shared-utils suite already covers, and add a one-line comment over the two ``pd.option_context`` sites in graph_templates so the intent of the width-cap toggles is obvious. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> --------- Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Land a Dockerfile that builds a custom self-hosted runner image with the uv wheel cache pre-populated for the project's
--all-extrasinstall. Once the runner hosts use this image, CI workflows that runuv sync --all-extras --frozenwarm-hit the cache instead of redownloading every wheel.The previous attempt at speeding up CI used
astral-sh/setup-uv@v7's GHA-cache integration. That doesn't work for a--all-extrasinstall:uv cache prune --cistrips the cache to ~2.7 MB, GHA Actions Cache caps at 10 GB per repo (and the unpruned cache is ~10.4 GB), and a shared host-mounted writable cache would be a poisoning vector. A pre-baked image side-steps all three: cache lives in the image's read-only layers, externally managed (rebuild on schedule), no per-run management.Changes
.github/runners/Dockerfile— bases onmyoung34/github-runner:latest, installs a pinnedUV_VERSION(default0.11.8),COPYs the build context into/tmp/flowmesh, runsuv sync --all-extras --frozento populate/root/.cache/uv, then deletes the source tree. Only the populated cache survives in the image..github/runners/README.md— build command, deployment to runner hosts (image tag swap in the systemd units +systemctl restart), refresh cadence (rebuild onuv.lockchanges orUV_VERSIONbumps), and the workflow-side change required (pinsetup-uv'sversion:to matchUV_VERSION, setenable-cache: false)..dockerignore— adds.claude/and.pr-body.mdso local agent state and PR-body drafts never bake into any image build.Design
Why COPY rather than
git clone: FlowMesh is private. Cloning at build time would require either an SSH key forwarded via BuildKit or a token in a build secret. Using the build context (the maintainer's local checkout) avoids both, and the pattern is "build it from the same place you'd runuv syncmanually."Why pin
UV_VERSION: uv changes its cache layout across minor versions. If the image bakes uv 0.11.8 but CI'ssetup-uvresolves uv 0.12.x, the cache may not be readable and CI falls through to a cold install. Pinning both sides (image build arg + workflowversion:) keeps them aligned.Cache freshness: rebuild when
uv.lockchanges onmain. Stale cache is not unsafe — uv falls through to network for any wheel it doesn't find — just slower until the rebuild.Test Plan
Test Result
Build succeeds locally; the image's
/root/.cache/uvis on the order of GB (the actual size depends on what--all-extrasresolves to at the time of build).Pre-submission Checklist
CONTRIBUTING.md(orAGENTS.mdif noCONTRIBUTING.md).uv run pre-commit run --all-filesand fixed any issues.[BREAKING]and described migration steps above.