feat: add CI/GPU runner infrastructure migrated from FlowMesh_dev by Qruixuan · Pull Request #15 · mlsys-io/FlowMesh

Qruixuan · 2026-05-04T12:05:35Z

Purpose

Migrate the CI/GPU runner infrastructure from mlsys-io/FlowMesh_dev (ci/gpu-runner-setup-v2) and adapt it to FlowMesh's single-server architecture (no separate host/guardian split).

Changes

docker/ci.compose.yml — isolated per-run Compose stack; server mounts the Docker socket to spawn workers via DooD; workers use network_mode: host to reach localhost:8000/50051
docker/ci.ports.fixed.yml — fixed port overlay (8000, 50051) for CI and local runs
docker/ci.worker.gpu.yml — GPU worker overlay wiring (CUDA devices, HF_TOKEN pass-through)
docker/ci.gpu_worker_config.yaml — supervisor worker config for GPU CI worker
scripts/ci/run_local.sh — mirrors the full GitHub Actions pipeline locally for pre-push validation; supports --gpu, --task-yaml, --timeout, --no-build, --keep
.github/workflows/ci.yml — integration (CPU echo) + GPU smoke tests (vLLM, HF Transformers, LoRA SFT, DAG inference, SSH, conditional echo); actions pinned to commit SHAs, permissions: contents: read, template-expansion injections moved to step-level env:
tests/integration/test_e2e.py — pytest E2E test that submits a workflow and polls until DONE; gracefully SKIPs on unavailable executors (max_attempts_exceeded + log pattern match)
templates/n8n/dag_inference.json — replace gated meta-llama/Llama-3.2-1B-Instruct with open Qwen/Qwen2.5-0.5B-Instruct to avoid GatedRepoError when HF_TOKEN is absent

Design

Workers are spawned by the server's Docker adapter (DooD via /var/run/docker.sock) with network_mode: host, so they reach the server at localhost:8000 and localhost:50051 directly. The pytest runner executes in a temporary python:3.11-slim container joined to the compose network, accessing the server via service name http://server:8000. Each CI run is isolated by a unique compose project name (ci-$RUN_ID).

The GPU workflow uses a content-hash-tagged builder image (flowmesh-builder:<hash>) to cache the heavy CUDA dependency layer across runs, rebuilding only when Dockerfile.cuda.builder or requirements.gpu.txt change.

Test Plan

# CPU integration test
./scripts/ci/run_local.sh

# GPU smoke tests (all templates)
./scripts/ci/run_local.sh --gpu

# Single template override
./scripts/ci/run_local.sh --gpu --task-yaml templates/inference_vllm_tiny.yaml

Test Result

CPU integration (echo task): PASSED
GPU vLLM inference (TinyLlama-1.1B): PASSED after bumping timeout 300 → 600s to accommodate cold-start download + torch.compile (~247s total)
GPU n8n DAG inference: PASSED after replacing gated Llama model with open Qwen model

Pre-submission Checklist

I have read the contribution guidelines.
I have run pre-commit run --all-files and fixed any issues.
I have added or updated tests covering my changes (if applicable).
I have verified that uv run pytest tests/ passes locally.
If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker.
If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-extras --frozen).
If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
I have updated documentation or config examples if user-facing behavior changed.

Migrate the following changes from mlsys-io/FlowMesh_dev (ci/gpu-runner-setup-v2): - .github/workflows/unit-tests.yml: switch install to --all-extras, add cuda runner label [self-hosted, cuda], pin action SHAs with uv version 0.11.8, add permissions/concurrency blocks - src/worker/docker/Dockerfile.cpu: rename SUPERVISOR_GRPC_TARGET -> GUARDIAN_GRPC_TARGET, switch shared copy to granular (shared/__init__.py + shared/all + shared/host_worker + shared/guardian_worker), drop source/url OCI labels - src/worker/docker/Dockerfile.cuda: same GUARDIAN rename + granular shared copy, drop source/url OCI labels - src/worker/docker/Dockerfile.ssh.cpu: drop source/url OCI labels - src/worker/docker/Dockerfile.ssh.gpu: drop source/url OCI labels - src/worker/docker/README.md: rename SUPERVISOR_GRPC_TARGET -> GUARDIAN_GRPC_TARGET, update TLS section (guardian naming) Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>

…om FlowMesh_dev Migrates changes from mlsys-io/FlowMesh_dev (main) that were not yet present in FlowMesh: - Dockerfile.cuda: rename SUPERVISOR_GRPC_TARGET → GUARDIAN_GRPC_TARGET; replace broad `COPY src/shared` with granular copies of shared/__init__.py, shared/all, shared/host_worker, shared/guardian_worker; drop extra org.opencontainers.image.source/url LABEL lines - Dockerfile.ssh.cpu: drop org.opencontainers.image.source/url LABELs - Dockerfile.ssh.gpu: drop org.opencontainers.image.source/url LABELs - src/worker/docker/README.md: rename SUPERVISOR_GRPC_TARGET → GUARDIAN_GRPC_TARGET, generate_server_tls_certs.sh → generate_guardian_tls_certs.sh, SERVER_GRPC_TLS_CA_B64 → GUARDIAN_GRPC_TLS_CA_B64 templates/n8n/dag_inference.json and CI workflows are already in sync (identical SHAs / FlowMesh has newer hardened versions). Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>

Previous agent incorrectly changed SUPERVISOR_GRPC_TARGET to GUARDIAN_GRPC_TARGET and altered COPY paths/labels. This reverts those files to their correct state. Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>

Dockerfile.cuda: install requirements.gpu.txt in addition to requirements.txt, and add build-time verification that torch/transformers are importable. transformers_executor.py: capture import error message in _HF_IMPORT_ERROR, split PreTrainedModel into a separate fallback import block, add _require_transformers() helper called from both prepare() and run(). Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>

Migrate ci.compose.yml, ci.worker.gpu.yml, ci.ports.fixed.yml, ci.worker_config.yaml, and ci.gpu_worker_config.yaml from FlowMesh_dev. Adapted: guardian service → supervisor, src/guardian/ → src/server/, /etc/guardian/ → /etc/supervisor/, env var names GUARDIAN_* → SUPERVISOR_*. Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>

Migrate .github/workflows/ci.yml (integration + GPU smoke jobs) and scripts/ci/setup-runner.md from FlowMesh_dev. Adapted: guardian→supervisor service names throughout; repo URL updated to mlsys-io/FlowMesh. Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>

scripts/ci/run_local.sh: migrate from FlowMesh_dev, adapted guardian→supervisor throughout (service exec, compose override, health checks, log references). templates: fix output.destination from http to local in conditional_echo_test.yaml and ssh_noninteractive.yaml; use dev version of echo_three_node_graph.yaml. Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>

FlowMesh has no separate host/guardian/postgres services. A single src/server/Dockerfile exposes both HTTP API (8000) and gRPC supervisor (50051). Updated ci.compose.yml, ci.worker.gpu.yml, ci.ports.fixed.yml: - server service built from src/server/Dockerfile - redis only (no postgres) - WORKER_DOCKER_NETWORK uses ${COMPOSE_PROJECT_NAME}_ci-net interpolation - SERVER_HOST=server so spawned workers get SUPERVISOR_GRPC_TARGET=server:50051 Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>

Key changes: - Single "Wait for server" health check (port 8000) instead of separate host + supervisor - Worker registration check uses docker compose exec -T server (no exposed port needed) - E2E tests use http://server:8000 (internal compose network name) - Destroy workers via server API on port 8000 - COMPOSE_PROJECT_NAME exported so ${COMPOSE_PROJECT_NAME}_ci-net interpolation works - run_local.sh: dc() wrapper exports COMPOSE_PROJECT_NAME; single server port block in compose override; step numbering adjusted (no separate supervisor confirm step) Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>

…s and exposed ports Workers are spawned by the Docker adapter with network_mode: host (see supervisor/adapters/docker.py _start()). They connect to the gRPC supervisor at localhost:50051 and download results via FLOWMESH_BASE_URL. Three bugs in the previous CI setup: 1. WORKER_DOCKER_NETWORK env var doesn't exist in FlowMesh — removed. 2. FLOWMESH_BASE_URL was "http://server:8000" but workers on host network can't resolve "server"; changed to "http://localhost:8000". 3. CI workflow never exposed ports 8000/50051 on the host, so workers (network_mode: host) couldn't reach the server container at all; added ci.ports.fixed.yml to both build steps. 4. run_local.sh used a dynamic HTTP port, but FLOWMESH_BASE_URL in the compose is a static value set before start; changed to fixed 127.0.0.1:8000:8000 so workers can always reach http://localhost:8000. Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>

Migrated from FlowMesh_dev ci/gpu-runner-setup-v2 branch unchanged: - Submits a workflow YAML to a live server and polls until DONE/FAILED - Skips automatically when FLOWMESH_HOST_URL is unset (safe for unit test runs) - Handles n8n JSON and native YAML formats - Skips (not fails) when executor package is unavailable on the worker Used by run_local.sh (step 7) and .github/workflows/ci.yml E2E steps. Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>

The server's _VolumeInitializer runs busybox:1.36.1 to chown the named Docker volume to UID 10001, but if busybox isn't cached it fails silently and marks the volume as initialized anyway — so all subsequent workers also get PermissionError writing to /var/lib/flowmesh-results. Fix: set RESULTS_DIR to an absolute host path. The docker adapter skips _VolumeInitializer for absolute paths (see _ensure_volume_access). Workers receive a bind-mount of a pre-created host dir with chmod 777, which UID 10001 (appuser) can write to without any chown step. - ci.compose.yml: RESULTS_DIR=/tmp/flowmesh-ci-results - ci.yml: mkdir + chmod 777 before 'docker compose up' in both jobs, rm -rf in teardown - run_local.sh: per-PID dir /tmp/flowmesh-ci-results-$PROJECT, overridden in compose overlay; cleaned up in teardown Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>

docker volume prune -f (in pre-clean) deleted the named volume flowmesh_server_hf_cache between runs, forcing TinyLlama to be re-downloaded every time (~50s) and causing the 300s vLLM test to time out by a few seconds. Fix: set HF_CACHE_DIR to the host's ~/.cache/huggingface so workers receive a bind mount of an absolute path. _ensure_volume_access skips _VolumeInitializer for absolute paths; models downloaded on the first run persist for every subsequent run on the same machine. - ci.compose.yml: pass HF_CACHE_DIR through from compose env - run_local.sh: resolve _HF_CACHE_DIR (host ~/.cache/huggingface), mkdir+chmod 777, inject into compose override - ci.yml: set HF_CACHE_DIR=$HOME/.cache/huggingface in project-name step; mkdir+chmod 777 in setup step; pass to docker compose env Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>

…M timeout HF_CACHE_DIR bind-mount was reverted — using the named Docker volume flowmesh_server_hf_cache (identical to FlowMesh_dev) avoids accumulating model weights on the host between CI runs; docker volume prune cleans it up. The timeout issue is fixed by bumping the GPU E2E timeout: cold-start (model download ~50s + load ~53s + compile ~17s + CUDA graphs) takes ~250s, leaving only ~50s for inference at the old 300s limit. - run_local.sh: GPU default timeout 300 → 600s - ci.yml: inference_vllm_tiny E2E_TIMEOUT_SEC 300 → 600s Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>

…json meta-llama/Llama-3.2-1B-Instruct requires HF_TOKEN; use the non-gated Qwen/Qwen2.5-0.5B-Instruct instead, matching FlowMesh_dev's fix. Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>

- Pin actions/checkout and actions/upload-artifact to commit SHAs - Add persist-credentials: false to all checkout steps - Add top-level permissions: contents: read - Move github.workspace and github.run_id out of run: blocks into step-level env: to eliminate template-expansion injection warnings Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>

Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>

timzsu · 2026-05-04T12:31:46Z

@kaiitunnz This PR runs on self-hosted runners on bare metal, which is not secure. We are thinking about wrapping the runner inside a container, which might require DinD. Do you have any ideas on how to ensure security while keeping the CI simple?

kaiitunnz · 2026-05-04T12:59:18Z

@kaiitunnz This PR runs on self-hosted runners on bare metal, which is not secure. We are thinking about wrapping the runner inside a container, which might require DinD. Do you have any ideas on how to ensure security while keeping the CI simple?

I think with DinD, malicious workflows can still escape into the host. We need a better solution for this before the release.

timzsu · 2026-05-04T14:11:10Z

@kaiitunnz This PR runs on self-hosted runners on bare metal, which is not secure. We are thinking about wrapping the runner inside a container, which might require DinD. Do you have any ideas on how to ensure security while keeping the CI simple?

I think with DinD, malicious workflows can still escape into the host. We need a better solution for this before the release.

One alternative that allows docker inside container without privilege is sysbox. But sysbox does not have a stable GPU support.

Regarding the malicious workflow, we can make this e2e CI a daily workflow on main. Thus, we only need to ensure that the workflows in main is secure. Does that help?

Qruixuan and others added 17 commits May 4, 2026 12:03

fix: replace gated llama model with open Qwen model in dag_inference.…

b6d1e2c

…json meta-llama/Llama-3.2-1B-Instruct requires HF_TOKEN; use the non-gated Qwen/Qwen2.5-0.5B-Instruct instead, matching FlowMesh_dev's fix. Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>

style: apply isort and black fixes

e6af601

Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add CI/GPU runner infrastructure migrated from FlowMesh_dev#15

feat: add CI/GPU runner infrastructure migrated from FlowMesh_dev#15
Qruixuan wants to merge 17 commits into
mainfrom
claude/migrate-flowmesh-work-E95UG

Qruixuan commented May 4, 2026 •

edited

Loading

Uh oh!

timzsu commented May 4, 2026

Uh oh!

kaiitunnz commented May 4, 2026

Uh oh!

timzsu commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Qruixuan commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Design

Test Plan

Test Result

Uh oh!

timzsu commented May 4, 2026

Uh oh!

kaiitunnz commented May 4, 2026

Uh oh!

timzsu commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Qruixuan commented May 4, 2026 •

edited

Loading