Skip to content

feat: add CI/GPU runner infrastructure migrated from FlowMesh_dev#15

Open
Qruixuan wants to merge 17 commits into
mainfrom
claude/migrate-flowmesh-work-E95UG
Open

feat: add CI/GPU runner infrastructure migrated from FlowMesh_dev#15
Qruixuan wants to merge 17 commits into
mainfrom
claude/migrate-flowmesh-work-E95UG

Conversation

@Qruixuan
Copy link
Copy Markdown
Collaborator

@Qruixuan Qruixuan commented May 4, 2026

Purpose

Migrate the CI/GPU runner infrastructure from mlsys-io/FlowMesh_dev (ci/gpu-runner-setup-v2) and adapt it to FlowMesh's single-server architecture (no separate host/guardian split).

Changes

  • docker/ci.compose.yml — isolated per-run Compose stack; server mounts the Docker socket to spawn workers via DooD; workers use network_mode: host to reach localhost:8000/50051
  • docker/ci.ports.fixed.yml — fixed port overlay (8000, 50051) for CI and local runs
  • docker/ci.worker.gpu.yml — GPU worker overlay wiring (CUDA devices, HF_TOKEN pass-through)
  • docker/ci.gpu_worker_config.yaml — supervisor worker config for GPU CI worker
  • scripts/ci/run_local.sh — mirrors the full GitHub Actions pipeline locally for pre-push validation; supports --gpu, --task-yaml, --timeout, --no-build, --keep
  • .github/workflows/ci.yml — integration (CPU echo) + GPU smoke tests (vLLM, HF Transformers, LoRA SFT, DAG inference, SSH, conditional echo); actions pinned to commit SHAs, permissions: contents: read, template-expansion injections moved to step-level env:
  • tests/integration/test_e2e.py — pytest E2E test that submits a workflow and polls until DONE; gracefully SKIPs on unavailable executors (max_attempts_exceeded + log pattern match)
  • templates/n8n/dag_inference.json — replace gated meta-llama/Llama-3.2-1B-Instruct with open Qwen/Qwen2.5-0.5B-Instruct to avoid GatedRepoError when HF_TOKEN is absent

Design

Workers are spawned by the server's Docker adapter (DooD via /var/run/docker.sock) with network_mode: host, so they reach the server at localhost:8000 and localhost:50051 directly. The pytest runner executes in a temporary python:3.11-slim container joined to the compose network, accessing the server via service name http://server:8000. Each CI run is isolated by a unique compose project name (ci-$RUN_ID).

The GPU workflow uses a content-hash-tagged builder image (flowmesh-builder:<hash>) to cache the heavy CUDA dependency layer across runs, rebuilding only when Dockerfile.cuda.builder or requirements.gpu.txt change.

Test Plan

# CPU integration test
./scripts/ci/run_local.sh

# GPU smoke tests (all templates)
./scripts/ci/run_local.sh --gpu

# Single template override
./scripts/ci/run_local.sh --gpu --task-yaml templates/inference_vllm_tiny.yaml

Test Result

  • CPU integration (echo task): PASSED
  • GPU vLLM inference (TinyLlama-1.1B): PASSED after bumping timeout 300 → 600s to accommodate cold-start download + torch.compile (~247s total)
  • GPU n8n DAG inference: PASSED after replacing gated Llama model with open Qwen model

Pre-submission Checklist
  • I have read the contribution guidelines.
  • I have run pre-commit run --all-files and fixed any issues.
  • I have added or updated tests covering my changes (if applicable).
  • I have verified that uv run pytest tests/ passes locally.
  • If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker.
  • If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-extras --frozen).
  • If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
  • I have updated documentation or config examples if user-facing behavior changed.

Qruixuan and others added 17 commits May 4, 2026 12:03
Migrate the following changes from mlsys-io/FlowMesh_dev (ci/gpu-runner-setup-v2):

- .github/workflows/unit-tests.yml: switch install to --all-extras, add cuda
  runner label [self-hosted, cuda], pin action SHAs with uv version 0.11.8,
  add permissions/concurrency blocks
- src/worker/docker/Dockerfile.cpu: rename SUPERVISOR_GRPC_TARGET ->
  GUARDIAN_GRPC_TARGET, switch shared copy to granular
  (shared/__init__.py + shared/all + shared/host_worker + shared/guardian_worker),
  drop source/url OCI labels
- src/worker/docker/Dockerfile.cuda: same GUARDIAN rename + granular shared copy,
  drop source/url OCI labels
- src/worker/docker/Dockerfile.ssh.cpu: drop source/url OCI labels
- src/worker/docker/Dockerfile.ssh.gpu: drop source/url OCI labels
- src/worker/docker/README.md: rename SUPERVISOR_GRPC_TARGET ->
  GUARDIAN_GRPC_TARGET, update TLS section (guardian naming)

Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
…om FlowMesh_dev

Migrates changes from mlsys-io/FlowMesh_dev (main) that were not yet
present in FlowMesh:

- Dockerfile.cuda: rename SUPERVISOR_GRPC_TARGET → GUARDIAN_GRPC_TARGET;
  replace broad `COPY src/shared` with granular copies of shared/__init__.py,
  shared/all, shared/host_worker, shared/guardian_worker; drop extra
  org.opencontainers.image.source/url LABEL lines
- Dockerfile.ssh.cpu: drop org.opencontainers.image.source/url LABELs
- Dockerfile.ssh.gpu: drop org.opencontainers.image.source/url LABELs
- src/worker/docker/README.md: rename SUPERVISOR_GRPC_TARGET →
  GUARDIAN_GRPC_TARGET, generate_server_tls_certs.sh →
  generate_guardian_tls_certs.sh, SERVER_GRPC_TLS_CA_B64 →
  GUARDIAN_GRPC_TLS_CA_B64

templates/n8n/dag_inference.json and CI workflows are already in sync
(identical SHAs / FlowMesh has newer hardened versions).

Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
Previous agent incorrectly changed SUPERVISOR_GRPC_TARGET to GUARDIAN_GRPC_TARGET
and altered COPY paths/labels. This reverts those files to their correct state.

Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
Dockerfile.cuda: install requirements.gpu.txt in addition to requirements.txt,
and add build-time verification that torch/transformers are importable.

transformers_executor.py: capture import error message in _HF_IMPORT_ERROR,
split PreTrainedModel into a separate fallback import block, add
_require_transformers() helper called from both prepare() and run().

Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
Migrate ci.compose.yml, ci.worker.gpu.yml, ci.ports.fixed.yml,
ci.worker_config.yaml, and ci.gpu_worker_config.yaml from FlowMesh_dev.
Adapted: guardian service → supervisor, src/guardian/ → src/server/,
/etc/guardian/ → /etc/supervisor/, env var names GUARDIAN_* → SUPERVISOR_*.

Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
Migrate .github/workflows/ci.yml (integration + GPU smoke jobs) and
scripts/ci/setup-runner.md from FlowMesh_dev. Adapted: guardian→supervisor
service names throughout; repo URL updated to mlsys-io/FlowMesh.

Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
scripts/ci/run_local.sh: migrate from FlowMesh_dev, adapted guardian→supervisor
throughout (service exec, compose override, health checks, log references).

templates: fix output.destination from http to local in conditional_echo_test.yaml
and ssh_noninteractive.yaml; use dev version of echo_three_node_graph.yaml.

Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
FlowMesh has no separate host/guardian/postgres services. A single
src/server/Dockerfile exposes both HTTP API (8000) and gRPC supervisor
(50051). Updated ci.compose.yml, ci.worker.gpu.yml, ci.ports.fixed.yml:
- server service built from src/server/Dockerfile
- redis only (no postgres)
- WORKER_DOCKER_NETWORK uses ${COMPOSE_PROJECT_NAME}_ci-net interpolation
- SERVER_HOST=server so spawned workers get SUPERVISOR_GRPC_TARGET=server:50051

Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
Key changes:
- Single "Wait for server" health check (port 8000) instead of separate host + supervisor
- Worker registration check uses docker compose exec -T server (no exposed port needed)
- E2E tests use http://server:8000 (internal compose network name)
- Destroy workers via server API on port 8000
- COMPOSE_PROJECT_NAME exported so ${COMPOSE_PROJECT_NAME}_ci-net interpolation works
- run_local.sh: dc() wrapper exports COMPOSE_PROJECT_NAME; single server port block
  in compose override; step numbering adjusted (no separate supervisor confirm step)

Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
…s and exposed ports

Workers are spawned by the Docker adapter with network_mode: host (see
supervisor/adapters/docker.py _start()). They connect to the gRPC
supervisor at localhost:50051 and download results via FLOWMESH_BASE_URL.

Three bugs in the previous CI setup:
1. WORKER_DOCKER_NETWORK env var doesn't exist in FlowMesh — removed.
2. FLOWMESH_BASE_URL was "http://server:8000" but workers on host network
   can't resolve "server"; changed to "http://localhost:8000".
3. CI workflow never exposed ports 8000/50051 on the host, so workers
   (network_mode: host) couldn't reach the server container at all;
   added ci.ports.fixed.yml to both build steps.
4. run_local.sh used a dynamic HTTP port, but FLOWMESH_BASE_URL in the
   compose is a static value set before start; changed to fixed
   127.0.0.1:8000:8000 so workers can always reach http://localhost:8000.

Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
Migrated from FlowMesh_dev ci/gpu-runner-setup-v2 branch unchanged:
- Submits a workflow YAML to a live server and polls until DONE/FAILED
- Skips automatically when FLOWMESH_HOST_URL is unset (safe for unit test runs)
- Handles n8n JSON and native YAML formats
- Skips (not fails) when executor package is unavailable on the worker

Used by run_local.sh (step 7) and .github/workflows/ci.yml E2E steps.

Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
The server's _VolumeInitializer runs busybox:1.36.1 to chown the named
Docker volume to UID 10001, but if busybox isn't cached it fails silently
and marks the volume as initialized anyway — so all subsequent workers
also get PermissionError writing to /var/lib/flowmesh-results.

Fix: set RESULTS_DIR to an absolute host path. The docker adapter skips
_VolumeInitializer for absolute paths (see _ensure_volume_access). Workers
receive a bind-mount of a pre-created host dir with chmod 777, which UID
10001 (appuser) can write to without any chown step.

- ci.compose.yml: RESULTS_DIR=/tmp/flowmesh-ci-results
- ci.yml: mkdir + chmod 777 before 'docker compose up' in both jobs,
  rm -rf in teardown
- run_local.sh: per-PID dir /tmp/flowmesh-ci-results-$PROJECT, overridden
  in compose overlay; cleaned up in teardown

Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
docker volume prune -f (in pre-clean) deleted the named volume
flowmesh_server_hf_cache between runs, forcing TinyLlama to be
re-downloaded every time (~50s) and causing the 300s vLLM test to
time out by a few seconds.

Fix: set HF_CACHE_DIR to the host's ~/.cache/huggingface so workers
receive a bind mount of an absolute path.  _ensure_volume_access skips
_VolumeInitializer for absolute paths; models downloaded on the first
run persist for every subsequent run on the same machine.

- ci.compose.yml: pass HF_CACHE_DIR through from compose env
- run_local.sh: resolve _HF_CACHE_DIR (host ~/.cache/huggingface),
  mkdir+chmod 777, inject into compose override
- ci.yml: set HF_CACHE_DIR=$HOME/.cache/huggingface in project-name
  step; mkdir+chmod 777 in setup step; pass to docker compose env

Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
…M timeout

HF_CACHE_DIR bind-mount was reverted — using the named Docker volume
flowmesh_server_hf_cache (identical to FlowMesh_dev) avoids accumulating
model weights on the host between CI runs; docker volume prune cleans it up.

The timeout issue is fixed by bumping the GPU E2E timeout: cold-start
(model download ~50s + load ~53s + compile ~17s + CUDA graphs) takes
~250s, leaving only ~50s for inference at the old 300s limit.
- run_local.sh: GPU default timeout 300 → 600s
- ci.yml: inference_vllm_tiny E2E_TIMEOUT_SEC 300 → 600s

Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
…json

meta-llama/Llama-3.2-1B-Instruct requires HF_TOKEN; use the non-gated
Qwen/Qwen2.5-0.5B-Instruct instead, matching FlowMesh_dev's fix.

Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
- Pin actions/checkout and actions/upload-artifact to commit SHAs
- Add persist-credentials: false to all checkout steps
- Add top-level permissions: contents: read
- Move github.workspace and github.run_id out of run: blocks into
  step-level env: to eliminate template-expansion injection warnings

Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
Signed-off-by: Qruixuan <121090450@link.cuhk.edu.cn>
@timzsu
Copy link
Copy Markdown
Collaborator

timzsu commented May 4, 2026

@kaiitunnz This PR runs on self-hosted runners on bare metal, which is not secure. We are thinking about wrapping the runner inside a container, which might require DinD. Do you have any ideas on how to ensure security while keeping the CI simple?

@kaiitunnz
Copy link
Copy Markdown
Collaborator

@kaiitunnz This PR runs on self-hosted runners on bare metal, which is not secure. We are thinking about wrapping the runner inside a container, which might require DinD. Do you have any ideas on how to ensure security while keeping the CI simple?

I think with DinD, malicious workflows can still escape into the host. We need a better solution for this before the release.

@timzsu
Copy link
Copy Markdown
Collaborator

timzsu commented May 4, 2026

@kaiitunnz This PR runs on self-hosted runners on bare metal, which is not secure. We are thinking about wrapping the runner inside a container, which might require DinD. Do you have any ideas on how to ensure security while keeping the CI simple?

I think with DinD, malicious workflows can still escape into the host. We need a better solution for this before the release.

One alternative that allows docker inside container without privilege is sysbox. But sysbox does not have a stable GPU support.

Regarding the malicious workflow, we can make this e2e CI a daily workflow on main. Thus, we only need to ensure that the workflows in main is secure. Does that help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants