Skip to content

fix: make multi-node stack deployment work#35

Merged
kaiitunnz merged 6 commits into
mainfrom
kaiitunnz/fix/multi-node-deployment
May 12, 2026
Merged

fix: make multi-node stack deployment work#35
kaiitunnz merged 6 commits into
mainfrom
kaiitunnz/fix/multi-node-deployment

Conversation

@kaiitunnz
Copy link
Copy Markdown
Collaborator

@kaiitunnz kaiitunnz commented May 12, 2026

Purpose

Two gaps prevent flowmesh stack from supporting multi-node deployments today:

  1. flowmesh stack bundle export bundled locally-built CLI/SDK wheels into the archive. To install on a remote host, the operator had to extract the tarball and pip install ./wheels/*.whl, with no support for installing the published flowmesh[cli] from PyPI.
  2. flowmesh stack up deployed redis_control and redis_telemetry on every node regardless of NODE_ROLE. A worker node ended up running two unused Redis containers while its server still needed to reach the root node's Redis for cross-node coordination. The local containers were dead weight at best, and silently broke coordination at worst (worker connects to its own empty Redis instead of the root's populated one).

This PR addresses both.

Changes

Bundle export defaults to published flowmesh[cli]

  • bundle.py — new _published_cli_spec() resolves to flowmesh[cli]==<version> and rejects dev / pre-release / local versions that PyPI would refuse to install.
  • New --include-wheels flag preserves the old wheel-bundling path for offline installs.
  • Docs (docs/CLI.md) updated.

Worker stack skips local Redis on NODE_ROLE=worker

  • compose.ymlprofiles: [root] on both Redis services + required: false on the server.depends_on entries.
  • stack.py — new _node_role(env_file) helper; _compose gained a profile= argument. up / pull / restart pass profile="root" only when role is root. down / clean / restart's down step pass profile="root" unconditionally so previously-deployed Redis containers actually get torn down.
  • src/server/main.py — one INFO log line emitting role + both Redis URLs at boot, so operators can verify what each node is talking to.
  • Docs / .env.exampleNODE_ROLE documented; root-vs-worker contract spelled out next to the Redis URLs.

Design

  • Fail fast on bad bundle versions. Emitting flowmesh[cli]==0.1.0.dev1 ships a bundle whose install.sh will fail downstream on pip install. Validating at bundle time with a directed error is strictly better than a broken tarball.
  • Compose profiles, not overlays. profiles: [root] + depends_on: required: false is exactly what compose was built for — single template, no Python templating, no second compose file.
  • NODE_ROLE is the only source of truth. Read from the same env file the server already reads. No --role CLI flag to drift out of sync.
  • Log Redis URLs, don't validate. A localhost heuristic would false-positive on single-machine multi-node testing (worker's localhost legitimately points at root's Redis on the same host). The boot log makes the resolved config greppable instead.
  • down needs --profile root too. Without it, docker compose down silently skips profile-gated services and leaves stray Redis containers. The flag is a no-op on workers where those services don't exist.

Test Plan

uv run pre-commit run --all-files
uv run pytest tests --ignore=tests/worker/test_mp_executor_cleanup_gpu.py

# Local two-node deployment from a single `flowmesh stack bundle export`
# tarball, with one root + one worker on the same host (distinct ports
# and stack slugs). TLS (server gRPC + Redis mTLS-style) and Redis ACL
# enabled end-to-end.
flowmesh stack bundle export --include-wheels   # produce bundle.tar.gz
# extract on each node, run install.sh, then `flowmesh stack up` with the
# matching .env (NODE_ROLE=root on one, NODE_ROLE=worker on the other).
flowmesh health                                  # against both nodes
flowmesh node list                               # against root, expect 2 nodes
docker ps --filter name=<root-slug>_redis_       # expect 2
docker ps --filter name=<worker-slug>_redis_     # expect 0
# Workload placement:
flowmesh stack worker up cpu 1                   # on root
flowmesh stack worker up gpu -t <idx>            # on worker
flowmesh workflow submit echo_three_node_graph.yaml      # CPU
flowmesh workflow submit inference_vllm_tiny.yaml        # GPU
# Verify tasks resolved to the expected node via `flowmesh task list -q workflow_id=...`
# + `flowmesh worker info <id>` → node_id.
# Worker should reject `workflow submit` (v1 workflow routers gated behind IS_ROOT_NODE):
FLOWMESH_BASE_URL=http://localhost:<worker-port> flowmesh workflow submit echo_three_node_graph.yaml
# Teardown leaves no stray containers on either node:
flowmesh stack down

Test Result

$ uv run pre-commit run --all-files
# All passed
$ uv run pytest tests --ignore=tests/worker/test_mp_executor_cleanup_gpu.py
# 759 passed, 18 warnings in 30.46s

Multi-node deployment, in-order highlights from the local run (TLS + Redis ACL on; root on :8010 / :50061 / Redis :6389,:6390, worker on :8011 / :50062, worker server points at root's Redis URLs):

[+] up 4/4
 ✔ Container flowmesh_node_root_redis_telemetry Healthy        6.4s
 ✔ Container flowmesh_node_root_redis_control   Healthy        6.4s
 ✔ Container flowmesh_node_root_server          Healthy       11.3s

[+] up 1/1
 ✔ Container flowmesh_node_worker_server        Healthy        5.6s
== /healthz ==
  PASS  root  /healthz @ :8010
  PASS  worker /healthz @ :8011
== compose profile gating ==
  PASS  root  has 2 redis containers (flowmesh_node_root)
  PASS  worker has 0 redis containers (flowmesh_node_worker)
== cross-node registration ==
  PASS  root's node list contains node-root and node-worker

Profile gating works end-to-end: worker's flowmesh stack up produced no redis_control / redis_telemetry containers; root's stack came up with both. The worker node still registered with the root via Redis pub/sub (flowmesh node list against root returned both node-root and node-worker), confirming the worker's server is talking to root's Redis over TLS + ACL rather than a phantom local Redis.

Workflow dispatch across the two nodes (CPU template → root, GPU template → worker), verified by submitting two workflows to the root server, polling status, and resolving each task's assigned_workernode_id via flowmesh worker info:

[5/7] wait for workflows to reach terminal status...
  cpu workflow (wfl-39181984-...): DONE
  gpu workflow (wfl-fc04b308-...): DONE
[6/7] verify task placement...
  PASS  CPU workflow ran on node-root (3/3 task(s) on nde-1)
  PASS  GPU workflow ran on node-worker (1/1 task(s) on nde-2)
[7/7] verify worker server rejects workflow submission...
  PASS  worker server rejected workflow submission (NotFoundError)

Teardown is now clean on both nodes — flowmesh stack down on the root removes all 4 containers (server + both Redis), confirming the --profile root fix to down / clean.


Pre-submission Checklist
  • I have read the contribution guidelines.
  • I have run pre-commit run --all-files and fixed any issues.
  • I have added or updated tests covering my changes (if applicable).
  • I have verified that uv run pytest tests/ passes locally.
  • If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker.
  • If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-packages --group ci --frozen).
  • If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
  • I have updated documentation or config examples if user-facing behavior changed.

kaiitunnz added 3 commits May 12, 2026 10:30
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
The compose template deployed redis_control and redis_telemetry
unconditionally, so a worker node spun up two unused Redis containers
while its server still needed to reach the root node's Redis for
cross-node coordination. Profile-gate both Redis services on `root` so
worker nodes skip them, thread NODE_ROLE from the env file through the
stack CLI to pass `--profile root` only when applicable, and log the
resolved role + Redis URLs at server startup so misrouted workers are
diagnosable from logs.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
@kaiitunnz kaiitunnz force-pushed the kaiitunnz/fix/multi-node-deployment branch from bed5b87 to d709ba2 Compare May 12, 2026 10:30
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
@kaiitunnz kaiitunnz marked this pull request as ready for review May 12, 2026 11:37
@kaiitunnz kaiitunnz requested a review from timzsu May 12, 2026 11:44
Copy link
Copy Markdown
Collaborator

@timzsu timzsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments are worth discussing. They were spotted by Codex and after checking the code, I think they make some sense.

Comment thread src/server/main.py Outdated
Comment thread docs/CLI.md
kaiitunnz added 2 commits May 12, 2026 14:40
`flowmesh stack logs` and `flowmesh stack ps` did not pass
`--profile root` to compose, so profile-gated services (the Redis
control / telemetry containers on a root node) were silently excluded
from log streaming and status output. Threads `profile` through
`compose_logs` / `stream_logs` and pins `profile="root"` at the two
callsites, matching the unconditional pattern used by `down` / `clean`
(harmless on worker nodes since no profile-gated containers exist
there).

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
@kaiitunnz kaiitunnz requested a review from timzsu May 12, 2026 14:43
Copy link
Copy Markdown
Collaborator

@timzsu timzsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@kaiitunnz kaiitunnz merged commit 81532b9 into main May 12, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants