fix: make multi-node stack deployment work by kaiitunnz · Pull Request #35 · mlsys-io/FlowMesh

kaiitunnz · 2026-05-12T07:30:55Z

Purpose

Two gaps prevent flowmesh stack from supporting multi-node deployments today:

flowmesh stack bundle export bundled locally-built CLI/SDK wheels into the archive. To install on a remote host, the operator had to extract the tarball and pip install ./wheels/*.whl, with no support for installing the published flowmesh[cli] from PyPI.
flowmesh stack up deployed redis_control and redis_telemetry on every node regardless of NODE_ROLE. A worker node ended up running two unused Redis containers while its server still needed to reach the root node's Redis for cross-node coordination. The local containers were dead weight at best, and silently broke coordination at worst (worker connects to its own empty Redis instead of the root's populated one).

This PR addresses both.

Changes

Bundle export defaults to published `flowmesh[cli]`

bundle.py — new _published_cli_spec() resolves to flowmesh[cli]==<version> and rejects dev / pre-release / local versions that PyPI would refuse to install.
New --include-wheels flag preserves the old wheel-bundling path for offline installs.
Docs (docs/CLI.md) updated.

Worker stack skips local Redis on `NODE_ROLE=worker`

compose.yml — profiles: [root] on both Redis services + required: false on the server.depends_on entries.
stack.py — new _node_role(env_file) helper; _compose gained a profile= argument. up / pull / restart pass profile="root" only when role is root. down / clean / restart's down step pass profile="root" unconditionally so previously-deployed Redis containers actually get torn down.
src/server/main.py — one INFO log line emitting role + both Redis URLs at boot, so operators can verify what each node is talking to.
Docs / .env.example — NODE_ROLE documented; root-vs-worker contract spelled out next to the Redis URLs.

Design

Fail fast on bad bundle versions. Emitting flowmesh[cli]==0.1.0.dev1 ships a bundle whose install.sh will fail downstream on pip install. Validating at bundle time with a directed error is strictly better than a broken tarball.
Compose profiles, not overlays. profiles: [root] + depends_on: required: false is exactly what compose was built for — single template, no Python templating, no second compose file.
NODE_ROLE is the only source of truth. Read from the same env file the server already reads. No --role CLI flag to drift out of sync.
Log Redis URLs, don't validate. A localhost heuristic would false-positive on single-machine multi-node testing (worker's localhost legitimately points at root's Redis on the same host). The boot log makes the resolved config greppable instead.
down needs --profile root too. Without it, docker compose down silently skips profile-gated services and leaves stray Redis containers. The flag is a no-op on workers where those services don't exist.

Test Plan

uv run pre-commit run --all-files
uv run pytest tests --ignore=tests/worker/test_mp_executor_cleanup_gpu.py

# Local two-node deployment from a single `flowmesh stack bundle export`
# tarball, with one root + one worker on the same host (distinct ports
# and stack slugs). TLS (server gRPC + Redis mTLS-style) and Redis ACL
# enabled end-to-end.
flowmesh stack bundle export --include-wheels   # produce bundle.tar.gz
# extract on each node, run install.sh, then `flowmesh stack up` with the
# matching .env (NODE_ROLE=root on one, NODE_ROLE=worker on the other).
flowmesh health                                  # against both nodes
flowmesh node list                               # against root, expect 2 nodes
docker ps --filter name=<root-slug>_redis_       # expect 2
docker ps --filter name=<worker-slug>_redis_     # expect 0
# Workload placement:
flowmesh stack worker up cpu 1                   # on root
flowmesh stack worker up gpu -t <idx>            # on worker
flowmesh workflow submit echo_three_node_graph.yaml      # CPU
flowmesh workflow submit inference_vllm_tiny.yaml        # GPU
# Verify tasks resolved to the expected node via `flowmesh task list -q workflow_id=...`
# + `flowmesh worker info <id>` → node_id.
# Worker should reject `workflow submit` (v1 workflow routers gated behind IS_ROOT_NODE):
FLOWMESH_BASE_URL=http://localhost:<worker-port> flowmesh workflow submit echo_three_node_graph.yaml
# Teardown leaves no stray containers on either node:
flowmesh stack down

Test Result

$ uv run pre-commit run --all-files
# All passed
$ uv run pytest tests --ignore=tests/worker/test_mp_executor_cleanup_gpu.py
# 759 passed, 18 warnings in 30.46s

Multi-node deployment, in-order highlights from the local run (TLS + Redis ACL on; root on :8010 / :50061 / Redis :6389,:6390, worker on :8011 / :50062, worker server points at root's Redis URLs):

[+] up 4/4
 ✔ Container flowmesh_node_root_redis_telemetry Healthy        6.4s
 ✔ Container flowmesh_node_root_redis_control   Healthy        6.4s
 ✔ Container flowmesh_node_root_server          Healthy       11.3s

[+] up 1/1
 ✔ Container flowmesh_node_worker_server        Healthy        5.6s

== /healthz ==
  PASS  root  /healthz @ :8010
  PASS  worker /healthz @ :8011
== compose profile gating ==
  PASS  root  has 2 redis containers (flowmesh_node_root)
  PASS  worker has 0 redis containers (flowmesh_node_worker)
== cross-node registration ==
  PASS  root's node list contains node-root and node-worker

Profile gating works end-to-end: worker's flowmesh stack up produced no redis_control / redis_telemetry containers; root's stack came up with both. The worker node still registered with the root via Redis pub/sub (flowmesh node list against root returned both node-root and node-worker), confirming the worker's server is talking to root's Redis over TLS + ACL rather than a phantom local Redis.

Workflow dispatch across the two nodes (CPU template → root, GPU template → worker), verified by submitting two workflows to the root server, polling status, and resolving each task's assigned_worker → node_id via flowmesh worker info:

[5/7] wait for workflows to reach terminal status...
  cpu workflow (wfl-39181984-...): DONE
  gpu workflow (wfl-fc04b308-...): DONE
[6/7] verify task placement...
  PASS  CPU workflow ran on node-root (3/3 task(s) on nde-1)
  PASS  GPU workflow ran on node-worker (1/1 task(s) on nde-2)
[7/7] verify worker server rejects workflow submission...
  PASS  worker server rejected workflow submission (NotFoundError)

Teardown is now clean on both nodes — flowmesh stack down on the root removes all 4 containers (server + both Redis), confirming the --profile root fix to down / clean.

Pre-submission Checklist

I have read the contribution guidelines.
I have run pre-commit run --all-files and fixed any issues.
I have added or updated tests covering my changes (if applicable).
I have verified that uv run pytest tests/ passes locally.
If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker.
If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-packages --group ci --frozen).
If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
I have updated documentation or config examples if user-facing behavior changed.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

The compose template deployed redis_control and redis_telemetry unconditionally, so a worker node spun up two unused Redis containers while its server still needed to reach the root node's Redis for cross-node coordination. Profile-gate both Redis services on `root` so worker nodes skip them, thread NODE_ROLE from the env file through the stack CLI to pass `--profile root` only when applicable, and log the resolved role + Redis URLs at server startup so misrouted workers are diagnosable from logs. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

timzsu

Two comments are worth discussing. They were spotted by Codex and after checking the code, I think they make some sense.

`flowmesh stack logs` and `flowmesh stack ps` did not pass `--profile root` to compose, so profile-gated services (the Redis control / telemetry containers on a root node) were silently excluded from log streaming and status output. Threads `profile` through `compose_logs` / `stream_logs` and pins `profile="root"` at the two callsites, matching the unconditional pattern used by `down` / `clean` (harmless on worker nodes since no profile-gated containers exist there). Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

timzsu

LGTM.

kaiitunnz added 3 commits May 12, 2026 10:30

feat: default stack bundle export to published flowmesh[cli]

c6d9038

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

fix: regenerate .env.example

d709ba2

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

kaiitunnz force-pushed the kaiitunnz/fix/multi-node-deployment branch from bed5b87 to d709ba2 Compare May 12, 2026 10:30

fix: add profile=root to stack teardown commands

9b91f5c

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

kaiitunnz marked this pull request as ready for review May 12, 2026 11:37

kaiitunnz requested a review from timzsu May 12, 2026 11:44

timzsu requested changes May 12, 2026

View reviewed changes

Comment thread src/server/main.py Outdated

Comment thread docs/CLI.md

kaiitunnz added 2 commits May 12, 2026 14:40

chore: drop server boot log of role and Redis URLs

e04c0d6

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

kaiitunnz requested a review from timzsu May 12, 2026 14:43

timzsu approved these changes May 12, 2026

View reviewed changes

kaiitunnz merged commit 81532b9 into main May 12, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make multi-node stack deployment work#35

fix: make multi-node stack deployment work#35
kaiitunnz merged 6 commits into
mainfrom
kaiitunnz/fix/multi-node-deployment

kaiitunnz commented May 12, 2026 •

edited

Loading

Uh oh!

timzsu left a comment

Uh oh!

Uh oh!

Uh oh!

timzsu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kaiitunnz commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Bundle export defaults to published flowmesh[cli]

Worker stack skips local Redis on NODE_ROLE=worker

Design

Test Plan

Test Result

Uh oh!

timzsu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

timzsu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaiitunnz commented May 12, 2026 •

edited

Loading

Bundle export defaults to published `flowmesh[cli]`

Worker stack skips local Redis on `NODE_ROLE=worker`