fix: make multi-node stack deployment work#35
Merged
Conversation
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
The compose template deployed redis_control and redis_telemetry unconditionally, so a worker node spun up two unused Redis containers while its server still needed to reach the root node's Redis for cross-node coordination. Profile-gate both Redis services on `root` so worker nodes skip them, thread NODE_ROLE from the env file through the stack CLI to pass `--profile root` only when applicable, and log the resolved role + Redis URLs at server startup so misrouted workers are diagnosable from logs. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
bed5b87 to
d709ba2
Compare
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
timzsu
requested changes
May 12, 2026
Collaborator
timzsu
left a comment
There was a problem hiding this comment.
Two comments are worth discussing. They were spotted by Codex and after checking the code, I think they make some sense.
`flowmesh stack logs` and `flowmesh stack ps` did not pass `--profile root` to compose, so profile-gated services (the Redis control / telemetry containers on a root node) were silently excluded from log streaming and status output. Threads `profile` through `compose_logs` / `stream_logs` and pins `profile="root"` at the two callsites, matching the unconditional pattern used by `down` / `clean` (harmless on worker nodes since no profile-gated containers exist there). Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Two gaps prevent
flowmesh stackfrom supporting multi-node deployments today:flowmesh stack bundle exportbundled locally-built CLI/SDK wheels into the archive. To install on a remote host, the operator had to extract the tarball andpip install ./wheels/*.whl, with no support for installing the publishedflowmesh[cli]from PyPI.flowmesh stack updeployedredis_controlandredis_telemetryon every node regardless ofNODE_ROLE. A worker node ended up running two unused Redis containers while its server still needed to reach the root node's Redis for cross-node coordination. The local containers were dead weight at best, and silently broke coordination at worst (worker connects to its own empty Redis instead of the root's populated one).This PR addresses both.
Changes
Bundle export defaults to published
flowmesh[cli]bundle.py— new_published_cli_spec()resolves toflowmesh[cli]==<version>and rejects dev / pre-release / local versions that PyPI would refuse to install.--include-wheelsflag preserves the old wheel-bundling path for offline installs.docs/CLI.md) updated.Worker stack skips local Redis on
NODE_ROLE=workercompose.yml—profiles: [root]on both Redis services +required: falseon theserver.depends_onentries.stack.py— new_node_role(env_file)helper;_composegained aprofile=argument.up/pull/restartpassprofile="root"only when role is root.down/clean/restart's down step passprofile="root"unconditionally so previously-deployed Redis containers actually get torn down.src/server/main.py— one INFO log line emitting role + both Redis URLs at boot, so operators can verify what each node is talking to..env.example—NODE_ROLEdocumented; root-vs-worker contract spelled out next to the Redis URLs.Design
flowmesh[cli]==0.1.0.dev1ships a bundle whoseinstall.shwill fail downstream onpip install. Validating at bundle time with a directed error is strictly better than a broken tarball.profiles: [root]+depends_on: required: falseis exactly what compose was built for — single template, no Python templating, no second compose file.NODE_ROLEis the only source of truth. Read from the same env file the server already reads. No--roleCLI flag to drift out of sync.localhostheuristic would false-positive on single-machine multi-node testing (worker'slocalhostlegitimately points at root's Redis on the same host). The boot log makes the resolved config greppable instead.downneeds--profile roottoo. Without it,docker compose downsilently skips profile-gated services and leaves stray Redis containers. The flag is a no-op on workers where those services don't exist.Test Plan
Test Result
Multi-node deployment, in-order highlights from the local run (TLS + Redis ACL on; root on
:8010/:50061/ Redis:6389,:6390, worker on:8011/:50062, worker server points at root's Redis URLs):Profile gating works end-to-end: worker's
flowmesh stack upproduced noredis_control/redis_telemetrycontainers; root's stack came up with both. The worker node still registered with the root via Redis pub/sub (flowmesh node listagainst root returned bothnode-rootandnode-worker), confirming the worker's server is talking to root's Redis over TLS + ACL rather than a phantom local Redis.Workflow dispatch across the two nodes (CPU template → root, GPU template → worker), verified by submitting two workflows to the root server, polling status, and resolving each task's
assigned_worker→node_idviaflowmesh worker info:Teardown is now clean on both nodes —
flowmesh stack downon the root removes all 4 containers (server + both Redis), confirming the--profile rootfix todown/clean.Pre-submission Checklist
pre-commit run --all-filesand fixed any issues.uv run pytest tests/passes locally.uv sync --all-packages --group ci --frozen).[BREAKING]and described migration steps above.