Skip to content

fix: re-subscribe and re-home workers on node re-register#93

Merged
tjluyao merged 1 commit into
mainfrom
fix/resubscribe-on-reregister
Jul 4, 2026
Merged

fix: re-subscribe and re-home workers on node re-register#93
tjluyao merged 1 commit into
mainfrom
fix/resubscribe-on-reregister

Conversation

@tjluyao

@tjluyao tjluyao commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Problem

PR #91 made a node re-register itself when the root registry loses it (e.g. after a flowmesh-host restart), but it re-registers under a new node id. The node kept:

  • its dispatch/command SUBSCRIBE on the old node:<id>:dispatch / node:<id>:cmds channels, and
  • its workers homed under the old node_id in Redis.

So the host published tasks to the new id's channel (nobody listening) and the dispatcher routed to workers under a stale node → the fleet landed registered-but-dispatch-dead. This is one of the three blockers gating flowmesh-host auto-sync.

Fix

Lifecycle now fires an on_reregister(new_node_id) callback after a successful re-register. The supervisor wires it to three rebinds:

  • TaskListener.rebind / CommandListener.rebind — move the dispatch and command subscriptions to the new node's channels on the live pubsub (subscribe new, unsubscribe old). Registered worker queues are preserved so in-flight streams keep working.
  • GrpcServer.rebind_node — stamp future RegisterWorker calls with the new id AND rewrite every already-registered worker's node_id field in Redis, so the dispatcher routes tasks to them under the new node.

Callback failures are caught + logged so a rebind hiccup can't kill the heartbeat loop.

Workers themselves need no change: they dial the local supervisor over gRPC and hold only a worker_id; node association is entirely server-side (worker_meta["node_id"]).

Tests

New tests/server/test_supervisor_reregister.py (11 tests) covers each rebind in isolation plus an end-to-end case proving that after a simulated registry loss → re-register-under-new-id, the node is subscribed on the new dispatch + command channels AND its workers are homed under the new id (dispatchable). Full server suite green (433 passed). ruff/black/isort clean; mypy clean on changed files.

🤖 Generated with Claude Code

@tjluyao tjluyao requested a review from kaiitunnz as a code owner July 4, 2026 20:02
@tjluyao tjluyao changed the title fix(supervisor): re-subscribe and re-home workers on node re-register fix: re-subscribe and re-home workers on node re-register Jul 4, 2026
When the root registry loses a node, Lifecycle re-registers it under a new
node id. Previously the dispatch/command subscriptions and the workers'
homed node id stayed on the OLD id, so the host published tasks to a
channel nobody listened on and routed to workers under a stale node id —
the fleet landed registered-but-dispatch-dead.

Lifecycle now fires an on_reregister(new_node_id) callback after a
successful re-register. The supervisor wires it to:
- TaskListener.rebind / CommandListener.rebind: move the dispatch and
  command subscriptions to the new node's channels on the live pubsub
  (subscribe new, unsubscribe old); worker queues are preserved.
- GrpcServer.rebind_node: stamp future registrations with the new id and
  rewrite every already-registered worker's node_id in Redis so the
  dispatcher routes to them under the new node.

Callback failures are logged, not raised, so a rebind hiccup can't kill
the heartbeat loop.

Signed-off-by: Yao Lu <ylu@yao.lu>
@tjluyao tjluyao force-pushed the fix/resubscribe-on-reregister branch from bbc31fb to 7b73af3 Compare July 4, 2026 20:51
@tjluyao tjluyao merged commit 53489d9 into main Jul 4, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant