fix: re-subscribe and re-home workers on node re-register#93
Merged
Conversation
When the root registry loses a node, Lifecycle re-registers it under a new node id. Previously the dispatch/command subscriptions and the workers' homed node id stayed on the OLD id, so the host published tasks to a channel nobody listened on and routed to workers under a stale node id — the fleet landed registered-but-dispatch-dead. Lifecycle now fires an on_reregister(new_node_id) callback after a successful re-register. The supervisor wires it to: - TaskListener.rebind / CommandListener.rebind: move the dispatch and command subscriptions to the new node's channels on the live pubsub (subscribe new, unsubscribe old); worker queues are preserved. - GrpcServer.rebind_node: stamp future registrations with the new id and rewrite every already-registered worker's node_id in Redis so the dispatcher routes to them under the new node. Callback failures are logged, not raised, so a rebind hiccup can't kill the heartbeat loop. Signed-off-by: Yao Lu <ylu@yao.lu>
bbc31fb to
7b73af3
Compare
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
PR #91 made a node re-register itself when the root registry loses it (e.g. after a flowmesh-host restart), but it re-registers under a new node id. The node kept:
node:<id>:dispatch/node:<id>:cmdschannels, andSo the host published tasks to the new id's channel (nobody listening) and the dispatcher routed to workers under a stale node → the fleet landed registered-but-dispatch-dead. This is one of the three blockers gating flowmesh-host auto-sync.
Fix
Lifecyclenow fires anon_reregister(new_node_id)callback after a successful re-register. The supervisor wires it to three rebinds:TaskListener.rebind/CommandListener.rebind— move the dispatch and command subscriptions to the new node's channels on the live pubsub (subscribe new, unsubscribe old). Registered worker queues are preserved so in-flight streams keep working.GrpcServer.rebind_node— stamp futureRegisterWorkercalls with the new id AND rewrite every already-registered worker'snode_idfield in Redis, so the dispatcher routes tasks to them under the new node.Callback failures are caught + logged so a rebind hiccup can't kill the heartbeat loop.
Workers themselves need no change: they dial the local supervisor over gRPC and hold only a
worker_id; node association is entirely server-side (worker_meta["node_id"]).Tests
New
tests/server/test_supervisor_reregister.py(11 tests) covers each rebind in isolation plus an end-to-end case proving that after a simulated registry loss → re-register-under-new-id, the node is subscribed on the new dispatch + command channels AND its workers are homed under the new id (dispatchable). Full server suite green (433 passed). ruff/black/isort clean; mypy clean on changed files.🤖 Generated with Claude Code