Skip to content

fix: prevent stale disconnect from marking reconnected broker offline#133

Closed
scion-gteam[bot] wants to merge 2 commits into
mainfrom
scion/dev-issue-131
Closed

fix: prevent stale disconnect from marking reconnected broker offline#133
scion-gteam[bot] wants to merge 2 commits into
mainfrom
scion/dev-issue-131

Conversation

@scion-gteam
Copy link
Copy Markdown

@scion-gteam scion-gteam Bot commented Jun 3, 2026

Summary

  • Root cause: When a broker's control channel disconnects and reconnects rapidly (within the same second), the old connection's deferred removeConnection goroutine races with the new connection. It removes the new connection from the map and fires onDisconnect, leaving onlineProviders=0 despite a live WebSocket session.
  • Fix: removeConnection now takes a *BrokerConnection pointer and only removes the map entry / fires the disconnect callback when the passed connection is still the active one for that brokerID. Superseded connections are silently skipped.
  • Adds two new tests reproducing the exact race scenario described in the issue.

Fixes #131

Test plan

  • TestControlChannelManager_ReconnectDoesNotTriggerDisconnect — verifies a stale removeConnection from the old connection does NOT fire onDisconnect or remove the new connection
  • TestControlChannelManager_StaleRemoveAfterReconnect_ThenRealDisconnect — verifies the full lifecycle: stale remove is skipped, then a real disconnect of the current connection fires the callback correctly
  • Existing TestControlChannelManager_OnDisconnectCallback and _NilSafe updated and passing
  • Full pkg/hub/... test suite passes

ptone added 2 commits June 3, 2026 23:37
When a broker's control channel disconnects and reconnects in the same
second, the old connection's deferred removeConnection was deleting the
new connection from the map and firing the onDisconnect callback. This
left onlineProviders=0 despite a live WebSocket session.

Guard removeConnection with a pointer comparison: only remove the map
entry and fire the disconnect callback when the connection being cleaned
up is still the active one for that brokerID. If a reconnect has already
replaced it, skip both — the new session is healthy.

Fixes #131
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Runtime broker control channel disconnects after ~4h, does not recover without hub restart

1 participant