Retry command center connection indefinitely#76
Merged
Conversation
Previously squadron gave up after 10 connection attempts (or just 1 when auto_reconnect was off). If the command center was slow to start or briefly unreachable, squadron would bail out and sit there disconnected. Now connectWithRetry loops forever at a 3s interval, only aborting when the client is shut down (Ctrl-C / SIGTERM). Surfaces client.Done() so the retry loop can observe shutdown cleanly.
SIGTERM during the now-indefinite connection retry loop would hard-kill the process before cleanup ran, orphaning the command center subprocess and MCP/plugin children — especially noticeable in background mode, where `squadron disengage` relies on graceful shutdown. Register signal handling before connectWithRetry. The handler closes the client, which closes client.Done() and unblocks the retry loop so the normal cleanup path runs. If shutdown fires during retry, skip the rest of startup and fall through to cleanup immediately.
After a natural websocket drop, Run() flipped c.connected to false and returned an error. The reconnect goroutine then read !IsConnected() as "shutting down" and exited — never reaching the AutoReconnect branch. Reconnect was effectively dead code for the common case. Rewrite the watchdog loop to gate exit on the shutdown channel (the actual shutdown signal), not on IsConnected(). Drop the AutoReconnect check: as long as squadron is running, it should always reconnect. AutoReconnect in config is now effectively always-on. Also stop cancelling c.ctx on register failure — that left the client permanently unable to reconnect once registration hiccuped.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
connectWithRetrynow loops forever at a 3s interval instead of giving up after 10 attempts (or 1 withauto_reconnectoff). Squadron will keep trying to reach the command center until it succeeds or the process is shut down.wsbridge.Client.Done()so the retry loop can observe shutdown cleanly and bail out on Ctrl-C / SIGTERM.auto_reconnectstill gates whether we attempt to reconnect after a connection is lost — it just no longer caps how long we keep trying once we've decided to try.Test plan
./squadron engagewith no command center running — squadron logs repeated retry attempts and connects when the command center comes up.auto_reconnect = true— squadron reconnects when it comes back, no matter how long it was down.