Recover from stale reconnection token after server replacement#317774
Recover from stale reconnection token after server replacement#317774EhabY wants to merge 1 commit into
Conversation
5ca0070 to
6504122
Compare
There was a problem hiding this comment.
Pull request overview
This PR improves VS Code’s remote reconnection resilience when the remote server is replaced (e.g., after long client sleep with --enable-remote-auto-shutdown). Instead of falling through to the “Reload Window” prompt on the Management connection when a reconnection is rejected due to a stale/unknown token, the client now rotates to a fresh reconnection token, resets protocol session state, and performs a fresh handshake (bounded retries).
Changes:
- Introduces structured, shared connection rejection reasons for stale/duplicate reconnection tokens and uses them server-side.
- Adds Management-only stale-token recovery: rotate reconnection token, reset
PersistentProtocolsession state, and re-handshake (up to 3 attempts per reconnect loop). - Adds
PersistentProtocol.resetSessionStateForFreshHandshake()plus a unit test covering queue/id reset behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/vs/workbench/contrib/remote/browser/remote.ts | Adds telemetry/logging for detecting server replacement on reconnect (new wasReplaced field). |
| src/vs/server/node/remoteExtensionHostAgentServer.ts | Replaces string rejection reasons with shared ConnectionRejectionReason values for token-related rejections. |
| src/vs/platform/remote/common/remoteAgentConnection.ts | Implements stale-token recovery path for Management reconnection by rotating token and forcing a fresh handshake on the existing protocol. |
| src/vs/base/parts/ipc/common/ipc.net.ts | Adds PersistentProtocol.resetSessionStateForFreshHandshake() to clear protocol session state for a fresh server. |
| src/vs/base/parts/ipc/test/node/ipc.net.test.ts | Adds a unit test ensuring resetSessionStateForFreshHandshake clears outgoing state and restarts message ids. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (3)
src/vs/platform/remote/common/remoteAgentConnection.ts:896
getErrorFromMessageno longer falls back to a legacymsg.reasonstring. If a client talks to an older server (or any producer still sending{ type:'error', reason: string }), the error message becomesunknown, andreasonCodestays undefined (which also disables the new stale-token recovery control flow). Consider: (1) parsingmsg.reasonwhenreasonCodeis absent, and (2) populating a best-effortreasonCodeonly when it is a known value; otherwise keep it undefined but preserve the human-readable reason.
function getErrorFromMessage(msg: any): Error | null {
if (msg && msg.type === 'error') {
const reasonCode: ConnectionRejectionReason | undefined = msg.reasonCode;
const description = reasonCode ? describeConnectionRejection(reasonCode) : 'unknown';
const detail: string | undefined = msg.detail;
const reason = detail ? `${description}: ${detail}` : description;
const error = new Error(`Connection error: ${reason}`);
// eslint-disable-next-line local/code-no-any-casts
(<any>error).code = 'VSCODE_CONNECTION_ERROR';
// eslint-disable-next-line local/code-no-any-casts
(<any>error).reasonCode = reasonCode;
return error;
}
return null;
}
src/vs/platform/remote/common/remoteAgentConnection.ts:1
- Removing the
reason: stringfield from the wire-level error message is a backwards-incompatible protocol change. Even if commit mismatches are usually rejected, there are still plausible mixed-version/dev scenarios (e.g., commit undefined) where this will degrade diagnostics or break clients expectingreason. A safer approach is to keepreasonas an optional/deprecated field (alongsidereasonCode/detail) for compatibility, and phase it out only once you can guarantee version-gated protocol negotiation.
/*---------------------------------------------------------------------------------------------
src/vs/server/node/remoteExtensionHostAgentServer.ts:1
- To preserve wire compatibility with older clients, consider also sending a deprecated
reasonstring (the preexisting payload shape) in addition toreasonCode/detail. That allows older clients to continue surfacing actionable messages while newer clients usereasonCodefor control flow.
/*---------------------------------------------------------------------------------------------
91c43d5 to
af6be49
Compare
af6be49 to
047f8f3
Compare
When the macOS client is asleep long enough for the remote server's --enable-remote-auto-shutdown to fire, the resolver brings up a fresh server on wake. The management connection's reconnect is then rejected with `Unknown reconnection token (never seen)`, which today falls through to a Reload Window prompt. Follow-up to microsoft#310131 / pr-comment-4508229968. Approach: - Stable, machine-readable rejection codes. `ErrorMessage` now carries a required `reasonCode: ConnectionRejectionReason` plus an optional `detail` string for runtime context. The human-readable text is derived in one place via `describeConnectionRejection`. Recovery matches on the code, never the string. Server-side reject sites all carry a code. - Management-only fresh-handshake recovery. On `UnknownReconnectionTokenNeverSeen`, the management connection rotates `_reconnectionToken` to a fresh UUID, resets the existing `PersistentProtocol` session state, and re-handshakes against the new server as a fresh connection. Bounded to 3 attempts per reconnect loop. Extension Host connections deliberately skip this path so the existing permanent-failure / `onDidDispose` pathway can restart the host process cleanly. - IPC init context is replayed. `IPCClient` sends its serialized context as the first regular message on connect, and the new `IPCServer` waits for that exact message. The fresh handshake queues the cached serialized context as msg microsoft#1 while `_isReconnecting=true` so `endAcceptReconnection` replays it ahead of any concurrent `client.call()` from the handshake window. - IPC client cleanup that survives leaks. `ChannelClient` now tracks promise requests and event subscriptions separately. On stale-token recovery, `cancelAllPending` rejects in-flight promises (so they don't sit in the handlers map waiting for responses that will never arrive) while leaving event subscriptions alive. `replayEventSubscriptions` re-issues `EventListen` for each live subscription on the new transport, preserving the original ids so client-side handlers continue to route correctly. The replay runs synchronously inside the fresh-handshake setup block via a `freshHandshakeOnProtocolReady` hook so the EventListens land in the protocol queue alongside the context and are replayed in order, with no microtask race. - Workbench detects the replacement. The local `tokenAtLossTime` snapshot is captured on `ConnectionLost` (stable across reconnect- loop token rotations) and compared on `ConnectionGain` to mark `wasReplaced` in telemetry and a warning log. - Smaller protocol hygiene. `resetSessionStateForFreshHandshake` clears pending ack timeouts and baselines `_incomingMsgLastTime` to `Date.now()` so ack scheduling uses a sensible window from the reset moment. Test plan: - Unit test for `PersistentProtocol.resetSessionStateForFreshHandshake` verifying state clear and post-reset replay ordering against a fresh partner. - IPC tests for `cancelAllPending` (rejects promises, preserves events), client usability after cancel, and `replayEventSubscriptions` re-issuing `EventListen` for live subscriptions while skipping disposed ones.
047f8f3 to
3a6a9fa
Compare
Follow-up to #310131. Bundle evidence and rationale: #310131 (comment)
After a long client sleep with
--enable-remote-auto-shutdown, the remote server is replaced. The management reconnect is then rejected withUnknown reconnection token (never seen), which today falls through to a Reload Window prompt. This PR recovers transparently on the Management connection.What changes
ErrorMessagekeepsreason: string(older and forked clients still work) and adds optionalreasonCode: ConnectionRejectionReasonanddetail. Control flow matches onreasonCode; if absent, the legacyreasonstring is parsed so recovery still drives against older servers.UnknownReconnectionTokenNeverSeenorUnknownReconnectionTokenSeenBefore,ManagementPersistentConnectionrotates its token, resets thePersistentProtocolsession state, and re-handshakes against the new server. Bounded by 3 attempts and by the existing reconnection grace time.PersistentProtocol.onDidResetSessionfires from inside the reset.ManagementPersistentConnectionlistens, queues the cached IPC init context (msg #1), then re-issuesEventListenfor each live subscription with the same id. They ride the nextendAcceptReconnectionreplay so they land ahead of any concurrentclient.call().ChannelClientseparates promise requests from event subscriptions.cancelAllPendingrejects in-flight promises (so they do not leak in the handlers map); event subscriptions stay alive locally and are re-registered server-side by the replay above.tokenAtLossTimesnapshot onConnectionLost, compared onConnectionGain. Surfaces as awasReplacedtelemetry field and a warning log._incomingMsgLastTimebaselined to now.Scope
Management only. Extension Host connections continue to use the existing permanent-failure pathway, which the workbench already handles by restarting the host on
onDidDispose. The user sees a brief "extensions reloading" blip rather than silent recovery. Transparent Extension Host recovery is tracked in #317980.