Skip to content

Recover from stale reconnection token after server replacement#317774

Open
EhabY wants to merge 1 commit into
microsoft:mainfrom
EhabY:fix/stale-reconnection-token-after-sleep
Open

Recover from stale reconnection token after server replacement#317774
EhabY wants to merge 1 commit into
microsoft:mainfrom
EhabY:fix/stale-reconnection-token-after-sleep

Conversation

@EhabY
Copy link
Copy Markdown
Contributor

@EhabY EhabY commented May 21, 2026

Follow-up to #310131. Bundle evidence and rationale: #310131 (comment)

After a long client sleep with --enable-remote-auto-shutdown, the remote server is replaced. The management reconnect is then rejected with Unknown reconnection token (never seen), which today falls through to a Reload Window prompt. This PR recovers transparently on the Management connection.

What changes

  • Additive reject shape on the wire. ErrorMessage keeps reason: string (older and forked clients still work) and adds optional reasonCode: ConnectionRejectionReason and detail. Control flow matches on reasonCode; if absent, the legacy reason string is parsed so recovery still drives against older servers.
  • Fresh handshake on stale token. On UnknownReconnectionTokenNeverSeen or UnknownReconnectionTokenSeenBefore, ManagementPersistentConnection rotates its token, resets the PersistentProtocol session state, and re-handshakes against the new server. Bounded by 3 attempts and by the existing reconnection grace time.
  • IPC init context and event subscriptions replay. PersistentProtocol.onDidResetSession fires from inside the reset. ManagementPersistentConnection listens, queues the cached IPC init context (msg #1), then re-issues EventListen for each live subscription with the same id. They ride the next endAcceptReconnection replay so they land ahead of any concurrent client.call().
  • Pending IPC promises rejected, event subscriptions preserved. ChannelClient separates promise requests from event subscriptions. cancelAllPending rejects in-flight promises (so they do not leak in the handlers map); event subscriptions stay alive locally and are re-registered server-side by the replay above.
  • Workbench detects the replacement. tokenAtLossTime snapshot on ConnectionLost, compared on ConnectionGain. Surfaces as a wasReplaced telemetry field and a warning log.
  • Protocol hygiene during reset. Pending ack timeouts cleared and _incomingMsgLastTime baselined to now.

Scope

Management only. Extension Host connections continue to use the existing permanent-failure pathway, which the workbench already handles by restarting the host on onDidDispose. The user sees a brief "extensions reloading" blip rather than silent recovery. Transparent Extension Host recovery is tracked in #317980.

Copilot AI review requested due to automatic review settings May 21, 2026 14:03
@EhabY EhabY force-pushed the fix/stale-reconnection-token-after-sleep branch from 5ca0070 to 6504122 Compare May 21, 2026 14:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves VS Code’s remote reconnection resilience when the remote server is replaced (e.g., after long client sleep with --enable-remote-auto-shutdown). Instead of falling through to the “Reload Window” prompt on the Management connection when a reconnection is rejected due to a stale/unknown token, the client now rotates to a fresh reconnection token, resets protocol session state, and performs a fresh handshake (bounded retries).

Changes:

  • Introduces structured, shared connection rejection reasons for stale/duplicate reconnection tokens and uses them server-side.
  • Adds Management-only stale-token recovery: rotate reconnection token, reset PersistentProtocol session state, and re-handshake (up to 3 attempts per reconnect loop).
  • Adds PersistentProtocol.resetSessionStateForFreshHandshake() plus a unit test covering queue/id reset behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/vs/workbench/contrib/remote/browser/remote.ts Adds telemetry/logging for detecting server replacement on reconnect (new wasReplaced field).
src/vs/server/node/remoteExtensionHostAgentServer.ts Replaces string rejection reasons with shared ConnectionRejectionReason values for token-related rejections.
src/vs/platform/remote/common/remoteAgentConnection.ts Implements stale-token recovery path for Management reconnection by rotating token and forcing a fresh handshake on the existing protocol.
src/vs/base/parts/ipc/common/ipc.net.ts Adds PersistentProtocol.resetSessionStateForFreshHandshake() to clear protocol session state for a fresh server.
src/vs/base/parts/ipc/test/node/ipc.net.test.ts Adds a unit test ensuring resetSessionStateForFreshHandshake clears outgoing state and restarts message ids.

Comment thread src/vs/workbench/contrib/remote/browser/remote.ts Outdated
Comment thread src/vs/base/parts/ipc/common/ipc.net.ts
Comment thread src/vs/platform/remote/common/remoteAgentConnection.ts
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread src/vs/platform/remote/common/remoteAgentConnection.ts
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@EhabY EhabY requested a review from Copilot May 22, 2026 09:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Comment thread src/vs/platform/remote/common/remoteAgentConnection.ts Outdated
Comment thread src/vs/base/parts/ipc/common/ipc.net.ts Outdated
Comment thread src/vs/platform/remote/common/remoteAgentConnection.ts Outdated
Comment thread src/vs/platform/remote/common/remoteAgentConnection.ts Outdated
@EhabY EhabY requested a review from Copilot May 22, 2026 10:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (3)

src/vs/platform/remote/common/remoteAgentConnection.ts:896

  • getErrorFromMessage no longer falls back to a legacy msg.reason string. If a client talks to an older server (or any producer still sending { type:'error', reason: string }), the error message becomes unknown, and reasonCode stays undefined (which also disables the new stale-token recovery control flow). Consider: (1) parsing msg.reason when reasonCode is absent, and (2) populating a best-effort reasonCode only when it is a known value; otherwise keep it undefined but preserve the human-readable reason.
function getErrorFromMessage(msg: any): Error | null {
	if (msg && msg.type === 'error') {
		const reasonCode: ConnectionRejectionReason | undefined = msg.reasonCode;
		const description = reasonCode ? describeConnectionRejection(reasonCode) : 'unknown';
		const detail: string | undefined = msg.detail;
		const reason = detail ? `${description}: ${detail}` : description;
		const error = new Error(`Connection error: ${reason}`);
		// eslint-disable-next-line local/code-no-any-casts
		(<any>error).code = 'VSCODE_CONNECTION_ERROR';
		// eslint-disable-next-line local/code-no-any-casts
		(<any>error).reasonCode = reasonCode;
		return error;
	}
	return null;
}

src/vs/platform/remote/common/remoteAgentConnection.ts:1

  • Removing the reason: string field from the wire-level error message is a backwards-incompatible protocol change. Even if commit mismatches are usually rejected, there are still plausible mixed-version/dev scenarios (e.g., commit undefined) where this will degrade diagnostics or break clients expecting reason. A safer approach is to keep reason as an optional/deprecated field (alongside reasonCode/detail) for compatibility, and phase it out only once you can guarantee version-gated protocol negotiation.
/*---------------------------------------------------------------------------------------------

src/vs/server/node/remoteExtensionHostAgentServer.ts:1

  • To preserve wire compatibility with older clients, consider also sending a deprecated reason string (the preexisting payload shape) in addition to reasonCode/detail. That allows older clients to continue surfacing actionable messages while newer clients use reasonCode for control flow.
/*---------------------------------------------------------------------------------------------

Comment thread src/vs/platform/remote/common/remoteAgentConnection.ts
@EhabY EhabY force-pushed the fix/stale-reconnection-token-after-sleep branch 2 times, most recently from 91c43d5 to af6be49 Compare May 22, 2026 11:49
@EhabY EhabY requested a review from Copilot May 22, 2026 12:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comment thread src/vs/base/parts/ipc/common/ipc.net.ts
Comment thread src/vs/base/parts/ipc/common/ipc.ts Outdated
Comment thread src/vs/workbench/contrib/remote/browser/remote.ts Outdated
@EhabY EhabY force-pushed the fix/stale-reconnection-token-after-sleep branch from af6be49 to 047f8f3 Compare May 22, 2026 12:22
@EhabY EhabY requested a review from Copilot May 22, 2026 12:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment thread src/vs/platform/remote/common/remoteAgentConnection.ts Outdated
Comment thread src/vs/server/node/remoteExtensionHostAgentServer.ts
When the macOS client is asleep long enough for the remote server's
--enable-remote-auto-shutdown to fire, the resolver brings up a fresh
server on wake. The management connection's reconnect is then
rejected with `Unknown reconnection token (never seen)`, which today
falls through to a Reload Window prompt.

Follow-up to microsoft#310131 / pr-comment-4508229968.

Approach:

- Stable, machine-readable rejection codes. `ErrorMessage` now carries
  a required `reasonCode: ConnectionRejectionReason` plus an optional
  `detail` string for runtime context. The human-readable text is
  derived in one place via `describeConnectionRejection`. Recovery
  matches on the code, never the string. Server-side reject sites all
  carry a code.

- Management-only fresh-handshake recovery. On
  `UnknownReconnectionTokenNeverSeen`, the management connection
  rotates `_reconnectionToken` to a fresh UUID, resets the existing
  `PersistentProtocol` session state, and re-handshakes against the
  new server as a fresh connection. Bounded to 3 attempts per
  reconnect loop. Extension Host connections deliberately skip this
  path so the existing permanent-failure / `onDidDispose` pathway can
  restart the host process cleanly.

- IPC init context is replayed. `IPCClient` sends its serialized
  context as the first regular message on connect, and the new
  `IPCServer` waits for that exact message. The fresh handshake
  queues the cached serialized context as msg microsoft#1 while
  `_isReconnecting=true` so `endAcceptReconnection` replays it
  ahead of any concurrent `client.call()` from the handshake window.

- IPC client cleanup that survives leaks. `ChannelClient` now tracks
  promise requests and event subscriptions separately. On stale-token
  recovery, `cancelAllPending` rejects in-flight promises (so they
  don't sit in the handlers map waiting for responses that will
  never arrive) while leaving event subscriptions alive.
  `replayEventSubscriptions` re-issues `EventListen` for each live
  subscription on the new transport, preserving the original ids so
  client-side handlers continue to route correctly. The replay runs
  synchronously inside the fresh-handshake setup block via a
  `freshHandshakeOnProtocolReady` hook so the EventListens land in
  the protocol queue alongside the context and are replayed in
  order, with no microtask race.

- Workbench detects the replacement. The local `tokenAtLossTime`
  snapshot is captured on `ConnectionLost` (stable across reconnect-
  loop token rotations) and compared on `ConnectionGain` to mark
  `wasReplaced` in telemetry and a warning log.

- Smaller protocol hygiene. `resetSessionStateForFreshHandshake`
  clears pending ack timeouts and baselines `_incomingMsgLastTime`
  to `Date.now()` so ack scheduling uses a sensible window from
  the reset moment.

Test plan:

- Unit test for `PersistentProtocol.resetSessionStateForFreshHandshake`
  verifying state clear and post-reset replay ordering against a
  fresh partner.
- IPC tests for `cancelAllPending` (rejects promises, preserves
  events), client usability after cancel, and
  `replayEventSubscriptions` re-issuing `EventListen` for live
  subscriptions while skipping disposed ones.
@EhabY EhabY force-pushed the fix/stale-reconnection-token-after-sleep branch from 047f8f3 to 3a6a9fa Compare May 22, 2026 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants