Fix remote agent host reconnect hang for SSH and tunnel paths#315552
Merged
roblourens merged 9 commits intoMay 11, 2026
Conversation
When an SSH or tunnel connection silently dies (TCP half-open before ssh2/dev-tunnels keepalives detect it), the SDK calls used to (re)create the relay would hang forever. The renderer's reconnect await would never settle, leaving the per-host pending flag set and effectively disabling auto-reconnect for the lifetime of the shared process. The user-visible symptom: reloading the window doesn't help, only quitting and restarting the app does. Fixes: - sshRemoteAgentHostService: bound _createWebSocketRelay in connect(replaceRelay=true) with raceTimeout. On timeout the existing catch tears down the dead sshClient so the next attempt starts fresh. - tunnelAgentHostService: bound the four hangable dev-tunnels SDK calls (relay connect, waitForForwardedPort, connectToForwardedPort, ws open) with per-step timeouts; dispose relayClient on failure so we don't leak it. - remoteAgentHost.contribution: rewrite _reconnectSSHEntries with exponential-backoff retry mirroring the tunnel pattern. Per-host state lives in a single SSHReconnectState with a MutableDisposable timer, owned by a DisposableMap so disposal of the contribution (or removal of a host) cancels pending timers automatically. Adds a unit test that simulates a stuck relay via a hangRelayCreationOnCall hook and verifies the timeout fires and disposes the SSH client. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes cases where remote Agent Host reconnect could hang indefinitely (SSH relay creation and dev-tunnels connect steps) after a silent network drop, which in turn could permanently disable auto-reconnect until the app was restarted.
Changes:
- Add a bounded timeout to SSH relay creation on the
replaceRelayreconnect path so a dead SSH client can be torn down and retried. - Add per-step timeouts (connect, waitForForwardedPort, connectToForwardedPort, WebSocket open) to tunnel connect flow, with relay client cleanup on failure/timeout.
- Rework renderer-side SSH auto-reconnect to use retry w/ exponential backoff and per-host reconnect state; add a unit test covering the SSH hang/timeout scenario.
Show a summary per file
| File | Description |
|---|---|
| src/vs/sessions/contrib/remoteAgentHost/browser/remoteAgentHost.contribution.ts | Adds per-host SSH reconnect state and exponential-backoff retry scheduling in the Agents window contribution. |
| src/vs/platform/agentHost/test/node/sshRemoteAgentHostService.test.ts | Adds a regression test that simulates a stuck relay creation and asserts reconnect times out and ends the SSH client. |
| src/vs/platform/agentHost/node/tunnelAgentHostService.ts | Wraps dev-tunnels SDK connection steps with timeouts and ensures relay client disposal on failure. |
| src/vs/platform/agentHost/node/sshRemoteAgentHostService.ts | Wraps _createWebSocketRelay with raceTimeout on reconnect to avoid indefinite hangs and force cleanup/retry. |
Copilot's findings
- Files reviewed: 4/4 changed files
- Comments generated: 2
- SSHReconnectState.scheduleRetry: clear _timer.value when the timer fires so hasPendingTimer reflects reality after the handler runs. - tunnelAgentHostService.withTimeout: switch to raceTimeout so the timer is cleared in finally on success (no stray timers per step). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After laptop sleep + network change the SSH/tunnel transport's underlying TCP can be half-open: writes succeed locally but never deliver, and no FIN/RST is ever observed. The agent host protocol client used a plain WebSocket with no keepalive/timeout, so subsequent requests just hung forever. Reloading the renderer didn't help — the dead transport state lived in the shared process. Add a no-ping watchdog at RemoteAgentHostProtocolClient that mirrors PersistentProtocol's _recvAckCheck mechanism: - Track a sentAt timestamp per pending request and _lastReadTime for the most recent inbound message of any kind. - Every 5s, if there's an outstanding request, no inbound traffic for 20s, and the oldest pending request is older than 20s, force-close the connection so the existing reconnect machinery takes over. - Idle connectio After laptop sleep + network change the SSH/tunnel transport's underlying TCP can be half-open: writes succeed locally but never deliver, and no FIN/RST is ever observed. The agent o aTCP can be half-open: writes succeed locally but never deliver, and no FerFIN/RST is ever observed. The agent host protocol client used a plainutWebSocket with no keepalive/timeout, so subsequent requests just huneaforever. Reloading the renderer didn't help — the dead transport s21lived in the shared process. Add a no-ping watchdog cd /Users/roblou/code/vscode.worktrees/agents-vsckb-implement-i-m-having-some-kind-of-2a7030e7 && git log --oneline -3 && git status --short cd /Users/roblou/code/vscode.worktrees/agents-vsckb-implement-i-m-having-some-kind-of-2a7030e7 && git log --oneline -3 cd /Users/roblou/code/vscode.worktrees/agents-vsckb-implement-i-m-having-some-kind-of-2a7030e7 && git log --oneline -3 && echo --- && git status --short tail -200 /var/folders/ss/g5zgxl3j787811nn36my74s80000gn/T/1778448442759-copilot-tool-output-k6w908.txt | head -100 echo hello grep -E watchdog
4cbb54c to
8a05153
Compare
On a reconnect to the same address, addManagedConnection disposed the previous entry's store, which included the previous transportDisposable. That disposable calls _mainService.disconnect(connectionId). Because the new entry shares the same connectionId (e.g. ssh:host) with the just- established shared-process tunnel, the disconnect call immediately tore down the brand-new connection. Track transportDisposable separately from the entry's store so it only runs on true removal (removeRemoteAgentHost, _removeConnection, full service dispose), not when the entry is replaced. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the protocol client is closed (e.g. by the watchdog forcing a close on a silently-dead transport) the client may live on for a moment before being replaced by addManagedConnection. During that window: - The interval timer would keep ticking pointlessly. - The shared SSHRelayTransport message source feeds both the old and new transports for the same connectionId, so the old client could see late responses for requests that were already rejected. Cancel the watchdog inside _handleClose and drop incoming messages in _handleMessage when _isClosed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Copilot's findings
Comments suppressed due to low confidence (1)
src/vs/platform/agentHost/browser/remoteAgentHostServiceImpl.ts:250
- When replacing an existing managed entry in addManagedConnection, the previous entry’s transportDisposable is intentionally not disposed. If that disposable owns resources (e.g. renderer-side handles/event listeners), this becomes a leak on repeated reconnects/replacements because it is neither disposed nor retained by the new entry. Consider making the transport teardown transferable (e.g. generation-guarded disposable that can be disposed safely on replacement, or a per-address MutableDisposable that updates ownership) so the previous disposable can be cleaned up without disconnecting the freshly-established tunnel.
const existingEntry = this._entries.get(address);
if (existingEntry) {
this._entries.delete(address);
existingEntry.store.dispose();
}
- Files reviewed: 8/8 changed files
- Comments generated: 0 new
- Watchdog cleanup-on-close: no double-close, late inbound messages dropped. - Tunnel withTimeout helper: success, error passthrough, hang→step-named timeout. - SSHReconnectState: schedule/fire, cancel, replace-on-reschedule, dispose, resetForResume, and hasPendingTimer clears after the timer fires. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
|
Base:
|
dmitrivMS
approved these changes
May 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a remote Agent Host SSH or dev-tunnel connection silently dies (TCP half-open before ssh2 / dev-tunnels keepalives detect it), the SDK calls used to (re)create the relay can hang forever:
client.forwardOut(...)inside_createWebSocketRelayhas no timeout and the callback never fires when the SSH client is unresponsive.relayClient.connect(),waitForForwardedPort(),connectToForwardedPort(), and the WebSocket'open'event have no timeouts.Because
connect(replaceRelay=true)reuses the existing deadsshClientand the renderer's_reconnectSSHEntrieswas single-shot, the per-host pending flag stayed set and auto-reconnect was effectively disabled for the lifetime of the shared process. Reloading the window didn't help — the dead client lives in the shared-process_connectionsmap. Only quitting and restarting the app recovered.Fix
sshRemoteAgentHostService_createWebSocketRelayinconnect(replaceRelay=true)withraceTimeout(60s, slightly above ssh2's keepalive failure window). On timeout the existing catch ends the deadsshClientand purges it from_connections, so the next attempt starts fresh.tunnelAgentHostServicerelayClienton any timeout/failure so we don't leak it.remoteAgentHost.contribution_reconnectSSHEntrieswith exponential-backoff retry mirroring the proven tunnel pattern (1s → 30s, max 10 attempts then pause). Resumes on config / connection-change events. Per-host state lives in a singleSSHReconnectStatewith aMutableDisposabletimer (disposableTimeout), owned by aDisposableMapso disposing the contribution (or removing a host) cancels pending timers automatically.Validation
npm run compile-check-ts-nativeclean.reconnect rejects with timeout when relay creation hangs) that simulates a stuck relay via ahangRelayCreationOnCallhook and verifies the timeout fires +sshClient.end()is called.How to verify by hand
The most deterministic synthetic repro for SSH:
Realistic real-world repro: close laptop lid for >10 min, wake, and confirm the host recovers without restarting the app.