Fix remote agent host reconnect hang for SSH and tunnel paths by roblourens · Pull Request #315552 · microsoft/vscode

roblourens · 2026-05-10T17:28:02Z

Problem

When a remote Agent Host SSH or dev-tunnel connection silently dies (TCP half-open before ssh2 / dev-tunnels keepalives detect it), the SDK calls used to (re)create the relay can hang forever:

SSH: client.forwardOut(...) inside _createWebSocketRelay has no timeout and the callback never fires when the SSH client is unresponsive.
Tunnel: relayClient.connect(), waitForForwardedPort(), connectToForwardedPort(), and the WebSocket 'open' event have no timeouts.

Because connect(replaceRelay=true) reuses the existing dead sshClient and the renderer's _reconnectSSHEntries was single-shot, the per-host pending flag stayed set and auto-reconnect was effectively disabled for the lifetime of the shared process. Reloading the window didn't help — the dead client lives in the shared-process _connections map. Only quitting and restarting the app recovered.

Fix

Layer	Change
`sshRemoteAgentHostService`	Wrap `_createWebSocketRelay` in `connect(replaceRelay=true)` with `raceTimeout` (60s, slightly above ssh2's keepalive failure window). On timeout the existing catch ends the dead `sshClient` and purges it from `_connections`, so the next attempt starts fresh.
`tunnelAgentHostService`	Wrap the four hangable dev-tunnels SDK calls with per-step timeouts (30s each); dispose `relayClient` on any timeout/failure so we don't leak it.
`remoteAgentHost.contribution`	Rewrite `_reconnectSSHEntries` with exponential-backoff retry mirroring the proven tunnel pattern (1s → 30s, max 10 attempts then pause). Resumes on config / connection-change events. Per-host state lives in a single `SSHReconnectState` with a `MutableDisposable` timer (`disposableTimeout`), owned by a `DisposableMap` so disposing the contribution (or removing a host) cancels pending timers automatically.

Validation

✅ TypeScript: npm run compile-check-ts-native clean.
✅ All 46 SSH main service tests pass, including a new test (reconnect rejects with timeout when relay creation hangs) that simulates a stuck relay via a hangRelayCreationOnCall hook and verifies the timeout fires + sshClient.end() is called.
✅ All 167 RemoteAgentHost (renderer) tests pass.
⚠️ No live repro included — the synthetic-network setup needed to deterministically reproduce the silent-death scenario is non-trivial. The unit test exercises the exact main-process code path that hangs in production.

How to verify by hand

The most deterministic synthetic repro for SSH:

# 1. Connect an SSH agent host (e.g. to a localhost SSH alias).
# 2. Find the per-connection sshd child:
lsof -i tcp:22 -sTCP:ESTABLISHED -P -n
# 3. Freeze it (simulates a dead network without closing the socket):
sudo kill -STOP <sshd-child-pid>
# 4. Trigger any host activity. Watch the agents output channel for:
#      "SSH relay creation timed out after 60000ms"
#      "Scheduling SSH reconnect ... in 1000ms (attempt 2/10)"
# 5. Restore:
sudo kill -CONT <sshd-child-pid>
# Within the next backoff window the host should reconnect.

Realistic real-world repro: close laptop lid for >10 min, wake, and confirm the host recovers without restarting the app.

When an SSH or tunnel connection silently dies (TCP half-open before ssh2/dev-tunnels keepalives detect it), the SDK calls used to (re)create the relay would hang forever. The renderer's reconnect await would never settle, leaving the per-host pending flag set and effectively disabling auto-reconnect for the lifetime of the shared process. The user-visible symptom: reloading the window doesn't help, only quitting and restarting the app does. Fixes: - sshRemoteAgentHostService: bound _createWebSocketRelay in connect(replaceRelay=true) with raceTimeout. On timeout the existing catch tears down the dead sshClient so the next attempt starts fresh. - tunnelAgentHostService: bound the four hangable dev-tunnels SDK calls (relay connect, waitForForwardedPort, connectToForwardedPort, ws open) with per-step timeouts; dispose relayClient on failure so we don't leak it. - remoteAgentHost.contribution: rewrite _reconnectSSHEntries with exponential-backoff retry mirroring the tunnel pattern. Per-host state lives in a single SSHReconnectState with a MutableDisposable timer, owned by a DisposableMap so disposal of the contribution (or removal of a host) cancels pending timers automatically. Adds a unit test that simulates a stuck relay via a hangRelayCreationOnCall hook and verifies the timeout fires and disposes the SSH client. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Fixes cases where remote Agent Host reconnect could hang indefinitely (SSH relay creation and dev-tunnels connect steps) after a silent network drop, which in turn could permanently disable auto-reconnect until the app was restarted.

Changes:

Add a bounded timeout to SSH relay creation on the replaceRelay reconnect path so a dead SSH client can be torn down and retried.
Add per-step timeouts (connect, waitForForwardedPort, connectToForwardedPort, WebSocket open) to tunnel connect flow, with relay client cleanup on failure/timeout.
Rework renderer-side SSH auto-reconnect to use retry w/ exponential backoff and per-host reconnect state; add a unit test covering the SSH hang/timeout scenario.

Show a summary per file

File	Description
src/vs/sessions/contrib/remoteAgentHost/browser/remoteAgentHost.contribution.ts	Adds per-host SSH reconnect state and exponential-backoff retry scheduling in the Agents window contribution.
src/vs/platform/agentHost/test/node/sshRemoteAgentHostService.test.ts	Adds a regression test that simulates a stuck relay creation and asserts reconnect times out and ends the SSH client.
src/vs/platform/agentHost/node/tunnelAgentHostService.ts	Wraps dev-tunnels SDK connection steps with timeouts and ensures relay client disposal on failure.
src/vs/platform/agentHost/node/sshRemoteAgentHostService.ts	Wraps `_createWebSocketRelay` with `raceTimeout` on reconnect to avoid indefinite hangs and force cleanup/retry.

Copilot's findings

Files reviewed: 4/4 changed files
Comments generated: 2

- SSHReconnectState.scheduleRetry: clear _timer.value when the timer fires so hasPendingTimer reflects reality after the handler runs. - tunnelAgentHostService.withTimeout: switch to raceTimeout so the timer is cleared in finally on success (no stray timers per step). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

After laptop sleep + network change the SSH/tunnel transport's underlying TCP can be half-open: writes succeed locally but never deliver, and no FIN/RST is ever observed. The agent host protocol client used a plain WebSocket with no keepalive/timeout, so subsequent requests just hung forever. Reloading the renderer didn't help — the dead transport state lived in the shared process. Add a no-ping watchdog at RemoteAgentHostProtocolClient that mirrors PersistentProtocol's _recvAckCheck mechanism: - Track a sentAt timestamp per pending request and _lastReadTime for the most recent inbound message of any kind. - Every 5s, if there's an outstanding request, no inbound traffic for 20s, and the oldest pending request is older than 20s, force-close the connection so the existing reconnect machinery takes over. - Idle connectio After laptop sleep + network change the SSH/tunnel transport's underlying TCP can be half-open: writes succeed locally but never deliver, and no FIN/RST is ever observed. The agent o aTCP can be half-open: writes succeed locally but never deliver, and no FerFIN/RST is ever observed. The agent host protocol client used a plainutWebSocket with no keepalive/timeout, so subsequent requests just huneaforever. Reloading the renderer didn't help — the dead transport s21lived in the shared process. Add a no-ping watchdog cd /Users/roblou/code/vscode.worktrees/agents-vsckb-implement-i-m-having-some-kind-of-2a7030e7 && git log --oneline -3 && git status --short cd /Users/roblou/code/vscode.worktrees/agents-vsckb-implement-i-m-having-some-kind-of-2a7030e7 && git log --oneline -3 cd /Users/roblou/code/vscode.worktrees/agents-vsckb-implement-i-m-having-some-kind-of-2a7030e7 && git log --oneline -3 && echo --- && git status --short tail -200 /var/folders/ss/g5zgxl3j787811nn36my74s80000gn/T/1778448442759-copilot-tool-output-k6w908.txt | head -100 echo hello grep -E watchdog

Copilot

Copilot's findings

Files reviewed: 6/6 changed files
Comments generated: 1

On a reconnect to the same address, addManagedConnection disposed the previous entry's store, which included the previous transportDisposable. That disposable calls _mainService.disconnect(connectionId). Because the new entry shares the same connectionId (e.g. ssh:host) with the just- established shared-process tunnel, the disconnect call immediately tore down the brand-new connection. Track transportDisposable separately from the entry's store so it only runs on true removal (removeRemoteAgentHost, _removeConnection, full service dispose), not when the entry is replaced. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When the protocol client is closed (e.g. by the watchdog forcing a close on a silently-dead transport) the client may live on for a moment before being replaced by addManagedConnection. During that window: - The interval timer would keep ticking pointlessly. - The shared SSHRelayTransport message source feeds both the old and new transports for the same connectionId, so the old client could see late responses for requests that were already rejected. Cancel the watchdog inside _handleClose and drop incoming messages in _handleMessage when _isClosed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…of-2a7030e7

Copilot

Copilot's findings

Comments suppressed due to low confidence (1)

src/vs/platform/agentHost/browser/remoteAgentHostServiceImpl.ts:250

When replacing an existing managed entry in addManagedConnection, the previous entry’s transportDisposable is intentionally not disposed. If that disposable owns resources (e.g. renderer-side handles/event listeners), this becomes a leak on repeated reconnects/replacements because it is neither disposed nor retained by the new entry. Consider making the transport teardown transferable (e.g. generation-guarded disposable that can be disposed safely on replacement, or a per-address MutableDisposable that updates ownership) so the previous disposable can be cleaned up without disconnecting the freshly-established tunnel.

		const existingEntry = this._entries.get(address);
		if (existingEntry) {
			this._entries.delete(address);
			existingEntry.store.dispose();
		}

Files reviewed: 8/8 changed files
Comments generated: 0 new

- Watchdog cleanup-on-close: no double-close, late inbound messages dropped. - Tunnel withTimeout helper: success, error passthrough, hang→step-named timeout. - SSHReconnectState: schedule/fire, cancel, replace-on-reschedule, dispose, resetForResume, and hasPendingTimer clears after the timer fires. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…of-2a7030e7

github-actions · 2026-05-11T00:16:29Z

Base: 37ad80f8 Current: 1eaaa5ab

~~No screenshot changes.~~

…of-2a7030e7

Copilot AI review requested due to automatic review settings May 10, 2026 17:28

Copilot started reviewing on behalf of roblourens May 10, 2026 17:28 View session

Copilot AI reviewed May 10, 2026

View reviewed changes

Comment thread src/vs/sessions/contrib/remoteAgentHost/browser/remoteAgentHost.contribution.ts Outdated

Comment thread src/vs/platform/agentHost/node/tunnelAgentHostService.ts Outdated

vs-code-engineering Bot assigned roblourens May 10, 2026

roblourens force-pushed the agents/vsckb-implement-i-m-having-some-kind-of-2a7030e7 branch from 4cbb54c to 8a05153 Compare May 10, 2026 21:37

roblourens requested a review from Copilot May 10, 2026 21:45

Copilot started reviewing on behalf of roblourens May 10, 2026 21:46 View session

Copilot AI reviewed May 10, 2026

View reviewed changes

Comment thread src/vs/platform/agentHost/browser/remoteAgentHostProtocolClient.ts

roblourens and others added 2 commits May 10, 2026 15:05

roblourens requested a review from Copilot May 10, 2026 22:16

Copilot started reviewing on behalf of roblourens May 10, 2026 22:18 View session

Merge branch 'main' into agents/vsckb-implement-i-m-having-some-kind-…

7d83b55

…of-2a7030e7

Copilot AI reviewed May 10, 2026

View reviewed changes

roblourens marked this pull request as ready for review May 10, 2026 23:04

roblourens enabled auto-merge (squash) May 10, 2026 23:04

roblourens and others added 2 commits May 10, 2026 16:13

Merge branch 'main' into agents/vsckb-implement-i-m-having-some-kind-…

0347ffa

…of-2a7030e7

dmitrivMS approved these changes May 11, 2026

View reviewed changes

Merge branch 'main' into agents/vsckb-implement-i-m-having-some-kind-…

c6ef76e

…of-2a7030e7

roblourens merged commit f91a396 into main May 11, 2026
39 of 40 checks passed

roblourens deleted the agents/vsckb-implement-i-m-having-some-kind-of-2a7030e7 branch May 11, 2026 02:26

vs-code-engineering Bot added this to the 1.120.0 milestone May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix remote agent host reconnect hang for SSH and tunnel paths#315552

Fix remote agent host reconnect hang for SSH and tunnel paths#315552
roblourens merged 9 commits into
mainfrom
agents/vsckb-implement-i-m-having-some-kind-of-2a7030e7

roblourens commented May 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented May 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

roblourens commented May 10, 2026

Problem

Fix

Validation

How to verify by hand

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 11, 2026 •

edited

Loading