Skip to content

Fix remote agent host reconnect hang for SSH and tunnel paths#315552

Merged
roblourens merged 9 commits into
mainfrom
agents/vsckb-implement-i-m-having-some-kind-of-2a7030e7
May 11, 2026
Merged

Fix remote agent host reconnect hang for SSH and tunnel paths#315552
roblourens merged 9 commits into
mainfrom
agents/vsckb-implement-i-m-having-some-kind-of-2a7030e7

Conversation

@roblourens
Copy link
Copy Markdown
Member

Problem

When a remote Agent Host SSH or dev-tunnel connection silently dies (TCP half-open before ssh2 / dev-tunnels keepalives detect it), the SDK calls used to (re)create the relay can hang forever:

  • SSH: client.forwardOut(...) inside _createWebSocketRelay has no timeout and the callback never fires when the SSH client is unresponsive.
  • Tunnel: relayClient.connect(), waitForForwardedPort(), connectToForwardedPort(), and the WebSocket 'open' event have no timeouts.

Because connect(replaceRelay=true) reuses the existing dead sshClient and the renderer's _reconnectSSHEntries was single-shot, the per-host pending flag stayed set and auto-reconnect was effectively disabled for the lifetime of the shared process. Reloading the window didn't help — the dead client lives in the shared-process _connections map. Only quitting and restarting the app recovered.

Fix

Layer Change
sshRemoteAgentHostService Wrap _createWebSocketRelay in connect(replaceRelay=true) with raceTimeout (60s, slightly above ssh2's keepalive failure window). On timeout the existing catch ends the dead sshClient and purges it from _connections, so the next attempt starts fresh.
tunnelAgentHostService Wrap the four hangable dev-tunnels SDK calls with per-step timeouts (30s each); dispose relayClient on any timeout/failure so we don't leak it.
remoteAgentHost.contribution Rewrite _reconnectSSHEntries with exponential-backoff retry mirroring the proven tunnel pattern (1s → 30s, max 10 attempts then pause). Resumes on config / connection-change events. Per-host state lives in a single SSHReconnectState with a MutableDisposable timer (disposableTimeout), owned by a DisposableMap so disposing the contribution (or removing a host) cancels pending timers automatically.

Validation

  • ✅ TypeScript: npm run compile-check-ts-native clean.
  • ✅ All 46 SSH main service tests pass, including a new test (reconnect rejects with timeout when relay creation hangs) that simulates a stuck relay via a hangRelayCreationOnCall hook and verifies the timeout fires + sshClient.end() is called.
  • ✅ All 167 RemoteAgentHost (renderer) tests pass.
  • ⚠️ No live repro included — the synthetic-network setup needed to deterministically reproduce the silent-death scenario is non-trivial. The unit test exercises the exact main-process code path that hangs in production.

How to verify by hand

The most deterministic synthetic repro for SSH:

# 1. Connect an SSH agent host (e.g. to a localhost SSH alias).
# 2. Find the per-connection sshd child:
lsof -i tcp:22 -sTCP:ESTABLISHED -P -n
# 3. Freeze it (simulates a dead network without closing the socket):
sudo kill -STOP <sshd-child-pid>
# 4. Trigger any host activity. Watch the agents output channel for:
#      "SSH relay creation timed out after 60000ms"
#      "Scheduling SSH reconnect ... in 1000ms (attempt 2/10)"
# 5. Restore:
sudo kill -CONT <sshd-child-pid>
# Within the next backoff window the host should reconnect.

Realistic real-world repro: close laptop lid for >10 min, wake, and confirm the host recovers without restarting the app.

When an SSH or tunnel connection silently dies (TCP half-open before
ssh2/dev-tunnels keepalives detect it), the SDK calls used to (re)create
the relay would hang forever. The renderer's reconnect await would never
settle, leaving the per-host pending flag set and effectively disabling
auto-reconnect for the lifetime of the shared process. The user-visible
symptom: reloading the window doesn't help, only quitting and restarting
the app does.

Fixes:

- sshRemoteAgentHostService: bound _createWebSocketRelay in
  connect(replaceRelay=true) with raceTimeout. On timeout the existing
  catch tears down the dead sshClient so the next attempt starts fresh.
- tunnelAgentHostService: bound the four hangable dev-tunnels SDK calls
  (relay connect, waitForForwardedPort, connectToForwardedPort, ws open)
  with per-step timeouts; dispose relayClient on failure so we don't
  leak it.
- remoteAgentHost.contribution: rewrite _reconnectSSHEntries with
  exponential-backoff retry mirroring the tunnel pattern. Per-host
  state lives in a single SSHReconnectState with a MutableDisposable
  timer, owned by a DisposableMap so disposal of the contribution (or
  removal of a host) cancels pending timers automatically.

Adds a unit test that simulates a stuck relay via a hangRelayCreationOnCall
hook and verifies the timeout fires and disposes the SSH client.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 10, 2026 17:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes cases where remote Agent Host reconnect could hang indefinitely (SSH relay creation and dev-tunnels connect steps) after a silent network drop, which in turn could permanently disable auto-reconnect until the app was restarted.

Changes:

  • Add a bounded timeout to SSH relay creation on the replaceRelay reconnect path so a dead SSH client can be torn down and retried.
  • Add per-step timeouts (connect, waitForForwardedPort, connectToForwardedPort, WebSocket open) to tunnel connect flow, with relay client cleanup on failure/timeout.
  • Rework renderer-side SSH auto-reconnect to use retry w/ exponential backoff and per-host reconnect state; add a unit test covering the SSH hang/timeout scenario.
Show a summary per file
File Description
src/vs/sessions/contrib/remoteAgentHost/browser/remoteAgentHost.contribution.ts Adds per-host SSH reconnect state and exponential-backoff retry scheduling in the Agents window contribution.
src/vs/platform/agentHost/test/node/sshRemoteAgentHostService.test.ts Adds a regression test that simulates a stuck relay creation and asserts reconnect times out and ends the SSH client.
src/vs/platform/agentHost/node/tunnelAgentHostService.ts Wraps dev-tunnels SDK connection steps with timeouts and ensures relay client disposal on failure.
src/vs/platform/agentHost/node/sshRemoteAgentHostService.ts Wraps _createWebSocketRelay with raceTimeout on reconnect to avoid indefinite hangs and force cleanup/retry.

Copilot's findings

  • Files reviewed: 4/4 changed files
  • Comments generated: 2

Comment thread src/vs/sessions/contrib/remoteAgentHost/browser/remoteAgentHost.contribution.ts Outdated
Comment thread src/vs/platform/agentHost/node/tunnelAgentHostService.ts Outdated
- SSHReconnectState.scheduleRetry: clear _timer.value when the timer
  fires so hasPendingTimer reflects reality after the handler runs.
- tunnelAgentHostService.withTimeout: switch to raceTimeout so the
  timer is cleared in finally on success (no stray timers per step).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After laptop sleep + network change the SSH/tunnel transport's underlying
TCP can be half-open: writes succeed locally but never deliver, and no
FIN/RST is ever observed. The agent host protocol client used a plain
WebSocket with no keepalive/timeout, so subsequent requests just hung
forever. Reloading the renderer didn't help — the dead transport state
lived in the shared process.

Add a no-ping watchdog at RemoteAgentHostProtocolClient that mirrors
PersistentProtocol's _recvAckCheck mechanism:
- Track a sentAt timestamp per pending request and _lastReadTime for
  the most recent inbound message of any kind.
- Every 5s, if there's an outstanding request, no inbound traffic for
  20s, and the oldest pending request is older than 20s, force-close
  the connection so the existing reconnect machinery takes over.
- Idle connectio
After laptop sleep + network change the SSH/tunnel transport's underlying
TCP can be half-open: writes succeed locally but never deliver, and no
FIN/RST is ever observed. The agent o aTCP can be half-open: writes succeed locally but never deliver, and no
FerFIN/RST is ever observed. The agent host protocol client used a plainutWebSocket with no keepalive/timeout, so subsequent requests just huneaforever. Reloading the renderer didn't help — the dead transport s21lived in the shared process.

Add a no-ping watchdog  cd /Users/roblou/code/vscode.worktrees/agents-vsckb-implement-i-m-having-some-kind-of-2a7030e7 && git log --oneline -3 && git status --short
 cd /Users/roblou/code/vscode.worktrees/agents-vsckb-implement-i-m-having-some-kind-of-2a7030e7 && git log --oneline -3
 cd /Users/roblou/code/vscode.worktrees/agents-vsckb-implement-i-m-having-some-kind-of-2a7030e7 && git log --oneline -3 && echo --- && git status --short
 tail -200 /var/folders/ss/g5zgxl3j787811nn36my74s80000gn/T/1778448442759-copilot-tool-output-k6w908.txt | head -100

 echo hello
 grep -E watchdog
@roblourens roblourens force-pushed the agents/vsckb-implement-i-m-having-some-kind-of-2a7030e7 branch from 4cbb54c to 8a05153 Compare May 10, 2026 21:37
@roblourens roblourens requested a review from Copilot May 10, 2026 21:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 6/6 changed files
  • Comments generated: 1

Comment thread src/vs/platform/agentHost/browser/remoteAgentHostProtocolClient.ts
roblourens and others added 2 commits May 10, 2026 15:05
On a reconnect to the same address, addManagedConnection disposed the
previous entry's store, which included the previous transportDisposable.
That disposable calls _mainService.disconnect(connectionId). Because the
new entry shares the same connectionId (e.g. ssh:host) with the just-
established shared-process tunnel, the disconnect call immediately tore
down the brand-new connection.

Track transportDisposable separately from the entry's store so it only
runs on true removal (removeRemoteAgentHost, _removeConnection, full
service dispose), not when the entry is replaced.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the protocol client is closed (e.g. by the watchdog forcing a close
on a silently-dead transport) the client may live on for a moment before
being replaced by addManagedConnection. During that window:

- The interval timer would keep ticking pointlessly.
- The shared SSHRelayTransport message source feeds both the old and new
  transports for the same connectionId, so the old client could see late
  responses for requests that were already rejected.

Cancel the watchdog inside _handleClose and drop incoming messages in
_handleMessage when _isClosed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

Comments suppressed due to low confidence (1)

src/vs/platform/agentHost/browser/remoteAgentHostServiceImpl.ts:250

  • When replacing an existing managed entry in addManagedConnection, the previous entry’s transportDisposable is intentionally not disposed. If that disposable owns resources (e.g. renderer-side handles/event listeners), this becomes a leak on repeated reconnects/replacements because it is neither disposed nor retained by the new entry. Consider making the transport teardown transferable (e.g. generation-guarded disposable that can be disposed safely on replacement, or a per-address MutableDisposable that updates ownership) so the previous disposable can be cleaned up without disconnecting the freshly-established tunnel.
		const existingEntry = this._entries.get(address);
		if (existingEntry) {
			this._entries.delete(address);
			existingEntry.store.dispose();
		}
  • Files reviewed: 8/8 changed files
  • Comments generated: 0 new

@roblourens roblourens marked this pull request as ready for review May 10, 2026 23:04
@roblourens roblourens enabled auto-merge (squash) May 10, 2026 23:04
roblourens and others added 2 commits May 10, 2026 16:13
- Watchdog cleanup-on-close: no double-close, late inbound messages dropped.
- Tunnel withTimeout helper: success, error passthrough, hang→step-named timeout.
- SSHReconnectState: schedule/fire, cancel, replace-on-reschedule, dispose,
  resetForResume, and hasPendingTimer clears after the timer fires.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

Base: 37ad80f8 Current: 1eaaa5ab

No screenshot changes.

@roblourens roblourens merged commit f91a396 into main May 11, 2026
39 of 40 checks passed
@roblourens roblourens deleted the agents/vsckb-implement-i-m-having-some-kind-of-2a7030e7 branch May 11, 2026 02:26
@vs-code-engineering vs-code-engineering Bot added this to the 1.120.0 milestone May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants