Skip to content

Improve tunnel connection reliability: malformed-frame hardening, auth short-circuit, host-online resume#311518

Merged
osortega merged 3 commits intomainfrom
agents/connection-stack-comparison-vscode-codespaces
Apr 21, 2026
Merged

Improve tunnel connection reliability: malformed-frame hardening, auth short-circuit, host-online resume#311518
osortega merged 3 commits intomainfrom
agents/connection-stack-comparison-vscode-codespaces

Conversation

@osortega
Copy link
Copy Markdown
Contributor

@osortega osortega commented Apr 20, 2026

Summary

Improves tunnel agent host connection reliability with targeted fixes for specific failure modes identified during a deep comparison of the vscode.dev/agents connection stack vs codespaces-web.

Companion PR: microsoft/vscode-dev#1410 (server-side structured close codes + SDK client pool)

Changes

Transport malformed-frame hardening

All three transports (WebSocketClientTransport, TunnelConnectionTransport, TunnelRelayTransport) now detect and surface malformed JSON frames instead of silently dropping them:

  • Warn log the first 5 per connection with a data preview for diagnostics
  • Force-close the transport after >10 malformed frames — surfaces protocol mismatch or corrupt relay as a reconnect instead of a silent hang
  • Shared constants in new transportConstants.ts

Auth-error short-circuit

_categorizeError now distinguishes authExpired (401/403, expired tokens) from generic auth errors. Both immediately pause reconnects instead of burning 10 retry slots with guaranteed-failing attempts. Resume is driven by onDidChangeSessions when a fresh GitHub session appears.

Host-online auto-resume

_silentStatusCheck detects when a tunnel paused for host-offline has its host come back online, and auto-resumes reconnect without requiring a wake/visibility event. Covers the common "laptop came back, remote host came back first" scenario.

Session-removal cleanup

Reacts to GitHub auth session removal by tearing down matching tunnel state (reconnect timers, backoff, telemetry sessions) and best-effort disconnect. Previously, signing out left stale reconnect loops running.

Telemetry

Added authExpired to TunnelConnectErrorCategory and TunnelConnectFailureReason types.

Validation

  • npm run compile-check-ts-native
  • npm run valid-layers-check

…h short-circuit, host-online resume, session cleanup

Transport layer:
- Add malformed-frame detection across all three transports (WebSocketClientTransport,
  TunnelConnectionTransport, TunnelRelayTransport): warn log first 5 per connection,
  force-close transport after >10 to trigger reconnect loop instead of silently
  dropping corrupt data. Shared constants in new transportConstants.ts.

Reconnect logic (tunnelAgentHost.contribution.ts):
- Auth-error short-circuit: authExpired/auth errors immediately pause reconnects
  instead of burning 10 retry slots, resume driven by onDidChangeSessions.
- Host-online auto-resume: _silentStatusCheck detects when a host-offline-paused
  tunnel comes back online and auto-resumes without needing a wake/visibility event.
- Session-removal cleanup: react to github session removal by tearing down matching
  tunnel state and best-effort disconnect.
- Richer _categorizeError: distinguish authExpired (401/403/token expired) from
  generic auth, add ECONN/ENOTFOUND/ETIMEDOUT to network category.

Telemetry:
- Add authExpired to TunnelConnectErrorCategory and TunnelConnectFailureReason types.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 20, 2026 22:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the tunnel-based remote agent host connection flow in the Agents window by surfacing malformed protocol frames, improving reconnect behavior around auth/host-offline states, and extending telemetry to distinguish auth-expired failures.

Changes:

  • Add malformed JSON frame detection/logging + forced-close thresholds across transports (shared constants in transportConstants.ts).
  • Short-circuit reconnect on auth/authExpired failures and resume on GitHub session changes; add host-online auto-resume for host-offline pauses.
  • Extend tunnel connect telemetry categories/reasons to include authExpired.
Show a summary per file
File Description
src/vs/sessions/contrib/remoteAgentHost/browser/webTunnelAgentHostService.ts Adds malformed-frame counting/logging + forced-close for the web tunnel connection transport.
src/vs/sessions/contrib/remoteAgentHost/browser/tunnelAgentHost.contribution.ts Improves reconnect pausing/resuming for auth failures, session removal, and host-offline recovery; adds rate-limiting constant.
src/vs/sessions/common/sessionsTelemetry.ts Extends tunnel connect telemetry types with authExpired.
src/vs/platform/agentHost/electron-browser/tunnelRelayTransport.ts Adds malformed-frame handling + forced disconnect for relay IPC transport.
src/vs/platform/agentHost/common/transportConstants.ts Introduces shared malformed-frame thresholds for consistent behavior across transports.
src/vs/platform/agentHost/browser/webSocketClientTransport.ts Adds malformed-frame handling + forced close for direct WebSocket transport.

Copilot's findings

Comments suppressed due to low confidence (1)

src/vs/sessions/contrib/remoteAgentHost/browser/tunnelAgentHost.contribution.ts:623

  • _resumeReconnects currently applies the same rate-limit to the 'sessionAdded' trigger. If a user signs in shortly after a wake/visibility resume, the session-added resume can be dropped, leaving auth-paused tunnels stuck until another wake/visibility event. Consider bypassing the rate-limit for 'sessionAdded' (or using a separate timestamp) so auth refresh reliably restarts reconnects immediately.
	private _resumeReconnects(trigger: 'wake' | 'visible' | 'sessionAdded'): void {
		if (!this._configurationService.getValue<boolean>(RemoteAgentHostsEnabledSettingId)) {
			return;
		}

		// Rate-limit rapid wake/visibility events (e.g. alt-tab bursts or
		// flaky Wi-Fi toggling online/offline) so we don't hammer the relay
		// with immediate retries. This is an event-smoothing gate, not an
		// error-backoff — that's handled by `_scheduleReconnect`.
		const now = Date.now();
		if (now - this._lastResumeAt < RESUME_RATE_LIMIT_MS) {
			return;
  • Files reviewed: 6/6 changed files
  • Comments generated: 3

Comment thread src/vs/sessions/contrib/remoteAgentHost/browser/tunnelAgentHost.contribution.ts Outdated
Comment thread src/vs/sessions/common/sessionsTelemetry.ts
Comment thread src/vs/platform/agentHost/browser/webSocketClientTransport.ts Outdated
- _categorizeError: remove \btoken\b from auth regex to avoid matching
  'connection token' protocol errors. Use 'auth.*(fail|error|invalid)'
  instead, which catches real auth failures without over-matching.
- _silentStatusCheck: pass 'github' authProvider to cacheTunnel() so
  auto-discovered tunnels are properly matched by _handleSessionsChange
  for teardown on session removal.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

Screenshot Changes

Base: 641cbea2 Current: d6c69637

Changed (2)

chat/aiCustomizations/aiCustomizationManagementEditor/McpBrowseMode/Light
Before After
before after
editor/inlineCompletions/other/JumpToHint/Dark
Before After
before after

- Pass actual errorCategory ('auth'|'authExpired') to _pauseReconnect
  instead of always using 'authExpired'. Add 'auth' to
  TunnelConnectFailureReason type.
- Update telemetry classification comments to list all current enum
  members (authExpired for errorCategory, auth/authExpired for
  failureReason).
- Log actual data type (ArrayBuffer/Blob) and byte length for non-string
  WebSocket frames instead of coercing to empty string.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@osortega osortega marked this pull request as ready for review April 20, 2026 23:44
@osortega osortega enabled auto-merge (squash) April 20, 2026 23:44
@osortega osortega merged commit 8ca93cd into main Apr 21, 2026
26 checks passed
@osortega osortega deleted the agents/connection-stack-comparison-vscode-codespaces branch April 21, 2026 00:13
@vs-code-engineering vs-code-engineering Bot added this to the 1.118.0 milestone Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants