Skip to content

CDP telemetry fixes#251

Merged
Sayan- merged 5 commits into
mainfrom
sayan/kernel-1301-add-more-events--cdp
May 28, 2026
Merged

CDP telemetry fixes#251
Sayan- merged 5 commits into
mainfrom
sayan/kernel-1301-add-more-events--cdp

Conversation

@Sayan-
Copy link
Copy Markdown
Contributor

@Sayan- Sayan- commented May 22, 2026

Overview

Bug

When chromium restarted mid-session (PATCH /chromium/flags, /configure, etc., or a crash + supervisord autorestart), the CDP proxy emitted cdp_disconnect.reason = "client_close" instead of "upstream_changed". Live repro confirmed: ~1.6s after the WS opened, reason was client_close even though chromium clearly went away.

Root cause

A goroutine watched mgr.Subscribe() for the new chromium URL and stamped reasonOverride = upstream_changed before triggering pump teardown. But the upstream's TCP EOF (chromium dying) reaches the pump before the new "DevTools listening on" log line is emitted (~5–8 s gap). So cleanup always ran first, saw reasonOverride == nil, and fell through to client_close.

Fix

  • wsproxy.Pump's onClose callback now receives a PumpExitCause (client / upstream / context).
  • Deleted the URL-watcher goroutine + reasonOverride in devtoolsproxy. Reason resolution moved into cleanup as a small helper, resolveDisconnectReason, which:
    • returns client_close / context_cancelled immediately from the cause,
    • on upstream cause, checks mgr.Current() and otherwise waits up to restartConfirmWait (10 s) on urlCh for a new URL — new URL ⇒ upstream_changed, timeout ⇒ upstream_error.

Tests

  • TestResolveDisconnectReason — 6 table cases covering all four reasons (sub-100 ms).
  • TestWebSocketProxyHandler_EmitsUpstreamChangedOnMidStreamRestart — integration through the handler with restartConfirmWait shortened.
  • Existing dial-time upstream_error + client_close tests continue to pass.

Verification

Repro'd live against the headful image. Before fix: reason=client_close, duration_ms=1620. After fix: reason=upstream_changed, duration_ms=10388.


Note

Medium Risk
Changes CDP WebSocket proxy disconnect attribution and timing used by telemetry; behavior affects automation clients during Chromium restarts but not auth or data paths.

Overview
Fixes CDP proxy telemetry when Chromium restarts mid-session: cdp_disconnect was often reported as client_close instead of upstream_changed because the old URL-watcher set reasonOverride only after the new DevTools URL appeared, while the upstream TCP EOF arrived first.

wsproxy.Pump now passes a PumpExitCause (client / upstream / context) into its onClose callback so callers know which side failed first. The DevTools proxy drops reasonOverride and uses resolveDisconnectReason: immediate mapping for client/context; on upstream exit it polls UpstreamManager.Current() for up to restartConfirmWait (10s) to distinguish upstream_changed vs upstream_error. A URL-change watcher closes the stale upstream WebSocket (instead of only cancelling the pump) so disconnects classify correctly. cdp_disconnect duration_ms and event timestamp use a pinned disconnectedAt so restart polling does not inflate session length.

Adds unit/integration tests for reason resolution, mid-stream restart, and proactive client kick on URL change. Remaining diff is mostly import/formatting cleanup across unrelated packages.

Reviewed by Cursor Bugbot for commit f43d049. Bugbot is set up for automated code reviews on this repo. Configure here.

@firetiger-agent
Copy link
Copy Markdown

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

Any PR that changes the kernel API. Monitor changes to API endpoints (packages/api/cmd/api/) and Temporal workflows (packages/api/lib/temporal) in the kernel repo

Reason: PR title 'CDP telemetry fixes' does not indicate changes to kernel API endpoints or Temporal workflows; please add a comment to opt in to deploy monitoring if this affects those areas.

To monitor this PR anyway, reply with @firetiger monitor this.

@Sayan- Sayan- requested review from archandatta and hiroTamada May 22, 2026 21:49
Comment thread server/lib/devtoolsproxy/proxy.go Outdated
} else if r.Context().Err() != nil {
reason = oapi.ContextCancelled
}
reason := resolveDisconnectReason(cause, r.Context(), mgr, urlCh, upstreamURL, restartConfirmWait, logger)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be a behavioral change, not just telemetry. Previously the URL watcher canceled the pump as soon as UpstreamManager published a different DevTools URL, forcing clients off stale upstream sessions. With this change, urlCh is only checked during dial/reason resolution after the pump exits, so a session can stay attached to the old upstream until the websocket itself errors or the request context is canceled. Could we preserve the watcher for stale-session cancellation while keeping this new cause-based reason resolution?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I am wrong, but just double-checking here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope this is totally fair and the initial impl was supes wrong. had opus do a deep dive against main and rebuild this segment. subsequent before / after breakdown:

Confirmed the key invariant on line 94 vs 98–116: setCurrent stores currentURL before broadcasting on urlCh. So mgr.Current() is always ≥ as fresh as anything any subscriber will read from urlCh. The polling design relies on exactly this.

Semantic correctness walkthrough

Origin/main columns reflect the actual behavior of pre-PR code, not just intent. New columns reflect the current uncommitted state.

A. Pump-running scenarios (cleanup path)

# Scenario Origin/main reason Origin/main correctness New reason New correctness
1 Client closes WS cleanly client_close (default fall-through) client_close (PumpExitClient → switch)
2 Request context cancelled (server shutdown, client aborted HTTP) context_cancelled (r.Context().Err() != nil) context_cancelled (reqCtx.Err() != nil early return; wins even when cause is Client/Upstream due to ctx-cancelled reads)
3 Chromium hard-crashes → TCP dies first, new URL appears within 10s (the PR's target bug) client_close ❌ — reasonOverride is nil because TCP EOF beats the "DevTools listening on" log; default fall-through misattributes ❌ BUG upstream_changedPumpExitUpstream → poll mgr.Current() → sees new URL within 100ms tick ✅ FIX
4 Chromium hard-crashes; no recovery within 10s client_close ❌ — same default fall-through; can never emit upstream_error from this path ❌ wrong telemetry upstream_error — poll exhausts restartConfirmWait ✅ FIX
5 URL changes while old upstream TCP still alive (watcher beats TCP — defense-in-depth) upstream_changed — watcher CASes reasonOverride + pumpCancel() upstream_changed — watcher closes upstreamConn; pump exits PumpExitUpstream; cleanup sees mgr.Current() != dialedURL immediately (no 100ms wait, because setCurrent updated Current before the broadcast that woke the watcher)
6 Concurrent: chromium crashes AND new URL appears, both during a single cleanup window client_close if TCP wins the race (often) ❌ or upstream_changed if URL log line wins (rare) ⚠️ race-dependent, wrong half the time upstream_changed deterministically (poll catches Current update regardless of order)
7 Concurrent: URL change watcher fires AND reqCtx cancelled race: upstream_changed (if watcher CASes first) or context_cancelled (if reqCtx wins) ⚠️ race-dependent context_cancelled deterministically (reqCtx check is first) ✅ — request termination dominates
8 Stale/duplicate URL broadcast (newURL == currentUpstreamURL) Watcher continues loop; no spurious teardown Watcher continues loop; resolveDisconnectReason poll also skips by equality check
9 Empty URL broadcast (defensive, shouldn't happen) Watcher continues loop Watcher continues loop; resolver's newest != "" guard handles it

B. Pre-pump scenarios (unchanged code path — dialUpstreamWithRetry + early returns)

# Scenario Origin/main reason New reason Note
10 Dial fails: upstream truly down upstream_error upstream_error unchanged
11 Dial fails: r.Context() cancelled mid-dial context_cancelled context_cancelled unchanged
12 Dial succeeds after retrying on a newer URL from urlCh (no disconnect event yet) (no disconnect event yet) dialedURL becomes the URL that succeeded, which is what the watcher and resolver compare against
13 mgr.Current() empty before WS accept HTTP 503, no events HTTP 503, no events unchanged

C. Goroutine / lifecycle invariants

# Invariant Origin/main New
14 Watcher goroutine exits when pump exits ✅ via pumpCtx.Done() (cleanup eventually calls pumpCancel) ✅ via pumpCtx.Done() (cleanup calls pumpCancel first thing)
15 Watcher goroutine exits when it closes upstream on URL change ✅ via explicit return after pumpCancel() ✅ via explicit return after upstreamConn.Close
16 cleanup runs exactly once sync.Once sync.Once
17 urlCh subscriber leaked? defer unsub() defer unsub()
18 Double-close of upstreamConn (watcher closes, then cleanup closes again) n/a (origin/main didn't close from watcher) benign — coder/websocket's Close is idempotent (returns error on 2nd call but does not panic)
19 pumpCancel called twice (cleanup + reqCtx cascade)? ✅ benign — context.WithCancel's cancel func is idempotent ✅ same
20 dialUpstreamWithRetry and watcher both consume from urlCh? sequential — dial returns before watcher starts; no overlap sequential — same
21 Watcher and resolver compete on urlCh? n/a — resolver never reads urlCh in origin/main eliminated — resolver now polls mgr.Current()

D. Unreachable / defensive cases

# Scenario Behavior
22 cause == PumpExitContext with reqCtx.Err() == nil Unreachable in practice: pumpCtx = WithCancel(r.Context()), and the only callers of pumpCancel are (a) cleanup itself which runs after the pump exits, and (b) r.Context() cancellation cascading down (which sets reqCtx.Err() != nil). Switch case returns context_cancelled defensively.
23 mgr.Current() == "" mid-stream Can't happen post-dial — setCurrent only stores non-empty, and there's no method that clears it. Defensive newest != "" guard catches it harmlessly.

Summary

  • All 3 real bugs in origin/main's reason classification (rows 3, 4, 6) are fixed by the new design.
  • All 2 race-dependent reason determinations (rows 6, 7) become deterministic.
  • The defense-in-depth invariant Hiro flagged (row 5) is preserved.
  • No goroutine leaks, no double-close panics, no urlCh contention.
  • The 100 ms polling tick adds at most one tick of latency to upstream_changed classification when the URL update lands during a poll sleep; given restartConfirmWait = 10s, that's a 1% worst-case delay on what is already a rare path.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit fb15932. Configure here.

Comment thread server/lib/devtoolsproxy/proxy.go
@Sayan- Sayan- requested a review from hiroTamada May 26, 2026 21:47
…-more-events--cdp

# Conflicts:
#	server/lib/cdpclient/cdpclient_test.go
@Sayan- Sayan- merged commit 2279cd0 into main May 28, 2026
8 of 9 checks passed
@Sayan- Sayan- deleted the sayan/kernel-1301-add-more-events--cdp branch May 28, 2026 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants