CDP telemetry fixes#251
Conversation
|
Firetiger deploy monitoring skipped This PR didn't match the auto-monitor filter configured on your GitHub connection:
Reason: PR title 'CDP telemetry fixes' does not indicate changes to kernel API endpoints or Temporal workflows; please add a comment to opt in to deploy monitoring if this affects those areas. To monitor this PR anyway, reply with |
| } else if r.Context().Err() != nil { | ||
| reason = oapi.ContextCancelled | ||
| } | ||
| reason := resolveDisconnectReason(cause, r.Context(), mgr, urlCh, upstreamURL, restartConfirmWait, logger) |
There was a problem hiding this comment.
this seems to be a behavioral change, not just telemetry. Previously the URL watcher canceled the pump as soon as UpstreamManager published a different DevTools URL, forcing clients off stale upstream sessions. With this change, urlCh is only checked during dial/reason resolution after the pump exits, so a session can stay attached to the old upstream until the websocket itself errors or the request context is canceled. Could we preserve the watcher for stale-session cancellation while keeping this new cause-based reason resolution?
There was a problem hiding this comment.
maybe I am wrong, but just double-checking here
There was a problem hiding this comment.
nope this is totally fair and the initial impl was supes wrong. had opus do a deep dive against main and rebuild this segment. subsequent before / after breakdown:
Confirmed the key invariant on line 94 vs 98–116: setCurrent stores currentURL before broadcasting on urlCh. So mgr.Current() is always ≥ as fresh as anything any subscriber will read from urlCh. The polling design relies on exactly this.
Semantic correctness walkthrough
Origin/main columns reflect the actual behavior of pre-PR code, not just intent. New columns reflect the current uncommitted state.
A. Pump-running scenarios (cleanup path)
| # | Scenario | Origin/main reason | Origin/main correctness | New reason | New correctness |
|---|---|---|---|---|---|
| 1 | Client closes WS cleanly | client_close (default fall-through) |
✅ | client_close (PumpExitClient → switch) |
✅ |
| 2 | Request context cancelled (server shutdown, client aborted HTTP) | context_cancelled (r.Context().Err() != nil) |
✅ | context_cancelled (reqCtx.Err() != nil early return; wins even when cause is Client/Upstream due to ctx-cancelled reads) |
✅ |
| 3 | Chromium hard-crashes → TCP dies first, new URL appears within 10s (the PR's target bug) | client_close ❌ — reasonOverride is nil because TCP EOF beats the "DevTools listening on" log; default fall-through misattributes |
❌ BUG | upstream_changed — PumpExitUpstream → poll mgr.Current() → sees new URL within 100ms tick |
✅ FIX |
| 4 | Chromium hard-crashes; no recovery within 10s | client_close ❌ — same default fall-through; can never emit upstream_error from this path |
❌ wrong telemetry | upstream_error — poll exhausts restartConfirmWait |
✅ FIX |
| 5 | URL changes while old upstream TCP still alive (watcher beats TCP — defense-in-depth) | upstream_changed — watcher CASes reasonOverride + pumpCancel() |
✅ | upstream_changed — watcher closes upstreamConn; pump exits PumpExitUpstream; cleanup sees mgr.Current() != dialedURL immediately (no 100ms wait, because setCurrent updated Current before the broadcast that woke the watcher) |
✅ |
| 6 | Concurrent: chromium crashes AND new URL appears, both during a single cleanup window | client_close if TCP wins the race (often) ❌ or upstream_changed if URL log line wins (rare) |
upstream_changed deterministically (poll catches Current update regardless of order) |
✅ | |
| 7 | Concurrent: URL change watcher fires AND reqCtx cancelled |
race: upstream_changed (if watcher CASes first) or context_cancelled (if reqCtx wins) |
context_cancelled deterministically (reqCtx check is first) |
✅ — request termination dominates | |
| 8 | Stale/duplicate URL broadcast (newURL == currentUpstreamURL) |
Watcher continues loop; no spurious teardown | ✅ | Watcher continues loop; resolveDisconnectReason poll also skips by equality check | ✅ |
| 9 | Empty URL broadcast (defensive, shouldn't happen) | Watcher continues loop | ✅ | Watcher continues loop; resolver's newest != "" guard handles it |
✅ |
B. Pre-pump scenarios (unchanged code path — dialUpstreamWithRetry + early returns)
| # | Scenario | Origin/main reason | New reason | Note |
|---|---|---|---|---|
| 10 | Dial fails: upstream truly down | upstream_error |
upstream_error |
unchanged |
| 11 | Dial fails: r.Context() cancelled mid-dial |
context_cancelled |
context_cancelled |
unchanged |
| 12 | Dial succeeds after retrying on a newer URL from urlCh |
(no disconnect event yet) | (no disconnect event yet) | dialedURL becomes the URL that succeeded, which is what the watcher and resolver compare against |
| 13 | mgr.Current() empty before WS accept |
HTTP 503, no events | HTTP 503, no events | unchanged |
C. Goroutine / lifecycle invariants
| # | Invariant | Origin/main | New |
|---|---|---|---|
| 14 | Watcher goroutine exits when pump exits | ✅ via pumpCtx.Done() (cleanup eventually calls pumpCancel) |
✅ via pumpCtx.Done() (cleanup calls pumpCancel first thing) |
| 15 | Watcher goroutine exits when it closes upstream on URL change | ✅ via explicit return after pumpCancel() |
✅ via explicit return after upstreamConn.Close |
| 16 | cleanup runs exactly once |
✅ sync.Once |
✅ sync.Once |
| 17 | urlCh subscriber leaked? |
✅ defer unsub() |
✅ defer unsub() |
| 18 | Double-close of upstreamConn (watcher closes, then cleanup closes again) |
n/a (origin/main didn't close from watcher) | benign — coder/websocket's Close is idempotent (returns error on 2nd call but does not panic) |
| 19 | pumpCancel called twice (cleanup + reqCtx cascade)? |
✅ benign — context.WithCancel's cancel func is idempotent |
✅ same |
| 20 | dialUpstreamWithRetry and watcher both consume from urlCh? |
sequential — dial returns before watcher starts; no overlap | sequential — same |
| 21 | Watcher and resolver compete on urlCh? |
n/a — resolver never reads urlCh in origin/main |
eliminated — resolver now polls mgr.Current() |
D. Unreachable / defensive cases
| # | Scenario | Behavior |
|---|---|---|
| 22 | cause == PumpExitContext with reqCtx.Err() == nil |
Unreachable in practice: pumpCtx = WithCancel(r.Context()), and the only callers of pumpCancel are (a) cleanup itself which runs after the pump exits, and (b) r.Context() cancellation cascading down (which sets reqCtx.Err() != nil). Switch case returns context_cancelled defensively. |
| 23 | mgr.Current() == "" mid-stream |
Can't happen post-dial — setCurrent only stores non-empty, and there's no method that clears it. Defensive newest != "" guard catches it harmlessly. |
Summary
- All 3 real bugs in origin/main's reason classification (rows 3, 4, 6) are fixed by the new design.
- All 2 race-dependent reason determinations (rows 6, 7) become deterministic.
- The defense-in-depth invariant Hiro flagged (row 5) is preserved.
- No goroutine leaks, no double-close panics, no
urlChcontention. - The 100 ms polling tick adds at most one tick of latency to
upstream_changedclassification when the URL update lands during a poll sleep; givenrestartConfirmWait = 10s, that's a 1% worst-case delay on what is already a rare path.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit fb15932. Configure here.
…-more-events--cdp # Conflicts: # server/lib/cdpclient/cdpclient_test.go

Overview
Bug
When chromium restarted mid-session (
PATCH /chromium/flags,/configure, etc., or a crash + supervisord autorestart), the CDP proxy emittedcdp_disconnect.reason = "client_close"instead of"upstream_changed". Live repro confirmed: ~1.6s after the WS opened, reason wasclient_closeeven though chromium clearly went away.Root cause
A goroutine watched
mgr.Subscribe()for the new chromium URL and stampedreasonOverride = upstream_changedbefore triggering pump teardown. But the upstream's TCP EOF (chromium dying) reaches the pump before the new "DevTools listening on" log line is emitted (~5–8 s gap). Socleanupalways ran first, sawreasonOverride == nil, and fell through toclient_close.Fix
wsproxy.Pump'sonClosecallback now receives aPumpExitCause(client/upstream/context).reasonOverrideindevtoolsproxy. Reason resolution moved into cleanup as a small helper,resolveDisconnectReason, which:client_close/context_cancelledimmediately from the cause,mgr.Current()and otherwise waits up torestartConfirmWait(10 s) onurlChfor a new URL — new URL ⇒upstream_changed, timeout ⇒upstream_error.Tests
TestResolveDisconnectReason— 6 table cases covering all four reasons (sub-100 ms).TestWebSocketProxyHandler_EmitsUpstreamChangedOnMidStreamRestart— integration through the handler withrestartConfirmWaitshortened.upstream_error+ client_close tests continue to pass.Verification
Repro'd live against the headful image. Before fix:
reason=client_close, duration_ms=1620. After fix:reason=upstream_changed, duration_ms=10388.Note
Medium Risk
Changes CDP WebSocket proxy disconnect attribution and timing used by telemetry; behavior affects automation clients during Chromium restarts but not auth or data paths.
Overview
Fixes CDP proxy telemetry when Chromium restarts mid-session:
cdp_disconnectwas often reported asclient_closeinstead ofupstream_changedbecause the old URL-watcher setreasonOverrideonly after the new DevTools URL appeared, while the upstream TCP EOF arrived first.wsproxy.Pumpnow passes aPumpExitCause(client/upstream/context) into itsonClosecallback so callers know which side failed first. The DevTools proxy dropsreasonOverrideand usesresolveDisconnectReason: immediate mapping for client/context; on upstream exit it pollsUpstreamManager.Current()for up torestartConfirmWait(10s) to distinguishupstream_changedvsupstream_error. A URL-change watcher closes the stale upstream WebSocket (instead of only cancelling the pump) so disconnects classify correctly.cdp_disconnectduration_msand event timestamp use a pinneddisconnectedAtso restart polling does not inflate session length.Adds unit/integration tests for reason resolution, mid-stream restart, and proactive client kick on URL change. Remaining diff is mostly import/formatting cleanup across unrelated packages.
Reviewed by Cursor Bugbot for commit f43d049. Bugbot is set up for automated code reviews on this repo. Configure here.