Skip to content

feat: route browser telemetry directly to the VM by default#108

Merged
rgarcia merged 3 commits into
nextfrom
raf/telemetry-default-routing
Jun 3, 2026
Merged

feat: route browser telemetry directly to the VM by default#108
rgarcia merged 3 commits into
nextfrom
raf/telemetry-default-routing

Conversation

@rgarcia
Copy link
Copy Markdown
Contributor

@rgarcia rgarcia commented Jun 3, 2026

What

  • Telemetry method path changed from GET /browsers/{id}/telemetry to GET /browsers/{id}/telemetry/stream (both sync and async) so it mirrors the browser VM endpoint. When the request is rewritten for direct-to-VM routing, the pure path passthrough yields {base_url}/telemetry/stream — the SSE stream on the VM. (The VM's {base_url}/telemetry is a different, non-stream JSON config endpoint, so the /stream suffix is required for direct routing to hit the stream.)
  • telemetry added to the default KERNEL_BROWSER_ROUTING_SUBRESOURCES allowlist alongside curl, so telemetry streams route directly to the VM with no env var configuration.
  • Updated api.md and the routing unit tests; added a unit test asserting a telemetry.stream(...) call rewrites to {base_url}/telemetry/stream with the jwt query param appended and the Authorization header stripped.

Dependency / rollout note

This DEPENDS ON the control-plane PR renaming /browsers/{id}/telemetry -> /browsers/{id}/telemetry/stream. That rename is not yet deployed to prod. Until it deploys, a non-routed call to the new SDK path 404s in prod, while a direct-routed call (rewritten to {base_url}/telemetry/stream on the VM) works today. Direct routing is the default for telemetry, so the common path works now; this also independently proves the rewrite is doing the work.

Live smoke evidence (prod)

Created a headless browser with telemetry enabled, opened browsers.telemetry.stream(...), generated activity via browsers.curl(...), and instrumented the httpx layer to record the final outbound request:

  • Events observed: 1 (within ~25s; category="api", type="api_call", operation_id="BrowserCurl")
  • Routed URL host: proxy.jfk-focused-mirzakhani.onkernel.com:8443 (the session base_url host) — NOT api.onkernel.com
  • Path ends with /telemetry/stream
  • Authorization header stripped; jwt query param present
  • Browser deleted in finally

Tests

  • uv run pytest tests/test_browser_routing.py — 21 passed
  • uv run ruff check on changed paths — clean
  • uv run python -c "import kernel" — ok

🤖 Generated with Claude Code


Note

Medium Risk
Changes default request routing and auth stripping for telemetry SSE, with a control-plane path dependency noted in the PR; the streaming decoder fix is localized but affects all SSE consumers.

Overview
Browser telemetry is now included in the default direct-to-VM routing allowlist (curl and telemetry), so browsers.telemetry.stream rewrites to the session VM with JWT query auth and no Authorization header unless routing is disabled via env. A new example and routing test cover streaming telemetry from a session.

The SSE decoder no longer treats sticky last_event_id as a reason to emit an event on comment-only keepalive blocks (:\n\n). That fixes idle telemetry streams dying with JSONDecodeError on empty frames after the first event with an id. A sync/async regression test was added.

Reviewed by Cursor Bugbot for commit a0ee2b2. Bugbot is set up for automated code reviews on this repo. Configure here.


Update — idle-stream SSE keepalive fix now included in this PR

This PR also fixes a crash on idle SSE streams. The SSE decoder's empty-block guard (src/kernel/_streaming.py) included last_event_id, which is sticky across events per the SSE spec — so once any event carried an id, every subsequent comment-only block (the server's :\n\n keepalive, sent every ~15s) dispatched an empty-data event that the typed Stream wrapper then .json()-decoded, raising JSONDecodeError and ending the stream. Dropped last_event_id from the guard; added a regression test (tests/test_streaming.py).

Note: src/kernel/_streaming.py is Stainless codegen-owned; this is a local patch pending an upstream fix in the generator config.

@firetiger-agent
Copy link
Copy Markdown

Created a monitoring plan for this PR.

What this PR does: Fixes the browser telemetry stream so it actually works — the SDK was calling the wrong API path, and now also routes telemetry requests directly to the browser VM (bypassing the central API) for lower latency.

Intended effect:

  • Telemetry stream success rate: baseline ~98% on the old /browsers/{id}/telemetry path (197/201 over 7 days, 0 server errors); confirmed if the new /browsers/{id}/telemetry/stream path shows 200s on first SDK adoption with no 404s
  • Direct-VM routing active: baseline 0 requests on /browsers/{id}/telemetry/stream (path didn't exist in SDK); confirmed if SDK callers show successful SSE event delivery without hitting the central API

Risks:

  • Backend path not yet served/browsers/{id}/telemetry/stream may not exist on the API or VM; alert if any HTTP 404 appears on this route after SDK adoption
  • VM-side path mismatch — direct-VM routing rewrites the request to /browser/kernel/telemetry/stream; if the VM only serves the old path without /stream, direct calls silently fail with 404 (not visible in API traces); alert if users report broken telemetry after upgrading SDK
  • Old path traffic increase — if GET /browsers/{id}/telemetry traffic unexpectedly rises above ~70 req/day, an integration may be pinned to the old path and breaking; alert if count exceeds baseline

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

rgarcia and others added 3 commits June 3, 2026 11:41
Telemetry is now a default direct-to-VM routing subresource. The
telemetry stream method path is changed from /browsers/{id}/telemetry to
/browsers/{id}/telemetry/stream so it mirrors the browser VM endpoint:
when the request is rewritten for direct routing it yields
{base_url}/telemetry/stream, which is the SSE stream on the VM (the VM's
/telemetry is a different, non-stream JSON endpoint).

"telemetry" is added to the default KERNEL_BROWSER_ROUTING_SUBRESOURCES
allowlist alongside "curl".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The SSE decoder's empty-block guard included last_event_id, which is sticky
across events per the SSE spec. Once any event carried an id, every subsequent
comment-only block (e.g. the server's ":\n\n" keepalive, sent every 15s on an
idle stream) fell through the guard and dispatched an empty-data event. The
typed Stream wrapper then calls .json() on it unconditionally, raising
JSONDecodeError and killing the stream.

This made idle browser telemetry streams crash after ~15s. Drop last_event_id
from the guard so dispatch depends only on the current block's event/data/retry
fields. Event-typed empty-data frames still dispatch (unchanged). Adds a
regression test for a keepalive comment following an id-bearing event.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rgarcia rgarcia force-pushed the raf/telemetry-default-routing branch from ad8d267 to a0ee2b2 Compare June 3, 2026 15:43
@rgarcia rgarcia requested a review from archandatta June 3, 2026 16:09
@rgarcia rgarcia merged commit 874641c into next Jun 3, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants