Skip to content

fix(telemetry): cache_tier streaming-race — read tracer AFTER body consumption#138

Merged
klappy merged 1 commit intomainfrom
fix/cache-tier-streaming-race
Apr 26, 2026
Merged

fix(telemetry): cache_tier streaming-race — read tracer AFTER body consumption#138
klappy merged 1 commit intomainfrom
fix/cache-tier-streaming-race

Conversation

@klappy
Copy link
Copy Markdown
Owner

@klappy klappy commented Apr 26, 2026

Summary

Every tool call in production telemetry has been recording cache_tier="none" for the entire E0008.1 lifetime of the field — across all 7 days of available data, all 16 distinct tool names, including oddkit_search (286 calls) and oddkit_orient (20 calls) which definitely call getIndex() and emit index spans.

The bug isn't missing instrumentation. It's a timing race: the outer read of tracer.indexSource in workers/src/index.ts happens before the streaming tool handler has finished populating the tracer.

Diagnostic evidence

Captured during the session that found this bug:

oddkit_search call ──► response.debug.trace.index_source = "cache"
                       (correct — read inside handleUnifiedAction
                       at orchestrate.ts:2806, AFTER action ran)

                       telemetry blob9 = "none"
                       (incorrect — read at index.ts:981, BEFORE
                       the streaming tool handler completed)

Same tracer instance. Two read sites. Two different values. Proves the race.

Root cause

agents/mcp returns a streaming Response. await handler(request, env, ctx) resolves with the Response object once headers are flushed and the body stream is constructed — but before the tool handler closure has finished running. The tool handler is what calls fetcher.getIndex(), which is what records the index span on the tracer.

So at line 981 of workers/src/index.ts, when we read cacheTier = tracer.indexSource:

  • _indexSource is still null
  • The getter returns the fallback "none"
  • That value is captured in the closure for recordTelemetry
  • Production sees cache_tier="none" for every call

The trace embedded in the response's debug envelope had the correct value because that snapshot is captured inside handleUnifiedAction at orchestrate.ts:2806, after the action ran. The outer reader was just too early.

Fix

Move the cacheTier = tracer.indexSource read inside the ctx.waitUntil callback, after await responseClone.text() resolves. Reading the response body to completion forces the streaming tool handler to have finished, which means the tracer is fully populated.

Three lines of code change. Heavy comment explaining why.

Regression test

cache_tier reads must happen after the streaming response body completes — uses setImmediate to model a streaming tool handler that has not yet recorded its index access at the moment the outer handler's await resolves. Asserts:

  • (a) OLD pattern (read immediately after await) returns "none" — reproduces the production bug
  • (b) FIXED pattern (read after body consumption) returns the actual span source
  • (c) Round-trip through recordTelemetry → blob9 carries the correct value

Without this test, a future refactor that hoists the read back out of waitUntil for "performance" would silently revert the fix and CI would not catch it.

Test results

18 passed, 0 failed
(was 17 — added 1)

Open issue (out of scope, follow-up)

getFile emits file:${path} spans, not "index" / "index-build" spans. The tracer's _indexSource setter only recognizes the latter two labels. This means that even with this race-fix, oddkit_get calls will record cache_tier reflecting the index tier (resolved when runGet calls getIndex first) rather than the file tier (where the document was actually fetched from).

Whether oddkit_get's blob9 should track the index tier (current behavior post-fix), the file tier (would need new span label or setter logic), or both is a separate scope decision. Filing as a follow-up.

What this means for the cache_tier dashboards

After this lands and deploys, expect to see actual values populating blob9:

  • memory — module-level cache hit (fastest, ~0ms)
  • cache — Cloudflare Cache API edge hit (~1ms)
  • r2 — R2 durable storage read (~40ms)
  • build — cold build from ZIP (worst case, seconds)
  • none — no index loaded (e.g. oddkit_version, oddkit_time)

Aggregate latency will be queryable by tier. Module cache hit rates become visible. The dashboard the field was added for will finally have data.


Note

Medium Risk
Telemetry behavior changes for all MCP tool calls by deferring cache_tier capture until after the streaming body is consumed; risk is mainly around response-cloning/body-consumption timing and potential performance/edge-case impacts in the deferred waitUntil path.

Overview
Fixes a streaming timing race where telemetry always recorded cache_tier="none" by moving the tracer.indexSource read into the deferred ctx.waitUntil callback after responseClone.text() completes.

Extends telemetry-integration.test.mjs to compile/import tracing.ts and adds a regression test that simulates the streaming race and verifies cache_tier (blob9) reflects the post-body tracer value rather than the premature fallback.

Reviewed by Cursor Bugbot for commit 262693c. Bugbot is set up for automated code reviews on this repo. Configure here.

…nsumption

The MCP handler from agents/mcp returns a streaming Response. `await
handler(request, env, ctx)` resolves with the Response object before
the tool handler closure has finished populating the tracer. Reading
`tracer.indexSource` immediately after that await yields "none"
because the "index" / "index-build" span has not been recorded yet.

This is why every tool call in production telemetry shows
cache_tier="none" across all 7 days of data — even oddkit_search,
oddkit_orient, and oddkit_catalog which definitely call getIndex().

The trace embedded in the response's debug envelope had the correct
value (e.g. index_source="cache") because that snapshot is captured
inside handleUnifiedAction at orchestrate.ts:2806, AFTER the action
ran. But the OUTER read in workers/src/index.ts ran too early and
captured the empty initial state.

Fix: move the `cacheTier = tracer.indexSource` read INSIDE the
ctx.waitUntil callback, AFTER `await responseClone.text()` resolves.
Reading the response body to completion forces the streaming tool
handler to have finished, which means the tracer is fully populated.

Diagnostic evidence captured during this session:

  oddkit_search call → response.debug.trace.index_source = "cache"
                       (correct — read inside handleUnifiedAction)
                       telemetry blob9 = "none"
                       (incorrect — read at index.ts:981 too early)

  Same tracer instance, two read sites, two different values. Proves
  the race.

Regression test added: `cache_tier reads must happen after the
streaming response body completes`. Uses setImmediate to model the
streaming tool handler that has not yet recorded its index access at
the moment the outer handler's await resolves. Asserts:

  (a) OLD pattern (read immediately after await) returns "none" —
      reproduces the production bug
  (b) FIXED pattern (read after body consumption) returns the actual
      span source
  (c) Round-trip through recordTelemetry → blob9 carries the correct
      value when fed the post-body-consumption read

Test count: 17 → 18 passing. tracing.ts now compiled into the test
build alongside telemetry.ts and tokenize.ts.

Open issue (separate fix, not in scope here):

  getFile emits `file:${path}` spans, not `index` / `index-build`
  spans. The tracer's _indexSource setter only matches the latter
  two labels. This means that even with this race-fix, oddkit_get
  may still record cache_tier values that reflect the index tier
  (resolved during the action's getIndex call) rather than the
  document fetch tier. Whether oddkit_get's blob9 should track the
  index tier (current behavior post-fix), the file tier (new span
  label needed), or both is a separate scope decision. Filing as a
  follow-up rather than expanding this PR.
@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
oddkit 262693c Commit Preview URL

Branch Preview URL
Apr 26 2026, 02:36 AM

@klappy klappy merged commit 838d97c into main Apr 26, 2026
5 checks passed
@klappy klappy deleted the fix/cache-tier-streaming-race branch April 26, 2026 02:54
klappy added a commit that referenced this pull request Apr 26, 2026
…th (#139)

Follow-up to #138. The streaming-race fix corrected the timing bug,
but oddkit_get for klappy:// URIs still recorded cache_tier="none"
because of a separate defect: the tracer's _indexSource setter only
recognized 'index' and 'index-build' labels.

runGet for klappy:// URIs takes a fast path that skips getIndex
entirely and calls getFile directly. The fetcher emits 'file:\${path}'
spans which the setter ignored. ~95% of oddkit_get calls hit this
path, so even after #138, ~95% of get calls had cache_tier='none'.

Fix: extend the setter to also recognize 'file:*' labels alongside
'index' / 'index-build'. First-wins guard preserved:

  - runSearch: 'index' fires first → index tier wins
  - runGet (klappy://): only 'file:*' fires → file tier wins
  - runGet (kb://, odd://): 'index' first → index tier wins (~5%)
  - 'file-r2:*' (R2 miss with source='miss') excluded by guard

Internal field/getter names stay unchanged. Public envelope key
'index_source' in tracer.toJSON() is part of the response contract.
Doc-comment updated to reflect broader semantic.

4 regression tests added covering: file:* recognition, index-wins
preserved, miss-spans excluded, original behavior unchanged.

Test count: 18 → 22 passing.

This completes the cache_tier dashboard: every tool with a real data
fetch now records its actual tier.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant