fix(telemetry): cache_tier streaming-race — read tracer AFTER body consumption#138
Merged
fix(telemetry): cache_tier streaming-race — read tracer AFTER body consumption#138
Conversation
…nsumption
The MCP handler from agents/mcp returns a streaming Response. `await
handler(request, env, ctx)` resolves with the Response object before
the tool handler closure has finished populating the tracer. Reading
`tracer.indexSource` immediately after that await yields "none"
because the "index" / "index-build" span has not been recorded yet.
This is why every tool call in production telemetry shows
cache_tier="none" across all 7 days of data — even oddkit_search,
oddkit_orient, and oddkit_catalog which definitely call getIndex().
The trace embedded in the response's debug envelope had the correct
value (e.g. index_source="cache") because that snapshot is captured
inside handleUnifiedAction at orchestrate.ts:2806, AFTER the action
ran. But the OUTER read in workers/src/index.ts ran too early and
captured the empty initial state.
Fix: move the `cacheTier = tracer.indexSource` read INSIDE the
ctx.waitUntil callback, AFTER `await responseClone.text()` resolves.
Reading the response body to completion forces the streaming tool
handler to have finished, which means the tracer is fully populated.
Diagnostic evidence captured during this session:
oddkit_search call → response.debug.trace.index_source = "cache"
(correct — read inside handleUnifiedAction)
telemetry blob9 = "none"
(incorrect — read at index.ts:981 too early)
Same tracer instance, two read sites, two different values. Proves
the race.
Regression test added: `cache_tier reads must happen after the
streaming response body completes`. Uses setImmediate to model the
streaming tool handler that has not yet recorded its index access at
the moment the outer handler's await resolves. Asserts:
(a) OLD pattern (read immediately after await) returns "none" —
reproduces the production bug
(b) FIXED pattern (read after body consumption) returns the actual
span source
(c) Round-trip through recordTelemetry → blob9 carries the correct
value when fed the post-body-consumption read
Test count: 17 → 18 passing. tracing.ts now compiled into the test
build alongside telemetry.ts and tokenize.ts.
Open issue (separate fix, not in scope here):
getFile emits `file:${path}` spans, not `index` / `index-build`
spans. The tracer's _indexSource setter only matches the latter
two labels. This means that even with this race-fix, oddkit_get
may still record cache_tier values that reflect the index tier
(resolved during the action's getIndex call) rather than the
document fetch tier. Whether oddkit_get's blob9 should track the
index tier (current behavior post-fix), the file tier (new span
label needed), or both is a separate scope decision. Filing as a
follow-up rather than expanding this PR.
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
oddkit | 262693c | Commit Preview URL Branch Preview URL |
Apr 26 2026, 02:36 AM |
klappy
added a commit
that referenced
this pull request
Apr 26, 2026
…th (#139) Follow-up to #138. The streaming-race fix corrected the timing bug, but oddkit_get for klappy:// URIs still recorded cache_tier="none" because of a separate defect: the tracer's _indexSource setter only recognized 'index' and 'index-build' labels. runGet for klappy:// URIs takes a fast path that skips getIndex entirely and calls getFile directly. The fetcher emits 'file:\${path}' spans which the setter ignored. ~95% of oddkit_get calls hit this path, so even after #138, ~95% of get calls had cache_tier='none'. Fix: extend the setter to also recognize 'file:*' labels alongside 'index' / 'index-build'. First-wins guard preserved: - runSearch: 'index' fires first → index tier wins - runGet (klappy://): only 'file:*' fires → file tier wins - runGet (kb://, odd://): 'index' first → index tier wins (~5%) - 'file-r2:*' (R2 miss with source='miss') excluded by guard Internal field/getter names stay unchanged. Public envelope key 'index_source' in tracer.toJSON() is part of the response contract. Doc-comment updated to reflect broader semantic. 4 regression tests added covering: file:* recognition, index-wins preserved, miss-spans excluded, original behavior unchanged. Test count: 18 → 22 passing. This completes the cache_tier dashboard: every tool with a real data fetch now records its actual tier.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Every tool call in production telemetry has been recording
cache_tier="none"for the entire E0008.1 lifetime of the field — across all 7 days of available data, all 16 distinct tool names, includingoddkit_search(286 calls) andoddkit_orient(20 calls) which definitely callgetIndex()and emitindexspans.The bug isn't missing instrumentation. It's a timing race: the outer read of
tracer.indexSourceinworkers/src/index.tshappens before the streaming tool handler has finished populating the tracer.Diagnostic evidence
Captured during the session that found this bug:
Same tracer instance. Two read sites. Two different values. Proves the race.
Root cause
agents/mcpreturns a streaming Response.await handler(request, env, ctx)resolves with the Response object once headers are flushed and the body stream is constructed — but before the tool handler closure has finished running. The tool handler is what callsfetcher.getIndex(), which is what records theindexspan on the tracer.So at line 981 of
workers/src/index.ts, when we readcacheTier = tracer.indexSource:_indexSourceis stillnull"none"recordTelemetrycache_tier="none"for every callThe trace embedded in the response's debug envelope had the correct value because that snapshot is captured inside
handleUnifiedActionatorchestrate.ts:2806, after the action ran. The outer reader was just too early.Fix
Move the
cacheTier = tracer.indexSourceread inside thectx.waitUntilcallback, afterawait responseClone.text()resolves. Reading the response body to completion forces the streaming tool handler to have finished, which means the tracer is fully populated.Three lines of code change. Heavy comment explaining why.
Regression test
cache_tier reads must happen after the streaming response body completes— usessetImmediateto model a streaming tool handler that has not yet recorded its index access at the moment the outer handler'sawaitresolves. Asserts:await) returns"none"— reproduces the production bugrecordTelemetry→ blob9 carries the correct valueWithout this test, a future refactor that hoists the read back out of
waitUntilfor "performance" would silently revert the fix and CI would not catch it.Test results
Open issue (out of scope, follow-up)
getFileemitsfile:${path}spans, not"index"/"index-build"spans. The tracer's_indexSourcesetter only recognizes the latter two labels. This means that even with this race-fix,oddkit_getcalls will recordcache_tierreflecting the index tier (resolved whenrunGetcallsgetIndexfirst) rather than the file tier (where the document was actually fetched from).Whether
oddkit_get's blob9 should track the index tier (current behavior post-fix), the file tier (would need new span label or setter logic), or both is a separate scope decision. Filing as a follow-up.What this means for the cache_tier dashboards
After this lands and deploys, expect to see actual values populating blob9:
memory— module-level cache hit (fastest, ~0ms)cache— Cloudflare Cache API edge hit (~1ms)r2— R2 durable storage read (~40ms)build— cold build from ZIP (worst case, seconds)none— no index loaded (e.g.oddkit_version,oddkit_time)Aggregate latency will be queryable by tier. Module cache hit rates become visible. The dashboard the field was added for will finally have data.
Note
Medium Risk
Telemetry behavior changes for all MCP tool calls by deferring
cache_tiercapture until after the streaming body is consumed; risk is mainly around response-cloning/body-consumption timing and potential performance/edge-case impacts in the deferredwaitUntilpath.Overview
Fixes a streaming timing race where telemetry always recorded
cache_tier="none"by moving thetracer.indexSourceread into the deferredctx.waitUntilcallback afterresponseClone.text()completes.Extends
telemetry-integration.test.mjsto compile/importtracing.tsand adds a regression test that simulates the streaming race and verifiescache_tier(blob9) reflects the post-body tracer value rather than the premature fallback.Reviewed by Cursor Bugbot for commit 262693c. Bugbot is set up for automated code reviews on this repo. Configure here.