fix(telemetry): cache_tier streaming-race — read tracer AFTER body consumption by klappy · Pull Request #138 · klappy/oddkit

klappy · 2026-04-26T02:36:04Z

Summary

Every tool call in production telemetry has been recording cache_tier="none" for the entire E0008.1 lifetime of the field — across all 7 days of available data, all 16 distinct tool names, including oddkit_search (286 calls) and oddkit_orient (20 calls) which definitely call getIndex() and emit index spans.

The bug isn't missing instrumentation. It's a timing race: the outer read of tracer.indexSource in workers/src/index.ts happens before the streaming tool handler has finished populating the tracer.

Diagnostic evidence

Captured during the session that found this bug:

oddkit_search call ──► response.debug.trace.index_source = "cache"
                       (correct — read inside handleUnifiedAction
                       at orchestrate.ts:2806, AFTER action ran)

                       telemetry blob9 = "none"
                       (incorrect — read at index.ts:981, BEFORE
                       the streaming tool handler completed)

Same tracer instance. Two read sites. Two different values. Proves the race.

Root cause

agents/mcp returns a streaming Response. await handler(request, env, ctx) resolves with the Response object once headers are flushed and the body stream is constructed — but before the tool handler closure has finished running. The tool handler is what calls fetcher.getIndex(), which is what records the index span on the tracer.

So at line 981 of workers/src/index.ts, when we read cacheTier = tracer.indexSource:

_indexSource is still null
The getter returns the fallback "none"
That value is captured in the closure for recordTelemetry
Production sees cache_tier="none" for every call

The trace embedded in the response's debug envelope had the correct value because that snapshot is captured inside handleUnifiedAction at orchestrate.ts:2806, after the action ran. The outer reader was just too early.

Fix

Move the cacheTier = tracer.indexSource read inside the ctx.waitUntil callback, after await responseClone.text() resolves. Reading the response body to completion forces the streaming tool handler to have finished, which means the tracer is fully populated.

Three lines of code change. Heavy comment explaining why.

Regression test

cache_tier reads must happen after the streaming response body completes — uses setImmediate to model a streaming tool handler that has not yet recorded its index access at the moment the outer handler's await resolves. Asserts:

(a) OLD pattern (read immediately after await) returns "none" — reproduces the production bug
(b) FIXED pattern (read after body consumption) returns the actual span source
(c) Round-trip through recordTelemetry → blob9 carries the correct value

Without this test, a future refactor that hoists the read back out of waitUntil for "performance" would silently revert the fix and CI would not catch it.

Test results

18 passed, 0 failed
(was 17 — added 1)

Open issue (out of scope, follow-up)

getFile emits file:${path} spans, not "index" / "index-build" spans. The tracer's _indexSource setter only recognizes the latter two labels. This means that even with this race-fix, oddkit_get calls will record cache_tier reflecting the index tier (resolved when runGet calls getIndex first) rather than the file tier (where the document was actually fetched from).

Whether oddkit_get's blob9 should track the index tier (current behavior post-fix), the file tier (would need new span label or setter logic), or both is a separate scope decision. Filing as a follow-up.

What this means for the cache_tier dashboards

After this lands and deploys, expect to see actual values populating blob9:

memory — module-level cache hit (fastest, ~0ms)
cache — Cloudflare Cache API edge hit (~1ms)
r2 — R2 durable storage read (~40ms)
build — cold build from ZIP (worst case, seconds)
none — no index loaded (e.g. oddkit_version, oddkit_time)

Aggregate latency will be queryable by tier. Module cache hit rates become visible. The dashboard the field was added for will finally have data.

Note

Medium Risk
Telemetry behavior changes for all MCP tool calls by deferring cache_tier capture until after the streaming body is consumed; risk is mainly around response-cloning/body-consumption timing and potential performance/edge-case impacts in the deferred waitUntil path.

Overview
Fixes a streaming timing race where telemetry always recorded cache_tier="none" by moving the tracer.indexSource read into the deferred ctx.waitUntil callback after responseClone.text() completes.

Extends telemetry-integration.test.mjs to compile/import tracing.ts and adds a regression test that simulates the streaming race and verifies cache_tier (blob9) reflects the post-body tracer value rather than the premature fallback.

^{Reviewed by Cursor Bugbot for commit 262693c. Bugbot is set up for automated code reviews on this repo. Configure here.}

…nsumption The MCP handler from agents/mcp returns a streaming Response. `await handler(request, env, ctx)` resolves with the Response object before the tool handler closure has finished populating the tracer. Reading `tracer.indexSource` immediately after that await yields "none" because the "index" / "index-build" span has not been recorded yet. This is why every tool call in production telemetry shows cache_tier="none" across all 7 days of data — even oddkit_search, oddkit_orient, and oddkit_catalog which definitely call getIndex(). The trace embedded in the response's debug envelope had the correct value (e.g. index_source="cache") because that snapshot is captured inside handleUnifiedAction at orchestrate.ts:2806, AFTER the action ran. But the OUTER read in workers/src/index.ts ran too early and captured the empty initial state. Fix: move the `cacheTier = tracer.indexSource` read INSIDE the ctx.waitUntil callback, AFTER `await responseClone.text()` resolves. Reading the response body to completion forces the streaming tool handler to have finished, which means the tracer is fully populated. Diagnostic evidence captured during this session: oddkit_search call → response.debug.trace.index_source = "cache" (correct — read inside handleUnifiedAction) telemetry blob9 = "none" (incorrect — read at index.ts:981 too early) Same tracer instance, two read sites, two different values. Proves the race. Regression test added: `cache_tier reads must happen after the streaming response body completes`. Uses setImmediate to model the streaming tool handler that has not yet recorded its index access at the moment the outer handler's await resolves. Asserts: (a) OLD pattern (read immediately after await) returns "none" — reproduces the production bug (b) FIXED pattern (read after body consumption) returns the actual span source (c) Round-trip through recordTelemetry → blob9 carries the correct value when fed the post-body-consumption read Test count: 17 → 18 passing. tracing.ts now compiled into the test build alongside telemetry.ts and tokenize.ts. Open issue (separate fix, not in scope here): getFile emits `file:${path}` spans, not `index` / `index-build` spans. The tracer's _indexSource setter only matches the latter two labels. This means that even with this race-fix, oddkit_get may still record cache_tier values that reflect the index tier (resolved during the action's getIndex call) rather than the document fetch tier. Whether oddkit_get's blob9 should track the index tier (current behavior post-fix), the file tier (new span label needed), or both is a separate scope decision. Filing as a follow-up rather than expanding this PR.

cloudflare-workers-and-pages · 2026-04-26T02:36:09Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	oddkit	`262693c`	Commit Preview URL Branch Preview URL	Apr 26 2026, 02:36 AM

…th (#139) Follow-up to #138. The streaming-race fix corrected the timing bug, but oddkit_get for klappy:// URIs still recorded cache_tier="none" because of a separate defect: the tracer's _indexSource setter only recognized 'index' and 'index-build' labels. runGet for klappy:// URIs takes a fast path that skips getIndex entirely and calls getFile directly. The fetcher emits 'file:\${path}' spans which the setter ignored. ~95% of oddkit_get calls hit this path, so even after #138, ~95% of get calls had cache_tier='none'. Fix: extend the setter to also recognize 'file:*' labels alongside 'index' / 'index-build'. First-wins guard preserved: - runSearch: 'index' fires first → index tier wins - runGet (klappy://): only 'file:*' fires → file tier wins - runGet (kb://, odd://): 'index' first → index tier wins (~5%) - 'file-r2:*' (R2 miss with source='miss') excluded by guard Internal field/getter names stay unchanged. Public envelope key 'index_source' in tracer.toJSON() is part of the response contract. Doc-comment updated to reflect broader semantic. 4 regression tests added covering: file:* recognition, index-wins preserved, miss-spans excluded, original behavior unchanged. Test count: 18 → 22 passing. This completes the cache_tier dashboard: every tool with a real data fetch now records its actual tier.

klappy merged commit 838d97c into main Apr 26, 2026
5 checks passed

klappy deleted the fix/cache-tier-streaming-race branch April 26, 2026 02:54

klappy mentioned this pull request Apr 26, 2026

fix(telemetry): tracer recognizes file:* spans for oddkit_get fast path #139

Merged

klappy mentioned this pull request Apr 26, 2026

refactor(telemetry): retire indexSource interpreter — fetchers report, telemetry tallies #141

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(telemetry): cache_tier streaming-race — read tracer AFTER body consumption#138

fix(telemetry): cache_tier streaming-race — read tracer AFTER body consumption#138
klappy merged 1 commit intomainfrom
fix/cache-tier-streaming-race

klappy commented Apr 26, 2026 •

edited by cursor Bot

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

klappy commented Apr 26, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Diagnostic evidence

Root cause

Fix

Regression test

Test results

Open issue (out of scope, follow-up)

What this means for the cache_tier dashboards

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 26, 2026

Deploying with Cloudflare Workers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

klappy commented Apr 26, 2026 •

edited by cursor Bot

Loading