feat(telemetry): drop tokenize_ms — Workers timer is unmeasurable

Claude (drafting for klappy) · Claude (drafting for klappy) · commit 815374507ceb · 2026-04-23T21:00:55.000Z
Fourth smoke confirmed bytes_in/out and tokens_in/out work in production (357-21319 bytes_out, 142-5398 tokens_out across varied payload sizes). But tokenize_ms remained 0 across every row even with the Date.now() fix from 279f761. Root cause discovered by the agent: Cloudflare Workers freezes BOTH performance.now() AND Date.now() during synchronous CPU work. Both timers only advance on network I/O events as a side-channel mitigation (documented at developers.cloudflare.com/workers/runtime-apis/web-standards/). Tokenization is pure CPU work, so any sub-request timing of it always reads 0 in production. This is a structural runtime constraint, not a bug we can patch. Workarounds considered and rejected: - Force artificial I/O between reads (KV.list, fetch) — adds real latency to telemetry-only paths, grotesque - Two writeDataPoint calls with start/end timestamps — over-engineered, doubles write count, complicates queries - Keep the column as always-0 — actively misleading Decision: drop tokenize_ms entirely from PayloadShape, the doubles array, schema doc, and tests. The bench at workers/test/tokenize.test.mjs already characterized the cost curve (cl100k handles 50 KB in ~1.3 ms on Node v22). Bytes_out + tokens_out are sufficient signal — a future maintainer can predict tokenize_ms from the bench curve given the observed payload sizes. Schema before: doubles: [count, duration_ms, bytes_in, bytes_out, tokens_in, tokens_out, tokenize_ms] // 7 fields Schema after: doubles: [count, duration_ms, bytes_in, bytes_out, tokens_in, tokens_out] // 6 fields Companion canon update at klappy/klappy.dev coming in next commit on that branch — drops tokenize_ms row from the doubles table and removes the tokenize_ms mention in 'What This Enables'. Methodology: this is the fourth Workers Runtime != Node behavioral diff caught by live smoke on this branch. Each was unmeasurable from unit tests because Node behaves differently: 1. b94aaa6 (mine, broken): Content-Type filter (MCP returns SSE) 2. 1a555df (mine, broken): clone in waitUntil (body already drained) 3. 279f761 (mine, broken): Date.now() in Workers (frozen too) 4. THIS: drop the unmeasurable column entirely The release-validation-gate canon doc is the only thing that surfaced each of these — the live preview smoke + telemetry_public SQL caught what no test setup I could ship would have caught. The Workers-runtime gap was real and the gate worked. Tests: - 7/7 unit tests pass (workers/test/tokenize.test.mjs) - 6/6 integration tests pass (workers/test/telemetry-integration.test.mjs) - typecheck clean
diff --git a/workers/src/index.ts b/workers/src/index.ts
@@ -958,9 +958,10 @@ export default {
 
       // Phase 1 telemetry — non-blocking, fire-and-forget (E0008)
       // Phase 1.5: cache_tier from tracer feeds blob9 (E0008.1)
-      // Phase 2: payload shape (bytes_in/out, tokens_in/out, tokenize_ms) feeds
-      // doubles 3–7. All measurement happens inside waitUntil so the response
-      // returns to the caller with zero added latency. Response body is
+      // Phase 2: payload shape (bytes_in/out, tokens_in/out) feeds doubles
+      // 3-6. tokenize_ms was tried and dropped — Workers freezes both
+      // performance.now() and Date.now() during synchronous CPU work, making
+      // sub-request timing of pure-CPU tokenization unmeasurable. Response body is
       // measured universally — MCP's Streamable HTTP transport returns SSE,
       // not JSON, so a Content-Type filter would (and did) drop almost every
       // response. The helper handles clone failures safely.
diff --git a/workers/src/telemetry.ts b/workers/src/telemetry.ts
@@ -39,17 +39,16 @@
  *                            0 when tokenization was skipped or failed.
  *   double6: tokens_out   — cl100k_base token count of the response body. 0 for
  *                            streamed responses or tokenizer failure.
- *   double7: tokenize_ms  — Total wall-clock time spent tokenizing both payloads
- *                            in the waitUntil() background task. Distinct from
- *                            the response trace — tokenization happens after the
- *                            response is sent so it never adds user-facing latency.
- *                            A value of 0 alongside non-zero bytes indicates the
- *                            tokenizer was skipped (load failure or empty payload).
- *                            Resolution is 1ms (Date.now), not sub-ms. Cloudflare
- *                            Workers' performance.now() does not advance during
- *                            synchronous CPU work, so it cannot measure pure-CPU
- *                            tokenization. Sub-ms tokenizations round to 0; the
- *                            bench-vs-prod comparison is therefore lower-bounded.
+ *
+ *   NOTE: a previous iteration shipped a `double7: tokenize_ms` field intended
+ *   to capture the wall-clock cost of tokenization for bench-vs-prod
+ *   comparison. It is gone. Cloudflare Workers freezes both
+ *   `performance.now()` and `Date.now()` between network I/O events as a
+ *   timing-side-channel mitigation, so any timing of pure CPU work always
+ *   reads 0 in production. The cost was characterized in the bench (workers/
+ *   test/tokenize.test.mjs) and bytes_in/out + tokens_in/out are sufficient
+ *   to predict per-call cost from that bench curve.
+ *
  *   index1: sampling_key  — consumer label (for sampling consistency)
  *
  * See: klappy://canon/constraints/telemetry-governance
@@ -258,7 +257,6 @@ export function recordTelemetry(
   const bytesOut = shape?.bytes_out ?? 0;
   const tokensIn = shape?.tokens_in ?? 0;
   const tokensOut = shape?.tokens_out ?? 0;
-  const tokenizeMs = shape?.tokenize_ms ?? 0;
 
   for (const payload of messages) {
     const { label: consumerLabel, source: consumerSource } = parseConsumerLabel(
@@ -296,7 +294,6 @@ export function recordTelemetry(
         bytesOut,         // double4: bytes_out
         tokensIn,         // double5: tokens_in
         tokensOut,        // double6: tokens_out
-        tokenizeMs,       // double7: tokenize_ms
       ],
       indexes: [consumerLabel],
     });
diff --git a/workers/src/tokenize.ts b/workers/src/tokenize.ts
@@ -21,7 +21,7 @@
  * Failure mode: if the tokenizer fails to load or throws on a payload,
  * `countTokensSafe` returns null. Telemetry treats null as "not measured"
  * and writes `0` to keep the schema dense; the absence is visible in the
- * tokenize_ms column being 0 alongside non-zero bytes.
+ * tokens columns being 0 alongside non-zero bytes.
  *
  * See: klappy://canon/constraints/telemetry-governance
  */
@@ -67,32 +67,31 @@ export async function countTokensSafe(text: string): Promise<number | null> {
 
 /**
  * Result of measuring a payload pair. All fields default to 0 on failure
- * so the telemetry schema stays dense; the `tokenize_ms` field carries
- * the signal — a value of 0 alongside non-zero bytes indicates the
- * tokenizer was skipped or failed.
+ * so the telemetry schema stays dense; the absence of a real value is
+ * encoded by tokens_in / tokens_out being 0 alongside non-zero bytes
+ * (encoder skipped or failed).
+ *
+ * Note: this struct does NOT carry a tokenize_ms field. Cloudflare Workers
+ * freezes both `performance.now()` and `Date.now()` during synchronous
+ * CPU work as a timing-side-channel mitigation — neither timer advances
+ * unless a network I/O event occurs between reads. Tokenization is pure
+ * CPU work, so any sub-request timing of it would always read 0 in
+ * production. The cost was already characterized in the bench (bench
+ * file at workers/test/tokenize.test.mjs and integration test). We keep
+ * the bytes/tokens shape and trust the bench for the per-payload cost
+ * curve.
  */
 export interface PayloadShape {
   bytes_in: number;
   bytes_out: number;
   tokens_in: number;
   tokens_out: number;
-  tokenize_ms: number;
 }
 
 /**
  * Measure the byte and token shape of a request/response pair. Tokenization
  * is performed once per payload using the lazy-loaded cl100k_base encoder.
  * Bytes are measured via TextEncoder (UTF-8 byte length, the wire size).
- *
- * Timing note: uses `Date.now()` rather than `performance.now()`. Cloudflare
- * Workers' `performance.now()` does not advance during synchronous CPU work
- * (a deterministic-timing mitigation against timing-side-channel attacks —
- * the timer only ticks when the worker performs I/O). Tokenization is pure
- * CPU work, so `performance.now()` returns the same value before and after
- * the encode and `tokenize_ms` always reads 0 in production. `Date.now()`
- * always advances, at 1ms resolution. The bench-vs-prod comparison loses
- * sub-millisecond precision but gains a working signal — payloads that take
- * ≥1ms (8KB and up per the bench) show up as 1ms and above.
  */
 export async function measurePayloadShape(
   requestText: string,
@@ -102,26 +101,15 @@ export async function measurePayloadShape(
   const bytes_in = requestText ? encoder.encode(requestText).length : 0;
   const bytes_out = responseText ? encoder.encode(responseText).length : 0;
 
-  const start = Date.now();
   const [tIn, tOut] = await Promise.all([
     countTokensSafe(requestText),
     countTokensSafe(responseText),
   ]);
-  const tokenize_ms = Date.now() - start;
-
-  // A `0` from countTokensSafe on empty text is a trivial short-circuit, not
-  // a real tokenization — only a non-null result on non-empty text proves the
-  // encoder ran. If neither payload was actually tokenized, zero out
-  // tokenize_ms to preserve the documented "skipped/failed" signal.
-  const tokenizerRan =
-    (requestText !== "" && tIn !== null) ||
-    (responseText !== "" && tOut !== null);
 
   return {
     bytes_in,
     bytes_out,
     tokens_in: tIn ?? 0,
     tokens_out: tOut ?? 0,
-    tokenize_ms: tokenizerRan ? tokenize_ms : 0,
   };
 }
diff --git a/workers/test/telemetry-integration.test.mjs b/workers/test/telemetry-integration.test.mjs
@@ -6,14 +6,13 @@
  * recordTelemetry + measurePayloadShape with realistic JSON-RPC payloads.
  *
  * Verifies end-to-end:
- *   - The full PayloadShape lands in doubles 3-7
+ *   - The full PayloadShape lands in doubles 3-6
  *   - bytes_in/out match TextEncoder UTF-8 byte length on the actual payloads
  *   - tokens_in/out are positive integers when payloads are non-empty
- *   - tokenize_ms is non-negative and finite
  *   - Batch JSON-RPC produces one data point per message
  *   - SSE simulation (responseText="") records zeros for the response side
  *   - Tool-call payloads correctly populate blob3 (tool_name)
- *   - The blob array is exactly 9 entries and the doubles array is exactly 7
+ *   - The blob array is exactly 9 entries and the doubles array is exactly 6
  *
  * This is the verification that wrangler dev would have done — same code
  * path, same schema, real tokenizer.
@@ -171,7 +170,7 @@ await test("oddkit_time tool call lands a complete telemetry record", async () =
 
   // Schema shape
   assert.equal(point.blobs.length, 9, `blobs should be 9, got ${point.blobs.length}`);
-  assert.equal(point.doubles.length, 7, `doubles should be 7, got ${point.doubles.length}`);
+  assert.equal(point.doubles.length, 6, `doubles should be 6, got ${point.doubles.length}`);
   assert.equal(point.indexes.length, 1, "indexes should be 1");
 
   // Blobs
@@ -190,11 +189,9 @@ await test("oddkit_time tool call lands a complete telemetry record", async () =
   assert.equal(point.doubles[3], shape.bytes_out, "double4 = bytes_out");
   assert.equal(point.doubles[4], shape.tokens_in, "double5 = tokens_in");
   assert.equal(point.doubles[5], shape.tokens_out, "double6 = tokens_out");
-  assert.equal(point.doubles[6], shape.tokenize_ms, "double7 = tokenize_ms");
 
   console.log(`     bytes_in=${shape.bytes_in} bytes_out=${shape.bytes_out} ` +
-              `tokens_in=${shape.tokens_in} tokens_out=${shape.tokens_out} ` +
-              `tokenize_ms=${shape.tokenize_ms.toFixed(3)}`);
+              `tokens_in=${shape.tokens_in} tokens_out=${shape.tokens_out}`);
 });
 
 // ─── Test 2: oddkit_search with realistic large response ───────────────────
@@ -227,18 +224,13 @@ await test("oddkit_search with realistic ~8KB response — measurements are sane
   assert.ok(shape.bytes_out > 5000, `bytes_out should be > 5000, got ${shape.bytes_out}`);
   assert.ok(shape.tokens_out > 1000, `tokens_out should be > 1000, got ${shape.tokens_out}`);
 
-  // Tokenization cost should be in the bench-predicted range (1-5ms for ~8KB)
-  assert.ok(shape.tokenize_ms < 100,
-    `tokenize_ms should be < 100ms for ~8KB payload (bench predicted ~1ms), got ${shape.tokenize_ms}`);
-
   console.log(`     bytes_out=${shape.bytes_out} (~${(shape.bytes_out/1024).toFixed(1)}KB) ` +
-              `tokens_out=${shape.tokens_out} ` +
-              `tokenize_ms=${shape.tokenize_ms.toFixed(3)} (bench predicted ~1ms for 8KB)`);
+              `tokens_out=${shape.tokens_out}`);
 });
 
 // ─── Test 3: SSE response (empty body) records zeros ───────────────────────
 
-await test("SSE response (empty body) records bytes_out=0, tokens_out=0, tokenize_ms=0", async () => {
+await test("SSE response (empty body) records bytes_out=0 and tokens_out=0", async () => {
   const env = mockEnv();
   const requestBody = JSON.stringify({
     jsonrpc: "2.0",
@@ -256,28 +248,6 @@ await test("SSE response (empty body) records bytes_out=0, tokens_out=0, tokeniz
   assert.ok(point.doubles[2] > 0, "bytes_in should still be > 0");
 });
 
-// Bugbot's fix (commit c4f5752) — distinguish "encoder ran" from
-// "encoder short-circuited on empty input." If the response is empty (SSE)
-// AND the encoder only ran on the request, that still counts as "ran" and
-// tokenize_ms must reflect the real cost. But if BOTH sides are empty,
-// tokenize_ms must be 0. This case locks both halves of that invariant in.
-await test("Bugbot invariant: tokenize_ms is 0 only when encoder did not actually run", async () => {
-  // Case A: both empty → tokenize_ms must be 0 (no encoder call did meaningful work)
-  const bothEmpty = await measurePayloadShape("", "");
-  assert.equal(bothEmpty.tokenize_ms, 0,
-    `both empty: tokenize_ms must be 0, got ${bothEmpty.tokenize_ms}`);
-
-  // Case B: request only → tokenize_ms can be non-zero (encoder ran on request)
-  const requestOnly = await measurePayloadShape("hello world payload", "");
-  assert.ok(requestOnly.tokenize_ms >= 0, "tokenize_ms must be >= 0");
-  assert.ok(requestOnly.tokens_in > 0, "tokens_in should be > 0 when request has content");
-  // tokenize_ms may be 0 if the call was extremely fast, but it must NOT be
-  // forced to zero just because responseText is empty. Confirming only that
-  // the field is present and finite — the prior bug was a non-zero value
-  // being recorded when nothing ran, not the inverse.
-  assert.ok(Number.isFinite(requestOnly.tokenize_ms), "tokenize_ms must be finite");
-});
-
 // ─── Test 4: Batch JSON-RPC writes one point per message ───────────────────
 
 await test("batch JSON-RPC produces one data point per message", async () => {
@@ -329,28 +299,5 @@ await test("missing env.ODDKIT_TELEMETRY is a graceful no-op", async () => {
   recordTelemetry(mockRequest(), requestBody, env, 5, "memory", shape);
 });
 
-// ─── Test 7: The tokenize_ms warm-vs-cold pattern ──────────────────────────
-
-await test("tokenize_ms cold-call > warm-call (encoder caches across calls)", async () => {
-  const reqA = JSON.stringify({ jsonrpc: "2.0", id: 1, method: "tools/call",
-    params: { name: "oddkit_time", arguments: {} } });
-  const resA = JSON.stringify({ jsonrpc: "2.0", id: 1, result: { x: 1 } });
-
-  const cold = await measurePayloadShape(reqA, resA);
-  const warm = await measurePayloadShape(reqA, resA);
-  const warmer = await measurePayloadShape(reqA, resA);
-
-  console.log(`     cold=${cold.tokenize_ms.toFixed(3)}ms ` +
-              `warm=${warm.tokenize_ms.toFixed(3)}ms ` +
-              `warmer=${warmer.tokenize_ms.toFixed(3)}ms`);
-
-  // The warm calls should be bounded — not asserting strict ordering
-  // because timing jitter can flip them, but the median should be tiny.
-  assert.ok(warm.tokenize_ms < 50,
-    `warm tokenize_ms should be < 50ms, got ${warm.tokenize_ms}`);
-  assert.ok(warmer.tokenize_ms < 50,
-    `warmer tokenize_ms should be < 50ms, got ${warmer.tokenize_ms}`);
-});
-
 console.log(`\n${pass} passed, ${fail} failed`);
 process.exit(fail > 0 ? 1 : 0);
diff --git a/workers/test/tokenize.test.mjs b/workers/test/tokenize.test.mjs
@@ -94,7 +94,7 @@ await test("countTokensSafe scales with text length", async () => {
 
 await test("measurePayloadShape returns all required fields as numbers", async () => {
   const s = await measurePayloadShape("request", "response");
-  for (const field of ["bytes_in", "bytes_out", "tokens_in", "tokens_out", "tokenize_ms"]) {
+  for (const field of ["bytes_in", "bytes_out", "tokens_in", "tokens_out"]) {
     assert.ok(field in s, `missing field: ${field}`);
     assert.equal(typeof s[field], "number", `${field} must be number, got ${typeof s[field]}`);
   }
@@ -117,12 +117,6 @@ await test("measurePayloadShape produces positive token counts for non-empty inp
   assert.ok(s.tokens_out > 0, "tokens_out should be > 0");
 });
 
-await test("measurePayloadShape tokenize_ms is non-negative and finite", async () => {
-  const s = await measurePayloadShape("a", "b");
-  assert.ok(s.tokenize_ms >= 0, "tokenize_ms must be >= 0");
-  assert.ok(Number.isFinite(s.tokenize_ms), "tokenize_ms must be finite");
-});
-
 await test("measurePayloadShape handles empty response (SSE skipped)", async () => {
   const s = await measurePayloadShape("hello", "");
   assert.equal(s.bytes_out, 0);