klappy · klappy · Apr 24, 2026 · Apr 23, 2026 · Apr 23, 2026 · Apr 23, 2026
diff --git a/canon/constraints/measure-before-you-object.md b/canon/constraints/measure-before-you-object.md
@@ -0,0 +1,136 @@
+---
+uri: klappy://canon/constraints/measure-before-you-object
+title: "Measure Before You Object — Theoretical Concerns Require Empirical Falsification"
+audience: canon
+exposure: nav
+tier: 1
+voice: neutral
+stability: semi_stable
+tags: ["canon", "constraint", "engineering", "measurement", "axiom-1", "axiom-4", "optimization", "prudence", "collaboration", "both-audiences"]
+epoch: E0008
+date: 2026-04-23
+derives_from: "canon/values/axioms.md"
+complements: "canon/observations/performed-prudence-anti-pattern.md, canon/constraints/mode-discipline-and-bottleneck-respect.md"
+governs: "All work where a contributor (model or human) raises a performance, cost, complexity, or scaling concern about a proposal under consideration"
+status: active
+---
+
+# Measure Before You Object — Theoretical Concerns Require Empirical Falsification
+
+> A concern is a claim. An unverified claim is a liability. When a contributor raises a performance, cost, or complexity worry that could block a proposal, the worry must be measured before it counts as engineering. If a fifteen-minute experiment would resolve the question, the experiment is mandatory. Performed prudence is not prudence.
+
+---
+
+## Summary — The Concern Itself Must Satisfy the Axioms
+
+This constraint applies the project axioms to a category of speech that usually escapes them: the cautionary objection. When a contributor says "this might be too slow" or "the bundle could blow up" or "we'd be picking a tokenizer, which is a domain opinion," the contributor is making truth claims about the world. Per Axiom 1, reality is sovereign over those claims. Per Axiom 4, an unverified claim is not knowledge.
+
+The default in most engineering cultures is the opposite. Raising concerns is rewarded as diligence. Confirming or falsifying them is treated as optional follow-up. The contributor who lists ten possible problems looks more careful than the contributor who measured one and shipped. This pattern is named separately in `canon/observations/performed-prudence-anti-pattern`. This document is the rule that prevents it.
+
+The rule has two halves. First: if a measurement would change the decision and is cheap to obtain, measuring is required before the concern blocks work. Second: when measurement is genuinely impractical, the concern must be labeled as untested and demoted from a blocker to a watch-item. Neither half permits raising a concern, leaving it speculative, and treating it as if it had been investigated.
+
+---
+
+## The Rule — Falsify or Defer
+
+When a contributor identifies a possible problem with a proposal, exactly one of three responses is permitted:
+
+1. **Falsify.** Run the cheapest experiment that would resolve the question. Report numbers. The numbers determine whether the concern is real.
+2. **Defer with a label.** State explicitly that the concern is untested, identify what would be required to test it, and demote it from a blocker to a tracked watch-item.
+3. **Drop it.** If the experiment is cheap and the contributor is unwilling to run it, the concern was never load-bearing.
+
+The forbidden fourth response is the one this constraint exists to prevent: raise the concern, decline to test it, and let its rhetorical weight influence the decision anyway. That is performed prudence and it costs the project real value.
+
+---
+
+## The Fifteen-Minute Test
+
+Measurement is mandatory when both of the following are true:
+
+- A reasonable measurement would resolve the disagreement.
+- The measurement can be obtained in roughly fifteen minutes of work.
+
+Examples of fifteen-minute measurements:
+
+- A microbenchmark of a candidate library on representative payloads.
+- A bundle-size delta from adding a dependency to a build.
+- A query against existing telemetry to check whether a feared usage pattern actually occurs.
+- A diff count or surface-area estimate from a small spike branch.
+- A back-of-envelope cost calculation using documented per-unit pricing.
+
+Examples that exceed the test and may legitimately defer:
+
+- A load test requiring production traffic shaping.
+- A migration simulation requiring a representative dataset that does not exist.
+- A user-study question requiring real users.
+
+When the experiment is not cheap, the concern still cannot be raised as a blocker without a label. The label converts speculation into an honest unknown.
+
+---
+
+## Anti-Laundering Clause
+
+A concern does not become engineering by being prefaced with hedge language. The following constructions are speculation, regardless of how prudent they sound:
+
+- "I'm worried this might..."
+- "There's a risk that..."
+- "We should be careful about..."
+- "This could become a problem if..."
+- "Worth flagging that..."
+
+Each of these is a claim about the world dressed as caution. If the underlying question is cheap to test, the hedge does not absolve the contributor of testing it. Hedge language is appropriate when paired with the explicit defer label from the rule above. Hedge language used to launder unmeasured concerns into the decision record is a violation of this constraint.
+
+---
+
+## Application Across Audiences
+
+This constraint binds both model contributors and human contributors.
+
+For model contributors, the failure mode usually appears in execution-adjacent work: the model proposes a solution, then immediately accumulates theoretical objections to it. The objections sound careful. The model has not measured any of them. The result is either reversion to the operator with a cluster of speculative worries, or a watered-down proposal that pre-emptively avoids problems that may not exist. Both outcomes burn the operator's attention on speculation rather than evidence.
+
+For human contributors, the failure mode usually appears in code review, design review, or planning meetings. The reviewer raises a concern that sounds senior. The original author either capitulates or spends days investigating something that a fifteen-minute test could have settled in the original meeting. The reviewer's reputation as careful is reinforced. The author's velocity is taxed. The project pays the cost in calendar time.
+
+The remedy is identical in both cases. The contributor raising the concern owns the falsification, or owns the explicit deferral label. Concerns without one of these are dropped from the decision record.
+
+---
+
+## What This Constraint Does Not Forbid
+
+This constraint is not a license to skip review or to dismiss instincts. Specifically:
+
+- **Pattern-matching from experience is allowed and valuable.** A contributor who has been burned by a similar failure may have low-cost evidence in the form of prior incidents. Citing that evidence is itself measurement.
+- **Architectural concerns that resist easy measurement are still legitimate.** A worry about long-term maintainability or about a coupling pattern may not be benchable. Such concerns must be raised with the explicit deferral label and routed through the appropriate planning channel.
+- **Stop-the-line concerns about safety, correctness, or governance are not subject to the fifteen-minute test.** A concern that a proposal violates an axiom, breaks a published contract, or risks data loss is a categorical block, not a performance-style worry.
+
+The constraint targets the specific failure of treating speculative cost-and-complexity worries as if they were investigated.
+
+---
+
+## Verification
+
+A contributor's adherence to this constraint can be verified by inspecting the decision record. For each concern raised in a planning or review session, exactly one of the following must be present:
+
+- A measurement result that confirms or falsifies the concern.
+- An explicit "untested, watch-item" label with a description of what would be required to test it.
+- Removal of the concern from the record once the contributor recognized it could not meet either bar.
+
+A decision record that contains unmeasured, unlabeled concerns is evidence that this constraint was not applied.
+
+---
+
+## Origin
+
+This constraint was written after a session in which the model proposed three theoretical objections to a telemetry instrumentation idea — bundle size impact, a vodka-architecture violation from picking a tokenizer, and the tokenizer-choice question being a domain opinion baked into the server. All three objections sounded prudent. None had been measured. When the operator pushed back and asked for actual numbers, a five-minute benchmark falsified all three, and a "safer" heuristic the model had proposed in their place turned out to be 34% wrong. The case study is documented in `canon/observations/performed-prudence-anti-pattern`.
+
+---
+
+## See Also
+
+- [Performed Prudence Anti-Pattern](klappy://canon/observations/performed-prudence-anti-pattern) — the failure mode this constraint prevents, with the case study
+- [The Dream House Principle](klappy://canon/principles/dream-house-principle) — the sequencing principle that pairs with this constraint: draw the full version before cutting, and this rule governs what objections can block the draw
+- [The Discernment Layer](klappy://canon/principles/discernment-layer) — the counterpart principle; the operator's expertise is what decides when the BS meter should fire and the test should be called for
+- [Capability Is Not Permission](klappy://canon/principles/capability-is-not-permission) — the cost-structure shift that makes the fifteen-minute-test tractable as a default rather than a heroic move
+- [Axioms](klappy://canon/values/axioms) — Axiom 1 (Reality Is Sovereign) and Axiom 4 (You Cannot Verify What You Did Not Observe) are the source
+- [Mode Discipline and Bottleneck Respect](klappy://canon/constraints/mode-discipline-and-bottleneck-respect) — the operator's attention is the bottleneck; speculative concerns tax it
+- [Time Blindness — The Axiom Violation Hiding in Every Model](klappy://canon/observations/time-blindness-axiom-violation) — the same axiom-violation pattern applied to a different dimension of reality
+- [The Dream House and Pre-Optimization](klappy://writings/the-dream-house-and-pre-optimization) — sibling essay; the tokenizer session is the worked case and the economic context for why this rule is load-bearing now
diff --git a/canon/constraints/telemetry-governance.md b/canon/constraints/telemetry-governance.md
@@ -57,10 +57,26 @@ All tracking occurs on `/mcp` POST envelopes. One data point is written per JSON
 
 ### Numeric Values (Doubles)
 
-| Value | What It Records |
-|-------|----------------|
-| Count | Always `1`, for SUM aggregation |
-| Duration | Request processing time in milliseconds |
+| # | Value | What It Records |
+|---|-------|----------------|
+| 1 | Count | Always `1`, for SUM aggregation |
+| 2 | `duration_ms` | Full request wall-clock at the worker edge — includes V8 cold-start, KB fetch, MCP SDK overhead, and handler compute |
+| 3 | `bytes_in` | UTF-8 byte length of the JSON-RPC request body. Tokenizer-agnostic; exact wire size. `0` when the body could not be read |
+| 4 | `bytes_out` | UTF-8 byte length of the response body. `0` for streamed (SSE) responses where the body cannot be measured without consuming the stream |
+| 5 | `tokens_in` | `cl100k_base` token count of the request body. See *Tokenizer Choice* below for rationale. `0` when tokenization was skipped or failed |
+| 6 | `tokens_out` | `cl100k_base` token count of the response body. `0` for streamed responses or tokenizer failure |
+
+#### Why no `tokenize_ms`
+
+A previous iteration shipped a `double7: tokenize_ms` field intended to capture the wall-clock cost of tokenization for bench-vs-prod comparison. It was removed after live-smoke verification confirmed it always reads `0` in production. The cause is a Cloudflare Workers runtime constraint, not a bug: both `performance.now()` and `Date.now()` are frozen between network I/O events as a timing-side-channel mitigation. Tokenization is pure CPU work, so any sub-request timing of it is unmeasurable from inside a Worker request handler. The bench at `workers/test/tokenize.test.mjs` characterized the cost curve once (cl100k handles 50 KB in ~1.3 ms on Node v22, which is the same V8 the Workers runtime uses); future per-call cost can be predicted from `bytes_out` and `tokens_out` against that curve.
+
+#### Tokenizer Choice
+
+Token counts use `gpt-tokenizer/encoding/cl100k_base` (GPT-4's tokenizer) as a tokenizer-agnostic proxy for "payload token shape." This is **not** a billing-accurate measure for any specific consumer model — counts diverge from Claude's tokenizer by approximately 3–4% on English prose. For the question telemetry is trying to answer ("how big are typical payloads, and which tools produce the most output?"), that noise floor is acceptable.
+
+The choice was driven by an empirical bench rather than tokenizer-correctness arguments. On Node v22 (same V8 engine as Workers), `cl100k_base` was roughly six times faster than `@anthropic-ai/tokenizer` (WASM) on representative payload sizes (200 B – 50 KB), with a dramatically more predictable p95 — no WASM memory-grow spikes. Bundle delta measured via esbuild: **432 KB gzipped**, comfortably within paid-tier Workers limits.
+
+The methodology behind this choice is documented separately in `klappy://canon/constraints/measure-before-you-object` — three theoretical objections (bundle bloat, vodka-architecture violation, tokenizer-choice as domain opinion) were falsified by a five-minute bench. See also `klappy://canon/observations/performed-prudence-anti-pattern` for the failure mode that prompted the constraint.
 
 ### Automatic Properties
 
@@ -78,6 +94,7 @@ All tracking occurs on `/mcp` POST envelopes. One data point is written per JSON
 - **Daily/hourly trends** — when usage happens, what caused spikes
 - **Method breakdown** — protocol-level health
 - **Cache health** — module memory hit rate, cold-build frequency, per-tier latency (`GROUP BY cache_tier, AVG(duration)`)
+- **Payload shape** — average and percentile bytes/tokens per tool, response-size distribution, token-cost-per-tool ranking. Lets the maintainer see which tools produce verbose responses and where trimming would actually move the needle (`GROUP BY tool_name, AVG(double6) AS avg_tokens_out`)
 
 ### Per-Request Diagnostics (Ephemeral, Not Stored)
 
@@ -252,4 +269,6 @@ This is not the standard telemetry social contract. The standard contract is: a
 - [Maintainability — One Person, Indefinitely](klappy://canon/principles/maintainability-one-person-indefinitely) — The principle telemetry serves
 - [KISS — Simplicity Is the Ceiling](klappy://canon/principles/kiss-simplicity-is-the-ceiling) — Why two tools, not twenty
 - [Axiom 4: You Cannot Verify What You Did Not Observe](klappy://canon/values/axioms) — The axiom telemetry corrects
+- [Measure Before You Object](klappy://canon/constraints/measure-before-you-object) — The methodology that drove the tokenizer choice
+- [Performed Prudence Anti-Pattern](klappy://canon/observations/performed-prudence-anti-pattern) — The failure mode that almost killed token tracking before it shipped
 - [Aquifer MCP Telemetry Governance](https://github.com/klappy/aquifer-mcp/blob/main/docs/telemetry-governance-snapshot.md) — The proven pattern this derives from