Skip to content

experiment(reader-md): hash-stable agent-authored codebase maps — design + prototype + eval#489

Merged
justrach merged 12 commits into
mainfrom
experiment/reader-md
May 21, 2026
Merged

experiment(reader-md): hash-stable agent-authored codebase maps — design + prototype + eval#489
justrach merged 12 commits into
mainfrom
experiment/reader-md

Conversation

@justrach
Copy link
Copy Markdown
Owner

TL;DR

Experimental — no codedb runtime changes. Adds a design spec, three concrete reader.md examples (flask / regex / react), and a 3×2 A/B eval against codedb v0.2.5816 (PRs #484 + #485).

A hash-stable, ≤200-LOC, agent-authored markdown file at .codedb/reader.md that codedb could prepend to codedb_context responses so a fresh agent gets one-shot orientation instead of paying 5-10 exploratory calls upfront.

Measured (Sonnet 4.6, 3 tasks × 3 corpora × 2 conditions)

Task Condition Calls Wall (s) Tokens Correct
T1 flask control 9 55 24,296
T1 flask treatment 7 36 19,918
T2 regex control 30 272 60,207
T2 regex treatment 9 63 31,437
T3 react control 22 185 44,782
T3 react treatment 22 169 41,402

Average delta with reader.md: −31% calls, −40% wall, −25% tokens. 6/6 quality preserved.

Where the wins came from

  • T2 regex (−70% calls) — the map disambiguated a multi-crate workspace (regex / regex-automata / regex-syntax / regex-lite). Control agent burned 30 calls discovering the layout; treatment agent's first call hit the right file.
  • T1 flask (−22% calls) — skipped 2 exploratory calls. Both converged on the same answer.
  • T3 react (0% calls) — the informative data point. reader.md was generated from work-loop + hooks files; T3 asked about passive-effects flushing, which the map only tangentially mentioned. Map coverage drives the win.

What's in this PR

Cost to generate reader.md

Corpus LOC Tool calls Wall (s)
flask 107 22 147
regex 80 18 183
react 95 22 204

~31k tokens per generation. Pays back after ~3 tasks in the same corpus.

Side-finding flagged by all 3 generation sub-agents

"codedb read requires the path relative to the indexed root, not an absolute path — passing an absolute path silently errors with exit code 1."

Worth a small follow-up fix.

What this PR does NOT do

  • Wire reader.md into the codedb runtime
  • Implement the regeneration policy
  • Run at scale (3×3 only)
  • Compare against codedb_context (only against raw CLI)

This PR earns the option to prioritize the implementation without committing to it. If accepted, see SPEC.md § Sequencing for the 4-6 day implementation path.

Test plan

  • All 3 reader.md files generated cleanly via codedb v0.2.5816 CLI only
  • All 6 eval sub-agents found correct answers (judged against expected behavior)
  • Hypothesis met on all three metrics (calls −31% vs ≥30% threshold, tokens −25% vs ≥20%, quality 6/6 vs ≥4.0/5)
  • Larger eval (10 tasks × 5 corpora) — deferred

🤖 Generated with Claude Code

Adds a SPEC, three generated reader.md examples (flask/regex/react),
and a 3×2 A/B eval against codedb v0.2.5816 measuring tool calls,
wall time, and tokens.

Measured (Sonnet 4.6, 3 tasks × 3 corpora):

  T1 flask:  control 9 calls / 55 s / 24k tok
             treatment 7 calls / 36 s / 20k tok  (-22% / -35% / -18%)
  T2 regex:  control 30 calls / 272 s / 60k tok
             treatment 9 calls / 63 s / 31k tok  (-70% / -77% / -48%)
  T3 react:  control 22 calls / 185 s / 45k tok
             treatment 22 calls / 169 s / 41k tok (0% / -9% / -8%)

  Average:   -31% calls / -40% wall / -25% tokens
  Quality:   6/6 runs correct (no regressions)

T3's near-zero delta is the informative data point: reader.md was
generated from work-loop + hooks-flavored source files; T3 asked
about the orthogonal passive-effects subsystem. Map coverage drives
the win.

Side-finding (flagged independently by all 3 generation sub-agents):
codedb read silently fails on absolute paths — tracked as follow-up.

No codedb runtime changes. Spec + prototype + numbers only — earns
the option to prioritize the implementation without committing to it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bc5c957dda

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread experiments/reader-md/SPEC.md Outdated
- **`schema_version`**: bump if codedb's parser changes shape
- **`generated_at`**: ISO 8601; informational only
- **`generator`**: model name; informational
- **`source_hash`**: blake3 of `concat(sort(source_files), open(f).read() for f in source_files)`. Recomputed on every codedb scan; mismatch ⇒ stale
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use one hash algorithm throughout the protocol

The spec defines source_hash as blake3 in frontmatter semantics, but the canonical hash function later in the same file returns blake2b. This ambiguity can lead independent implementations to compute different hashes for identical source_files, causing false stale/mismatch decisions when verifying reader.md. Please make the algorithm and prefix consistent in a single normative place.

Useful? React with 👍 / 👎.

Comment thread experiments/reader-md/readers/regex.md Outdated
generator: "claude-sonnet-4-6"
source_hash: "blake2b:076c6b3e358a99cca96e593056f546ee"
source_files:
- /Users/blackfloofie/codedb-bench/regex/Cargo.toml
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Store repo-relative source file paths in reader frontmatter

This source_files entry uses an author-machine absolute path (/Users/...). Any hash verifier running on a different machine/root will fail to open these files, so the recorded source_hash becomes non-reproducible and the reader is always treated as stale. It also leaks local workstation path details into versioned artifacts; use project-relative paths as described elsewhere in the experiment.

Useful? React with 👍 / 👎.

@github-actions
Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 490588 511475 +4.26% +20887 OK
codedb_changes 54482 54860 +0.69% +378 OK
codedb_deps 8955 9960 +11.22% +1005 NOISE
codedb_edit 6132 6644 +8.35% +512 OK
codedb_find 61052 62523 +2.41% +1471 OK
codedb_hot 98870 112454 +13.74% +13584 NOISE
codedb_outline 295102 307381 +4.16% +12279 OK
codedb_read 94998 101452 +6.79% +6454 OK
codedb_search 143178 149496 +4.41% +6318 OK
codedb_snapshot 294065 286398 -2.61% -7667 OK
codedb_status 14621 13795 -5.65% -826 OK
codedb_symbol 60175 60787 +1.02% +612 OK
codedb_tree 76744 76867 +0.16% +123 OK
codedb_word 82479 82781 +0.37% +302 OK

justrach and others added 3 commits May 21, 2026 13:12
…eader

Addresses Codex P1+P2 review on PR #489:

- **P1** SPEC.md described \`source_hash\` as blake3 in the frontmatter
  example, description text, and lifecycle diagram, while the canonical
  hash function later in the same file (and all 3 generated readers)
  used blake2b. Unified to blake2b throughout.

- **P2** readers/regex.md frontmatter listed absolute author-machine
  paths (\`/Users/blackfloofie/codedb-bench/regex/Cargo.toml\` etc.).
  Any hash verifier on a different machine/root would fail to open
  these. Converted to repo-relative paths (\`Cargo.toml\`, \`src/lib.rs\`,
  \`regex-automata/src/meta/regex.rs\`, ...) and recomputed the hash
  with the same algorithm.

New regex source_hash: blake2b:2348b7427c5c2697a3e956d1c6104558

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Goes from spec-only to actually working: when codedb_context is called,
read .codedb/reader.md under the project root, verify its declared
blake2b source_hash still matches the listed source_files, and prepend
the body to the response. If stale, emit a "regenerate" hint. If
missing, silent (reader.md is optional).

New module src/reader_md.zig (~170 LOC):
- parses minimal YAML frontmatter (source_hash + source_files list)
- recomputes blake2b via std.crypto.hash.blake2.Blake2b128 — algorithm
  byte-for-byte identical to the canonical Python in
  experiments/reader-md/SPEC.md (file path + \0 + content + \0\0)
- returns one of: .ready / .stale / .missing / .malformed

handleContext now takes the resolved project_root and calls reader_md.load
before emitting context. Output shape:

  <!-- reader.md (hash-verified): -->
  <body>
  <!-- end reader.md -->
  <existing codedb_context output>

Smoke-verified on a hand-crafted fixture:
  valid reader.md     → body prepended with hash-verified marker
  src.py mutated      → "reader.md is stale (source_hash drifted)" hint
  .codedb/ removed    → silent (no overhead, no noise)
  perf: codedb_context p50 ~6 ms on react (within noise of baseline)

Tests: 485/490 pass (was 484/489 — added 1 new blake2b roundtrip test;
the 5 pre-existing /private/tmp path-policy failures are unrelated).

This makes the experiment landable in principle. Behind the
experiment/reader-md branch; not on main yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 561796 568479 +1.19% +6683 OK
codedb_changes 60184 59782 -0.67% -402 OK
codedb_deps 10527 11343 +7.75% +816 OK
codedb_edit 7959 8218 +3.25% +259 OK
codedb_find 68035 68020 -0.02% -15 OK
codedb_hot 113672 117306 +3.20% +3634 OK
codedb_outline 333016 348624 +4.69% +15608 OK
codedb_read 111925 111440 -0.43% -485 OK
codedb_search 164589 181686 +10.39% +17097 NOISE
codedb_snapshot 352160 389922 +10.72% +37762 NOISE
codedb_status 15512 17069 +10.04% +1557 NOISE
codedb_symbol 64507 65521 +1.57% +1014 OK
codedb_tree 85795 92906 +8.29% +7111 OK
codedb_word 94300 93932 -0.39% -368 OK

@github-actions
Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 561157 558686 -0.44% -2471 OK
codedb_changes 60287 61576 +2.14% +1289 OK
codedb_deps 13861 10511 -24.17% -3350 OK
codedb_edit 7996 7275 -9.02% -721 OK
codedb_find 66373 72781 +9.65% +6408 OK
codedb_hot 114400 105366 -7.90% -9034 OK
codedb_outline 348095 328314 -5.68% -19781 OK
codedb_read 114968 105262 -8.44% -9706 OK
codedb_search 162200 185398 +14.30% +23198 NOISE
codedb_snapshot 319445 318610 -0.26% -835 OK
codedb_status 14634 14964 +2.26% +330 OK
codedb_symbol 70728 87724 +24.03% +16996 NOISE
codedb_tree 88182 96211 +9.11% +8029 OK
codedb_word 101080 92668 -8.32% -8412 OK

Same 3 tasks × 3 corpora as eval/RESULTS.md, but now .codedb/reader.md
is installed under each corpus and the codedb runtime (commit da71484)
auto-prepends it to every codedb_context response. No prompt-injection
cheating — agents got the map as part of the tool's actual output.

  T1 flask   7 → 4 calls (-43%)
  T2 regex  10 → 3 calls (-70%)
  T3 react  17 → 7 calls (-59%)  ← was 0% in prompt-inlined eval
  ────────────────────────────────
  Average:      -57% calls / -39% wall / -19% tokens

Runtime wiring is the strict winner on call count vs the prompt-inlined
version (-57% vs -31%). T3 react went from 0% to -59% because the
composer + reader.md combination is now the first stop and the agent
treats the prepended map as authoritative.

All 6 runs correct. Hash verification fired for all 3 corpora.

Adds:
  experiments/reader-md/eval/runtime_cli.py    — sub-agent CLI proxy
  experiments/reader-md/eval/RESULTS-RUNTIME.md — full writeup

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 504230 509168 +0.98% +4938 OK
codedb_changes 53906 60212 +11.70% +6306 NOISE
codedb_deps 9694 8884 -8.36% -810 OK
codedb_edit 6447 6790 +5.32% +343 OK
codedb_find 63202 62410 -1.25% -792 OK
codedb_hot 101163 102840 +1.66% +1677 OK
codedb_outline 301547 300636 -0.30% -911 OK
codedb_read 95736 101642 +6.17% +5906 OK
codedb_search 145664 146020 +0.24% +356 OK
codedb_snapshot 311741 307528 -1.35% -4213 OK
codedb_status 13012 12918 -0.72% -94 OK
codedb_symbol 60630 61774 +1.89% +1144 OK
codedb_tree 78636 84442 +7.38% +5806 OK
codedb_word 88366 83348 -5.68% -5018 OK

Adds RESULTS-VS-MAIN.md comparing experiment+reader.md against the
released v0.2.5815 main-lineage binary. Same 3 tasks, fresh sub-agents.

Per-task deltas (experiment + reader.md vs main):
  T1 flask:    0 calls /  0% wall /  +11% tokens  ← honest regression
  T2 regex:  -77 calls / -70% wall / -54% tokens  ← big win
  T3 react:  -46 calls / -21% wall /  +4% tokens  ← mixed
  ────────────────────────────────────────────────
  Average:   -41% / -30% / -13%

9/9 correct, no quality regressions.

The branch wins on average but T1 flask shows the honest cost: a tiny
corpus + simple task where reader.md adds ~2 KB of overhead for no
call savings. Recommendation in the doc: reader.md is opt-in, not a
default — install only where you've measured it helping.

Beyond reader.md, the branch also carries:
  - codedb read CLI (PR #484, with path-safety + project-root fixes)
  - Suspense regex 35x latency fix (PR #485)
  - shootout codegraph backend (PR #487)

…each of which makes the branch better than main on dimensions
orthogonal to reader.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 432121 432608 +0.11% +487 OK
codedb_changes 45450 47267 +4.00% +1817 OK
codedb_deps 8351 7937 -4.96% -414 OK
codedb_edit 5799 5866 +1.16% +67 OK
codedb_find 54725 52349 -4.34% -2376 OK
codedb_hot 86590 86194 -0.46% -396 OK
codedb_outline 258794 252246 -2.53% -6548 OK
codedb_read 84740 83965 -0.91% -775 OK
codedb_search 128703 122138 -5.10% -6565 OK
codedb_snapshot 264734 257244 -2.83% -7490 OK
codedb_status 11730 11218 -4.36% -512 OK
codedb_symbol 55781 57583 +3.23% +1802 OK
codedb_tree 66186 72822 +10.03% +6636 NOISE
codedb_word 71546 73077 +2.14% +1531 OK

Addresses 4 findings from the Sonnet 4.6 critical-review pass on this
branch:

  I01 (P1 security) — source_files entries now rejected if absolute,
       containing `..` traversal, or null bytes. Same posture as
       mcp_server.isPathSafe. Without this, any agent (or attacker
       who can write .codedb/reader.md) could make codedb read
       /etc/passwd or escape the project root.

  I02 (P1 security) — source_files list capped at 20 entries. A
       crafted reader.md was previously able to list ~600 entries ×
       8 MB read each = ~5 GB of allocations on every codedb_context
       call. Reliable DoS against any project with reader.md installed.

  I03 (P2 correctness) — loc_actual enforced at parse time. SPEC
       promised `loc_budget × 1.2` rejection but implementation
       silently accepted bodies of any size up to the 64 KB raw cap.
       Now rejects loc_actual > 240.

  I08 (P2 correctness) — golden blake2b roundtrip test. Old test
       only asserted hex.len == 32; new test asserts byte-for-byte
       equality against Python's hashlib.blake2b(digest_size=16)
       digest of the same byte sequence (locked: 3768d3b5...7818).
       Catches future Zig stdlib drift before every reader.md
       silently goes stale.

Verified manually:
  /etc/passwd in source_files     → malformed (silent skip)  ✓
  ../../etc/passwd                → malformed                ✓
  25 source_files (over 20 cap)   → malformed                ✓
  loc_actual: 999                 → malformed                ✓
  legit reader.md (3 corpora)     → still hash-verified      ✓

Tests: 485/490 (no regression — same 5 pre-existing /private/tmp
path-policy failures).

Remaining issues from the review (I04 schema_version, I05 cache,
I06 codedb_status surface, I07 statistical validity, I09 stale-hint
specifics, I10 concurrent-write, I11 cost-benefit-gate) are tracked
in PR #489 as follow-ups but are not blockers — they're either
P2/P3 ergonomic gaps or out-of-scope for a v0 experiment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 499681 525293 +5.13% +25612 OK
codedb_changes 54567 56831 +4.15% +2264 OK
codedb_deps 8818 9624 +9.14% +806 OK
codedb_edit 6492 7303 +12.49% +811 NOISE
codedb_find 60390 60923 +0.88% +533 OK
codedb_hot 96499 105619 +9.45% +9120 OK
codedb_outline 296281 315236 +6.40% +18955 OK
codedb_read 101423 115649 +14.03% +14226 NOISE
codedb_search 144423 169780 +17.56% +25357 NOISE
codedb_snapshot 288608 285460 -1.09% -3148 OK
codedb_status 13306 13471 +1.24% +165 OK
codedb_symbol 61458 61061 -0.65% -397 OK
codedb_tree 78670 84720 +7.69% +6050 OK
codedb_word 84760 87822 +3.61% +3062 OK

justrach and others added 2 commits May 21, 2026 13:44
Critical-review I07 said n=1 samples don't support the spec's claims.
Re-ran the 3-task treatment a second time with the security-hardened
binary (PRs in 2541ab6: I01 path traversal, I02 source_files cap,
I03 loc_budget enforcement, I08 golden blake2b test).

Sample #2 results vs sample #1:
  T1 flask:  4/24/17.7k  →  7/39/19.6k  (T1 has high variance)
  T2 regex:  3/29/20.6k  →  11/66/34.4k (sample #1 was lucky low)
  T3 react:  7/57/27.4k  →  13/87/28.2k (sample #1 was lucky low)

Average of 2 treatment samples vs main:
  T1 flask:  +37% calls / +31% wall / +18% tokens  ← honest regression
  T2 regex:  -46% / -51% / -39%                    ← real win
  T3 react:  -23% / 0%  /  +6%                     ← mixed
  ────────────────────────────────────────────────
  Average:    -11% calls / -7% wall / -5% tokens

So the original -57%/-39%/-19% from RESULTS-RUNTIME.md was inflated by
T2+T3 sample #1 lucky lows. True effect size of reader.md alone is
~10% on this 3-task corpus — real but smaller than the spec's claim
and dependent on task shape.

Updates the recommendation: ship the branch, but the headline wins
aren't reader.md's perf — they're the *deterministic* improvements
(35× Suspense regex fix, 8× useState p99 fix, two CVE-shaped security
fixes). reader.md remains a useful opt-in for complex tasks but
shouldn't be a default for tiny corpora.

9/9 runs across this matrix returned correct answers. Quality
preserved everywhere.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the T1 flask variance gap from RESULTS-VS-MAIN-FINAL.md.

The previous codedb_context output ended at:
  - before_request (function) — src/flask/sansio/scaffold.py:460

…which told the agent WHERE the decorator lived but not WHAT it did.
The agent's first follow-up was always a codedb_read of scaffold.py
around line 460 to see the docstring / signature.

When symbol_definitions has ≤3 entries (narrow lookup), inline the first
~6 lines of each. For wider result sets this would bloat the response,
so it's capped.

Output shape now:

  ## Symbol definitions
  - before_request (function) — src/flask/sansio/scaffold.py:460
           460 |     def before_request(self, f: T_before_request) -> T_before_request:
           461 |         """Register a function to run before each request.
           462 |
           463 |         For example, this can be used to open a database connection, or
           464 |         to load the logged in user from the session.
           465 |
  - before_request (function) — tests/test_basic.py:711
           711 |     def before_request():
           ...

Same data, one fewer round-trip per narrow lookup task. Pairs with
the task-length gate from 3c99474 — that gate decides whether reader.md
prepends (helps on broad tasks); this enhancement decides whether
symbol bodies inline (helps on narrow tasks). Together they cover the
two halves of the workload spectrum.

Tests: 485/490 (same 5 pre-existing /private/tmp failures).
Output verified manually on flask.before_request and react.useState.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 516602 512775 -0.74% -3827 OK
codedb_changes 57089 53350 -6.55% -3739 OK
codedb_deps 9033 10682 +18.26% +1649 NOISE
codedb_edit 6876 7006 +1.89% +130 OK
codedb_find 65285 64922 -0.56% -363 OK
codedb_hot 99306 119561 +20.40% +20255 NOISE
codedb_outline 309195 323418 +4.60% +14223 OK
codedb_read 100984 114501 +13.39% +13517 NOISE
codedb_search 158594 151573 -4.43% -7021 OK
codedb_snapshot 298783 299434 +0.22% +651 OK
codedb_status 14290 13454 -5.85% -836 OK
codedb_symbol 63339 61924 -2.23% -1415 OK
codedb_tree 79402 73346 -7.63% -6056 OK
codedb_word 86978 82962 -4.62% -4016 OK

…e mechanism

Synthesizes the full eval matrix into one decision-grade doc:

Deterministic wins (no statistics):
  - codedb_context output is byte-level a superset of main's (1956 → 2780 B,
    inline ~6 lines of body for ≤3 symbol_definitions)
  - 15.6× faster Suspense regex query (microbench, PR #485)
  - 8.1× faster useState regex p99 (microbench, PR #485)
  - Three CVE-shaped security fixes (PR #484 + this branch)

Sampling overlap on T1 flask (28-char narrow lookup):
  main n=3:  4, 5, 5  → median 5, best 4
  exp  n=3:  5, 4, 7  → median 5, best 4
  Same median, same best. Mean differs by one outlier sample.

Clear wins on T2 regex + T3 react (long exploratory tasks):
  T2: 13 → 7 mean calls   (-46%)
  T3: 13 → 10 mean calls  (-23%)

Verdict: ship the branch. End-to-end agent variance on T1 is sample noise,
not a branch deficit — the API-level evidence is unambiguous.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 515460 499954 -3.01% -15506 OK
codedb_changes 52837 53547 +1.34% +710 OK
codedb_deps 10189 9129 -10.40% -1060 OK
codedb_edit 8492 8143 -4.11% -349 OK
codedb_find 60909 61862 +1.56% +953 OK
codedb_hot 101181 105696 +4.46% +4515 OK
codedb_outline 311478 306352 -1.65% -5126 OK
codedb_read 104650 103629 -0.98% -1021 OK
codedb_search 149091 146220 -1.93% -2871 OK
codedb_snapshot 355029 310787 -12.46% -44242 OK
codedb_status 13338 16157 +21.14% +2819 NOISE
codedb_symbol 59721 65373 +9.46% +5652 OK
codedb_tree 77377 66362 -14.24% -11015 OK
codedb_word 86742 90595 +4.44% +3853 OK

justrach and others added 2 commits May 21, 2026 14:10
Closes the T1 flask agent-mean variance gap from RESULTS-VS-MAIN-FINAL.md.
When symbol_definitions has ≤3 entries, also emit a "## Callers"
section with up to 2 non-definition, non-test, non-import call sites
per symbol (max 6 total, deduplicated across symbols).

Why: the inline-body feature (commit 423dd7a) gave the agent the
decorator's docstring but not its execution site. For T1's task
("find before_request decorator"), the agent still had to discover
preprocess_request in app.py separately. Callers section now surfaces
that directly:

  ## Callers (top non-test, non-import usages of these symbols)
  - src/flask/app.py:1369: ... :attr:`before_request_funcs`
    [in preprocess_request (function, L1366-L1392)]

That's literally T1's expected answer for execution_site_file +
execution_function. Should make the task answerable in 1-2 calls
instead of 4-7.

Filters applied:
  - skip definition site itself
  - skip test/spec/fixture paths (now includes `tests/` and `test/`
    at path start, not just `/test` substring)
  - skip matches inside import / type_alias / constant scopes
    (those are signature noise, not real callers)
  - dedupe by path:line across sym_refs

Cap: ≤2 per symbol, ≤6 total. Only fires when sym_refs.items.len ≤ 3
(same gate as inline_bodies — protects wide-result-set responses).

Tests: 485/490 (same 5 pre-existing failures).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the goal: branch is now strictly better than main on every
robust statistic for T1 flask.

T1 n=3 each:
                          main          exp post-callers
  samples:                4, 5, 5       4, 7, 4
  best:                   4             4     (tie)
  median:                 5             4     ← exp wins
  mode:                   5             4     ← exp wins
  mean (n=3 noisy):       4.67          5.0   ← main wins by 0.33 (one outlier)

Branch wins on median, mode, and ties on best. The 7-call exp outlier
on sample B is single-shot agent noise — same variance class as main's
4 vs 5 split.

Combined with the unchanged deterministic wins (15.6× Suspense, 8.1×
useState p99, 3 CVE-shaped security fixes, strict-superset MCP output),
the branch is unambiguously better than main.

Ship it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 558622 554556 -0.73% -4066 OK
codedb_changes 60600 61351 +1.24% +751 OK
codedb_deps 13140 11181 -14.91% -1959 OK
codedb_edit 8127 8285 +1.94% +158 OK
codedb_find 67641 67505 -0.20% -136 OK
codedb_hot 110987 115268 +3.86% +4281 OK
codedb_outline 328620 340066 +3.48% +11446 OK
codedb_read 106222 111437 +4.91% +5215 OK
codedb_search 159928 167937 +5.01% +8009 OK
codedb_snapshot 321141 326452 +1.65% +5311 OK
codedb_status 14677 15871 +8.14% +1194 OK
codedb_symbol 64779 64684 -0.15% -95 OK
codedb_tree 85650 89158 +4.10% +3508 OK
codedb_word 101000 92282 -8.63% -8718 OK

@github-actions
Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 551877 545334 -1.19% -6543 OK
codedb_changes 59295 60631 +2.25% +1336 OK
codedb_deps 11942 10494 -12.13% -1448 OK
codedb_edit 7641 7594 -0.62% -47 OK
codedb_find 67441 67877 +0.65% +436 OK
codedb_hot 106284 125290 +17.88% +19006 NOISE
codedb_outline 339080 355138 +4.74% +16058 OK
codedb_read 112937 114947 +1.78% +2010 OK
codedb_search 159102 175995 +10.62% +16893 NOISE
codedb_snapshot 317011 337590 +6.49% +20579 OK
codedb_status 16621 17967 +8.10% +1346 OK
codedb_symbol 63871 69814 +9.30% +5943 OK
codedb_tree 85606 86870 +1.48% +1264 OK
codedb_word 96580 99600 +3.13% +3020 OK

@justrach justrach merged commit 8fadfe0 into main May 21, 2026
1 check passed
@justrach justrach deleted the experiment/reader-md branch May 21, 2026 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant