Skip to content

chunker: range-tighten clip + opt-in row-group splitter (no recall move; correctness only)#11

Merged
arnav2 merged 1 commit into
bench/harness-regex-and-in-scope-filterfrom
chunker/range-tighten-and-size-cap
May 20, 2026
Merged

chunker: range-tighten clip + opt-in row-group splitter (no recall move; correctness only)#11
arnav2 merged 1 commit into
bench/harness-regex-and-in-scope-filterfrom
chunker/range-tighten-and-size-cap

Conversation

@arnav2
Copy link
Copy Markdown
Collaborator

@arnav2 arnav2 commented May 20, 2026

Summary

Two chunker-side changes. Neither moves retrieval recall on the full 912 SpreadsheetBench v0.1 with either BGE-small or text-embedding-3-large. I'm proposing this for correctness / citation-grade output, not for the bench number. Be honest about that going in.

Stacked on #10 (recall-90 harness PR) since the bench instrumentation it adds is what I used to measure the (lack of) impact below.

What changed

A · Range-tightening clip in _block_to_chunk

_tight_content_bbox(block, sheet) walks the cells inside block.cell_range and returns the bbox of cells whose raw_value is non-None or display_value is non-empty whitespace. The chunk's claimed (top_left_cell, bottom_right_cell) is then clipped to that bbox before emission. The renderer continues iterating the original block.cell_range, so the narrowed claim is always a superset of cells that contributed to render_text — invariant preserved.

Concrete pathology this fixes: on the SpreadsheetBench corpus, several sheets carry styled-empty cells stretching across the full XFD width (16,384 columns). The segmenter sees them, the chunker dutifully emits a chunk claiming A1:XFD4 despite zero actual data outside the upper-left corner. Without the clip, citation UIs would highlight the entire sheet width as the "source" of any retrieved chunk.

B · _split_block_by_rows + KS_CHUNK_BUDGET_CHARS env var

Row-group splitter for oversize blocks. Each child has a tight, non-overlapping A1 range over its data rows; siblings sum to the parent's row coverage exactly. Default KS_CHUNK_BUDGET_CHARS=100000 — effectively OFF for any reasonable workbook. Available behind the env var for downstream consumers with strict per-chunk token economy.

Empirical numbers (full 912 / text-embedding-3-large)

Metric Before this PR After this PR Δ
recall_text@5 0.750 0.750
recall_text@5_in_scope 0.838 0.838
recall_geometric@5 0.482 0.482
recall_geometric@5_in_scope 0.538 0.538
mean_parse_ms 156 174 +18 ms
net instance flips 0 miss→hit, 0 hit→miss

Why no retrieval delta despite a real bug fix

The over-claims happen on sheets where the GT cells live in other blocks. So the over-claim was already a false negative for geom scoring, not a false positive that needed correction — the metric just didn't see the lie. Citation UIs that highlight the chunk's claimed range are the actual beneficiaries. Without a citation-accuracy metric we can't surface that gain numerically; that metric is follow-up work (see #issue-when-i-file-it).

Type of change

  • 🐞 Bug fix (chunk over-claims A1:XFD4-style ranges)
  • ✨ New feature (opt-in row-group splitter behind env var, default off)
  • 🧪 Parser edge case / new regression test (8 new test cases)

Test plan

  • make test passes — 1071 → 1079 tests
  • Full 912 SpreadsheetBench v0.1 with text-embedding-3-large: 0 instance flips, recall unchanged
  • tests/test_chunker_range_tighten.py — 3 cases including a corpus-fixture invariant
  • tests/test_chunker_size_cap.py — 5 cases including the env-var-driven split + default-no-split

Notes for reviewers

🤖 Generated with Claude Code

…p splitter

Two chunker-side changes, neither moves retrieval recall — explicit
calibration on the full 912 SpreadsheetBench v0.1 with both BGE-small
and text-embedding-3-large shows 0 instance flips for either change.
Shipping for correctness, not recall.

A. _tight_content_bbox + clip in _block_to_chunk

   Walks `block.cell_range`, finds the bbox of cells whose `raw_value`
   or non-empty `display_value` is set, and clips the chunk's claimed
   `(top_left_cell, bottom_right_cell)` to that bbox. Fixes the over-
   claim pathology where a sheet with styled-empty cells across XFD
   width produces a chunk claiming `A1:XFD4` despite the actual data
   sitting in a 5×3 corner. The renderer already iterates the original
   range, so the narrowed claim is always a superset of cells that
   contributed to render_text — invariant preserved.

   Bench (full 912 / text-embedding-3-large):
     recall_text@5:           0.750 → 0.750  (no change)
     recall_geometric@5:      0.482 → 0.482  (no change)
     recall_text@5_in_scope:  0.838 → 0.838  (no change)
     mean parse_ms:           156   → 174    (+18 ms; bbox walk)
     net instance flips:      0 miss→hit, 0 hit→miss

   Why no recall change despite fixing a real over-claim: the over-
   claims happen on sheets with empty-XFD blocks; the GT cells on
   those sheets are in OTHER blocks (proper data regions). So the
   over-claim was already a false negative for geom scoring, not a
   false positive that needed correction. The pathology lives in the
   dead zone of the retrieval metric. Citation UIs that highlight the
   chunk's claimed range still benefit — that's the actual value here.

B. _split_block_by_rows + KS_CHUNK_BUDGET_CHARS env var

   When a block's render_text exceeds a configurable budget, split it
   into row-group sub-blocks with tight A1 ranges and non-overlapping
   coverage. Default budget is 100,000 chars — effectively OFF for any
   reasonable workbook on this corpus. Calibration on the 50-sample
   showed every smaller budget (2k, 4k, 8k) was net-zero or net-
   negative on retrieval recall because the embedding cannot
   discriminate between same-shape row-group children. Lower it via
   `KS_CHUNK_BUDGET_CHARS=2000` only if your downstream consumer has
   a strict per-chunk token economy that demands fragmentation; bench
   any such change against your own corpus first.

Tests: tests/test_chunker_range_tighten.py (3 cases) + tests/test_chunker_size_cap.py
(5 cases) — 8 added, 1071 → 1079 total passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@arnav2 arnav2 merged commit e4625d8 into bench/harness-regex-and-in-scope-filter May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant