chunker: range-tighten clip + opt-in row-group splitter (no recall move; correctness only) by arnav2 · Pull Request #11 · knowledgestack/excel-parser

arnav2 · 2026-05-20T06:55:15Z

Summary

Two chunker-side changes. Neither moves retrieval recall on the full 912 SpreadsheetBench v0.1 with either BGE-small or text-embedding-3-large. I'm proposing this for correctness / citation-grade output, not for the bench number. Be honest about that going in.

Stacked on #10 (recall-90 harness PR) since the bench instrumentation it adds is what I used to measure the (lack of) impact below.

What changed

A · Range-tightening clip in `_block_to_chunk`

_tight_content_bbox(block, sheet) walks the cells inside block.cell_range and returns the bbox of cells whose raw_value is non-None or display_value is non-empty whitespace. The chunk's claimed (top_left_cell, bottom_right_cell) is then clipped to that bbox before emission. The renderer continues iterating the original block.cell_range, so the narrowed claim is always a superset of cells that contributed to render_text — invariant preserved.

Concrete pathology this fixes: on the SpreadsheetBench corpus, several sheets carry styled-empty cells stretching across the full XFD width (16,384 columns). The segmenter sees them, the chunker dutifully emits a chunk claiming A1:XFD4 despite zero actual data outside the upper-left corner. Without the clip, citation UIs would highlight the entire sheet width as the "source" of any retrieved chunk.

B · `_split_block_by_rows` + `KS_CHUNK_BUDGET_CHARS` env var

Row-group splitter for oversize blocks. Each child has a tight, non-overlapping A1 range over its data rows; siblings sum to the parent's row coverage exactly. Default KS_CHUNK_BUDGET_CHARS=100000 — effectively OFF for any reasonable workbook. Available behind the env var for downstream consumers with strict per-chunk token economy.

Empirical numbers (full 912 / text-embedding-3-large)

Metric	Before this PR	After this PR	Δ
`recall_text@5`	0.750	0.750	—
`recall_text@5_in_scope`	0.838	0.838	—
`recall_geometric@5`	0.482	0.482	—
`recall_geometric@5_in_scope`	0.538	0.538	—
`mean_parse_ms`	156	174	+18 ms
net instance flips	—	—	0 miss→hit, 0 hit→miss

Why no retrieval delta despite a real bug fix

The over-claims happen on sheets where the GT cells live in other blocks. So the over-claim was already a false negative for geom scoring, not a false positive that needed correction — the metric just didn't see the lie. Citation UIs that highlight the chunk's claimed range are the actual beneficiaries. Without a citation-accuracy metric we can't surface that gain numerically; that metric is follow-up work (see #issue-when-i-file-it).

Type of change

🐞 Bug fix (chunk over-claims A1:XFD4-style ranges)
✨ New feature (opt-in row-group splitter behind env var, default off)
🧪 Parser edge case / new regression test (8 new test cases)

Test plan

make test passes — 1071 → 1079 tests
Full 912 SpreadsheetBench v0.1 with text-embedding-3-large: 0 instance flips, recall unchanged
tests/test_chunker_range_tighten.py — 3 cases including a corpus-fixture invariant
tests/test_chunker_size_cap.py — 5 cases including the env-var-driven split + default-no-split

Notes for reviewers

Honest framing on the value: this PR is a correctness fix. If your review heuristic is "does it move the bench number," it doesn't. If it's "is the chunk's claimed range honest about what cells it covers," yes.
B is plumbing, not active code. The default keeps tables whole. Lower the budget at your own risk, validate against your own corpus.
Stacks on bench(harness): make malformed GT specs scorable + add in-scope recall filter #10 — review bench(harness): make malformed GT specs scorable + add in-scope recall filter #10 first.

🤖 Generated with Claude Code

…p splitter Two chunker-side changes, neither moves retrieval recall — explicit calibration on the full 912 SpreadsheetBench v0.1 with both BGE-small and text-embedding-3-large shows 0 instance flips for either change. Shipping for correctness, not recall. A. _tight_content_bbox + clip in _block_to_chunk Walks `block.cell_range`, finds the bbox of cells whose `raw_value` or non-empty `display_value` is set, and clips the chunk's claimed `(top_left_cell, bottom_right_cell)` to that bbox. Fixes the over- claim pathology where a sheet with styled-empty cells across XFD width produces a chunk claiming `A1:XFD4` despite the actual data sitting in a 5×3 corner. The renderer already iterates the original range, so the narrowed claim is always a superset of cells that contributed to render_text — invariant preserved. Bench (full 912 / text-embedding-3-large): recall_text@5: 0.750 → 0.750 (no change) recall_geometric@5: 0.482 → 0.482 (no change) recall_text@5_in_scope: 0.838 → 0.838 (no change) mean parse_ms: 156 → 174 (+18 ms; bbox walk) net instance flips: 0 miss→hit, 0 hit→miss Why no recall change despite fixing a real over-claim: the over- claims happen on sheets with empty-XFD blocks; the GT cells on those sheets are in OTHER blocks (proper data regions). So the over-claim was already a false negative for geom scoring, not a false positive that needed correction. The pathology lives in the dead zone of the retrieval metric. Citation UIs that highlight the chunk's claimed range still benefit — that's the actual value here. B. _split_block_by_rows + KS_CHUNK_BUDGET_CHARS env var When a block's render_text exceeds a configurable budget, split it into row-group sub-blocks with tight A1 ranges and non-overlapping coverage. Default budget is 100,000 chars — effectively OFF for any reasonable workbook on this corpus. Calibration on the 50-sample showed every smaller budget (2k, 4k, 8k) was net-zero or net- negative on retrieval recall because the embedding cannot discriminate between same-shape row-group children. Lower it via `KS_CHUNK_BUDGET_CHARS=2000` only if your downstream consumer has a strict per-chunk token economy that demands fragmentation; bench any such change against your own corpus first. Tests: tests/test_chunker_range_tighten.py (3 cases) + tests/test_chunker_size_cap.py (5 cases) — 8 added, 1071 → 1079 total passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

arnav2 merged commit e4625d8 into bench/harness-regex-and-in-scope-filter May 20, 2026

arnav2 mentioned this pull request May 20, 2026

renderer(tier-1): row anchors + number-format expansion + merged-cell propagation #12

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chunker: range-tighten clip + opt-in row-group splitter (no recall move; correctness only)#11

chunker: range-tighten clip + opt-in row-group splitter (no recall move; correctness only)#11
arnav2 merged 1 commit into
bench/harness-regex-and-in-scope-filterfrom
chunker/range-tighten-and-size-cap

arnav2 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arnav2 commented May 20, 2026

Summary

What changed

A · Range-tightening clip in _block_to_chunk

B · _split_block_by_rows + KS_CHUNK_BUDGET_CHARS env var

Empirical numbers (full 912 / text-embedding-3-large)

Why no retrieval delta despite a real bug fix

Type of change

Test plan

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

A · Range-tightening clip in `_block_to_chunk`

B · `_split_block_by_rows` + `KS_CHUNK_BUDGET_CHARS` env var