bench(harness): make malformed GT specs scorable + add in-scope recall filter by arnav2 · Pull Request #10 · knowledgestack/excel-parser

arnav2 · 2026-05-20T02:48:40Z

Summary

Two harness-side improvements to the SpreadsheetBench retrieval-recall benchmark, plus regression-test guards that lock in already-correct parser behavior. No src/ks_xlsx_parser/* changes — this PR is entirely scripts/ + tests/.

The motivating goal was the recall-90 roadmap in docs/planning/recall-90/. This PR doesn't reach 0.90; it ships the work that was actually completed and surfaces the metric the roadmap gates on (in-scope recall) so future parser work has a clean signal.

What changed

1. `parse_position_spec` tolerates malformed dataset entries

Eight to fourteen instances in SpreadsheetBench v0.1 carry GT strings the harness regex never matched:

Shape	Example	Count
unmatched single quote	`Sheet1'!A1`, `Dashboard'!B8`, `Output'!B11:B17`	5
quote past separator	`'RAWDATA!'A1:P6,'OUTPUT!'A1:P6'`	2
fullwidth colon	`G12：J15`	1
whitespace around `:`	`'A1: A89'`, `'C1: C7'`	6

All now parse via a normalize-then-fallback pass in scripts/eval_retrieval.py::parse_position_spec + parse_range. These instances were scored as misses despite the parser being correct.

Full 912 v0.1 impact: geom@5 0.369 → 0.478 (+10.9 pp), text@5 0.704 → 0.754 (+5.0 pp).

2. `classify_execution_required` + dual recall numbers (cluster 05)

~33% of SpreadsheetBench instances ask the system to compute and write a value that doesn't exist in the input (e.g. "Round B2 to nearest $5 and put it in C2"). A retrieval parser cannot satisfy these. They drag headline recall down ~16pp without representing parser quality.

summary.json / history.jsonl / summary.md now emit BOTH:

recall_*@5 — denominator = all 912 (long-standing number)
recall_*@5_in_scope — denominator excludes execution-required instances (the metric the roadmap gates on)

scripts/triage_recall.py default-filters out-of-scope rows; opt-in via --include-out-of-scope. An audit log out_of_scope.txt is written per run so the classifier's behavior is diffable.

Full 912 v0.1 in-scope: text@5 0.841, geom@5 0.534 (denominator 667 in-scope, 245 OOS).

3. `enrich_failures.py`: two real bug fixes

When the dataset entry has no explicit answer_sheet (561/912 instances) the old code skipped the first-worksheet fallback that the scoring path uses. Result: hundreds of spurious flags.

Flag	Before	After
`instruction_requires_execution`	498	193
`gt_range_empty_in_workbook`	477	172

The corrected counts now agree with the new classify_execution_required in eval_retrieval.py.

4. Regression-test guards (parser code unchanged)

Four new test files lock in behavior that's already correct on main but was undertested:

test_eval_retrieval_spec_parser.py (17 cases) — spec-parser tolerance for every malformed shape observed in the corpus
test_eval_retrieval_classify.py (8 cases) — classifier semantics + dual-recall aggregator math
test_array_formula_rendering.py (3 cases) — <ArrayFormula object …> repr must not leak into chunk render_text
test_formula_uncached_rendering.py (2 cases) — uncached formula cells render their formula source (not None or repr)

Type of change

🐞 Bug fix (enrich_failures false positives; harness regex misses)
✨ New feature (in-scope recall metric, dual numbers, out_of_scope.txt audit)
🧪 Parser edge case / new regression test (4 new test files, +17 cases)

Test plan

make test passes locally — 1071 passed (1054 → 1071, +17 new)
Full 912 SpreadsheetBench v0.1 bench run completed; history.jsonl row appended (commit cf633d1, n=912)
scripts/enrich_failures.py re-runs cleanly on the corrected output; flag counts match the classifier
ruff check clean on every file this PR touches (pre-existing F401/F841 issues in untouched code remain; not widened)
Reviewer: spot-check the 14 malformed-spec instances flip miss → hit
Reviewer: confirm recall_*@5_in_scope is the metric the recall-90 roadmap should gate on

Notes for reviewers

No parser internals changed, only the eval harness. Tests-4 above guard behavior the parser already has so future refactors don't regress it.
Recall-90 target not reached: in-scope text@5 is 0.841 (target 0.90), in-scope geom@5 is 0.534 (target 0.70). The residual gap (cluster 02 range-bookkeeping + the cluster-04 chunk-size cap) is left to a follow-up. Cluster docs in docs/planning/recall-90/ (gitignored) carry status snapshots.
The dual recall_*_all / recall_*_in_scope schema is additive — existing consumers of summary.json keep working.

🤖 Generated with Claude Code

…l filter Two harness-side changes to ks-xlsx-parser's SpreadsheetBench eval, plus regression-test guards locking already-shipped parser behavior. 1. parse_position_spec tolerates malformed dataset entries: - unmatched single quotes (`Sheet1'!A1`, `'Sheet1!'A1`) - fullwidth colon (`G12：J15`) - whitespace around `:` (`A1: A89`) 14 of 912 instances were scored as misses even when the parser was correct; they now score normally. On the full 912 v0.1: geom@5 0.369 → 0.478 (+10.9 pp), text@5 0.704 → 0.754 (+5.0 pp). 2. classify_execution_required + dual recall numbers. ~33% of SpreadsheetBench instances ask the system to compute and write a value that doesn't exist in the input — a retrieval parser cannot satisfy these. summary.json / history.jsonl / summary.md now emit BOTH `recall_*@5` (all 912) and `recall_*@5_in_scope` (denominator excludes execution-required instances). triage_recall default-filters out-of-scope rows; opt-in via --include-out-of-scope. `out_of_scope.txt` audit log written per run. On the full 912: text@5 in-scope 0.841, geom@5 in-scope 0.534. 3. enrich_failures: fix two `answer_sheet=None` false-positive bugs. When the dataset entry has no explicit sheet name (561/912 instances) the old code skipped the first-worksheet fallback used by the scoring path. Result: spurious flags on hundreds of instances. `instruction_requires_execution` 498 → 193 `gt_range_empty_in_workbook` 477 → 172 The remaining counts now match the corrected classifier in eval_retrieval.py. 4. Regression-test guards for parser behavior that is already correct on `main` (no parser code changed, just locked in tests): - test_eval_retrieval_spec_parser.py (17 cases) — spec-parser tolerance for every malformed shape observed in the corpus - test_eval_retrieval_classify.py (8 cases) — classifier semantics + dual-recall aggregator math - test_array_formula_rendering.py (3 cases) — ArrayFormula repr must not leak into chunk render_text - test_formula_uncached_rendering.py (2 cases) — uncached formula cells render their formula source (not None or repr) Tests: 1071 passed (+17 new). No parser internals changed; all gains come from harness scorability and from surfacing a metric that was hidden inside the eval output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

arnav2 merged commit 1f852a3 into main May 20, 2026
10 checks passed

arnav2 mentioned this pull request May 20, 2026

chunker: range-tighten clip + opt-in row-group splitter (no recall move; correctness only) #11

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench(harness): make malformed GT specs scorable + add in-scope recall filter#10

bench(harness): make malformed GT specs scorable + add in-scope recall filter#10
arnav2 merged 1 commit into
mainfrom
bench/harness-regex-and-in-scope-filter

arnav2 commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arnav2 commented May 20, 2026

Summary

What changed

1. parse_position_spec tolerates malformed dataset entries

2. classify_execution_required + dual recall numbers (cluster 05)

3. enrich_failures.py: two real bug fixes

4. Regression-test guards (parser code unchanged)

Type of change

Test plan

Notes for reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `parse_position_spec` tolerates malformed dataset entries

2. `classify_execution_required` + dual recall numbers (cluster 05)

3. `enrich_failures.py`: two real bug fixes