bench(harness): make malformed GT specs scorable + add in-scope recall filter#10
Merged
Merged
Conversation
…l filter
Two harness-side changes to ks-xlsx-parser's SpreadsheetBench eval,
plus regression-test guards locking already-shipped parser behavior.
1. parse_position_spec tolerates malformed dataset entries:
- unmatched single quotes (`Sheet1'!A1`, `'Sheet1!'A1`)
- fullwidth colon (`G12:J15`)
- whitespace around `:` (`A1: A89`)
14 of 912 instances were scored as misses even when the parser was
correct; they now score normally. On the full 912 v0.1:
geom@5 0.369 → 0.478 (+10.9 pp), text@5 0.704 → 0.754 (+5.0 pp).
2. classify_execution_required + dual recall numbers.
~33% of SpreadsheetBench instances ask the system to compute and
write a value that doesn't exist in the input — a retrieval parser
cannot satisfy these. summary.json / history.jsonl / summary.md now
emit BOTH `recall_*@5` (all 912) and `recall_*@5_in_scope`
(denominator excludes execution-required instances). triage_recall
default-filters out-of-scope rows; opt-in via --include-out-of-scope.
`out_of_scope.txt` audit log written per run.
On the full 912: text@5 in-scope 0.841, geom@5 in-scope 0.534.
3. enrich_failures: fix two `answer_sheet=None` false-positive bugs.
When the dataset entry has no explicit sheet name (561/912 instances)
the old code skipped the first-worksheet fallback used by the
scoring path. Result: spurious flags on hundreds of instances.
`instruction_requires_execution` 498 → 193
`gt_range_empty_in_workbook` 477 → 172
The remaining counts now match the corrected classifier in
eval_retrieval.py.
4. Regression-test guards for parser behavior that is already correct
on `main` (no parser code changed, just locked in tests):
- test_eval_retrieval_spec_parser.py (17 cases) — spec-parser
tolerance for every malformed shape observed in the corpus
- test_eval_retrieval_classify.py (8 cases) — classifier semantics
+ dual-recall aggregator math
- test_array_formula_rendering.py (3 cases) — ArrayFormula repr
must not leak into chunk render_text
- test_formula_uncached_rendering.py (2 cases) — uncached formula
cells render their formula source (not None or repr)
Tests: 1071 passed (+17 new). No parser internals changed; all gains
come from harness scorability and from surfacing a metric that was
hidden inside the eval output.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merged
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two harness-side improvements to the SpreadsheetBench retrieval-recall benchmark, plus regression-test guards that lock in already-correct parser behavior. No
src/ks_xlsx_parser/*changes — this PR is entirelyscripts/+tests/.The motivating goal was the recall-90 roadmap in
docs/planning/recall-90/. This PR doesn't reach 0.90; it ships the work that was actually completed and surfaces the metric the roadmap gates on (in-scope recall) so future parser work has a clean signal.What changed
1.
parse_position_spectolerates malformed dataset entriesEight to fourteen instances in SpreadsheetBench v0.1 carry GT strings the harness regex never matched:
Sheet1'!A1,Dashboard'!B8,Output'!B11:B17'RAWDATA!'A1:P6,'OUTPUT!'A1:P6'G12:J15:'A1: A89','C1: C7'All now parse via a normalize-then-fallback pass in
scripts/eval_retrieval.py::parse_position_spec+parse_range. These instances were scored as misses despite the parser being correct.Full 912 v0.1 impact:
geom@50.369 → 0.478 (+10.9 pp),text@50.704 → 0.754 (+5.0 pp).2.
classify_execution_required+ dual recall numbers (cluster 05)~33% of SpreadsheetBench instances ask the system to compute and write a value that doesn't exist in the input (e.g. "Round B2 to nearest $5 and put it in C2"). A retrieval parser cannot satisfy these. They drag headline recall down ~16pp without representing parser quality.
summary.json/history.jsonl/summary.mdnow emit BOTH:recall_*@5— denominator = all 912 (long-standing number)recall_*@5_in_scope— denominator excludes execution-required instances (the metric the roadmap gates on)scripts/triage_recall.pydefault-filters out-of-scope rows; opt-in via--include-out-of-scope. An audit logout_of_scope.txtis written per run so the classifier's behavior is diffable.Full 912 v0.1 in-scope:
text@50.841,geom@50.534 (denominator 667 in-scope, 245 OOS).3.
enrich_failures.py: two real bug fixesWhen the dataset entry has no explicit
answer_sheet(561/912 instances) the old code skipped the first-worksheet fallback that the scoring path uses. Result: hundreds of spurious flags.instruction_requires_executiongt_range_empty_in_workbookThe corrected counts now agree with the new
classify_execution_requiredineval_retrieval.py.4. Regression-test guards (parser code unchanged)
Four new test files lock in behavior that's already correct on
mainbut was undertested:test_eval_retrieval_spec_parser.py(17 cases) — spec-parser tolerance for every malformed shape observed in the corpustest_eval_retrieval_classify.py(8 cases) — classifier semantics + dual-recall aggregator mathtest_array_formula_rendering.py(3 cases) —<ArrayFormula object …>repr must not leak into chunk render_texttest_formula_uncached_rendering.py(2 cases) — uncached formula cells render their formula source (not None or repr)Type of change
Test plan
make testpasses locally — 1071 passed (1054 → 1071, +17 new)history.jsonlrow appended (commitcf633d1, n=912)scripts/enrich_failures.pyre-runs cleanly on the corrected output; flag counts match the classifierruff checkclean on every file this PR touches (pre-existing F401/F841 issues in untouched code remain; not widened)recall_*@5_in_scopeis the metric the recall-90 roadmap should gate onNotes for reviewers
text@5is 0.841 (target 0.90), in-scopegeom@5is 0.534 (target 0.70). The residual gap (cluster 02 range-bookkeeping + the cluster-04 chunk-size cap) is left to a follow-up. Cluster docs indocs/planning/recall-90/(gitignored) carry status snapshots.recall_*_all/recall_*_in_scopeschema is additive — existing consumers ofsummary.jsonkeep working.🤖 Generated with Claude Code