fix(packaging): nest everything under ks_xlsx_parser + add CI/bench infra by arnav2 · Pull Request #9 · knowledgestack/excel-parser

arnav2 · 2026-05-19T21:36:10Z

Why

error: Import "ks_xlsx_parser.pipeline" could not be resolved (reportMissingImports)

Building the v0.2.0 wheel locally and unzipping it confirmed two distinct packaging bugs:

pipeline.py and api.py were missing from the wheel — top-level modules under src/, and setuptools.packages.find only picks up packages, so they were silently dropped.
Namespace pollution — top_level.txt listed 13 generic siblings of ks_xlsx_parser (models, utils, parsers, …). Installing the package would shadow any other models in site-packages.

CI didn't catch this because the matrix test job uses an editable install — src/ is on sys.path and the wheel-packaging defect is invisible.

What this PR does

Restructures into a proper nested package — git mv of 13 packages + pipeline.py + api.py + py.typed under src/ks_xlsx_parser/; mechanical rewrite of 47 files' imports; pyproject.toml constrains packages.find to ks_xlsx_parser*; drops PYTHONPATH=src from the Makefile.
Adds the regression guard CI was missing — scripts/verify_wheel.py builds the wheel, installs in a clean venv, asserts top_level.txt == ['ks_xlsx_parser'] and the public imports resolve. Wired into a new wheel-check CI job and a Verify wheel step in release.yml.
ks-backend-style CI — split ci.yml into test / wheel-check / lint / typecheck jobs, uv-backed installs. Lint and typecheck are non-blocking with TODO markers (45 pre-existing ruff, 59 mypy — separate cleanup PR).
Docker benchmark + accuracy tracking — Dockerfile.bench reproduces SpreadsheetBench retrieval scoring. .github/workflows/benchmark.yml runs a 60-instance sample on PRs and the full 912-instance corpus weekly. scripts/append_bench_history.py appends each run to tests/benchmarks/reports/history.jsonl tagged with the git commit.
Recall failure triage — eval_retrieval.py --emit-failures dumps per-miss diagnostics; scripts/triage_recall.py produces a ranked bucket histogram + exemplar failures so "recall@5 = X" becomes a ranked worklist.
make install-dev alias (Frank pattern-matched this off ks-backend).

What this PR deliberately does NOT do

No version bump. Cutting v0.2.1 is a separate step (see docs/RELEASE_PROCESS.md).
No recall improvements. docs/recall-investigation.md documents the diagnosis framework for the follow-up PRs.
No fix for the pre-existing ruff/mypy findings — separate cleanup PR.

…infra Frank's report on the v0.2.0 PyPI release was correct: the wheel was missing pipeline.py and api.py (top-level modules under src/, which setuptools packages.find skips), AND it was leaking 13 generic top-level packages (models, utils, parsers, ...) into site-packages. Confirmed by building the wheel and inspecting top_level.txt + the file list. Packaging - Move 13 packages + pipeline.py + api.py + py.typed under src/ks_xlsx_parser/ via git mv (history preserved). - Mechanical rewrite of 47 files: `from pipeline import` → `from ks_xlsx_parser.pipeline import`, etc. Dead `xlsx_parser.` refs in docstrings and examples updated to `ks_xlsx_parser.`. - pyproject.toml: packages.find constrained to ks_xlsx_parser*, py.typed declared as package data, `xlsx-parser-api` console script fixed. - Drop PYTHONPATH=src from Makefile — package is now importable. Regression guard - scripts/verify_wheel.py builds the wheel, installs in a clean venv, asserts top_level.txt = ['ks_xlsx_parser'] and the public import surface resolves. Wired into a new wheel-check CI job and the release workflow. The matrix `test` job uses an editable install, which structurally cannot catch a broken wheel — this is the gap that let v0.2.0 ship. CI (ks-backend-style) - Split ci.yml into test/lint/typecheck/wheel-check jobs, uv-backed installs, lint+typecheck non-blocking with TODO until the cleanup PR lands (114 → 45 ruff findings after safe autofix; 59 mypy unchanged). - Add wheel verification step to release.yml before PyPI publish. Benchmark + accuracy tracking - Dockerfile.bench reproduces the SpreadsheetBench retrieval benchmark. Pre-warms BAAI/bge-small-en-v1.5; entrypoint ensures corpus, runs eval, appends to history, triages. - .github/workflows/benchmark.yml: 60-instance sample on PRs, full 912-instance corpus weekly + on manual dispatch. Uploads reports as artifact and posts recall summary to the job step summary. - scripts/append_bench_history.py appends one row per run to tests/benchmarks/reports/history.jsonl tagged with the git commit and prints the row-over-row delta. Goal documented: text recall@5 > 0.90 (currently 0.704). Recall failure triage (turns "recall is low" into a ranked worklist) - eval_retrieval.py --emit-failures dumps every recall@5 miss with the top-8 ranked chunks + ground-truth values + a failure_bucket label (answer_absent_from_chunks / present_but_ranked_low / wrong_sheet / geometric_no_overlap / no_chunks / parse_error / unparseable_ground_truth). - summary.json gains a failure_buckets histogram per parser; summary.md prints the bucket table. - scripts/triage_recall.py reads failures.ndjson, ranks buckets by count, prints 3 exemplar failures per bucket so a human or agent can see WHY each miss happened without re-running the benchmark. - docs/recall-investigation.md documents the diagnosis framework and three named hypotheses (chunk-size dilution, formula-expression rendering, range-bookkeeping drift) for the next investigation pass. - .claude/skills/recall-failure-triage/SKILL.md — agent guide. Misc - New `bench` optional-dependency group (sentence-transformers, numpy). - make install-dev alias (Frank pattern-matched this verb off ks-backend). - New make targets: wheel-check, bench-track, docker-bench. Verification - 1041 tests pass with no PYTHONPATH=src. - Wheel built + verified in clean venv. - mypy: 59 pre-existing findings, no new ones. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Verified locally on a real 60-instance SpreadsheetBench sample: recall_text@1 = 0.550, recall_text@5 = 0.633, geometric@5 = 0.283 failure_buckets = {answer_absent_from_chunks: 11, ...all others 0} Producer side of --emit-failures runs end-to-end on the real corpus. The doc captures the exact loop (install → corpus → eval → triage → history) plus Docker parity and troubleshooting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

arnav2 · 2026-05-19T21:39:36Z

Local setup verified end-to-end. Doc added: docs/benchmark-local-setup.md.

Ran python scripts/eval_retrieval.py --parsers ks --sample 60 --emit-failures against the real SpreadsheetBench corpus (downloaded via make corpus-download, 91 MB tarball → 2,726 .xlsx files). Producer side of --emit-failures runs clean. Result:

recall_text@1 = 0.550   recall_text@3 = 0.617   recall_text@5 = 0.633
recall_geometric@5 = 0.283
failure_buckets:
  answer_absent_from_chunks: 11   ← 100% of misses
  present_but_ranked_low:     0
  wrong_sheet:                0
  geometric_no_overlap:       0
  ...all others:              0

Surprising and actionable: 11/11 recall@5 misses fall in answer_absent_from_chunks. Refutes H1 (chunk-size dilution / present_but_ranked_low) and confirms H2 (the parser is dropping cells — extraction gap, not chunking). The next investigation PR should focus on parsers/cell_parser.py + rendering/text_renderer.py, NOT on chunk size tuning.

Concrete example surfaced by triage_recall.py (instance 55260):

Q: "round the amount in the second column to the nearest $5… round the value 367.5…"
answer_position: C1:C2, ground-truth values: ['365', '500']
n_chunks: 1
Top chunk: Sheet2 (1,1):(2,2) — only columns A:B, column C entirely missing from the chunk's range and render_text.

Other observations the doc covers:

Sample sizes matter — 10-instance run gave recall_text@5 = 0.80 (1 miss); 60-instance run gave 0.633 (11 misses). Decision-grade comparisons need --sample 60 minimum, full corpus for final numbers.
Mean parse time on the sample: 54 ms (P50 10 ms) — well below the 251 ms full-corpus number in the screenshot, because the sample over-represents tiny instances.

tests/benchmarks/reports/history.jsonl (gitignored) is now populated with both runs locally, demonstrating the row-over-row delta path also works.

The doc covers install → corpus → eval → triage → history → Docker parity → "adding a new bucket" → troubleshooting. Anyone can reproduce this in ~5 minutes after the first corpus + embedding-model download.

…script Ran the 200-instance seed=1337 sample with --emit-failures locally, enriched the failure rows with per-instance diagnostics (input.xlsx vs chunk bbox vs answer.xlsx vs sheet enumeration), and clustered the actionable failures by observable pattern. Key empirical finding: 127 of 200 instances are `instruction_requires_execution` — the benchmark asks the system to COMPUTE and WRITE the answer; a retrieval parser cannot satisfy them. The headline recall_text@5 = 0.635 includes those instances in the denominator, dragging the metric down by ~25 pp. Real in-scope recall is **0.59**, and getting to 0.90 means closing 30 actionable failures. Adds: - docs/planning/recall-90/README.md — overview, in-scope vs out-of-scope framing, status table, the iteration loop, parallelism gates by file scope. - docs/planning/recall-90/00..05.md — one cluster per file. Each TODO is self-contained with: cluster pattern, repro instance IDs, diagnostic signature (jq query), file scope, acceptance criteria stated against the 200-sample seed=1337 rerun, a failing-test sketch. Deliberately omits proposed fixes and effort estimates (the picking agent does that work). - scripts/enrich_failures.py — post-hoc enricher. Reads results.ndjson, re-parses each failed instance with openpyxl + ks-xlsx-parser, emits per-row diagnostic columns (gt_sheet, gt_cell_raw, gt_cell_formula, cached value, chunked sheets vs workbook sheets, chunk bbox vs GT range, ten heuristic flags). Avoids re-running the slow benchmark to add diagnostic columns. - .claude/skills/recall-failure-triage/SKILL.md — adds the in-scope filter step (Step 0). Cluster summary (200-sample seed=1337): 00 benchmark spec parser bugs 8 instances 01 array-formula rendering 2 instances 02 chunk range vs text mismatch 7 instances 03 cell-drop / uncached formula ≤10 instances (some unscorable) 04 single-chunk-per-sheet dilution 2 instances 05 out-of-scope execution instances 127 instances (scoring fix) Parallelism: each TODO scoped to a different file slice; 00 and 05 both touch eval_retrieval.py so they're flagged for serialization. Everything else can land independently. Updates docs/recall-investigation.md with the post-200-sample reality: H1 (chunk-size dilution) was largely refuted; H3 (range bookkeeping) landed as TODO 02; H2 (formula rendering) extended into TODO 03 covering both cached-value gaps and array formulas. 1041 tests still passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The recall→0.90 cluster TODO files were coordination notes for internal agents, not material the open-source community needs. Keep them locally (disk-side) but stop tracking them so they don't leak via the public repo. docs/recall-investigation.md + docs/benchmark-local-setup.md stay — those have value for external contributors reproducing benchmark results. Removed two now-broken cross-references from recall-investigation.md that pointed at the planning dir. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bumps version. Adds RELEASE_NOTES_v0.2.1.md (consumed by release.yml). Includes two follow-ups that got lost in the squash of PR #9: - gitignore docs/planning/ (internal recall worklist) - drop two stale docs/planning/ links from recall-investigation.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

arnav2 and others added 2 commits May 19, 2026 17:35

arnav2 and others added 2 commits May 19, 2026 17:58

arnav2 added bug Something isn't working enhancement New feature or request labels May 19, 2026

arnav2 merged commit 1328894 into main May 19, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(packaging): nest everything under ks_xlsx_parser + add CI/bench infra#9

fix(packaging): nest everything under ks_xlsx_parser + add CI/bench infra#9
arnav2 merged 4 commits into
mainfrom
arnav2/fix-xlsx-packaging-and-ci

arnav2 commented May 19, 2026 •

edited

Loading

Uh oh!

arnav2 commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arnav2 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What this PR does

What this PR deliberately does NOT do

Uh oh!

arnav2 commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arnav2 commented May 19, 2026 •

edited

Loading