fix(packaging): nest everything under ks_xlsx_parser + add CI/bench infra#9
Conversation
…infra Frank's report on the v0.2.0 PyPI release was correct: the wheel was missing pipeline.py and api.py (top-level modules under src/, which setuptools packages.find skips), AND it was leaking 13 generic top-level packages (models, utils, parsers, ...) into site-packages. Confirmed by building the wheel and inspecting top_level.txt + the file list. Packaging - Move 13 packages + pipeline.py + api.py + py.typed under src/ks_xlsx_parser/ via git mv (history preserved). - Mechanical rewrite of 47 files: `from pipeline import` → `from ks_xlsx_parser.pipeline import`, etc. Dead `xlsx_parser.` refs in docstrings and examples updated to `ks_xlsx_parser.`. - pyproject.toml: packages.find constrained to ks_xlsx_parser*, py.typed declared as package data, `xlsx-parser-api` console script fixed. - Drop PYTHONPATH=src from Makefile — package is now importable. Regression guard - scripts/verify_wheel.py builds the wheel, installs in a clean venv, asserts top_level.txt = ['ks_xlsx_parser'] and the public import surface resolves. Wired into a new wheel-check CI job and the release workflow. The matrix `test` job uses an editable install, which structurally cannot catch a broken wheel — this is the gap that let v0.2.0 ship. CI (ks-backend-style) - Split ci.yml into test/lint/typecheck/wheel-check jobs, uv-backed installs, lint+typecheck non-blocking with TODO until the cleanup PR lands (114 → 45 ruff findings after safe autofix; 59 mypy unchanged). - Add wheel verification step to release.yml before PyPI publish. Benchmark + accuracy tracking - Dockerfile.bench reproduces the SpreadsheetBench retrieval benchmark. Pre-warms BAAI/bge-small-en-v1.5; entrypoint ensures corpus, runs eval, appends to history, triages. - .github/workflows/benchmark.yml: 60-instance sample on PRs, full 912-instance corpus weekly + on manual dispatch. Uploads reports as artifact and posts recall summary to the job step summary. - scripts/append_bench_history.py appends one row per run to tests/benchmarks/reports/history.jsonl tagged with the git commit and prints the row-over-row delta. Goal documented: text recall@5 > 0.90 (currently 0.704). Recall failure triage (turns "recall is low" into a ranked worklist) - eval_retrieval.py --emit-failures dumps every recall@5 miss with the top-8 ranked chunks + ground-truth values + a failure_bucket label (answer_absent_from_chunks / present_but_ranked_low / wrong_sheet / geometric_no_overlap / no_chunks / parse_error / unparseable_ground_truth). - summary.json gains a failure_buckets histogram per parser; summary.md prints the bucket table. - scripts/triage_recall.py reads failures.ndjson, ranks buckets by count, prints 3 exemplar failures per bucket so a human or agent can see WHY each miss happened without re-running the benchmark. - docs/recall-investigation.md documents the diagnosis framework and three named hypotheses (chunk-size dilution, formula-expression rendering, range-bookkeeping drift) for the next investigation pass. - .claude/skills/recall-failure-triage/SKILL.md — agent guide. Misc - New `bench` optional-dependency group (sentence-transformers, numpy). - make install-dev alias (Frank pattern-matched this verb off ks-backend). - New make targets: wheel-check, bench-track, docker-bench. Verification - 1041 tests pass with no PYTHONPATH=src. - Wheel built + verified in clean venv. - mypy: 59 pre-existing findings, no new ones. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verified locally on a real 60-instance SpreadsheetBench sample:
recall_text@1 = 0.550, recall_text@5 = 0.633, geometric@5 = 0.283
failure_buckets = {answer_absent_from_chunks: 11, ...all others 0}
Producer side of --emit-failures runs end-to-end on the real corpus.
The doc captures the exact loop (install → corpus → eval → triage →
history) plus Docker parity and troubleshooting.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Local setup verified end-to-end. Doc added: Ran Surprising and actionable: 11/11 recall@5 misses fall in Concrete example surfaced by
Other observations the doc covers:
The doc covers install → corpus → eval → triage → history → Docker parity → "adding a new bucket" → troubleshooting. Anyone can reproduce this in ~5 minutes after the first corpus + embedding-model download. |
…script Ran the 200-instance seed=1337 sample with --emit-failures locally, enriched the failure rows with per-instance diagnostics (input.xlsx vs chunk bbox vs answer.xlsx vs sheet enumeration), and clustered the actionable failures by observable pattern. Key empirical finding: 127 of 200 instances are `instruction_requires_execution` — the benchmark asks the system to COMPUTE and WRITE the answer; a retrieval parser cannot satisfy them. The headline recall_text@5 = 0.635 includes those instances in the denominator, dragging the metric down by ~25 pp. Real in-scope recall is **0.59**, and getting to 0.90 means closing 30 actionable failures. Adds: - docs/planning/recall-90/README.md — overview, in-scope vs out-of-scope framing, status table, the iteration loop, parallelism gates by file scope. - docs/planning/recall-90/00..05.md — one cluster per file. Each TODO is self-contained with: cluster pattern, repro instance IDs, diagnostic signature (jq query), file scope, acceptance criteria stated against the 200-sample seed=1337 rerun, a failing-test sketch. Deliberately omits proposed fixes and effort estimates (the picking agent does that work). - scripts/enrich_failures.py — post-hoc enricher. Reads results.ndjson, re-parses each failed instance with openpyxl + ks-xlsx-parser, emits per-row diagnostic columns (gt_sheet, gt_cell_raw, gt_cell_formula, cached value, chunked sheets vs workbook sheets, chunk bbox vs GT range, ten heuristic flags). Avoids re-running the slow benchmark to add diagnostic columns. - .claude/skills/recall-failure-triage/SKILL.md — adds the in-scope filter step (Step 0). Cluster summary (200-sample seed=1337): 00 benchmark spec parser bugs 8 instances 01 array-formula rendering 2 instances 02 chunk range vs text mismatch 7 instances 03 cell-drop / uncached formula ≤10 instances (some unscorable) 04 single-chunk-per-sheet dilution 2 instances 05 out-of-scope execution instances 127 instances (scoring fix) Parallelism: each TODO scoped to a different file slice; 00 and 05 both touch eval_retrieval.py so they're flagged for serialization. Everything else can land independently. Updates docs/recall-investigation.md with the post-200-sample reality: H1 (chunk-size dilution) was largely refuted; H3 (range bookkeeping) landed as TODO 02; H2 (formula rendering) extended into TODO 03 covering both cached-value gaps and array formulas. 1041 tests still passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The recall→0.90 cluster TODO files were coordination notes for internal agents, not material the open-source community needs. Keep them locally (disk-side) but stop tracking them so they don't leak via the public repo. docs/recall-investigation.md + docs/benchmark-local-setup.md stay — those have value for external contributors reproducing benchmark results. Removed two now-broken cross-references from recall-investigation.md that pointed at the planning dir. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bumps version. Adds RELEASE_NOTES_v0.2.1.md (consumed by release.yml). Includes two follow-ups that got lost in the squash of PR #9: - gitignore docs/planning/ (internal recall worklist) - drop two stale docs/planning/ links from recall-investigation.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why
Building the v0.2.0 wheel locally and unzipping it confirmed two distinct packaging bugs:
pipeline.pyandapi.pywere missing from the wheel — top-level modules undersrc/, andsetuptools.packages.findonly picks up packages, so they were silently dropped.top_level.txtlisted 13 generic siblings ofks_xlsx_parser(models,utils,parsers, …). Installing the package would shadow any othermodelsin site-packages.CI didn't catch this because the matrix
testjob uses an editable install —src/is onsys.pathand the wheel-packaging defect is invisible.What this PR does
git mvof 13 packages +pipeline.py+api.py+py.typedundersrc/ks_xlsx_parser/; mechanical rewrite of 47 files' imports;pyproject.tomlconstrainspackages.findtoks_xlsx_parser*; dropsPYTHONPATH=srcfrom the Makefile.scripts/verify_wheel.pybuilds the wheel, installs in a clean venv, assertstop_level.txt == ['ks_xlsx_parser']and the public imports resolve. Wired into a newwheel-checkCI job and aVerify wheelstep inrelease.yml.ci.ymlintotest/wheel-check/lint/typecheckjobs,uv-backed installs. Lint and typecheck are non-blocking with TODO markers (45 pre-existing ruff, 59 mypy — separate cleanup PR).Dockerfile.benchreproduces SpreadsheetBench retrieval scoring..github/workflows/benchmark.ymlruns a 60-instance sample on PRs and the full 912-instance corpus weekly.scripts/append_bench_history.pyappends each run totests/benchmarks/reports/history.jsonltagged with the git commit.eval_retrieval.py --emit-failuresdumps per-miss diagnostics;scripts/triage_recall.pyproduces a ranked bucket histogram + exemplar failures so "recall@5 = X" becomes a ranked worklist.make install-devalias (Frank pattern-matched this off ks-backend).What this PR deliberately does NOT do
v0.2.1is a separate step (seedocs/RELEASE_PROCESS.md).docs/recall-investigation.mddocuments the diagnosis framework for the follow-up PRs.