Skip to content

fix(packaging): nest everything under ks_xlsx_parser + add CI/bench infra#9

Merged
arnav2 merged 4 commits into
mainfrom
arnav2/fix-xlsx-packaging-and-ci
May 19, 2026
Merged

fix(packaging): nest everything under ks_xlsx_parser + add CI/bench infra#9
arnav2 merged 4 commits into
mainfrom
arnav2/fix-xlsx-packaging-and-ci

Conversation

@arnav2
Copy link
Copy Markdown
Collaborator

@arnav2 arnav2 commented May 19, 2026

Why

error: Import "ks_xlsx_parser.pipeline" could not be resolved (reportMissingImports)

Building the v0.2.0 wheel locally and unzipping it confirmed two distinct packaging bugs:

  1. pipeline.py and api.py were missing from the wheel — top-level modules under src/, and setuptools.packages.find only picks up packages, so they were silently dropped.
  2. Namespace pollutiontop_level.txt listed 13 generic siblings of ks_xlsx_parser (models, utils, parsers, …). Installing the package would shadow any other models in site-packages.

CI didn't catch this because the matrix test job uses an editable install — src/ is on sys.path and the wheel-packaging defect is invisible.

What this PR does

  • Restructures into a proper nested packagegit mv of 13 packages + pipeline.py + api.py + py.typed under src/ks_xlsx_parser/; mechanical rewrite of 47 files' imports; pyproject.toml constrains packages.find to ks_xlsx_parser*; drops PYTHONPATH=src from the Makefile.
  • Adds the regression guard CI was missingscripts/verify_wheel.py builds the wheel, installs in a clean venv, asserts top_level.txt == ['ks_xlsx_parser'] and the public imports resolve. Wired into a new wheel-check CI job and a Verify wheel step in release.yml.
  • ks-backend-style CI — split ci.yml into test / wheel-check / lint / typecheck jobs, uv-backed installs. Lint and typecheck are non-blocking with TODO markers (45 pre-existing ruff, 59 mypy — separate cleanup PR).
  • Docker benchmark + accuracy trackingDockerfile.bench reproduces SpreadsheetBench retrieval scoring. .github/workflows/benchmark.yml runs a 60-instance sample on PRs and the full 912-instance corpus weekly. scripts/append_bench_history.py appends each run to tests/benchmarks/reports/history.jsonl tagged with the git commit.
  • Recall failure triageeval_retrieval.py --emit-failures dumps per-miss diagnostics; scripts/triage_recall.py produces a ranked bucket histogram + exemplar failures so "recall@5 = X" becomes a ranked worklist.
  • make install-dev alias (Frank pattern-matched this off ks-backend).

What this PR deliberately does NOT do

  • No version bump. Cutting v0.2.1 is a separate step (see docs/RELEASE_PROCESS.md).
  • No recall improvements. docs/recall-investigation.md documents the diagnosis framework for the follow-up PRs.
  • No fix for the pre-existing ruff/mypy findings — separate cleanup PR.

arnav2 and others added 2 commits May 19, 2026 17:35
…infra

Frank's report on the v0.2.0 PyPI release was correct: the wheel was
missing pipeline.py and api.py (top-level modules under src/, which
setuptools packages.find skips), AND it was leaking 13 generic top-level
packages (models, utils, parsers, ...) into site-packages. Confirmed by
building the wheel and inspecting top_level.txt + the file list.

Packaging
- Move 13 packages + pipeline.py + api.py + py.typed under
  src/ks_xlsx_parser/ via git mv (history preserved).
- Mechanical rewrite of 47 files: `from pipeline import` →
  `from ks_xlsx_parser.pipeline import`, etc. Dead `xlsx_parser.` refs
  in docstrings and examples updated to `ks_xlsx_parser.`.
- pyproject.toml: packages.find constrained to ks_xlsx_parser*, py.typed
  declared as package data, `xlsx-parser-api` console script fixed.
- Drop PYTHONPATH=src from Makefile — package is now importable.

Regression guard
- scripts/verify_wheel.py builds the wheel, installs in a clean venv,
  asserts top_level.txt = ['ks_xlsx_parser'] and the public import
  surface resolves. Wired into a new wheel-check CI job and the release
  workflow. The matrix `test` job uses an editable install, which
  structurally cannot catch a broken wheel — this is the gap that let
  v0.2.0 ship.

CI (ks-backend-style)
- Split ci.yml into test/lint/typecheck/wheel-check jobs, uv-backed
  installs, lint+typecheck non-blocking with TODO until the cleanup PR
  lands (114 → 45 ruff findings after safe autofix; 59 mypy unchanged).
- Add wheel verification step to release.yml before PyPI publish.

Benchmark + accuracy tracking
- Dockerfile.bench reproduces the SpreadsheetBench retrieval benchmark.
  Pre-warms BAAI/bge-small-en-v1.5; entrypoint ensures corpus, runs
  eval, appends to history, triages.
- .github/workflows/benchmark.yml: 60-instance sample on PRs, full
  912-instance corpus weekly + on manual dispatch. Uploads reports as
  artifact and posts recall summary to the job step summary.
- scripts/append_bench_history.py appends one row per run to
  tests/benchmarks/reports/history.jsonl tagged with the git commit
  and prints the row-over-row delta. Goal documented:
  text recall@5 > 0.90 (currently 0.704).

Recall failure triage (turns "recall is low" into a ranked worklist)
- eval_retrieval.py --emit-failures dumps every recall@5 miss with the
  top-8 ranked chunks + ground-truth values + a failure_bucket label
  (answer_absent_from_chunks / present_but_ranked_low / wrong_sheet /
  geometric_no_overlap / no_chunks / parse_error / unparseable_ground_truth).
- summary.json gains a failure_buckets histogram per parser; summary.md
  prints the bucket table.
- scripts/triage_recall.py reads failures.ndjson, ranks buckets by
  count, prints 3 exemplar failures per bucket so a human or agent can
  see WHY each miss happened without re-running the benchmark.
- docs/recall-investigation.md documents the diagnosis framework and
  three named hypotheses (chunk-size dilution, formula-expression
  rendering, range-bookkeeping drift) for the next investigation pass.
- .claude/skills/recall-failure-triage/SKILL.md — agent guide.

Misc
- New `bench` optional-dependency group (sentence-transformers, numpy).
- make install-dev alias (Frank pattern-matched this verb off ks-backend).
- New make targets: wheel-check, bench-track, docker-bench.

Verification
- 1041 tests pass with no PYTHONPATH=src.
- Wheel built + verified in clean venv.
- mypy: 59 pre-existing findings, no new ones.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verified locally on a real 60-instance SpreadsheetBench sample:
  recall_text@1 = 0.550, recall_text@5 = 0.633, geometric@5 = 0.283
  failure_buckets = {answer_absent_from_chunks: 11, ...all others 0}

Producer side of --emit-failures runs end-to-end on the real corpus.
The doc captures the exact loop (install → corpus → eval → triage →
history) plus Docker parity and troubleshooting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@arnav2
Copy link
Copy Markdown
Collaborator Author

arnav2 commented May 19, 2026

Local setup verified end-to-end. Doc added: docs/benchmark-local-setup.md.

Ran python scripts/eval_retrieval.py --parsers ks --sample 60 --emit-failures against the real SpreadsheetBench corpus (downloaded via make corpus-download, 91 MB tarball → 2,726 .xlsx files). Producer side of --emit-failures runs clean. Result:

recall_text@1 = 0.550   recall_text@3 = 0.617   recall_text@5 = 0.633
recall_geometric@5 = 0.283
failure_buckets:
  answer_absent_from_chunks: 11   ← 100% of misses
  present_but_ranked_low:     0
  wrong_sheet:                0
  geometric_no_overlap:       0
  ...all others:              0

Surprising and actionable: 11/11 recall@5 misses fall in answer_absent_from_chunks. Refutes H1 (chunk-size dilution / present_but_ranked_low) and confirms H2 (the parser is dropping cells — extraction gap, not chunking). The next investigation PR should focus on parsers/cell_parser.py + rendering/text_renderer.py, NOT on chunk size tuning.

Concrete example surfaced by triage_recall.py (instance 55260):

Q: "round the amount in the second column to the nearest $5… round the value 367.5…"
answer_position: C1:C2, ground-truth values: ['365', '500']
n_chunks: 1
Top chunk: Sheet2 (1,1):(2,2) — only columns A:B, column C entirely missing from the chunk's range and render_text.

Other observations the doc covers:

  • Sample sizes matter — 10-instance run gave recall_text@5 = 0.80 (1 miss); 60-instance run gave 0.633 (11 misses). Decision-grade comparisons need --sample 60 minimum, full corpus for final numbers.
  • Mean parse time on the sample: 54 ms (P50 10 ms) — well below the 251 ms full-corpus number in the screenshot, because the sample over-represents tiny instances.

tests/benchmarks/reports/history.jsonl (gitignored) is now populated with both runs locally, demonstrating the row-over-row delta path also works.

The doc covers install → corpus → eval → triage → history → Docker parity → "adding a new bucket" → troubleshooting. Anyone can reproduce this in ~5 minutes after the first corpus + embedding-model download.

arnav2 and others added 2 commits May 19, 2026 17:58
…script

Ran the 200-instance seed=1337 sample with --emit-failures locally,
enriched the failure rows with per-instance diagnostics (input.xlsx vs
chunk bbox vs answer.xlsx vs sheet enumeration), and clustered the
actionable failures by observable pattern.

Key empirical finding: 127 of 200 instances are
`instruction_requires_execution` — the benchmark asks the system to
COMPUTE and WRITE the answer; a retrieval parser cannot satisfy them.
The headline recall_text@5 = 0.635 includes those instances in the
denominator, dragging the metric down by ~25 pp. Real in-scope recall
is **0.59**, and getting to 0.90 means closing 30 actionable failures.

Adds:
- docs/planning/recall-90/README.md  — overview, in-scope vs out-of-scope
  framing, status table, the iteration loop, parallelism gates by file
  scope.
- docs/planning/recall-90/00..05.md   — one cluster per file. Each TODO
  is self-contained with: cluster pattern, repro instance IDs, diagnostic
  signature (jq query), file scope, acceptance criteria stated against
  the 200-sample seed=1337 rerun, a failing-test sketch. Deliberately
  omits proposed fixes and effort estimates (the picking agent does
  that work).
- scripts/enrich_failures.py — post-hoc enricher. Reads results.ndjson,
  re-parses each failed instance with openpyxl + ks-xlsx-parser, emits
  per-row diagnostic columns (gt_sheet, gt_cell_raw, gt_cell_formula,
  cached value, chunked sheets vs workbook sheets, chunk bbox vs GT
  range, ten heuristic flags). Avoids re-running the slow benchmark to
  add diagnostic columns.
- .claude/skills/recall-failure-triage/SKILL.md — adds the in-scope
  filter step (Step 0).

Cluster summary (200-sample seed=1337):
  00  benchmark spec parser bugs              8 instances
  01  array-formula rendering                 2 instances
  02  chunk range vs text mismatch           7 instances
  03  cell-drop / uncached formula          ≤10 instances (some unscorable)
  04  single-chunk-per-sheet dilution        2 instances
  05  out-of-scope execution instances    127 instances (scoring fix)

Parallelism: each TODO scoped to a different file slice; 00 and 05 both
touch eval_retrieval.py so they're flagged for serialization. Everything
else can land independently.

Updates docs/recall-investigation.md with the post-200-sample reality:
H1 (chunk-size dilution) was largely refuted; H3 (range bookkeeping)
landed as TODO 02; H2 (formula rendering) extended into TODO 03 covering
both cached-value gaps and array formulas.

1041 tests still passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The recall→0.90 cluster TODO files were coordination notes for internal
agents, not material the open-source community needs. Keep them locally
(disk-side) but stop tracking them so they don't leak via the public
repo. docs/recall-investigation.md + docs/benchmark-local-setup.md
stay — those have value for external contributors reproducing benchmark
results.

Removed two now-broken cross-references from recall-investigation.md
that pointed at the planning dir.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@arnav2 arnav2 added bug Something isn't working enhancement New feature or request labels May 19, 2026
@arnav2 arnav2 merged commit 1328894 into main May 19, 2026
10 checks passed
arnav2 added a commit that referenced this pull request May 19, 2026
Bumps version. Adds RELEASE_NOTES_v0.2.1.md (consumed by release.yml).
Includes two follow-ups that got lost in the squash of PR #9:
  - gitignore docs/planning/ (internal recall worklist)
  - drop two stale docs/planning/ links from recall-investigation.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant