feat: pdftable integration - table-aware ingest (Phase 1.3.F)#20
Conversation
Replace the ledongthuc/pdf glyph-reassembly path with pdftable's positioned-word extractor and add a new table-aware extraction stage that emits each detected table as its own Section flagged with Metadata["table"]="true". - pdftable.Page.Words() handles intra-word glyph reassembly, letter-spacing collapse, and ligature expansion natively. The bespoke collapseLetterSpacing / looksLetterSpaced / multiSpaceRe helpers are deleted (handled by pdftable's WordOpts). - The engine still uses ledongthuc/pdf solely for /Outlines access — pdftable does not yet expose the outline dictionary. Outline-driven parsing degrades gracefully when ledongthuc fails on a PDF that pdftable accepts. - Encrypted PDFs are detected via pdftable.ErrEncrypted and routed through pdfcpu's empty-password decryptor as before. - Table extraction runs after section building; tables are wrapped under a synthetic "Tables" container at the document root so the prose outline order stays untouched. Markdown rendering escapes pipes and collapses embedded newlines to keep GFM well-formed. - Resilience: every page.ExtractTables() call is wrapped in safeExtractTables (recover()) and errors are logged-and-swallowed. pdftable cannot break ingest under any condition. On the 3M 2023Q2 10-Q this surfaces 62 table sections across 38 distinct pages — content that previously collapsed into space-joined runs and was effectively unsearchable.
Surface pdftable's table-extraction knobs through the engine config so
operators can flip strategies / minima / kill-switch without code
changes.
- IngestConfig.Tables (yaml: ingest.tables) with Enabled (default
true), VerticalStrategy / HorizontalStrategy ("lines" defaults),
MinTableRows / MinTableCols (2 / 2 floor).
- VLE_INGEST_TABLES_ENABLED, VLE_INGEST_TABLES_VERTICAL_STRATEGY,
VLE_INGEST_TABLES_HORIZONTAL_STRATEGY, VLE_INGEST_TABLES_MIN_ROWS,
VLE_INGEST_TABLES_MIN_COLS env overrides following the existing
pattern.
- Validate() rejects unknown strategy values and negative minima.
- ingest.RegistryFromTableOpts() constructs a parser.Registry with a
table-aware PDF parser; DefaultRegistry stays compatible for tests.
- cmd/engine + cmd/server wire the config block through, log the
enabled / disabled state at startup so the operator can see the
active configuration in the journal.
- config.example.yaml documents the block alongside its sibling
HyDE / global LLM concurrency knobs.
Adds the regression gate for the integration: a small (13 KB) two-table PDF copied from pdftable's golden fixtures asserts the parser actually emits table sections with the expected metadata, the synthetic "Tables" container is in place, the kill-switch works, and corrupt input never panics. - pkg/parser/pdf_tables_test.go: TestPDFParserEmitsTableSections asserts pages, GFM rendering, known cell substrings; rows/cols metadata; TestPDFParserTablesContainerHidesUnderParent verifies the container wrapping; TestPDFParserDisabledTables verifies the rollback path; TestPDFParserCorruptInputReturnsCleanError pins the error contract; TestPDFParser10KSmokeOptional is gated on VLE_TEST_FILING_PDF for manual benchmark validation against real 10-Ks. - pkg/parser/testdata/tables-example.pdf: the issue-466 two-table golden fixture from pdftable. Small enough to commit. - pkg/config/config_test.go: TestTablesDefaults / TestTablesEnvOverride / TestTablesValidateRejectsBadStrategy round-trip the new config block through YAML + env + Validate.
|
Warning Review limit reached
More reviews will be available in 11 minutes and 31 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (2)
📒 Files selected for processing (11)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
pdftable v0.3.0 is now tagged on the remote. Resolves cleanly via go get and removes the only blocker for clean external go-module fetch — CI can now build the engine without a sibling pdftable checkout.
Summary
ledongthuc/pdfglyph reassembly topdftablev0.3.0, which already ships pdfplumber-parity word grouping, letter-spacing collapse, and ligature expansion.Sectionflagged withMetadata["table"]="true"and content rendered as GitHub-flavoured Markdown. Tables are wrapped under a synthetic "Tables" container at the document root so the prose outline order stays untouched.ingest.tables.{enabled,vertical_strategy,horizontal_strategy,min_table_rows,min_table_cols}with matchingVLE_INGEST_TABLES_*env overrides. Default on; one flip disables the integration entirely.Design rationale
PDF is a layout format — every numeric question in a 10-K balance sheet lives in a ruled table that text-only extraction collapses into a space-joined run. pdftable's
linesstrategy reconstructs those tables from drawn rules; thetextstrategy handles borderless / narrative tables; both axes can be configured independently. The parser keeps the existing font-size + bold heuristics and outline-driven structure recovery (the latter still backed byledongthuc/pdfsince pdftable doesn't surface/Outlinesyet).The
ledongthuc/pdfdependency stays for outline access only, which is graceful: if the secondary reader can't open a PDF that pdftable accepts, the engine falls through to the heuristic path without failing ingest.Risk envelope
tables.enabled: falseis a one-line rollback.safeExtractTableswraps everypage.ExtractTablesinrecover()plus error logging atwarn. Ingest produces text-only output in the worst case.errors.Is(err, pdftable.ErrEncrypted).Test plan
go build ./...cleango vet ./...cleango test ./...all green (incl. five new parser tests, three new config tests)3M_2023Q2_10Q.pdfviaVLE_TEST_FILING_PDF-> 62 table sections across 38 distinct pages ingested cleanlypdf: open:error rather than panickingtables.enabled: falseproduces no table sections and no "Tables" container, all other sections still emitBefore / after snippet
A 3x4 ruled table on the issue-466 fixture used to leak into the document outline as one prose run:
After the integration it surfaces as its own section:
Notes
go.modcarries areplace github.com/hallelx2/pdftable => ../pdftabledirective until pdftable v0.3.0 is tagged on the remote. Strip the directive in a follow-up commit oncego get github.com/hallelx2/pdftable@v0.3.0resolves cleanly.pkg/parser/testdata/tables-example.pdf, 13 KB) is the golden two-table PDF from pdftable's testdata.