Skip to content

fix(ingest): total-parse timeout + working section cap + pdftable v0.3.1 (robust full-feature ingest)#31

Merged
hallelx2 merged 3 commits into
mainfrom
feat/parse-robustness
May 29, 2026
Merged

fix(ingest): total-parse timeout + working section cap + pdftable v0.3.1 (robust full-feature ingest)#31
hallelx2 merged 3 commits into
mainfrom
feat/parse-robustness

Conversation

@hallelx2
Copy link
Copy Markdown
Owner

@hallelx2 hallelx2 commented May 29, 2026

Summary

Hardens PDF ingest so the full feature set stays ON (LLM-built TOC, table extraction, summarize, HyDE, multi-axis — all of it) but can no longer hang. Nothing is disabled; the pipeline is bounded and the runaway-section path is fixed.

Three changes:

1. Bump pdftable v0.3.0 → v0.3.1

v0.3.1 ships the grid-indexed cell finder (the O(n²/n³)→O(cells) fix), so table extraction no longer degrades pathologically on dense financial pages. Table extraction stays enabled; it's just faster/bounded.

2. Total-parse timeout (the key robustness fix)

A 10-K (ACTIVISIONBLIZZARD_2019_10K) was observed hanging 600s+ in parsing even in minimal mode — the hang is in ledongthuc row extraction (extractPDFRowsreader.Page(n).Content()), which is pure-Go and pre-LLM, so none of the existing per-stage table-extraction budgets bound it.

The entire PDF parse (row extraction → table extraction → section building → leaf cap) is now wrapped in a configurable deadline. The work runs on a goroutine; on timeout or ctx cancellation Parse returns a clear error (pdf: parse exceeded <timeout> — document too complex or malformed) and abandons the goroutine (buffered result channel, no leak on send; a panic in the work is recovered). Same abandon-on-deadline pattern safeExtractTables already uses for one table page, lifted to cover the whole parse. The ingest pipeline already treats a parse error as a doc-level failure, so a doc that can't parse fast goes to failed and is visible to ops/bench instead of wedging forever.

  • Config: ingest.parse_timeout_seconds (default 120). Env VLE_INGEST_PARSE_TIMEOUT_SECONDS; the server binary forwards VLS_/VLE_INGEST_PARSE_TIMEOUT_SECONDS.
  • Also threads the existing ingest.max_sections through to the parser (previously dropped on the floor) via RegistryFromIngestParams, so both robustness valves are operator-tunable.
  • A negative value disables the bound (escape hatch / legacy unbounded behaviour).

3. Fix the section-cap merge bug

capLeafSections only merged adjacent leaf siblings, but the real 10-K explosion is hundreds of single-leaf parents (heading → one body leaf) with no adjacent leaf-sibling pairs — so the cap silently did nothing and a 92-page filing sailed past the 400 cap at 463-1465 leaves (each one a summarize + HyDE + multi-axis LLM call, the throughput killer).

Now reduces any tree shape:

  • Phase 1: collapseSingleLeafParents flattens heading→lone-leaf chains (bottom-up) so only-children become adjacent siblings (count unchanged; parent absorbs the leaf's content + page range).
  • Phase 2: the existing smallest-first adjacent-pair merge reduces to the cap, with a defensive collapse for any pair a table leaf blocked.
  • Top-level sections are wrapped under a synthetic root so the merge step can shrink the top-level sibling list too (a bare slice parameter would not propagate the shrink back — this was also why the naive fix span 90-140s on the guard).

Invariant: for any tree with > N mergeable leaves, capLeafSections drives the leaf count to ≤ N. Content is always preserved (concatenated, page ranges unioned); table sections (Metadata["table"]=="true") are never merged or collapsed.

Nothing is turned off. The full enrichment pipeline (TOC, tables, summarize, HyDE, multi-axis) still runs end to end — parse is merely bounded and the section count is correctly capped.

Test plan

  • go build ./... green (both binaries)
  • go vet ./... clean
  • go test ./... green (all packages)
  • Total-parse-timeout tests: the deadline mechanism returns a timeout error in ~the timeout (work sleeps 10s, returns in ~0.05s — proves it does NOT wait out the hang), passes fast work through, propagates real parse errors, runs inline when disabled, honours ctx cancel, recovers panics; plus a full-Parse test with a 1ms deadline on a real PDF.
  • Cap tests: 1000 single-leaf-parents → ≤ 400 with no content loss (the exact case the old cap failed and that let 1465 through), deep heading→subheading→body chains, mixed flat/single-leaf/multi-child tree, and a table-protection test asserting table leaves survive verbatim.
  • Config: default (120) / env-override / validate-rejects-negative coverage for parse_timeout_seconds.
  • pdftable v0.3.1 resolves cleanly and pins in go.mod/go.sum.
  • config.example.yaml documents ingest.parse_timeout_seconds (and ingest.max_sections).

Summary by CodeRabbit

  • New Features

    • Added configurable parse timeout limit for PDF documents (default 120 seconds)
    • Added configurable maximum section limit to manage document processing complexity (default 400 sections)
  • Chores

    • Updated dependencies

Review Change Stack

hallelx2 added 3 commits May 29, 2026 17:13
v0.3.1 ships the grid-indexed cell finder (the O(n^2/n^3) -> O(cells)
fix), so table extraction no longer degrades pathologically on dense
financial pages. Full table-extraction feature stays on; this only
makes it faster/bounded.
A 10-K was observed hanging 600s+ in `parsing` even in minimal mode: the
hang is in ledongthuc row extraction (extractPDFRows ->
reader.Page(n).Content()), which is pure-Go and pre-LLM, so none of the
existing per-stage table-extraction budgets bound it.

Wrap the entire PDF parse (row extraction, table extraction, section
building, leaf cap) in a deadline. The work runs on a goroutine; on
timeout or ctx cancellation Parse returns a clear error and abandons the
goroutine (buffered result channel, so no leak on send; a panic in the
work is recovered). This is the same abandon-on-deadline pattern
safeExtractTables already uses for a single table page, lifted to cover
the whole parse so ANY parse pathology fails fast and cleanly instead of
wedging ingest. The ingest pipeline already treats a parse error as a
doc-level failure, so the document goes to `failed` and is visible to
ops/bench rather than hanging forever.

Nothing is disabled: the full feature set (LLM TOC, tables, summarize,
HyDE, multi-axis) still runs — parse is merely bounded.

Config: IngestConfig.ParseTimeoutSeconds (default 120). Env
VLE_INGEST_PARSE_TIMEOUT_SECONDS; the server binary forwards
VLS_/VLE_INGEST_PARSE_TIMEOUT_SECONDS. Also threads the existing
ingest.max_sections through to the parser (previously dropped on the
floor) via RegistryFromIngestParams, so both robustness valves are
operator-tunable. A negative value disables the bound (escape hatch).

Tests: the deadline mechanism returns a timeout error in ~the timeout
(not after a 10s sleep), passes fast work through, propagates real parse
errors, runs inline when disabled, honours ctx cancel, and recovers
panics; plus config default/env-override/validate coverage.
The cap only merged ADJACENT leaf siblings, but the real 10-K explosion
is hundreds of SINGLE-LEAF PARENTS (heading -> one body leaf) that have
no adjacent leaf-sibling pairs at all — so the cap silently did nothing
and a 92-page filing sailed past the 400 cap at 463-1465 leaves, each
costing a summarize + HyDE + multi-axis LLM call (the throughput killer
for full ingest).

Fix it in two phases:
  1. collapseSingleLeafParents flattens every heading -> lone-leaf chain
     (bottom-up, so deep chains fold in one pass) so the formerly
     only-children become adjacent leaf siblings. Count is unchanged; the
     parent absorbs the child's content + page range and becomes the leaf.
  2. The existing smallest-first adjacent-pair merge then reduces the
     count to the cap, with a defensive single-leaf-parent collapse for
     any pair a table leaf blocked.

The top-level sections are wrapped under a synthetic root so the merge
step — which shrinks a sibling list by rewriting parent.Children — can
shrink the TOP-level list too (a bare slice parameter would not
propagate the shrink back). This is what makes the invariant hold for a
flat list of single-leaf parents.

Invariant: for any tree with > N mergeable leaves, capLeafSections drives
the leaf count to <= N. Content is always preserved (concatenated,
page ranges unioned) and table sections (Metadata["table"]=="true") are
never merged or collapsed. Nothing is disabled — the full section tree is
still produced; it's just bounded.

Tests: 1000 single-leaf-parents -> <= 400 with no content loss (the case
the old cap failed and that let 1465 through), deep heading->subheading->
body chains, a mixed flat/single-leaf/multi-child tree, and a
table-protection test asserting table leaves survive verbatim.
Copilot AI review requested due to automatic review settings May 29, 2026 16:58
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @hallelx2, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 38fa60f0-21f9-4328-9fd1-5922ec905bb4

📥 Commits

Reviewing files that changed from the base of the PR and between dfc1c45 and 9d8c7b4.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (11)
  • cmd/engine/main.go
  • cmd/server/main.go
  • config.example.yaml
  • go.mod
  • internal/config/config.go
  • pkg/config/config.go
  • pkg/config/config_test.go
  • pkg/ingest/ingest.go
  • pkg/parser/cap_test.go
  • pkg/parser/pdf.go
  • pkg/parser/pdf_parse_timeout_test.go

📝 Walkthrough

Walkthrough

This PR introduces configurable parse-timeout and leaf-section limits to the PDF ingest pipeline. Configuration adds ParseTimeoutSeconds (default 120s) and MaxSections (default 400) with environment override support; the pipeline routes these to a new RegistryFromIngestParams constructor that wires the PDF parser with deadline enforcement and capping. The parser implements timeout via goroutine-based deadline wrapper with panic recovery; leaf-capping is rewritten to reliably handle complex outline trees by collapsing single-leaf parents then merging smallest pairs, while preserving table-marked sections.

Changes

Parse Timeout Configuration & Defaults

Layer / File(s) Summary
Configuration schema and environment handling
pkg/config/config.go, internal/config/config.go, go.mod
IngestConfig adds ParseTimeoutSeconds field with 120s default and YAML binding; applyEnvOverrides handles VLE_INGEST_PARSE_TIMEOUT_SECONDS environment variable; validation rejects negative values; pdftable dependency updated to v0.3.1.
Configuration tests and documentation
pkg/config/config_test.go, config.example.yaml
TestDefaultValues asserts 120s timeout and 400 max-sections defaults; new tests validate environment override and validation error paths; example YAML documents both fields with enforcement semantics.

Ingest Pipeline Wiring

Layer / File(s) Summary
Registry factory and parameterized construction
pkg/ingest/ingest.go
New RegistryFromIngestParams(opts, maxSections, parseTimeout) builds parser.Registry with PDF parser created via NewPDFWithConfig; RegistryFromTableOpts delegates to it with zero values for backward compatibility.
Application entry-point pipeline wiring
cmd/engine/main.go, cmd/server/main.go
Ingest pipeline switches from RegistryFromTableOpts to RegistryFromIngestParams, passing config-derived MaxSections and ParseTimeoutSeconds converted to time.Duration.

PDF Parser Timeout Implementation

Layer / File(s) Summary
ParseTimeout field and timeout resolution
pkg/parser/pdf.go
PDF struct adds ParseTimeout field; resolvedParseTimeout() selects 120s default when zero, disables when negative; NewPDFWithConfig(opts, maxSections, parseTimeout) constructor wires both bounds explicitly.
Deadline wrapper and Parse rewrite
pkg/parser/pdf.go
Parse method wraps full read/parse work in runParseWithDeadline; goroutine executes parseDoc, select on result/timeout/context-cancel, abandons goroutine on timeout using buffered channel, recovers panics as errors.
Timeout and deadline tests
pkg/parser/pdf_parse_timeout_test.go
Tests cover timeout enforcement (error, no hang), fast completion, work-error propagation, disabled timeout (inline), context cancellation, panic recovery, timeout resolution logic, and end-to-end integration with 1ms timeout on real PDF fixture.

Leaf-Section Capping Algorithm Rewrite

Layer / File(s) Summary
Two-phase leaf-capping and helpers
pkg/parser/pdf.go
capLeafSections rewritten to first collapse single-leaf parent chains into mergeable siblings (skipping table leaves), then iteratively merge smallest adjacent leaf pairs until within cap; new helpers detect table leaves, absorb collapsed content/ranges, exclude table leaves from merge eligibility.
Leaf-capping test suite
pkg/parser/cap_test.go
singleLeafParentTree helper builds deterministic outline models with single-leaf parents; new tests validate collapse behavior, deep chain flattening, mixed-shape reduction, and table-leaf preservation even under restrictive caps.

Possibly related PRs

  • hallelx2/vectorless-engine#30: Modifies ingest pipeline wiring to support minimal mode with altered parser registry setup and disabled table extraction.
  • hallelx2/vectorless-engine#23: Changes PDF.Parse pipeline at the primitive glyph-to-row extraction and reader error handling layer.
  • hallelx2/vectorless-engine#12: Modifies (*PDF).Parse and downstream section/leaf post-processing for heading detection and oversized leaf splitting.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🐰 A Parse Timeout Tale

With bounds to check and deadlines near,
Leaf sections cap—no runaway fear!
Goroutines race, then timeout calls,
Panic's caught before it falls. ⏰

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/parse-robustness

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hallelx2 hallelx2 merged commit cbd46f5 into main May 29, 2026
3 of 9 checks passed
@hallelx2 hallelx2 deleted the feat/parse-robustness branch May 29, 2026 17:00
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants