Skip to content

fix(parser): time-box pdftable table extraction (un-hang parse)#29

Merged
hallelx2 merged 1 commit into
mainfrom
fix/table-extraction-timeout
May 28, 2026
Merged

fix(parser): time-box pdftable table extraction (un-hang parse)#29
hallelx2 merged 1 commit into
mainfrom
fix/table-extraction-timeout

Conversation

@hallelx2
Copy link
Copy Markdown
Owner

@hallelx2 hallelx2 commented May 28, 2026

A 1.27MB 3M 10-K sat in parsing 20+ min during a bench run — pdftable's geometry finder has a pathological case on dense financial pages. Parse is pre-LLM so the ingest LLM-call timeout doesn't cover it. Adds a 15s/page timeout (goroutine + select) in safeExtractTables and a 90s/document budget in extractPDFTables; a slow page falls back to text-only (the page text, incl. the table, still lives in prose sections — what the page-based strategy reads). go build/vet/test green.

Summary by Sourcery

Time-box PDF table extraction to prevent parser hangs on pathological pages and allow ingest to fall back to text-only when limits are exceeded.

Bug Fixes:

  • Prevent PDF parsing from hanging indefinitely on dense financial pages by enforcing per-page and per-document time limits around pdftable extraction.

Enhancements:

  • Log table extraction timeouts and failures at warn level while allowing documents to continue ingesting without table sections.

Summary by CodeRabbit

Bug Fixes

  • Improved PDF table extraction reliability with time-based processing limits for both individual pages and entire documents.
  • Enhanced error handling for slow or problematic PDF extraction operations to ensure consistent performance.

Review Change Stack

A 1.27MB 3M 10-K sat in `parsing` for 20+ minutes during a bench run:
pdftable's geometry-based table finder has a pathological case on dense
financial pages (a balance sheet's hundreds of ruling segments blow up
the intersection-grid construction). Parse is pure-geometry / pre-LLM,
so the ingest per-LLM-call timeout (the earlier fix) doesn't cover it,
and there was no other bound — the document hung indefinitely.

Two bounds, both in pkg/parser/pdf.go:
  - safeExtractTables now runs page.ExtractTables on a goroutine and
    abandons it after tableExtractPageTimeout (15s). The slow page is
    skipped (text-only); a single bad page can't stall the parse.
  - extractPDFTables tracks a doc-wide tableExtractDocBudget (90s);
    once spent, remaining pages are skipped and ingest proceeds.

A document that loses its table sections still indexes fine — the page
text (including the table's text) lives in the prose sections, which is
exactly what the page-based retrieval strategy reads. The proper fix is
upstream in pdftable's finder; this keeps ingest robust meanwhile.

go build/vet/test green.
Copilot AI review requested due to automatic review settings May 28, 2026 16:41
@hallelx2 hallelx2 merged commit 8d292cb into main May 28, 2026
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented May 28, 2026

Reviewer's Guide

Adds strict time-bounding to PDF table extraction so pathological pdftable pages and documents can't hang parsing, using per-page timeouts and a document-wide budget while preserving ingest via text-only fallback.

File-Level Changes

Change Details Files
Enforce a document-wide time budget for PDF table extraction and stop table detection once it is exhausted.
  • Introduce a doc-level deadline using tableExtractDocBudget and time.Now().Add in extractPDFTables.
  • Before processing each page, check the deadline; if exceeded, log a warning with pages processed and break out of the loop, allowing ingest to continue without further table extraction.
pkg/parser/pdf.go
Introduce bounded, panic-safe per-page table extraction with a goroutine + timeout wrapper.
  • Replace the previous recover-based safeExtractTables with a goroutine that calls page.ExtractTables and returns through a buffered channel.
  • Add a select between the goroutine result and time.After(tableExtractPageTimeout) to enforce a per-page timeout; on timeout, log a warning and return nil tables.
  • Keep panic protection inside the goroutine, logging panics and sending an empty result to the channel; log non-nil errors from ExtractTables and suppress tables on failures.
  • Change the safeExtractTables signature to return tables directly (no named return) and route all exit paths through the select logic.
pkg/parser/pdf.go
Define constants and imports required to support time-bounded table extraction.
  • Import the time package for deadlines and timeouts.
  • Add tableExtractPageTimeout and tableExtractDocBudget constants with explanatory comments describing the motivation and behavior of time-bounded extraction.
pkg/parser/pdf.go

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@hallelx2 hallelx2 deleted the fix/table-extraction-timeout branch May 28, 2026 16:41
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e8ad446a-4e4e-46ce-8a2b-58139c143585

📥 Commits

Reviewing files that changed from the base of the PR and between 678bcf0 and 856b0f4.

📒 Files selected for processing (1)
  • pkg/parser/pdf.go

📝 Walkthrough

Walkthrough

This PR adds time-based limits to PDF table extraction in a single file. It introduces per-page timeouts and a document-wide budget that halt further extraction when exceeded. The previous panic-recovery wrapper is replaced with a goroutine-based timeout guard that logs warnings and returns nil when extraction stalls.

Changes

PDF Table Extraction Timeout System

Layer / File(s) Summary
Time bounds constants and import
pkg/parser/pdf.go
time package imported; tableExtractPageTimeout (per-page limit) and tableExtractDocBudget (document-wide limit) constants defined.
Per-page timeout mechanism
pkg/parser/pdf.go
New safeExtractTables implementation uses goroutine + channel + select to enforce per-page timeout; panics and errors are logged and return no tables, while timeouts produce nil.
Document-wide budget enforcement
pkg/parser/pdf.go
extractPDFTables computes a deadline before the page loop and breaks with warning logging once the budget is exhausted.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A rabbit's tale of patient extraction,
Pages timed with precise action.
No goroutine shall linger too long,
Deadlines keep the PDFs strong!
Hop, extract, then flee the scene—
Fastest parsing ever seen! 🎯

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/table-extraction-timeout

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In safeExtractTables, consider replacing time.After(tableExtractPageTimeout) with a time.NewTimer and defer timer.Stop() so the timer can be stopped when the done case wins, avoiding the extra timer goroutine/allocation for every page.
  • The doc-level budget check in extractPDFTables uses time.Now() inside the loop; if you expect very large documents, you might want to precompute the deadline once (as you do) and compare against a monotonic clock (time.Since(start)) to avoid potential issues if the system clock is adjusted.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `safeExtractTables`, consider replacing `time.After(tableExtractPageTimeout)` with a `time.NewTimer` and `defer timer.Stop()` so the timer can be stopped when the `done` case wins, avoiding the extra timer goroutine/allocation for every page.
- The doc-level budget check in `extractPDFTables` uses `time.Now()` inside the loop; if you expect very large documents, you might want to precompute the deadline once (as you do) and compare against a monotonic clock (`time.Since(start)`) to avoid potential issues if the system clock is adjusted.

## Individual Comments

### Comment 1
<location path="pkg/parser/pdf.go" line_range="1341-1350" />
<code_context>
+		tables []*pdftable.Table
+		err    error
+	}
+	done := make(chan result, 1)
+	go func() {
+		defer func() {
+			if r := recover(); r != nil {
+				slog.Warn("pdf: table extraction panicked",
+					"page", pageNum, "panic", fmt.Sprintf("%v", r))
+				done <- result{}
+			}
+		}()
+		t, err := page.ExtractTables(settings)
+		done <- result{tables: t, err: err}
 	}()
-	tables, err := page.ExtractTables(settings)
-	if err != nil {
-		slog.Warn("pdf: table extraction failed",
-			"page", pageNum,
-			"err", err)
+
+	select {
+	case <-time.After(tableExtractPageTimeout):
+		// The goroutine is abandoned (it will finish on its own and the
</code_context>
<issue_to_address>
**issue (performance):** Current timeout approach abandons the goroutine instead of canceling it, which could tie up resources longer than necessary.

Since the goroutine keeps running after the select times out, `page.ExtractTables` can still consume CPU and memory for its full worst-case duration, and many such timeouts could lead to a buildup of in-flight goroutines. If pdftable supports contexts, consider wrapping the call in a `context.WithTimeout` and passing it through so the work is actually canceled. If not, consider limiting concurrent extractions (e.g., worker pool or semaphore) so callers can’t create unbounded abandoned goroutines on problematic documents.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread pkg/parser/pdf.go
Comment on lines +1341 to +1350
done := make(chan result, 1)
go func() {
defer func() {
if r := recover(); r != nil {
slog.Warn("pdf: table extraction panicked",
"page", pageNum, "panic", fmt.Sprintf("%v", r))
done <- result{}
}
}()
t, err := page.ExtractTables(settings)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (performance): Current timeout approach abandons the goroutine instead of canceling it, which could tie up resources longer than necessary.

Since the goroutine keeps running after the select times out, page.ExtractTables can still consume CPU and memory for its full worst-case duration, and many such timeouts could lead to a buildup of in-flight goroutines. If pdftable supports contexts, consider wrapping the call in a context.WithTimeout and passing it through so the work is actually canceled. If not, consider limiting concurrent extractions (e.g., worker pool or semaphore) so callers can’t create unbounded abandoned goroutines on problematic documents.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds strict time bounds to PDF table extraction to prevent pathological PDFs from stalling ingest.

Changes:

  • Introduces a document-wide table extraction budget to skip remaining pages once exceeded.
  • Wraps per-page table extraction in a goroutine with a per-page timeout and panic recovery.
  • Adds constants for page timeout and document budget.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/parser/pdf.go
Comment on lines +1354 to 1369
select {
case <-time.After(tableExtractPageTimeout):
// The goroutine is abandoned (it will finish on its own and the
// buffered channel lets it send without blocking, then get GC'd).
// A single slow page is not worth stalling the whole ingest.
slog.Warn("pdf: table extraction timed out; skipping page",
"page", pageNum, "timeout", tableExtractPageTimeout)
return nil
case res := <-done:
if res.err != nil {
slog.Warn("pdf: table extraction failed",
"page", pageNum, "err", res.err)
return nil
}
return res.tables
}
Comment thread pkg/parser/pdf.go
Comment on lines +1341 to 1352
done := make(chan result, 1)
go func() {
defer func() {
if r := recover(); r != nil {
slog.Warn("pdf: table extraction panicked",
"page", pageNum, "panic", fmt.Sprintf("%v", r))
done <- result{}
}
}()
t, err := page.ExtractTables(settings)
done <- result{tables: t, err: err}
}()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants