fix(ingest): per-LLM-call timeout + leaf-section cap (un-stick big PDFs)#27
Conversation
Two ingest bugs that froze FinanceBench ingests and are real product defects on any large filing: 1. No per-LLM-call timeout. A single hung summarize / HyDE / multi-axis / TOC-build call blocked the stage's errgroup Wait() forever — a doc was observed stuck in `summarizing` for 13+ hours. Fix: completeWithTimeout wraps every individual LLM.Complete in a context.WithTimeout (default 90s, ingest.llm_call_timeout_seconds / VLE_INGEST_LLM_CALL_TIMEOUT_SECONDS). On timeout the call is logged and skipped — the section keeps its existing/empty summary and the document still reaches `ready`. One bad call can no longer freeze a whole document. 2. Leaf-section explosion. chunkOversizedLeaves splits any leaf over 2400 chars into ~900-char pieces, so a 45K-char "Notes to Financial Statements" section shattered into ~50 chunks; a 92-page 10-K produced ~1500 leaves, each costing a summarize+HyDE+multi-axis LLM call → the slow/stalled ingest. Fix: capLeafSections enforces a ceiling (default 400, ingest.max_sections / VLE_INGEST_MAX_SECTIONS) by merging the smallest adjacent leaf siblings under a shared parent until the count is within budget. Content is preserved (blank-line joined), page ranges unioned, and table sections — attached after this pass — are never merged. Applied in both the heuristic and outline parse paths. The cap runs at its default (400) through the existing RegistryFromTableOpts → NewPDFWithTables path, so the fix is active on the deployed binary without a cmd wiring change. ingest.max_sections becoming operator-tunable end-to-end is a small follow-up in the cmd binaries. Tests: a hung-call mock proves the pipeline still completes and other sections summarize; cap tests prove merge-down to budget, smallest-pair ordering, content preservation, and the disabled (<=0) escape hatch. go build/vet/test all green.
|
Warning Review limit reached
More reviews will be available in 58 minutes and 44 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (8)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
Two ingest bugs that blocked FinanceBench from completing and are real defects on any large filing.
1. No per-LLM-call timeout → infinite ingest hangs
A single hung summarize / HyDE / multi-axis / TOC-build call blocked the stage's errgroup
Wait()forever. A doc was observed stuck insummarizingfor 13+ hours. Fix:completeWithTimeoutwraps every individualLLM.Completein acontext.WithTimeout(default 90s, tunable viaingest.llm_call_timeout_seconds/VLE_INGEST_LLM_CALL_TIMEOUT_SECONDS). On timeout the call is logged and skipped; the document still reachesready.2. Leaf-section explosion → ~1500 sections per 10-K
chunkOversizedLeavessplits any leaf >2400 chars into ~900-char pieces, so a 45K-char "Notes to Financial Statements" section shattered into ~50 chunks. A 92-page 10-K produced ~1500 leaves, each costing a summarize + HyDE + multi-axis LLM call — the root of the slow/stalled ingest. Fix:capLeafSectionsenforces a ceiling (default 400,ingest.max_sections/VLE_INGEST_MAX_SECTIONS) by merging the smallest adjacent leaf siblings under a shared parent until within budget. Content preserved (blank-line joined), page ranges unioned, table sections never merged. Applied in both the heuristic and outline parse paths.Reachability
The 400-leaf cap is active on the deployed binary today via the existing
RegistryFromTableOpts → NewPDFWithTablespath (MaxSections:0resolves to the default). Makingingest.max_sectionsoperator-tunable end-to-end is a 2-line follow-up in thecmd/binaries (left out here to avoid colliding with the in-flight cmd/server consolidation PR).Test plan
go build ./...,go vet ./...,go test ./...all greenExpected impact
A 92-page 10-K should now cap at ~400 leaves (down from ~1500), cutting ingest LLM calls ~3-4x, and no single hung call can freeze a document.