fix(pdf): harden large document batching#61
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 22c553a1eb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ? 'text' | ||
| : info.recommendedStrategy; | ||
|
|
||
| if (options?.mergePages === false) { |
There was a problem hiding this comment.
Treat default mergePages as page-preserving in batched mode
extractTextBatched only switches to page-wise extraction when mergePages === false. For the default case where callers omit mergePages, large PDFs still go through batch mode, and each batch is extracted with mergePages: true, so separators are only preserved between batches (e.g., every 25 pages) rather than between pages. This changes default output structure for large documents and can break downstream page-based chunking/indexing.
Useful? React with 👍 / 👎.
| const provider = isRuntimeProvider(config.provider ?? 'auto') | ||
| ? (config.provider ?? 'auto') | ||
| : 'auto'; |
There was a problem hiding this comment.
Reject unknown provider values instead of falling back to auto
This coercion maps any unrecognized provider string to 'auto', so a typo like provider: 'unpfd' no longer fails fast and instead silently selects a different backend. That hides misconfiguration and can unexpectedly change provider behavior (OCR path, limits, dependencies) at runtime. The previous behavior surfaced this as an Unknown PDF provider error.
Useful? React with 👍 / 👎.
| */ | ||
|
|
||
| import { promises as fs } from 'node:fs'; | ||
| import type { ExtractionConfig, ExtractionResult } from '@kreuzberg/node'; |
There was a problem hiding this comment.
Avoid build-time type imports from optional Kreuzberg module
@kreuzberg/node is declared as an optional dependency, but this top-level type import makes TypeScript resolution require that package during typecheck/build. In environments where optional dependencies are skipped or unsupported, this turns a runtime-optional provider into a build blocker, bypassing the intended graceful runtime fallback via dynamic import.
Useful? React with 👍 / 👎.
willgriffin
left a comment
There was a problem hiding this comment.
Review of the large-PDF batching changes. The batching logic and broader maxFileSize propagation are moving in a good direction, and the new tests add real coverage. A few concerns below that aren't already covered by the Codex review (which separately flagged the default-mergePages regression, the silent auto fallback for unknown providers, and the top-level @kreuzberg/node type import from an optional dependency — those are all valid).
Main things I'd like to see addressed:
- Every
extractText()now performs a full PDF parse viagetInfo()before deciding whether to batch, which is a perf regression for the common small/text-PDF case. maxFileSizeis enforced after that parse, which defeats the primary reason to have the limit.- Error routing relies on substring matching of error messages rather than typed error classes, which is fragile.
Inline notes on each.
| } | ||
|
|
||
| try { | ||
| const info = await this.getInfo(source); |
There was a problem hiding this comment.
P1 — getInfo() is now unconditionally called on every extractText().
UnpdfProvider.getInfo() loads the PDF, calls getDocumentProxy, getMetadata, and then for each of up to 3 sample pages runs getPage, getTextContent, and getOperatorList. That's a full parse of the document before we even know whether batching applies.
For the common case — a small/medium text-based PDF that would have hit the fast direct-extraction path — this doubles the parse work (getInfo parses the PDF, then unpdfProvider.extractText(source, options) on line 325 parses it again). unpdf doesn't cache across calls that take a raw source, so this is real duplicated work on the hot path.
Suggest only triggering the batching analysis when there's a cheap signal that batching might matter: check getSourceByteLength(source) (stat for paths, .byteLength for in-memory) before calling getInfo(). If the source is well under LARGE_DOCUMENT_BATCH_BYTES and no explicit pages filter is set, fall straight through to the original direct-extraction path.
|
|
||
| try { | ||
| const info = await this.getInfo(source); | ||
| await this.assertWithinConfiguredMaxFileSize(source, info.fileSize); |
There was a problem hiding this comment.
P2 — maxFileSize is enforced after getInfo() has already loaded and parsed the PDF.
The whole point of maxFileSize is to prevent oversized files from being brought into memory/parsed. But on line 307 we call this.getInfo(source), which runs normalizeSource (reads the full file into a Buffer) and then getDocumentProxy before we ever check the limit. A 1GB PDF with maxFileSize: 100*1024*1024 will still fully load and parse before throwing.
Recommend running assertWithinConfiguredMaxFileSize(source) first using the cheap path — stat() for file paths, byteLength for in-memory sources — and only then calling getInfo(). The Kreuzberg provider already does it in that order (line 185-192 of kreuzberg.ts), so the two providers will be consistent.
| } catch (error) { | ||
| if ( | ||
| error instanceof Error && | ||
| (error.message.includes('Large PDF extraction failed') || |
There was a problem hiding this comment.
P1 — String-match on error messages to decide whether to rethrow is fragile.
The outer catch uses error.message.includes('Large PDF extraction failed') / 'configured maxFileSize' to decide whether to propagate. If either prefix is ever reworded — localized, prefixed for logging, changed to include page numbers differently — both oversized-PDF errors and batch failures will silently fall into console.error(...); return null; again, quietly reverting the explicit-failure guarantee the changeset advertises.
Worth introducing typed error subclasses (e.g., PDFBatchExtractionError, PDFFileSizeError extending PDFError) and using instanceof here. That also gives callers something to catch programmatically without regex-ing error messages.
| return ocrText; | ||
| } | ||
|
|
||
| return ocrText.length > directText.length * 1.2 ? ocrText : directText; |
There was a problem hiding this comment.
P2 — The ocrText.length > directText.length * 1.2 heuristic is unjustified.
Length is a weak proxy for OCR quality: OCR output commonly has more characters than the truth (inserted whitespace, repeated noise characters, garbage on image borders) while being lower-quality than the embedded text. Under this rule, an image-heavy page whose embedded text is actually correct can still be overridden by noisy OCR as long as the OCR output is 20% longer.
At minimum add a comment explaining why 1.2 was chosen. A more robust signal would be OCR confidence from performOCR (already returned as ocrResult.confidence on line 204) — only pick OCR if it's both longer and clears a confidence threshold.
There was a problem hiding this comment.
Pull request overview
This PR hardens PDF large-document handling by introducing batched extraction in the Node combined provider, ensuring page-boundary/skip-OCR behaviors are preserved, and propagating configured maxFileSize through provider capabilities and enforcement paths.
Changes:
- Add large-document batching in
CombinedNodeProviderwith strategy-aware batch sizes and explicit failures on mid-batch errors. - Propagate and report configured
maxFileSizeacross factory-selected providers (Combined + Kreuzberg) and capabilities. - Add regression tests covering batching order, boundary preservation, capability reporting, and Kreuzberg in-memory handling.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/shared/factory.ts | Validates provider strings at runtime, improves browser detection, and forwards maxFileSize into Node providers. |
| src/node/kreuzberg.ts | Adds maxFileSize enforcement + capability reporting and avoids double-normalizing in-memory sources. |
| src/node/combined.ts | Implements batched extraction for large PDFs and enforces/report maxFileSize for the combined Node provider. |
| src/node/combined.test.ts | Adds regression tests for large-PDF batching behaviors and max file size capability reporting. |
| src/factory.test.ts | Adds tests for max file size propagation and Kreuzberg in-memory normalization behavior. |
| .changeset/large-pdf-batching.md | Patch changeset describing the new batching and max file size behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return ( | ||
| !source || | ||
| (typeof source === 'string' && source.trim() === '') || | ||
| (typeof source === 'object' && | ||
| Object.keys(source).length === 0 && | ||
| !(source instanceof Buffer) && | ||
| !(source instanceof Uint8Array)) | ||
| ); |
There was a problem hiding this comment.
PDFSource includes ArrayBuffer, but isInvalidSource() treats any ArrayBuffer as an “empty object” (because Object.keys(new ArrayBuffer(...)) is always empty) and returns true, causing extractText() to return null for valid in-memory PDFs. Update the invalid-source check to explicitly allow ArrayBuffer (and preferably avoid using Object.keys() for binary sources, using byteLength checks instead).
| return ( | |
| !source || | |
| (typeof source === 'string' && source.trim() === '') || | |
| (typeof source === 'object' && | |
| Object.keys(source).length === 0 && | |
| !(source instanceof Buffer) && | |
| !(source instanceof Uint8Array)) | |
| ); | |
| if (!source) { | |
| return true; | |
| } | |
| if (typeof source === 'string') { | |
| return source.trim() === ''; | |
| } | |
| if (source instanceof Buffer || source instanceof Uint8Array) { | |
| return source.byteLength === 0; | |
| } | |
| if (source instanceof ArrayBuffer) { | |
| return source.byteLength === 0; | |
| } | |
| return typeof source === 'object' && Object.keys(source).length === 0; |
| try { | ||
| const info = await this.getInfo(source); | ||
| await this.assertWithinConfiguredMaxFileSize(source, info.fileSize); | ||
|
|
There was a problem hiding this comment.
extractText() now calls getInfo() unconditionally, but getInfo() (via unpdfProvider.getInfo) fully reads/parses the PDF, and unpdfProvider.extractText() will parse it again. This adds a guaranteed second full parse for every extraction (even small PDFs). Consider deferring getInfo() until you’ve established batching is needed (e.g., based on file size / page selection) or reusing analysis results to avoid double work.
| try { | ||
| const info = await this.getInfo(source); | ||
| await this.assertWithinConfiguredMaxFileSize(source, info.fileSize); | ||
|
|
There was a problem hiding this comment.
The configured maxFileSize ceiling is enforced after getInfo(), but getInfo() reads the full file into memory for path sources (see UnpdfProvider.normalizeSource). For oversized PDFs this defeats the purpose of the ceiling and can still cause unnecessary memory/IO. Enforce maxFileSize before calling getInfo() for string sources (via stat) and/or before any full-buffer normalization/parsing occurs.
|
CI investigation summary: This does not look Renovate-specific. The currently open Renovate PRs I checked ( Root cause from the GitHub Actions logs:
Important detail:
Suggested fix path in this PR:
|
Summary
Verification