Skip to content

fix(pdf): harden large document batching#61

Merged
willgriffin merged 3 commits into
mainfrom
codex/large-pdf-batching
Apr 20, 2026
Merged

fix(pdf): harden large document batching#61
willgriffin merged 3 commits into
mainfrom
codex/large-pdf-batching

Conversation

@willgriffin
Copy link
Copy Markdown
Contributor

Summary

  • preserve page boundaries and skip-OCR behavior in large-document batching
  • enforce configured file-size ceilings consistently across combined and Kreuzberg readers
  • add regressions for batching order, capability reporting, and Kreuzberg in-memory handling

Verification

  • pnpm typecheck
  • pnpm test

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 22c553a1eb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/node/combined.ts Outdated
? 'text'
: info.recommendedStrategy;

if (options?.mergePages === false) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Treat default mergePages as page-preserving in batched mode

extractTextBatched only switches to page-wise extraction when mergePages === false. For the default case where callers omit mergePages, large PDFs still go through batch mode, and each batch is extracted with mergePages: true, so separators are only preserved between batches (e.g., every 25 pages) rather than between pages. This changes default output structure for large documents and can break downstream page-based chunking/indexing.

Useful? React with 👍 / 👎.

Comment thread src/shared/factory.ts Outdated
Comment on lines +146 to +148
const provider = isRuntimeProvider(config.provider ?? 'auto')
? (config.provider ?? 'auto')
: 'auto';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject unknown provider values instead of falling back to auto

This coercion maps any unrecognized provider string to 'auto', so a typo like provider: 'unpfd' no longer fails fast and instead silently selects a different backend. That hides misconfiguration and can unexpectedly change provider behavior (OCR path, limits, dependencies) at runtime. The previous behavior surfaced this as an Unknown PDF provider error.

Useful? React with 👍 / 👎.

Comment thread src/node/kreuzberg.ts Outdated
*/

import { promises as fs } from 'node:fs';
import type { ExtractionConfig, ExtractionResult } from '@kreuzberg/node';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid build-time type imports from optional Kreuzberg module

@kreuzberg/node is declared as an optional dependency, but this top-level type import makes TypeScript resolution require that package during typecheck/build. In environments where optional dependencies are skipped or unsupported, this turns a runtime-optional provider into a build blocker, bypassing the intended graceful runtime fallback via dynamic import.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

@willgriffin willgriffin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of the large-PDF batching changes. The batching logic and broader maxFileSize propagation are moving in a good direction, and the new tests add real coverage. A few concerns below that aren't already covered by the Codex review (which separately flagged the default-mergePages regression, the silent auto fallback for unknown providers, and the top-level @kreuzberg/node type import from an optional dependency — those are all valid).

Main things I'd like to see addressed:

  • Every extractText() now performs a full PDF parse via getInfo() before deciding whether to batch, which is a perf regression for the common small/text-PDF case.
  • maxFileSize is enforced after that parse, which defeats the primary reason to have the limit.
  • Error routing relies on substring matching of error messages rather than typed error classes, which is fragile.

Inline notes on each.

Comment thread src/node/combined.ts Outdated
}

try {
const info = await this.getInfo(source);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 — getInfo() is now unconditionally called on every extractText().

UnpdfProvider.getInfo() loads the PDF, calls getDocumentProxy, getMetadata, and then for each of up to 3 sample pages runs getPage, getTextContent, and getOperatorList. That's a full parse of the document before we even know whether batching applies.

For the common case — a small/medium text-based PDF that would have hit the fast direct-extraction path — this doubles the parse work (getInfo parses the PDF, then unpdfProvider.extractText(source, options) on line 325 parses it again). unpdf doesn't cache across calls that take a raw source, so this is real duplicated work on the hot path.

Suggest only triggering the batching analysis when there's a cheap signal that batching might matter: check getSourceByteLength(source) (stat for paths, .byteLength for in-memory) before calling getInfo(). If the source is well under LARGE_DOCUMENT_BATCH_BYTES and no explicit pages filter is set, fall straight through to the original direct-extraction path.

Comment thread src/node/combined.ts Outdated

try {
const info = await this.getInfo(source);
await this.assertWithinConfiguredMaxFileSize(source, info.fileSize);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 — maxFileSize is enforced after getInfo() has already loaded and parsed the PDF.

The whole point of maxFileSize is to prevent oversized files from being brought into memory/parsed. But on line 307 we call this.getInfo(source), which runs normalizeSource (reads the full file into a Buffer) and then getDocumentProxy before we ever check the limit. A 1GB PDF with maxFileSize: 100*1024*1024 will still fully load and parse before throwing.

Recommend running assertWithinConfiguredMaxFileSize(source) first using the cheap path — stat() for file paths, byteLength for in-memory sources — and only then calling getInfo(). The Kreuzberg provider already does it in that order (line 185-192 of kreuzberg.ts), so the two providers will be consistent.

Comment thread src/node/combined.ts Outdated
} catch (error) {
if (
error instanceof Error &&
(error.message.includes('Large PDF extraction failed') ||
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 — String-match on error messages to decide whether to rethrow is fragile.

The outer catch uses error.message.includes('Large PDF extraction failed') / 'configured maxFileSize' to decide whether to propagate. If either prefix is ever reworded — localized, prefixed for logging, changed to include page numbers differently — both oversized-PDF errors and batch failures will silently fall into console.error(...); return null; again, quietly reverting the explicit-failure guarantee the changeset advertises.

Worth introducing typed error subclasses (e.g., PDFBatchExtractionError, PDFFileSizeError extending PDFError) and using instanceof here. That also gives callers something to catch programmatically without regex-ing error messages.

Comment thread src/node/combined.ts Outdated
return ocrText;
}

return ocrText.length > directText.length * 1.2 ? ocrText : directText;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 — The ocrText.length > directText.length * 1.2 heuristic is unjustified.

Length is a weak proxy for OCR quality: OCR output commonly has more characters than the truth (inserted whitespace, repeated noise characters, garbage on image borders) while being lower-quality than the embedded text. Under this rule, an image-heavy page whose embedded text is actually correct can still be overridden by noisy OCR as long as the OCR output is 20% longer.

At minimum add a comment explaining why 1.2 was chosen. A more robust signal would be OCR confidence from performOCR (already returned as ocrResult.confidence on line 204) — only pick OCR if it's both longer and clears a confidence threshold.

Comment thread src/node/combined.ts
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens PDF large-document handling by introducing batched extraction in the Node combined provider, ensuring page-boundary/skip-OCR behaviors are preserved, and propagating configured maxFileSize through provider capabilities and enforcement paths.

Changes:

  • Add large-document batching in CombinedNodeProvider with strategy-aware batch sizes and explicit failures on mid-batch errors.
  • Propagate and report configured maxFileSize across factory-selected providers (Combined + Kreuzberg) and capabilities.
  • Add regression tests covering batching order, boundary preservation, capability reporting, and Kreuzberg in-memory handling.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/shared/factory.ts Validates provider strings at runtime, improves browser detection, and forwards maxFileSize into Node providers.
src/node/kreuzberg.ts Adds maxFileSize enforcement + capability reporting and avoids double-normalizing in-memory sources.
src/node/combined.ts Implements batched extraction for large PDFs and enforces/report maxFileSize for the combined Node provider.
src/node/combined.test.ts Adds regression tests for large-PDF batching behaviors and max file size capability reporting.
src/factory.test.ts Adds tests for max file size propagation and Kreuzberg in-memory normalization behavior.
.changeset/large-pdf-batching.md Patch changeset describing the new batching and max file size behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/node/combined.ts Outdated
Comment on lines +49 to +56
return (
!source ||
(typeof source === 'string' && source.trim() === '') ||
(typeof source === 'object' &&
Object.keys(source).length === 0 &&
!(source instanceof Buffer) &&
!(source instanceof Uint8Array))
);
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PDFSource includes ArrayBuffer, but isInvalidSource() treats any ArrayBuffer as an “empty object” (because Object.keys(new ArrayBuffer(...)) is always empty) and returns true, causing extractText() to return null for valid in-memory PDFs. Update the invalid-source check to explicitly allow ArrayBuffer (and preferably avoid using Object.keys() for binary sources, using byteLength checks instead).

Suggested change
return (
!source ||
(typeof source === 'string' && source.trim() === '') ||
(typeof source === 'object' &&
Object.keys(source).length === 0 &&
!(source instanceof Buffer) &&
!(source instanceof Uint8Array))
);
if (!source) {
return true;
}
if (typeof source === 'string') {
return source.trim() === '';
}
if (source instanceof Buffer || source instanceof Uint8Array) {
return source.byteLength === 0;
}
if (source instanceof ArrayBuffer) {
return source.byteLength === 0;
}
return typeof source === 'object' && Object.keys(source).length === 0;

Copilot uses AI. Check for mistakes.
Comment thread src/node/combined.ts Outdated
Comment on lines +306 to +309
try {
const info = await this.getInfo(source);
await this.assertWithinConfiguredMaxFileSize(source, info.fileSize);

Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extractText() now calls getInfo() unconditionally, but getInfo() (via unpdfProvider.getInfo) fully reads/parses the PDF, and unpdfProvider.extractText() will parse it again. This adds a guaranteed second full parse for every extraction (even small PDFs). Consider deferring getInfo() until you’ve established batching is needed (e.g., based on file size / page selection) or reusing analysis results to avoid double work.

Copilot uses AI. Check for mistakes.
Comment thread src/node/combined.ts Outdated
Comment on lines +306 to +309
try {
const info = await this.getInfo(source);
await this.assertWithinConfiguredMaxFileSize(source, info.fileSize);

Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configured maxFileSize ceiling is enforced after getInfo(), but getInfo() reads the full file into memory for path sources (see UnpdfProvider.normalizeSource). For oversized PDFs this defeats the purpose of the ceiling and can still cause unnecessary memory/IO. Enforce maxFileSize before calling getInfo() for string sources (via stat) and/or before any full-buffer normalization/parsing occurs.

Copilot uses AI. Check for mistakes.
Comment thread src/node/kreuzberg.ts
@willgriffin
Copy link
Copy Markdown
Contributor Author

CI investigation summary:

This does not look Renovate-specific. The currently open Renovate PRs I checked (#60, #59, and #47) all fail the same Test / Run Tests job, and this PR is failing that same job too.

Root cause from the GitHub Actions logs:

  • The failing step is Lint in .github/workflows/test.yml.
  • The blocking error is a Biome formatting error in package.json.
  • Specifically, Biome wants the files array formatted on one line instead of:
    • "files": [
    • "dist",
    • "README.md",
    • "LICENSE"
    • ]
  • The log shows Biome would rewrite that to:
    • "files": ["dist", "README.md", "LICENSE"],

Important detail:

  • The other diagnostics in src/browser/pdfjs.ts, src/node/kreuzberg.ts, and src/shared/factory.ts are warnings, not the reason CI is exiting non-zero.
  • main advanced to 70913e9 (chore(release): v0.62.23 [skip ci]) on April 9, 2026, so this base-branch issue appears to have landed without the test workflow running on that commit.
  • That means every PR merging against current main is inheriting the same red CI until we fix the formatting issue.

Suggested fix path in this PR:

  • Update package.json formatting to satisfy Biome.
  • Re-run CI here.
  • Once lint is green again, we can see whether any Renovate PR still has a dependency-specific failure after the base issue is removed.

@willgriffin willgriffin merged commit 15fc514 into main Apr 20, 2026
2 checks passed
@willgriffin willgriffin deleted the codex/large-pdf-batching branch April 20, 2026 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants