fix(pdf): harden large document batching by willgriffin · Pull Request #61 · happyvertical/pdf

willgriffin · 2026-04-20T15:28:28Z

Summary

preserve page boundaries and skip-OCR behavior in large-document batching
enforce configured file-size ceilings consistently across combined and Kreuzberg readers
add regressions for batching order, capability reporting, and Kreuzberg in-memory handling

Verification

pnpm typecheck
pnpm test

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 22c553a1eb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-20T15:34:23Z

+      ? 'text'
+      : info.recommendedStrategy;
+
+    if (options?.mergePages === false) {


Treat default mergePages as page-preserving in batched mode

extractTextBatched only switches to page-wise extraction when mergePages === false. For the default case where callers omit mergePages, large PDFs still go through batch mode, and each batch is extracted with mergePages: true, so separators are only preserved between batches (e.g., every 25 pages) rather than between pages. This changes default output structure for large documents and can break downstream page-based chunking/indexing.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-20T15:34:23Z

+  const provider = isRuntimeProvider(config.provider ?? 'auto')
+    ? (config.provider ?? 'auto')
+    : 'auto';


Reject unknown provider values instead of falling back to auto

This coercion maps any unrecognized provider string to 'auto', so a typo like provider: 'unpfd' no longer fails fast and instead silently selects a different backend. That hides misconfiguration and can unexpectedly change provider behavior (OCR path, limits, dependencies) at runtime. The previous behavior surfaced this as an Unknown PDF provider error.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-20T15:34:23Z

 */

 import { promises as fs } from 'node:fs';
+import type { ExtractionConfig, ExtractionResult } from '@kreuzberg/node';


Avoid build-time type imports from optional Kreuzberg module

@kreuzberg/node is declared as an optional dependency, but this top-level type import makes TypeScript resolution require that package during typecheck/build. In environments where optional dependencies are skipped or unsupported, this turns a runtime-optional provider into a build blocker, bypassing the intended graceful runtime fallback via dynamic import.

Useful? React with 👍 / 👎.

willgriffin

Review of the large-PDF batching changes. The batching logic and broader maxFileSize propagation are moving in a good direction, and the new tests add real coverage. A few concerns below that aren't already covered by the Codex review (which separately flagged the default-mergePages regression, the silent auto fallback for unknown providers, and the top-level @kreuzberg/node type import from an optional dependency — those are all valid).

Main things I'd like to see addressed:

Every extractText() now performs a full PDF parse via getInfo() before deciding whether to batch, which is a perf regression for the common small/text-PDF case.
maxFileSize is enforced after that parse, which defeats the primary reason to have the limit.
Error routing relies on substring matching of error messages rather than typed error classes, which is fragile.

Inline notes on each.

willgriffin · 2026-04-20T15:39:12Z

+    }
+
    try {
+      const info = await this.getInfo(source);


P1 — getInfo() is now unconditionally called on every extractText().

UnpdfProvider.getInfo() loads the PDF, calls getDocumentProxy, getMetadata, and then for each of up to 3 sample pages runs getPage, getTextContent, and getOperatorList. That's a full parse of the document before we even know whether batching applies.

For the common case — a small/medium text-based PDF that would have hit the fast direct-extraction path — this doubles the parse work (getInfo parses the PDF, then unpdfProvider.extractText(source, options) on line 325 parses it again). unpdf doesn't cache across calls that take a raw source, so this is real duplicated work on the hot path.

Suggest only triggering the batching analysis when there's a cheap signal that batching might matter: check getSourceByteLength(source) (stat for paths, .byteLength for in-memory) before calling getInfo(). If the source is well under LARGE_DOCUMENT_BATCH_BYTES and no explicit pages filter is set, fall straight through to the original direct-extraction path.

willgriffin · 2026-04-20T15:39:12Z

+
    try {
+      const info = await this.getInfo(source);
+      await this.assertWithinConfiguredMaxFileSize(source, info.fileSize);


P2 — maxFileSize is enforced after getInfo() has already loaded and parsed the PDF.

The whole point of maxFileSize is to prevent oversized files from being brought into memory/parsed. But on line 307 we call this.getInfo(source), which runs normalizeSource (reads the full file into a Buffer) and then getDocumentProxy before we ever check the limit. A 1GB PDF with maxFileSize: 100*1024*1024 will still fully load and parse before throwing.

Recommend running assertWithinConfiguredMaxFileSize(source) first using the cheap path — stat() for file paths, byteLength for in-memory sources — and only then calling getInfo(). The Kreuzberg provider already does it in that order (line 185-192 of kreuzberg.ts), so the two providers will be consistent.

willgriffin · 2026-04-20T15:39:12Z

    } catch (error) {
+      if (
+        error instanceof Error &&
+        (error.message.includes('Large PDF extraction failed') ||


P1 — String-match on error messages to decide whether to rethrow is fragile.

The outer catch uses error.message.includes('Large PDF extraction failed') / 'configured maxFileSize' to decide whether to propagate. If either prefix is ever reworded — localized, prefixed for logging, changed to include page numbers differently — both oversized-PDF errors and batch failures will silently fall into console.error(...); return null; again, quietly reverting the explicit-failure guarantee the changeset advertises.

Worth introducing typed error subclasses (e.g., PDFBatchExtractionError, PDFFileSizeError extending PDFError) and using instanceof here. That also gives callers something to catch programmatically without regex-ing error messages.

willgriffin · 2026-04-20T15:39:12Z

+        return ocrText;
+      }
+
+      return ocrText.length > directText.length * 1.2 ? ocrText : directText;


P2 — The ocrText.length > directText.length * 1.2 heuristic is unjustified.

Length is a weak proxy for OCR quality: OCR output commonly has more characters than the truth (inserted whitespace, repeated noise characters, garbage on image borders) while being lower-quality than the embedded text. Under this rule, an image-heavy page whose embedded text is actually correct can still be overridden by noisy OCR as long as the OCR output is 20% longer.

At minimum add a comment explaining why 1.2 was chosen. A more robust signal would be OCR confidence from performOCR (already returned as ocrResult.confidence on line 204) — only pick OCR if it's both longer and clears a confidence threshold.

Copilot

Pull request overview

This PR hardens PDF large-document handling by introducing batched extraction in the Node combined provider, ensuring page-boundary/skip-OCR behaviors are preserved, and propagating configured maxFileSize through provider capabilities and enforcement paths.

Changes:

Add large-document batching in CombinedNodeProvider with strategy-aware batch sizes and explicit failures on mid-batch errors.
Propagate and report configured maxFileSize across factory-selected providers (Combined + Kreuzberg) and capabilities.
Add regression tests covering batching order, boundary preservation, capability reporting, and Kreuzberg in-memory handling.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/shared/factory.ts	Validates provider strings at runtime, improves browser detection, and forwards `maxFileSize` into Node providers.
src/node/kreuzberg.ts	Adds `maxFileSize` enforcement + capability reporting and avoids double-normalizing in-memory sources.
src/node/combined.ts	Implements batched extraction for large PDFs and enforces/report `maxFileSize` for the combined Node provider.
src/node/combined.test.ts	Adds regression tests for large-PDF batching behaviors and max file size capability reporting.
src/factory.test.ts	Adds tests for max file size propagation and Kreuzberg in-memory normalization behavior.
.changeset/large-pdf-batching.md	Patch changeset describing the new batching and max file size behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-20T15:40:12Z

+    return (
+      !source ||
+      (typeof source === 'string' && source.trim() === '') ||
+      (typeof source === 'object' &&
+        Object.keys(source).length === 0 &&
+        !(source instanceof Buffer) &&
+        !(source instanceof Uint8Array))
+    );


PDFSource includes ArrayBuffer, but isInvalidSource() treats any ArrayBuffer as an “empty object” (because Object.keys(new ArrayBuffer(...)) is always empty) and returns true, causing extractText() to return null for valid in-memory PDFs. Update the invalid-source check to explicitly allow ArrayBuffer (and preferably avoid using Object.keys() for binary sources, using byteLength checks instead).

Suggested change

return (

!source ||

(typeof source === 'string' && source.trim() === '') ||

(typeof source === 'object' &&

Object.keys(source).length === 0 &&

!(source instanceof Buffer) &&

!(source instanceof Uint8Array))

);

if (!source) {

return true;

}

if (typeof source === 'string') {

return source.trim() === '';

}

if (source instanceof Buffer || source instanceof Uint8Array) {

return source.byteLength === 0;

}

if (source instanceof ArrayBuffer) {

return source.byteLength === 0;

}

return typeof source === 'object' && Object.keys(source).length === 0;

Copilot · 2026-04-20T15:40:13Z

    try {
+      const info = await this.getInfo(source);
+      await this.assertWithinConfiguredMaxFileSize(source, info.fileSize);
+


extractText() now calls getInfo() unconditionally, but getInfo() (via unpdfProvider.getInfo) fully reads/parses the PDF, and unpdfProvider.extractText() will parse it again. This adds a guaranteed second full parse for every extraction (even small PDFs). Consider deferring getInfo() until you’ve established batching is needed (e.g., based on file size / page selection) or reusing analysis results to avoid double work.

Copilot · 2026-04-20T15:40:13Z

    try {
+      const info = await this.getInfo(source);
+      await this.assertWithinConfiguredMaxFileSize(source, info.fileSize);
+


The configured maxFileSize ceiling is enforced after getInfo(), but getInfo() reads the full file into memory for path sources (see UnpdfProvider.normalizeSource). For oversized PDFs this defeats the purpose of the ceiling and can still cause unnecessary memory/IO. Enforce maxFileSize before calling getInfo() for string sources (via stat) and/or before any full-buffer normalization/parsing occurs.

willgriffin · 2026-04-20T15:40:42Z

CI investigation summary:

This does not look Renovate-specific. The currently open Renovate PRs I checked (#60, #59, and #47) all fail the same Test / Run Tests job, and this PR is failing that same job too.

Root cause from the GitHub Actions logs:

The failing step is Lint in .github/workflows/test.yml.
The blocking error is a Biome formatting error in package.json.
Specifically, Biome wants the files array formatted on one line instead of:
- "files": [
- "dist",
- "README.md",
- "LICENSE"
- ]
The log shows Biome would rewrite that to:
- "files": ["dist", "README.md", "LICENSE"],

Important detail:

The other diagnostics in src/browser/pdfjs.ts, src/node/kreuzberg.ts, and src/shared/factory.ts are warnings, not the reason CI is exiting non-zero.
main advanced to 70913e9 (chore(release): v0.62.23 [skip ci]) on April 9, 2026, so this base-branch issue appears to have landed without the test workflow running on that commit.
That means every PR merging against current main is inheriting the same red CI until we fix the formatting issue.

Suggested fix path in this PR:

Update package.json formatting to satisfy Biome.
Re-run CI here.
Once lint is green again, we can see whether any Renovate PR still has a dependency-specific failure after the base issue is removed.

fix(pdf): harden large document batching

22c553a

chatgpt-codex-connector Bot reviewed Apr 20, 2026

View reviewed changes

willgriffin requested a review from Copilot April 20, 2026 15:34

Copilot started reviewing on behalf of willgriffin April 20, 2026 15:35 View session

willgriffin commented Apr 20, 2026

View reviewed changes

Copilot AI reviewed Apr 20, 2026

View reviewed changes

willgriffin added 2 commits April 20, 2026 09:46

fix(pdf): address review feedback

715e697

fix(pdf): address lint failures

6046d3a

willgriffin merged commit 15fc514 into main Apr 20, 2026
2 checks passed

willgriffin deleted the codex/large-pdf-batching branch April 20, 2026 16:09

-    return (
-      !source ||
-      (typeof source === 'string' && source.trim() === '') ||
-      (typeof source === 'object' &&
-        Object.keys(source).length === 0 &&
-        !(source instanceof Buffer) &&
-        !(source instanceof Uint8Array))
-    );
+    if (!source) {
+      return true;
+    }
+    if (typeof source === 'string') {
+      return source.trim() === '';
+    }
+    if (source instanceof Buffer || source instanceof Uint8Array) {
+      return source.byteLength === 0;
+    }
+    if (source instanceof ArrayBuffer) {
+      return source.byteLength === 0;
+    }
+    return typeof source === 'object' && Object.keys(source).length === 0;

Conversation

willgriffin commented Apr 20, 2026

Summary

Verification

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

willgriffin left a comment

Choose a reason for hiding this comment

Uh oh!

willgriffin Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

willgriffin Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

willgriffin Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

willgriffin Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

willgriffin commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants