Skip to content

fix(pdf): harden OCR fallback and image extraction#72

Merged
willgriffin merged 3 commits into
mainfrom
codex/pdf-70-71-release
Apr 25, 2026
Merged

fix(pdf): harden OCR fallback and image extraction#72
willgriffin merged 3 commits into
mainfrom
codex/pdf-70-71-release

Conversation

@willgriffin
Copy link
Copy Markdown
Contributor

Summary

  • Make OCR fallback fail explicitly with PDFOCRFallbackError when page rendering fails or OCR returns no recognized text, while preserving graceful null behavior for invalid/missing/corrupt inputs.
  • Add a collected-image byte ceiling with PDFImageCollectionLimitError so large image-heavy PDFs use the existing batched onBatch path instead of retaining unbounded image buffers.
  • Document the large-PDF image extraction path and add a patch changeset for one release covering pdf#70 and pdf#71.

Fixes #70
Fixes #71

Validation

  • pnpm exec tsc --noEmit
  • CI=true pnpm test
  • pnpm run build
  • focused touched-file biome check
  • live Bentley March 24, 2026 minutes PDF extracted 6063 characters through the built package

@willgriffin willgriffin marked this pull request as ready for review April 25, 2026 19:58
@willgriffin willgriffin requested a review from Copilot April 25, 2026 19:59
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Hardens Node PDF OCR fallback and image extraction to fail explicitly in “OCR attempted but produced nothing / rendering failed” cases, and to prevent unbounded memory growth when collecting extracted image buffers.

Changes:

  • Add explicit PDFOCRFallbackError on OCR fallback rendering failures and on “OCR returned no text”.
  • Add maxCollectedBytes guard with PDFImageCollectionLimitError to cap retained image buffers when collect is enabled.
  • Update docs, tests, and changeset to reflect the new failure modes and large-PDF image extraction guidance.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/shared/types.ts Adds throwOnError + maxCollectedBytes options and new error types.
src/node/unpdf.ts Enforces collected-image byte ceiling and supports throwing OCR-render errors via throwOnError.
src/node/combined.ts Makes OCR fallback failures explicit and propagates PDFOCRFallbackError.
src/node/combined.test.ts Adds coverage for explicit OCR-fallback failures + image collection limit behavior.
src/node/child-extraction.ts Revives PDFOCRFallbackError across child-process boundary.
README.md Documents byte-capped image collection and new error handling branches.
.changeset/pdf-70-71-release.md Adds patch changeset for issues #70 and #71.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/node/combined.ts Outdated
Comment thread src/node/combined.ts
Comment thread src/node/unpdf.ts
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a92611af28

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/node/combined.ts
@willgriffin willgriffin changed the title [codex] Harden OCR fallback and image extraction fix(pdf): harden OCR fallback and image extraction Apr 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Release Preview

When this PR is squash-merged, @happyvertical/pdf will receive a patch version bump based on the PR title's conventional commit format.

What happens on merge?

  1. Tests run on main branch
  2. The squash commit message is validated
  3. Version is bumped automatically
  4. Package is published to GitHub Packages
  5. Git tag is created

No manual intervention needed.

@willgriffin willgriffin merged commit bd1d45e into main Apr 25, 2026
4 checks passed
@willgriffin willgriffin deleted the codex/pdf-70-71-release branch April 25, 2026 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prevent extractImages from OOMing on large PDFs Investigate OCR fallback returning empty text for scanned Bentley minutes PDF

2 participants