fix(pdf): harden OCR fallback and image extraction by willgriffin · Pull Request #72 · happyvertical/pdf

willgriffin · 2026-04-25T18:30:29Z

Summary

Make OCR fallback fail explicitly with PDFOCRFallbackError when page rendering fails or OCR returns no recognized text, while preserving graceful null behavior for invalid/missing/corrupt inputs.
Add a collected-image byte ceiling with PDFImageCollectionLimitError so large image-heavy PDFs use the existing batched onBatch path instead of retaining unbounded image buffers.
Document the large-PDF image extraction path and add a patch changeset for one release covering pdf#70 and pdf#71.

Fixes #70
Fixes #71

Validation

pnpm exec tsc --noEmit
CI=true pnpm test
pnpm run build
focused touched-file biome check
live Bentley March 24, 2026 minutes PDF extracted 6063 characters through the built package

Copilot

Pull request overview

Hardens Node PDF OCR fallback and image extraction to fail explicitly in “OCR attempted but produced nothing / rendering failed” cases, and to prevent unbounded memory growth when collecting extracted image buffers.

Changes:

Add explicit PDFOCRFallbackError on OCR fallback rendering failures and on “OCR returned no text”.
Add maxCollectedBytes guard with PDFImageCollectionLimitError to cap retained image buffers when collect is enabled.
Update docs, tests, and changeset to reflect the new failure modes and large-PDF image extraction guidance.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/shared/types.ts	Adds `throwOnError` + `maxCollectedBytes` options and new error types.
src/node/unpdf.ts	Enforces collected-image byte ceiling and supports throwing OCR-render errors via `throwOnError`.
src/node/combined.ts	Makes OCR fallback failures explicit and propagates `PDFOCRFallbackError`.
src/node/combined.test.ts	Adds coverage for explicit OCR-fallback failures + image collection limit behavior.
src/node/child-extraction.ts	Revives `PDFOCRFallbackError` across child-process boundary.
README.md	Documents byte-capped image collection and new error handling branches.
.changeset/pdf-70-71-release.md	Adds patch changeset for issues #70 and #71.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a92611af28

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

github-actions · 2026-04-25T20:06:23Z

Release Preview

When this PR is squash-merged, @happyvertical/pdf will receive a patch version bump based on the PR title's conventional commit format.

What happens on merge?

Tests run on main branch
The squash commit message is validated
Version is bumped automatically
Package is published to GitHub Packages
Git tag is created

No manual intervention needed.

fix(pdf): harden OCR fallback and image extraction

a92611a

willgriffin marked this pull request as ready for review April 25, 2026 19:58

willgriffin requested a review from Copilot April 25, 2026 19:59

Copilot started reviewing on behalf of willgriffin April 25, 2026 19:59 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

Comment thread src/node/combined.ts Outdated

Comment thread src/node/combined.ts

Comment thread src/node/unpdf.ts

chatgpt-codex-connector Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread src/node/combined.ts

willgriffin changed the title ~~[codex] Harden OCR fallback and image extraction~~ fix(pdf): harden OCR fallback and image extraction Apr 25, 2026

fix(pdf): address OCR batching edge cases

7a328a9

fix(pdf): accept arraybuffer render sources

9fae730

willgriffin merged commit bd1d45e into main Apr 25, 2026
4 checks passed

willgriffin deleted the codex/pdf-70-71-release branch April 25, 2026 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pdf): harden OCR fallback and image extraction#72

fix(pdf): harden OCR fallback and image extraction#72
willgriffin merged 3 commits into
mainfrom
codex/pdf-70-71-release

willgriffin commented Apr 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

willgriffin commented Apr 25, 2026

Summary

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

github-actions Bot commented Apr 25, 2026

Release Preview

What happens on merge?

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants