fix(pdf): harden OCR fallback and image extraction#72
Conversation
There was a problem hiding this comment.
Pull request overview
Hardens Node PDF OCR fallback and image extraction to fail explicitly in “OCR attempted but produced nothing / rendering failed” cases, and to prevent unbounded memory growth when collecting extracted image buffers.
Changes:
- Add explicit
PDFOCRFallbackErroron OCR fallback rendering failures and on “OCR returned no text”. - Add
maxCollectedBytesguard withPDFImageCollectionLimitErrorto cap retained image buffers whencollectis enabled. - Update docs, tests, and changeset to reflect the new failure modes and large-PDF image extraction guidance.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/shared/types.ts | Adds throwOnError + maxCollectedBytes options and new error types. |
| src/node/unpdf.ts | Enforces collected-image byte ceiling and supports throwing OCR-render errors via throwOnError. |
| src/node/combined.ts | Makes OCR fallback failures explicit and propagates PDFOCRFallbackError. |
| src/node/combined.test.ts | Adds coverage for explicit OCR-fallback failures + image collection limit behavior. |
| src/node/child-extraction.ts | Revives PDFOCRFallbackError across child-process boundary. |
| README.md | Documents byte-capped image collection and new error handling branches. |
| .changeset/pdf-70-71-release.md | Adds patch changeset for issues #70 and #71. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a92611af28
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Release PreviewWhen this PR is squash-merged, @happyvertical/pdf will receive a patch version bump based on the PR title's conventional commit format. What happens on merge?
No manual intervention needed. |
Summary
PDFOCRFallbackErrorwhen page rendering fails or OCR returns no recognized text, while preserving gracefulnullbehavior for invalid/missing/corrupt inputs.PDFImageCollectionLimitErrorso large image-heavy PDFs use the existing batchedonBatchpath instead of retaining unbounded image buffers.Fixes #70
Fixes #71
Validation
pnpm exec tsc --noEmitCI=true pnpm testpnpm run buildbiome check