fix(pdf): node unpdf asset loading for image extraction#68
Conversation
Release PreviewWhen this PR is squash-merged, @happyvertical/pdf will receive a patch version bump based on the PR title's conventional commit format. What happens on merge?
No manual intervention needed. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8182208310
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR hardens Node-side unpdf document loading to reliably resolve PDF.js standard fonts and OpenJPEG/JPEG2000 wasm assets during image extraction (issue #67).
Changes:
- Configure
unpdfto usepdfjs-dist/legacyin Node and ensure the PDF.js worker is set up consistently. - Resolve and pass Node-friendly
standardFontDataUrl,wasmUrl, anduseWorkerFetch: falsethrough allgetDocumentProxy(...)calls, and centralize document cleanup. - Add regression tests covering runtime selection and asset-path options being passed during batched image extraction.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/node/unpdf.ts |
Adds PDF.js runtime/asset configuration for Node, threads document options through getDocumentProxy, and ensures consistent document cleanup. |
src/node/combined.test.ts |
Adds tests asserting the legacy runtime is selected and Node asset/document options are applied during extraction. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
This fixes the remaining Node-side asset loading gap behind issue #67.
unpdfimage extraction in Node was still vulnerable to missing PDF.js assets during document loading, which could surface as standard font warnings and JPEG2000/OpenJPEG decode failures duringextractImages()runs.Root Cause
In Node, the
unpdfprovider was still relying on default document loading behavior without explicitly configuring a Node-friendly PDF.js runtime and asset paths.That left standard font and wasm asset resolution too implicit for the large-document image extraction path, which is exactly where issue #67 was still reproducing.
What Changed
unpdfto use the officialpdfjs-dist/legacyruntime in NodestandardFontDataUrlandwasmUrlfrom installedpdfjs-distasset directoriesgetDocumentProxy(...)path used by the providerImpact
Node
extractImages()is now much more reliable for PDFs that depend on standard fonts or OpenJPEG/JPEG2000 assets, especially large batched extractions like the issue #67 repro document.Fixes #67.
Validation
pnpm test20260224_RCM_Agenda.pdf)Unable to load font datawarningsOpenJPEG failed to initializewarningsNotes
pnpm lintstill reports an unrelated pre-existing formatting issue inpackage.jsonthat is not part of this change.