Skip to content

fix(pdf): node unpdf asset loading for image extraction#68

Merged
willgriffin merged 2 commits into
mainfrom
codex/issue-67-unpdf-node-assets
Apr 23, 2026
Merged

fix(pdf): node unpdf asset loading for image extraction#68
willgriffin merged 2 commits into
mainfrom
codex/issue-67-unpdf-node-assets

Conversation

@willgriffin
Copy link
Copy Markdown
Contributor

Summary

This fixes the remaining Node-side asset loading gap behind issue #67.

unpdf image extraction in Node was still vulnerable to missing PDF.js assets during document loading, which could surface as standard font warnings and JPEG2000/OpenJPEG decode failures during extractImages() runs.

Root Cause

In Node, the unpdf provider was still relying on default document loading behavior without explicitly configuring a Node-friendly PDF.js runtime and asset paths.

That left standard font and wasm asset resolution too implicit for the large-document image extraction path, which is exactly where issue #67 was still reproducing.

What Changed

  • configure unpdf to use the official pdfjs-dist/legacy runtime in Node
  • resolve standardFontDataUrl and wasmUrl from installed pdfjs-dist asset directories
  • disable worker fetch for those document assets so Node uses filesystem-backed loading
  • pass the document options through every getDocumentProxy(...) path used by the provider
  • add regression coverage for runtime selection and Node asset-path behavior during image extraction
  • centralize PDF document cleanup for the affected paths

Impact

Node extractImages() is now much more reliable for PDFs that depend on standard fonts or OpenJPEG/JPEG2000 assets, especially large batched extractions like the issue #67 repro document.

Fixes #67.

Validation

Notes

  • pnpm lint still reports an unrelated pre-existing formatting issue in package.json that is not part of this change.

@willgriffin willgriffin marked this pull request as ready for review April 23, 2026 19:19
@willgriffin willgriffin changed the title [codex] Fix Node unpdf asset loading for image extraction fix(pdf): node unpdf asset loading for image extraction Apr 23, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Release Preview

When this PR is squash-merged, @happyvertical/pdf will receive a patch version bump based on the PR title's conventional commit format.

What happens on merge?

  1. Tests run on main branch
  2. The squash commit message is validated
  3. Version is bumped automatically
  4. Package is published to GitHub Packages
  5. Git tag is created

No manual intervention needed.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8182208310

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/node/unpdf.ts Outdated
Comment thread src/node/unpdf.ts
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens Node-side unpdf document loading to reliably resolve PDF.js standard fonts and OpenJPEG/JPEG2000 wasm assets during image extraction (issue #67).

Changes:

  • Configure unpdf to use pdfjs-dist/legacy in Node and ensure the PDF.js worker is set up consistently.
  • Resolve and pass Node-friendly standardFontDataUrl, wasmUrl, and useWorkerFetch: false through all getDocumentProxy(...) calls, and centralize document cleanup.
  • Add regression tests covering runtime selection and asset-path options being passed during batched image extraction.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/node/unpdf.ts Adds PDF.js runtime/asset configuration for Node, threads document options through getDocumentProxy, and ensures consistent document cleanup.
src/node/combined.test.ts Adds tests asserting the legacy runtime is selected and Node asset/document options are applied during extraction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/node/unpdf.ts Outdated
@willgriffin willgriffin merged commit 4dfba17 into main Apr 23, 2026
4 checks passed
@willgriffin willgriffin deleted the codex/issue-67-unpdf-node-assets branch April 23, 2026 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Node unpdf provider still fails to load standard fonts/OpenJPEG assets for image extraction

2 participants