feat(pdf): web-safe image output + canonical mime types (#73, #74) by willgriffin · Pull Request #75 · happyvertical/pdf

willgriffin · 2026-04-27T00:59:22Z

Closes #73, closes #74. Companion to happyvertical/ocr#75 (open PR happyvertical/ocr#76).

Summary

extractImages now accepts outputFormat: 'webp' | 'png' | 'jpeg' | 'original' and re-encodes raw image streams via @napi-rs/canvas (already a dep — no new deps). Default is 'webp' so callers can store/serve image.data directly.
renderPages accepts the same option; default stays 'original' so the internal OCR fallback path stays zero-copy.
PDFImage.format is now a canonical lowercase IANA mime type (image/jpeg, image/png, image/webp, image/tiff, image/x-rgb, image/x-rgba, image/x-grayscale, image/x-cmyk, application/octet-stream).
PDFImage.bitsPerComponent is populated on raw outputs (always 8 from unpdf), forward-compatible with feat(ocr): add bitsPerComponent to OCRImage ocr#76.
New exports: canonicalizeImageFormat (pure) and encodePDFImage (lazy Node-only re-encoder).
Encoded outputs drop channels and bitsPerComponent so the buffer cannot be misread as raw pixel data by OCR providers that key on width && height && channels. width/height stay set as metadata.

Migration

Callers that consumed raw RGB bytes (format === 'rgb') → pass outputFormat: 'original' and check image.format === 'image/x-rgb'.
Callers that re-encoded extracted images downstream (e.g. via sharp) → drop the re-encoder; image.data is now WebP by default.
extractImages(source, { onBatch }) streaming callers receive WebP unless they pass outputFormat: 'original'.
OCR-on-extracted-images flows that fed extractImages() → performOCR() → pass outputFormat: 'original' (raw RGB is the OCR fast path). Internal OCR fallback (extractText() / renderPages() → OCR) is unaffected.

Full migration notes are in .changeset/web-safe-image-extraction.md.

Notable design decisions

Default for extractImages is 'webp' (issue extractImages should produce web-safe encoded output (or expose an encoder helper) #73's preferred design); default for renderPages is 'original' so the OCR hot path stays zero-copy. Verified in combined.ts that no internal call site passes raw extractImages() output to OCR.
bitsPerComponent is declared on PDFImage (not OCRImage) so this PR doesn't block on feat(ocr): add bitsPerComponent to OCRImage ocr#76 publishing. When that ships and the OCR dep is bumped, both interfaces will declare the same optional field — TypeScript merges cleanly.
canonicalizeImageFormat lives in src/shared/ (no Node deps) so the shared entry point stays browser-safe; encodePDFImage is exposed via await import('./node/image-encoding') so importing @happyvertical/pdf doesn't pull in @napi-rs/canvas until you actually call it.
The encoder's isRaw8BitPixelImage guard is gated on bitsPerComponent ?? 8 === 8 plus exact width * height * channels byte-length match, so it cannot misinterpret a 16-bit grayscale or 2-channel buffer as raw 8-bit (the exact ambiguity ocr#75 calls out). Covered by a unit test.

Test plan

npm test — 113/113 passing (added 11 new tests).
npx tsc --noEmit — clean.
npx biome check src — clean.
npm run build — clean (8.08 kB index, 3.15 kB lazy image-encoding chunk, 12.40 kB image-format chunk).
Independent code review pass (1 finding, fixed before push: encoded output now also drops channels).
Round-trip on a real scanned PDF: WebP-encoded extractImages output verified by RIFF/WEBP magic bytes; OCR-on-extracted-images flow with outputFormat: 'original' verified end-to-end.

Closes #73 and #74; companion to happyvertical/ocr#75 (PR happyvertical/ocr#76). - extractImages defaults outputFormat to 'webp' so callers can store/serve image.data directly. Pass outputFormat: 'original' for the previous raw RGB behavior (still the OCR fast path). - renderPages defaults outputFormat to 'original' so the OCR fallback pipeline stays zero-copy. - PDFImage.format is now a canonical lowercase IANA mime type (image/jpeg, image/png, image/webp, image/tiff, image/x-rgb, image/x-rgba, image/x-grayscale, image/x-cmyk, application/octet-stream). - PDFImage.bitsPerComponent is populated for raw outputs (8 from unpdf), forward-compatible with happyvertical/ocr#76. - New exports: canonicalizeImageFormat (pure) and encodePDFImage (lazy Node-only re-encoder using @napi-rs/canvas, an existing dep). - Encoded outputs drop channels and bitsPerComponent so the buffer is not misread as raw pixel data by OCR providers that key on width && height && channels.

github-actions · 2026-04-27T01:00:48Z

Release Preview

When this PR is squash-merged, @happyvertical/pdf will receive a patch version bump based on the PR title's conventional commit format.

What happens on merge?

Tests run on main branch
The squash commit message is validated
Version is bumped automatically
Package is published to GitHub Packages
Git tag is created

No manual intervention needed.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b4c9f4c9e3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T01:05:26Z

+    // Canonicalize `format` to a lowercase IANA mime type per #74 so
+    // downstream consumers don't re-implement the same normalizer.
+    const format = canonicalizeImageFormat(undefined, image.channels);


Avoid inferring raw mime for non-raw extracted buffers

When the extracted payload is not a raw pixel buffer (rawData.length !== width * height * channels), this still canonicalizes using only channels, which produces raw mime types like image/x-rgb/image/x-rgba. In those cases applyOutputFormat cannot safely re-encode and returns the original bytes, so downstream consumers can receive encoded/opaque bytes mislabeled as raw pixels and mis-decode them. This is particularly risky for 4-channel CMYK streams, which get labeled as RGBA and can be color-corrupted when converted.

Useful? React with 👍 / 👎.

Fixed in f02af1e. convertExtractedImage now does byte-length-confirms-raw-layout first, falls back to magic-byte sniffing (PNG/JPEG/WebP/TIFF), then application/octet-stream. Channel-based raw inference only fires when the layout is byte-length-confirmed, so a 4-channel CMYK stream is no longer mis-labelled image/x-rgba. Also fixed an idempotency bug where canonicalizeImageFormat was re-running channel inference on application/octet-stream and upgrading opaque blobs back to image/x-rgb. Regression test added in combined.test.ts.

Copilot

Pull request overview

Adds web-safe image outputs and canonical MIME typing to the PDF image extraction/rendering pipeline, so extracted images are directly storable/servable by default while preserving a raw-pixel escape hatch for OCR hot paths.

Changes:

Add outputFormat + encodingOptions to extractImages (default webp) and renderPages (default original), with Node-side re-encoding via @napi-rs/canvas.
Standardize PDFImage.format to canonical lowercase IANA MIME types and add bitsPerComponent on raw outputs.
Introduce canonicalizeImageFormat (shared/pure) and encodePDFImage (lazy Node-only) helpers, plus unit/integration test updates and a changeset.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/shared/types.ts	Extends options/types (`outputFormat`, `encodingOptions`, canonical MIME types, `bitsPerComponent`).
src/shared/image-format.ts	Adds browser-safe canonical MIME normalizer.
src/node/unpdf.ts	Applies output formatting/encoding defaults and canonicalizes image metadata during extraction/rendering.
src/node/image-encoding.ts	Implements Node-only raw-pixel → web-safe encoding helpers and batch application.
src/node/image-encoding.test.ts	Adds unit coverage for canonicalization + encoding behavior/guards.
src/extraction.test.ts	Verifies default WebP extraction + raw-output metadata guarantees.
src/ocr-integration.test.ts	Pins OCR-on-extracted-images to `outputFormat: 'original'`.
src/node/combined.test.ts	Adjusts tests to pin raw output where assertions depend on mocked bytes.
src/index.ts	Exports `canonicalizeImageFormat` and a lazy Node-only `encodePDFImage` wrapper.
.changeset/web-safe-image-extraction.md	Documents the behavior change and migration guidance.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-27T01:07:35Z

+    // Canonicalize `format` to a lowercase IANA mime type per #74 so
+    // downstream consumers don't re-implement the same normalizer.
+    const format = canonicalizeImageFormat(undefined, image.channels);
+


convertExtractedImage currently sets format via canonicalizeImageFormat(undefined, image.channels), which will label any 3-channel image as image/x-rgb even when the buffer is not a raw pixel buffer (e.g., an encoded JPEG/PNG stream where rawData.length !== width*height*channels). This can cause downstream consumers/OCR to misinterpret encoded bytes as raw pixels. Consider only inferring image/x-* formats when the byte-length check confirms a raw 8-bit layout; otherwise set format to an encoded mime type via magic-byte sniffing (jpeg/png/webp/tiff) or fall back to application/octet-stream.

Fixed in f02af1e — same approach as you suggested: byte-length raw-layout check is now authoritative, with magic-byte sniffing as the fallback for non-raw streams. Test in combined.test.ts covers JPEG, PNG, raw RGB (with confirmed layout), and opaque-garbage cases.

Copilot · 2026-04-27T01:07:35Z

+  if (canonicalFormat === targetMime) {
+    return { ...image, format: targetMime };


In encodePDFImage, the pass-through branch for canonicalFormat === targetMime returns { ...image, format: targetMime }, which preserves channels / bitsPerComponent if they were present on the input. The PR’s contract/docs say encoded outputs should drop these fields to avoid consumers/OCR heuristics misreading the encoded buffer as raw pixel data. Consider clearing channels and bitsPerComponent (and any other raw-pixel-only metadata) in this branch as well so all encoded outputs behave consistently.

Suggested change

if (canonicalFormat === targetMime) {

return { ...image, format: targetMime };

// Even though the bytes are unchanged, this is still an encoded output,

// so drop raw-pixel-only metadata for consistency with the re-encoded path.

if (canonicalFormat === targetMime) {

return {

...image,

format: targetMime,

bitsPerComponent: undefined,

channels: undefined,

};

Fixed in f02af1e — applied your exact suggestion. Pass-through branch now drops channels and bitsPerComponent for consistency with the re-encoded path. Test added at image-encoding.test.ts ("drops raw-pixel metadata in the pass-through (already-encoded) branch").

Resolves Codex+Copilot review on PR #75: - convertExtractedImage: only infer raw image/x-* when byte-length confirms a raw 8-bit layout. Otherwise sniff magic bytes (PNG, JPEG, WebP, TIFF) and fall back to application/octet-stream. Prevents encoded streams embedded in PDFs (or 4-channel CMYK) from being mis-labelled as raw RGB. - encodePDFImage pass-through (when source already matches target) now drops channels and bitsPerComponent for consistency with the re-encoded path. Prevents OCR providers from mis-reading an encoded buffer as raw pixel data via width && height && channels heuristics. - canonicalizeImageFormat treats application/octet-stream as idempotent (regression: applyOutputFormat's 'original' branch was re-running channel inference on intentionally-opaque blobs and upgrading them back to image/x-rgb). - New shared helper detectImageMimeFromMagicBytes for PNG/JPEG/WebP/TIFF. - Format package.json (CI lint requirement). 3 new tests added (sniffing dispatch, pass-through metadata drop, octet-stream idempotency). 118/118 passing.

willgriffin requested a review from Copilot April 27, 2026 01:02

Copilot started reviewing on behalf of willgriffin April 27, 2026 01:03 View session

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

Copilot AI reviewed Apr 27, 2026

View reviewed changes

willgriffin merged commit 68680e5 into main Apr 27, 2026
4 checks passed

willgriffin deleted the claude/eloquent-yalow-d4aa30 branch April 27, 2026 01:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pdf): web-safe image output + canonical mime types (#73, #74)#75

feat(pdf): web-safe image output + canonical mime types (#73, #74)#75
willgriffin merged 2 commits into
mainfrom
claude/eloquent-yalow-d4aa30

willgriffin commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Uh oh!

willgriffin Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

willgriffin Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

willgriffin Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if (canonicalFormat === targetMime) {
		return { ...image, format: targetMime };

Conversation

willgriffin commented Apr 27, 2026

Summary

Migration

Notable design decisions

Test plan

Uh oh!

github-actions Bot commented Apr 27, 2026

Release Preview

What happens on merge?

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

willgriffin Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

willgriffin Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

willgriffin Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants