Skip to content

feat(pdf): web-safe image output + canonical mime types (#73, #74)#75

Merged
willgriffin merged 2 commits into
mainfrom
claude/eloquent-yalow-d4aa30
Apr 27, 2026
Merged

feat(pdf): web-safe image output + canonical mime types (#73, #74)#75
willgriffin merged 2 commits into
mainfrom
claude/eloquent-yalow-d4aa30

Conversation

@willgriffin
Copy link
Copy Markdown
Contributor

Closes #73, closes #74. Companion to happyvertical/ocr#75 (open PR happyvertical/ocr#76).

Summary

  • extractImages now accepts outputFormat: 'webp' | 'png' | 'jpeg' | 'original' and re-encodes raw image streams via @napi-rs/canvas (already a dep — no new deps). Default is 'webp' so callers can store/serve image.data directly.
  • renderPages accepts the same option; default stays 'original' so the internal OCR fallback path stays zero-copy.
  • PDFImage.format is now a canonical lowercase IANA mime type (image/jpeg, image/png, image/webp, image/tiff, image/x-rgb, image/x-rgba, image/x-grayscale, image/x-cmyk, application/octet-stream).
  • PDFImage.bitsPerComponent is populated on raw outputs (always 8 from unpdf), forward-compatible with feat(ocr): add bitsPerComponent to OCRImage ocr#76.
  • New exports: canonicalizeImageFormat (pure) and encodePDFImage (lazy Node-only re-encoder).
  • Encoded outputs drop channels and bitsPerComponent so the buffer cannot be misread as raw pixel data by OCR providers that key on width && height && channels. width/height stay set as metadata.

Migration

  • Callers that consumed raw RGB bytes (format === 'rgb') → pass outputFormat: 'original' and check image.format === 'image/x-rgb'.
  • Callers that re-encoded extracted images downstream (e.g. via sharp) → drop the re-encoder; image.data is now WebP by default.
  • extractImages(source, { onBatch }) streaming callers receive WebP unless they pass outputFormat: 'original'.
  • OCR-on-extracted-images flows that fed extractImages()performOCR() → pass outputFormat: 'original' (raw RGB is the OCR fast path). Internal OCR fallback (extractText() / renderPages() → OCR) is unaffected.

Full migration notes are in .changeset/web-safe-image-extraction.md.

Notable design decisions

  • Default for extractImages is 'webp' (issue extractImages should produce web-safe encoded output (or expose an encoder helper) #73's preferred design); default for renderPages is 'original' so the OCR hot path stays zero-copy. Verified in combined.ts that no internal call site passes raw extractImages() output to OCR.
  • bitsPerComponent is declared on PDFImage (not OCRImage) so this PR doesn't block on feat(ocr): add bitsPerComponent to OCRImage ocr#76 publishing. When that ships and the OCR dep is bumped, both interfaces will declare the same optional field — TypeScript merges cleanly.
  • canonicalizeImageFormat lives in src/shared/ (no Node deps) so the shared entry point stays browser-safe; encodePDFImage is exposed via await import('./node/image-encoding') so importing @happyvertical/pdf doesn't pull in @napi-rs/canvas until you actually call it.
  • The encoder's isRaw8BitPixelImage guard is gated on bitsPerComponent ?? 8 === 8 plus exact width * height * channels byte-length match, so it cannot misinterpret a 16-bit grayscale or 2-channel buffer as raw 8-bit (the exact ambiguity ocr#75 calls out). Covered by a unit test.

Test plan

  • npm test — 113/113 passing (added 11 new tests).
  • npx tsc --noEmit — clean.
  • npx biome check src — clean.
  • npm run build — clean (8.08 kB index, 3.15 kB lazy image-encoding chunk, 12.40 kB image-format chunk).
  • Independent code review pass (1 finding, fixed before push: encoded output now also drops channels).
  • Round-trip on a real scanned PDF: WebP-encoded extractImages output verified by RIFF/WEBP magic bytes; OCR-on-extracted-images flow with outputFormat: 'original' verified end-to-end.

Closes #73 and #74; companion to happyvertical/ocr#75 (PR happyvertical/ocr#76).

- extractImages defaults outputFormat to 'webp' so callers can store/serve
  image.data directly. Pass outputFormat: 'original' for the previous raw
  RGB behavior (still the OCR fast path).
- renderPages defaults outputFormat to 'original' so the OCR fallback
  pipeline stays zero-copy.
- PDFImage.format is now a canonical lowercase IANA mime type
  (image/jpeg, image/png, image/webp, image/tiff, image/x-rgb,
  image/x-rgba, image/x-grayscale, image/x-cmyk, application/octet-stream).
- PDFImage.bitsPerComponent is populated for raw outputs (8 from unpdf),
  forward-compatible with happyvertical/ocr#76.
- New exports: canonicalizeImageFormat (pure) and encodePDFImage (lazy
  Node-only re-encoder using @napi-rs/canvas, an existing dep).
- Encoded outputs drop channels and bitsPerComponent so the buffer is not
  misread as raw pixel data by OCR providers that key on width && height
  && channels.
@github-actions
Copy link
Copy Markdown
Contributor

Release Preview

When this PR is squash-merged, @happyvertical/pdf will receive a patch version bump based on the PR title's conventional commit format.

What happens on merge?

  1. Tests run on main branch
  2. The squash commit message is validated
  3. Version is bumped automatically
  4. Package is published to GitHub Packages
  5. Git tag is created

No manual intervention needed.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b4c9f4c9e3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/node/unpdf.ts Outdated
Comment on lines +555 to +557
// Canonicalize `format` to a lowercase IANA mime type per #74 so
// downstream consumers don't re-implement the same normalizer.
const format = canonicalizeImageFormat(undefined, image.channels);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid inferring raw mime for non-raw extracted buffers

When the extracted payload is not a raw pixel buffer (rawData.length !== width * height * channels), this still canonicalizes using only channels, which produces raw mime types like image/x-rgb/image/x-rgba. In those cases applyOutputFormat cannot safely re-encode and returns the original bytes, so downstream consumers can receive encoded/opaque bytes mislabeled as raw pixels and mis-decode them. This is particularly risky for 4-channel CMYK streams, which get labeled as RGBA and can be color-corrupted when converted.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f02af1e. convertExtractedImage now does byte-length-confirms-raw-layout first, falls back to magic-byte sniffing (PNG/JPEG/WebP/TIFF), then application/octet-stream. Channel-based raw inference only fires when the layout is byte-length-confirmed, so a 4-channel CMYK stream is no longer mis-labelled image/x-rgba. Also fixed an idempotency bug where canonicalizeImageFormat was re-running channel inference on application/octet-stream and upgrading opaque blobs back to image/x-rgb. Regression test added in combined.test.ts.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds web-safe image outputs and canonical MIME typing to the PDF image extraction/rendering pipeline, so extracted images are directly storable/servable by default while preserving a raw-pixel escape hatch for OCR hot paths.

Changes:

  • Add outputFormat + encodingOptions to extractImages (default webp) and renderPages (default original), with Node-side re-encoding via @napi-rs/canvas.
  • Standardize PDFImage.format to canonical lowercase IANA MIME types and add bitsPerComponent on raw outputs.
  • Introduce canonicalizeImageFormat (shared/pure) and encodePDFImage (lazy Node-only) helpers, plus unit/integration test updates and a changeset.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/shared/types.ts Extends options/types (outputFormat, encodingOptions, canonical MIME types, bitsPerComponent).
src/shared/image-format.ts Adds browser-safe canonical MIME normalizer.
src/node/unpdf.ts Applies output formatting/encoding defaults and canonicalizes image metadata during extraction/rendering.
src/node/image-encoding.ts Implements Node-only raw-pixel → web-safe encoding helpers and batch application.
src/node/image-encoding.test.ts Adds unit coverage for canonicalization + encoding behavior/guards.
src/extraction.test.ts Verifies default WebP extraction + raw-output metadata guarantees.
src/ocr-integration.test.ts Pins OCR-on-extracted-images to outputFormat: 'original'.
src/node/combined.test.ts Adjusts tests to pin raw output where assertions depend on mocked bytes.
src/index.ts Exports canonicalizeImageFormat and a lazy Node-only encodePDFImage wrapper.
.changeset/web-safe-image-extraction.md Documents the behavior change and migration guidance.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/node/unpdf.ts Outdated
Comment on lines +555 to +558
// Canonicalize `format` to a lowercase IANA mime type per #74 so
// downstream consumers don't re-implement the same normalizer.
const format = canonicalizeImageFormat(undefined, image.channels);

Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convertExtractedImage currently sets format via canonicalizeImageFormat(undefined, image.channels), which will label any 3-channel image as image/x-rgb even when the buffer is not a raw pixel buffer (e.g., an encoded JPEG/PNG stream where rawData.length !== width*height*channels). This can cause downstream consumers/OCR to misinterpret encoded bytes as raw pixels. Consider only inferring image/x-* formats when the byte-length check confirms a raw 8-bit layout; otherwise set format to an encoded mime type via magic-byte sniffing (jpeg/png/webp/tiff) or fall back to application/octet-stream.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f02af1e — same approach as you suggested: byte-length raw-layout check is now authoritative, with magic-byte sniffing as the fallback for non-raw streams. Test in combined.test.ts covers JPEG, PNG, raw RGB (with confirmed layout), and opaque-garbage cases.

Comment thread src/node/image-encoding.ts Outdated
Comment on lines +131 to +132
if (canonicalFormat === targetMime) {
return { ...image, format: targetMime };
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In encodePDFImage, the pass-through branch for canonicalFormat === targetMime returns { ...image, format: targetMime }, which preserves channels / bitsPerComponent if they were present on the input. The PR’s contract/docs say encoded outputs should drop these fields to avoid consumers/OCR heuristics misreading the encoded buffer as raw pixel data. Consider clearing channels and bitsPerComponent (and any other raw-pixel-only metadata) in this branch as well so all encoded outputs behave consistently.

Suggested change
if (canonicalFormat === targetMime) {
return { ...image, format: targetMime };
// Even though the bytes are unchanged, this is still an encoded output,
// so drop raw-pixel-only metadata for consistency with the re-encoded path.
if (canonicalFormat === targetMime) {
return {
...image,
format: targetMime,
bitsPerComponent: undefined,
channels: undefined,
};

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f02af1e — applied your exact suggestion. Pass-through branch now drops channels and bitsPerComponent for consistency with the re-encoded path. Test added at image-encoding.test.ts ("drops raw-pixel metadata in the pass-through (already-encoded) branch").

Resolves Codex+Copilot review on PR #75:

- convertExtractedImage: only infer raw image/x-* when byte-length confirms
  a raw 8-bit layout. Otherwise sniff magic bytes (PNG, JPEG, WebP, TIFF)
  and fall back to application/octet-stream. Prevents encoded streams
  embedded in PDFs (or 4-channel CMYK) from being mis-labelled as raw RGB.
- encodePDFImage pass-through (when source already matches target) now
  drops channels and bitsPerComponent for consistency with the re-encoded
  path. Prevents OCR providers from mis-reading an encoded buffer as raw
  pixel data via width && height && channels heuristics.
- canonicalizeImageFormat treats application/octet-stream as idempotent
  (regression: applyOutputFormat's 'original' branch was re-running
  channel inference on intentionally-opaque blobs and upgrading them
  back to image/x-rgb).
- New shared helper detectImageMimeFromMagicBytes for PNG/JPEG/WebP/TIFF.
- Format package.json (CI lint requirement).

3 new tests added (sniffing dispatch, pass-through metadata drop,
octet-stream idempotency). 118/118 passing.
@willgriffin willgriffin merged commit 68680e5 into main Apr 27, 2026
4 checks passed
@willgriffin willgriffin deleted the claude/eloquent-yalow-d4aa30 branch April 27, 2026 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants