feat(pdf): web-safe image output + canonical mime types (#73, #74)#75
Conversation
Closes #73 and #74; companion to happyvertical/ocr#75 (PR happyvertical/ocr#76). - extractImages defaults outputFormat to 'webp' so callers can store/serve image.data directly. Pass outputFormat: 'original' for the previous raw RGB behavior (still the OCR fast path). - renderPages defaults outputFormat to 'original' so the OCR fallback pipeline stays zero-copy. - PDFImage.format is now a canonical lowercase IANA mime type (image/jpeg, image/png, image/webp, image/tiff, image/x-rgb, image/x-rgba, image/x-grayscale, image/x-cmyk, application/octet-stream). - PDFImage.bitsPerComponent is populated for raw outputs (8 from unpdf), forward-compatible with happyvertical/ocr#76. - New exports: canonicalizeImageFormat (pure) and encodePDFImage (lazy Node-only re-encoder using @napi-rs/canvas, an existing dep). - Encoded outputs drop channels and bitsPerComponent so the buffer is not misread as raw pixel data by OCR providers that key on width && height && channels.
Release PreviewWhen this PR is squash-merged, @happyvertical/pdf will receive a patch version bump based on the PR title's conventional commit format. What happens on merge?
No manual intervention needed. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b4c9f4c9e3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Canonicalize `format` to a lowercase IANA mime type per #74 so | ||
| // downstream consumers don't re-implement the same normalizer. | ||
| const format = canonicalizeImageFormat(undefined, image.channels); |
There was a problem hiding this comment.
Avoid inferring raw mime for non-raw extracted buffers
When the extracted payload is not a raw pixel buffer (rawData.length !== width * height * channels), this still canonicalizes using only channels, which produces raw mime types like image/x-rgb/image/x-rgba. In those cases applyOutputFormat cannot safely re-encode and returns the original bytes, so downstream consumers can receive encoded/opaque bytes mislabeled as raw pixels and mis-decode them. This is particularly risky for 4-channel CMYK streams, which get labeled as RGBA and can be color-corrupted when converted.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in f02af1e. convertExtractedImage now does byte-length-confirms-raw-layout first, falls back to magic-byte sniffing (PNG/JPEG/WebP/TIFF), then application/octet-stream. Channel-based raw inference only fires when the layout is byte-length-confirmed, so a 4-channel CMYK stream is no longer mis-labelled image/x-rgba. Also fixed an idempotency bug where canonicalizeImageFormat was re-running channel inference on application/octet-stream and upgrading opaque blobs back to image/x-rgb. Regression test added in combined.test.ts.
There was a problem hiding this comment.
Pull request overview
Adds web-safe image outputs and canonical MIME typing to the PDF image extraction/rendering pipeline, so extracted images are directly storable/servable by default while preserving a raw-pixel escape hatch for OCR hot paths.
Changes:
- Add
outputFormat+encodingOptionstoextractImages(defaultwebp) andrenderPages(defaultoriginal), with Node-side re-encoding via@napi-rs/canvas. - Standardize
PDFImage.formatto canonical lowercase IANA MIME types and addbitsPerComponenton raw outputs. - Introduce
canonicalizeImageFormat(shared/pure) andencodePDFImage(lazy Node-only) helpers, plus unit/integration test updates and a changeset.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/shared/types.ts | Extends options/types (outputFormat, encodingOptions, canonical MIME types, bitsPerComponent). |
| src/shared/image-format.ts | Adds browser-safe canonical MIME normalizer. |
| src/node/unpdf.ts | Applies output formatting/encoding defaults and canonicalizes image metadata during extraction/rendering. |
| src/node/image-encoding.ts | Implements Node-only raw-pixel → web-safe encoding helpers and batch application. |
| src/node/image-encoding.test.ts | Adds unit coverage for canonicalization + encoding behavior/guards. |
| src/extraction.test.ts | Verifies default WebP extraction + raw-output metadata guarantees. |
| src/ocr-integration.test.ts | Pins OCR-on-extracted-images to outputFormat: 'original'. |
| src/node/combined.test.ts | Adjusts tests to pin raw output where assertions depend on mocked bytes. |
| src/index.ts | Exports canonicalizeImageFormat and a lazy Node-only encodePDFImage wrapper. |
| .changeset/web-safe-image-extraction.md | Documents the behavior change and migration guidance. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Canonicalize `format` to a lowercase IANA mime type per #74 so | ||
| // downstream consumers don't re-implement the same normalizer. | ||
| const format = canonicalizeImageFormat(undefined, image.channels); | ||
|
|
There was a problem hiding this comment.
convertExtractedImage currently sets format via canonicalizeImageFormat(undefined, image.channels), which will label any 3-channel image as image/x-rgb even when the buffer is not a raw pixel buffer (e.g., an encoded JPEG/PNG stream where rawData.length !== width*height*channels). This can cause downstream consumers/OCR to misinterpret encoded bytes as raw pixels. Consider only inferring image/x-* formats when the byte-length check confirms a raw 8-bit layout; otherwise set format to an encoded mime type via magic-byte sniffing (jpeg/png/webp/tiff) or fall back to application/octet-stream.
There was a problem hiding this comment.
Fixed in f02af1e — same approach as you suggested: byte-length raw-layout check is now authoritative, with magic-byte sniffing as the fallback for non-raw streams. Test in combined.test.ts covers JPEG, PNG, raw RGB (with confirmed layout), and opaque-garbage cases.
| if (canonicalFormat === targetMime) { | ||
| return { ...image, format: targetMime }; |
There was a problem hiding this comment.
In encodePDFImage, the pass-through branch for canonicalFormat === targetMime returns { ...image, format: targetMime }, which preserves channels / bitsPerComponent if they were present on the input. The PR’s contract/docs say encoded outputs should drop these fields to avoid consumers/OCR heuristics misreading the encoded buffer as raw pixel data. Consider clearing channels and bitsPerComponent (and any other raw-pixel-only metadata) in this branch as well so all encoded outputs behave consistently.
| if (canonicalFormat === targetMime) { | |
| return { ...image, format: targetMime }; | |
| // Even though the bytes are unchanged, this is still an encoded output, | |
| // so drop raw-pixel-only metadata for consistency with the re-encoded path. | |
| if (canonicalFormat === targetMime) { | |
| return { | |
| ...image, | |
| format: targetMime, | |
| bitsPerComponent: undefined, | |
| channels: undefined, | |
| }; |
There was a problem hiding this comment.
Fixed in f02af1e — applied your exact suggestion. Pass-through branch now drops channels and bitsPerComponent for consistency with the re-encoded path. Test added at image-encoding.test.ts ("drops raw-pixel metadata in the pass-through (already-encoded) branch").
Resolves Codex+Copilot review on PR #75: - convertExtractedImage: only infer raw image/x-* when byte-length confirms a raw 8-bit layout. Otherwise sniff magic bytes (PNG, JPEG, WebP, TIFF) and fall back to application/octet-stream. Prevents encoded streams embedded in PDFs (or 4-channel CMYK) from being mis-labelled as raw RGB. - encodePDFImage pass-through (when source already matches target) now drops channels and bitsPerComponent for consistency with the re-encoded path. Prevents OCR providers from mis-reading an encoded buffer as raw pixel data via width && height && channels heuristics. - canonicalizeImageFormat treats application/octet-stream as idempotent (regression: applyOutputFormat's 'original' branch was re-running channel inference on intentionally-opaque blobs and upgrading them back to image/x-rgb). - New shared helper detectImageMimeFromMagicBytes for PNG/JPEG/WebP/TIFF. - Format package.json (CI lint requirement). 3 new tests added (sniffing dispatch, pass-through metadata drop, octet-stream idempotency). 118/118 passing.
Closes #73, closes #74. Companion to happyvertical/ocr#75 (open PR happyvertical/ocr#76).
Summary
extractImagesnow acceptsoutputFormat: 'webp' | 'png' | 'jpeg' | 'original'and re-encodes raw image streams via@napi-rs/canvas(already a dep — no new deps). Default is'webp'so callers can store/serveimage.datadirectly.renderPagesaccepts the same option; default stays'original'so the internal OCR fallback path stays zero-copy.PDFImage.formatis now a canonical lowercase IANA mime type (image/jpeg,image/png,image/webp,image/tiff,image/x-rgb,image/x-rgba,image/x-grayscale,image/x-cmyk,application/octet-stream).PDFImage.bitsPerComponentis populated on raw outputs (always8from unpdf), forward-compatible with feat(ocr): add bitsPerComponent to OCRImage ocr#76.canonicalizeImageFormat(pure) andencodePDFImage(lazy Node-only re-encoder).channelsandbitsPerComponentso the buffer cannot be misread as raw pixel data by OCR providers that key onwidth && height && channels.width/heightstay set as metadata.Migration
format === 'rgb') → passoutputFormat: 'original'and checkimage.format === 'image/x-rgb'.sharp) → drop the re-encoder;image.datais now WebP by default.extractImages(source, { onBatch })streaming callers receive WebP unless they passoutputFormat: 'original'.extractImages()→performOCR()→ passoutputFormat: 'original'(raw RGB is the OCR fast path). Internal OCR fallback (extractText()/renderPages()→ OCR) is unaffected.Full migration notes are in
.changeset/web-safe-image-extraction.md.Notable design decisions
extractImagesis'webp'(issue extractImages should produce web-safe encoded output (or expose an encoder helper) #73's preferred design); default forrenderPagesis'original'so the OCR hot path stays zero-copy. Verified incombined.tsthat no internal call site passes rawextractImages()output to OCR.bitsPerComponentis declared onPDFImage(notOCRImage) so this PR doesn't block on feat(ocr): add bitsPerComponent to OCRImage ocr#76 publishing. When that ships and the OCR dep is bumped, both interfaces will declare the same optional field — TypeScript merges cleanly.canonicalizeImageFormatlives insrc/shared/(no Node deps) so the shared entry point stays browser-safe;encodePDFImageis exposed viaawait import('./node/image-encoding')so importing@happyvertical/pdfdoesn't pull in@napi-rs/canvasuntil you actually call it.isRaw8BitPixelImageguard is gated onbitsPerComponent ?? 8 === 8plus exactwidth * height * channelsbyte-length match, so it cannot misinterpret a 16-bit grayscale or 2-channel buffer as raw 8-bit (the exact ambiguity ocr#75 calls out). Covered by a unit test.Test plan
npm test— 113/113 passing (added 11 new tests).npx tsc --noEmit— clean.npx biome check src— clean.npm run build— clean (8.08 kB index, 3.15 kB lazy image-encoding chunk, 12.40 kB image-format chunk).channels).outputFormat: 'original'verified end-to-end.