Prevent extractImages from OOMing on large PDFs

## Summary

Anytown production can now complete large-PDF text extraction with the child-process path, but the follow-up embedded-image extraction path can still OOM the caller process/container.

Observed against Blackfalds' public Jan. 27, 2026 agenda PDF:

- `https://www.blackfalds.ca/uploads/dm/110249/20260127_RCM_Agenda.pdf`
- Praeco/Anytown saved extracted text successfully: `186,485` chars and one `source_document` asset.
- Immediately after that, while Praeco called `reader.extractImages(pdfPath)` to attach document-image assets, the dashboard container was `OOMKilled` at its `2Gi` memory limit.
- SMRT then recovered the job as stale because the pod died mid-job.

This means the large `extractText()` path is improved, but `extractImages()` is still not safe for arbitrarily large PDFs.

## Likely Cause

The Node `unpdf` reader currently:

- iterates every page,
- appends every extracted image to one `allImages` array,
- preserves raw RGB image buffers when `channels === 3`, which can be very large,
- returns the full array only after the entire PDF has been processed.

That API shape forces callers like Praeco to hold the entire extracted image set in memory before they can store/cleanup derived assets.

## Desired Direction

Make PDF image extraction bounded-memory upstream rather than requiring app-local PDF-size gates.

Possible implementation directions:

- Add page-batched extraction support to `extractImages(source, options?)`, similar to the large text path.
- Add an async iterator / callback API for extracted images so callers can consume and persist one batch/image at a time.
- Support file-backed image outputs for large images, with explicit cleanup ownership.
- Avoid returning raw RGB buffers by default for asset extraction use cases, or expose a mode that encodes/compresses images into standard asset-safe formats.
- Keep failure explicit for catastrophic extraction failures; do not silently return a partial success as if extraction fully succeeded.

## Acceptance Criteria

- A large PDF with many embedded images can be processed without holding all images in parent process memory.
- Callers can attach/store images incrementally and cleanup temporary artifacts.
- Existing `extractImages(source)` remains compatible for small PDFs.
- Tests cover a many-page/many-image PDF path and prove memory/temp cleanup remains bounded.
- Praeco can use the upstream capability without reintroducing app-local size gates.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent extractImages from OOMing on large PDFs #71

Summary

Likely Cause

Desired Direction

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Prevent extractImages from OOMing on large PDFs #71

Description

Summary

Likely Cause

Desired Direction

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions