Summary
Anytown production can now complete large-PDF text extraction with the child-process path, but the follow-up embedded-image extraction path can still OOM the caller process/container.
Observed against Blackfalds' public Jan. 27, 2026 agenda PDF:
https://www.blackfalds.ca/uploads/dm/110249/20260127_RCM_Agenda.pdf
- Praeco/Anytown saved extracted text successfully:
186,485 chars and one source_document asset.
- Immediately after that, while Praeco called
reader.extractImages(pdfPath) to attach document-image assets, the dashboard container was OOMKilled at its 2Gi memory limit.
- SMRT then recovered the job as stale because the pod died mid-job.
This means the large extractText() path is improved, but extractImages() is still not safe for arbitrarily large PDFs.
Likely Cause
The Node unpdf reader currently:
- iterates every page,
- appends every extracted image to one
allImages array,
- preserves raw RGB image buffers when
channels === 3, which can be very large,
- returns the full array only after the entire PDF has been processed.
That API shape forces callers like Praeco to hold the entire extracted image set in memory before they can store/cleanup derived assets.
Desired Direction
Make PDF image extraction bounded-memory upstream rather than requiring app-local PDF-size gates.
Possible implementation directions:
- Add page-batched extraction support to
extractImages(source, options?), similar to the large text path.
- Add an async iterator / callback API for extracted images so callers can consume and persist one batch/image at a time.
- Support file-backed image outputs for large images, with explicit cleanup ownership.
- Avoid returning raw RGB buffers by default for asset extraction use cases, or expose a mode that encodes/compresses images into standard asset-safe formats.
- Keep failure explicit for catastrophic extraction failures; do not silently return a partial success as if extraction fully succeeded.
Acceptance Criteria
- A large PDF with many embedded images can be processed without holding all images in parent process memory.
- Callers can attach/store images incrementally and cleanup temporary artifacts.
- Existing
extractImages(source) remains compatible for small PDFs.
- Tests cover a many-page/many-image PDF path and prove memory/temp cleanup remains bounded.
- Praeco can use the upstream capability without reintroducing app-local size gates.
Summary
Anytown production can now complete large-PDF text extraction with the child-process path, but the follow-up embedded-image extraction path can still OOM the caller process/container.
Observed against Blackfalds' public Jan. 27, 2026 agenda PDF:
https://www.blackfalds.ca/uploads/dm/110249/20260127_RCM_Agenda.pdf186,485chars and onesource_documentasset.reader.extractImages(pdfPath)to attach document-image assets, the dashboard container wasOOMKilledat its2Gimemory limit.This means the large
extractText()path is improved, butextractImages()is still not safe for arbitrarily large PDFs.Likely Cause
The Node
unpdfreader currently:allImagesarray,channels === 3, which can be very large,That API shape forces callers like Praeco to hold the entire extracted image set in memory before they can store/cleanup derived assets.
Desired Direction
Make PDF image extraction bounded-memory upstream rather than requiring app-local PDF-size gates.
Possible implementation directions:
extractImages(source, options?), similar to the large text path.Acceptance Criteria
extractImages(source)remains compatible for small PDFs.