Skip to content

Prevent extractImages from OOMing on large PDFs #71

@willgriffin

Description

@willgriffin

Summary

Anytown production can now complete large-PDF text extraction with the child-process path, but the follow-up embedded-image extraction path can still OOM the caller process/container.

Observed against Blackfalds' public Jan. 27, 2026 agenda PDF:

  • https://www.blackfalds.ca/uploads/dm/110249/20260127_RCM_Agenda.pdf
  • Praeco/Anytown saved extracted text successfully: 186,485 chars and one source_document asset.
  • Immediately after that, while Praeco called reader.extractImages(pdfPath) to attach document-image assets, the dashboard container was OOMKilled at its 2Gi memory limit.
  • SMRT then recovered the job as stale because the pod died mid-job.

This means the large extractText() path is improved, but extractImages() is still not safe for arbitrarily large PDFs.

Likely Cause

The Node unpdf reader currently:

  • iterates every page,
  • appends every extracted image to one allImages array,
  • preserves raw RGB image buffers when channels === 3, which can be very large,
  • returns the full array only after the entire PDF has been processed.

That API shape forces callers like Praeco to hold the entire extracted image set in memory before they can store/cleanup derived assets.

Desired Direction

Make PDF image extraction bounded-memory upstream rather than requiring app-local PDF-size gates.

Possible implementation directions:

  • Add page-batched extraction support to extractImages(source, options?), similar to the large text path.
  • Add an async iterator / callback API for extracted images so callers can consume and persist one batch/image at a time.
  • Support file-backed image outputs for large images, with explicit cleanup ownership.
  • Avoid returning raw RGB buffers by default for asset extraction use cases, or expose a mode that encodes/compresses images into standard asset-safe formats.
  • Keep failure explicit for catastrophic extraction failures; do not silently return a partial success as if extraction fully succeeded.

Acceptance Criteria

  • A large PDF with many embedded images can be processed without holding all images in parent process memory.
  • Callers can attach/store images incrementally and cleanup temporary artifacts.
  • Existing extractImages(source) remains compatible for small PDFs.
  • Tests cover a many-page/many-image PDF path and prove memory/temp cleanup remains bounded.
  • Praeco can use the upstream capability without reintroducing app-local size gates.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions