Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Flood of "Recursion depth exceeded in _find_image_xrefs_page" #1321

Closed
user1584 opened this issue May 30, 2024 · 5 comments
Closed

[Bug]: Flood of "Recursion depth exceeded in _find_image_xrefs_page" #1321

user1584 opened this issue May 30, 2024 · 5 comments
Assignees
Labels

Comments

@user1584
Copy link

Describe the bug

I am trying to OCR a lot of documents. Some of them lead to a flood of "Recursion depth exceeded in _find_image_xrefs_page" warnings (>1M warnings per document!). If I understand the problem correctly, this happens when OCRmyPDF tries to optimize embedded images. It seems that these documents contain a deep (and broad) structure of xrefs. The scanning takes verly long and stalls further processing.
Unfortunately, I cannot share the documents for legal reasons.
Is there any way to identify documents that might lead to this problem before running the optimization?

Steps to reproduce

This is a minimal python script that I used to dig deeper (with little success):

import pathlib
from ocrmypdf import PdfContext
import ocrmypdf.optimize
from pikepdf import ObjectStreamMode


infile = pathlib.Path("input.pdf")
tmpdir = "temp/"


class OptimizeOptions:
    """Emulate ocrmypdf's options."""

    def __init__(self):
        self.input_file = infile
        self.jobs = 1
        self.optimize = 1
        self.jpeg_quality = 0
        self.png_quality = 0
        self.jbig2_page_group_size = 0
        self.jbig2_lossy = False
        self.jbig2_threshold = 0.85
        self.quiet = True
        self.progress_bar = False


context = PdfContext(OptimizeOptions(), tmpdir, infile, None, None)
ocrmypdf.optimize.optimize(
    input_file=infile,
    output_file=pathlib.Path("optimized.pdf"),
    context=context,
    save_settings=dict(
        compress_streams=True,
        preserve_pdfa=True,
        object_stream_mode=ObjectStreamMode.generate,
    ),
)


The log output is limited to the first occurrence of the recursion limit warning.

Files

No response

How did you download and install the software?

PyPI (pip, poetry, pipx, etc.)

OCRmyPDF version

16.0.3

Relevant log output

DEBUG:ocrmypdf.optimize:xref 338: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:xref 2: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm0 in page 1
DEBUG:ocrmypdf.optimize:xref 10: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm0 in page 2
DEBUG:ocrmypdf.optimize:xref 15: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm0 in page 3
DEBUG:ocrmypdf.optimize:xref 15: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm0 in page 4
DEBUG:ocrmypdf.optimize:xref 15: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm0 in page 5
DEBUG:ocrmypdf.optimize:xref 15: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm0 in page 6
DEBUG:ocrmypdf.optimize:xref 15: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm0 in page 7
DEBUG:ocrmypdf.optimize:xref 15: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm0 in page 8
DEBUG:ocrmypdf.optimize:xref 15: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm0 in page 9
DEBUG:ocrmypdf.optimize:xref 15: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm0 in page 10
DEBUG:ocrmypdf.optimize:xref 15: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm0 in page 11
DEBUG:ocrmypdf.optimize:xref 15: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm0 in page 12
DEBUG:ocrmypdf.optimize:xref 38: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm0 in page 13
DEBUG:ocrmypdf.optimize:xref 39: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:xref 40: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm1 in page 14
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm0 in page 14
DEBUG:ocrmypdf.optimize:xref 48: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:xref 52: skipping image because it is an SMask
DEBUG:ocrmypdf.optimize:xref 51: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:xref 50: skipping image because it is an SMask
DEBUG:ocrmypdf.optimize:xref 49: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm1 in page 15
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm0 in page 15
DEBUG:ocrmypdf.optimize:xref 48: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:xref 52: skipping image because it is an SMask
DEBUG:ocrmypdf.optimize:xref 51: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:xref 50: skipping image because it is an SMask
DEBUG:ocrmypdf.optimize:xref 49: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm1 in page 16
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm0 in page 16
DEBUG:ocrmypdf.optimize:xref 48: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:xref 52: skipping image because it is an SMask
DEBUG:ocrmypdf.optimize:xref 51: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:xref 50: skipping image because it is an SMask
DEBUG:ocrmypdf.optimize:xref 49: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:xref 69: skipping image because it is an SMask
DEBUG:ocrmypdf.optimize:xref 68: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm0 in page 17
DEBUG:ocrmypdf.optimize:xref 71: skipping image because it is an SMask
DEBUG:ocrmypdf.optimize:xref 70: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:xref 72: treating as an optimization candidate
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm18 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm10 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm25 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm27 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm21 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm22 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm24 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm1 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm30 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm8 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm7 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm23 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm17 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm20 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm9 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm4 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm15 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm26 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Fm12 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi9 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi37 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi41 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi48 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi45 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi9 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi37 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi41 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi48 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi45 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi9 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi37 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi41 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi48 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi45 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi9 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi37 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi41 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi48 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi45 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi9 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi37 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi41 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi48 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi45 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi9 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi37 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi41 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi48 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi45 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi9 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi37 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi41 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi48 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi45 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi9 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi37 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi41 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi48 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi45 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi9 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi37 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi41 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi48 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi45 in page 18
DEBUG:ocrmypdf.optimize:Recursing into Form XObject /Xi9 in page 18
WARNING:ocrmypdf.optimize:Recursion depth exceeded in _find_image_xrefs_page
@jbarlow83
Copy link
Collaborator

On page 18 /Xi9 appears to be in a circular reference loop that includes itself over and over. I can guard against it but the file is likely malformed at this location. Try Ghostscript to rewrite the PDF.

If it does just happen to be really deep Python has a system call to increase recursion depth which you could use to force the file through. Doubt it would help though.

@user1584
Copy link
Author

Thanks, for my test document, Ghostscript detected the problem and repaired the file.
Changing the python recursion limit does not seem to matter for this issue since the function contains a fixed recursion limit of 10.

@jbarlow83
Copy link
Collaborator

I added a guard that should prevent recursion into previously visited XObjects. Please test and let me know if it works.

It's still probably best to use ghostscript to repair these files.

@user1584
Copy link
Author

user1584 commented Jun 3, 2024

The guard does not seem to work. If I understand it correctly, you check against all the included/excluded xrefs before the recursion but the xref are added only after the recursion to the set of included/excluded xrefs. Thus, the function can cycle through the circular references forever before the problematic xrefs are added.

I tried a different approach by passing all the layers of the recursion:

def _find_image_xrefs_container(
    pdf: Pdf,
    container: Object,
    pageno: int,
    include_xrefs: MutableSet[Xref],
    exclude_xrefs: MutableSet[Xref],
    pageno_for_xref: dict[Xref, int],
    recursion_layers: tuple[Xref,...] = tuple()
):
    """Find all image XRefs or Form XObject and add to the include/exclude sets."""
    if len(recursion_layers) > 10:
        log.warning("Recursion depth exceeded in _find_image_xrefs_page")
        return
    try:
        xobjs = container.Resources.XObject
    except AttributeError:
        return
    for _imname, image in dict(xobjs).items():
        if image.objgen[1] != 0:
            continue  # Ignore images in an incremental PDF
        xref = Xref(image.objgen[0])
        if xref in recursion_layers:
            log.warning(f"Skipping {_imname} in page {pageno} since it seems to be recursive.")
            continue  # Already processed
        if Name.Subtype in image and image.Subtype == Name.Form:
            # Recurse into Form XObjects
            log.debug(f"Recursing into Form XObject {_imname} in page {pageno}")
            _find_image_xrefs_container(
                pdf,
                image,
                pageno,
                include_xrefs,
                exclude_xrefs,
                pageno_for_xref,
                recursion_layers=tuple([*recursion_layers, xref])
            )
            continue
        if Name.SMask in image:
            # Ignore soft masks
            smask_xref = Xref(image.SMask.objgen[0])
            exclude_xrefs.add(smask_xref)
            log.debug(f"xref {smask_xref}: skipping image because it is an SMask")
        include_xrefs.add(xref)
        log.debug(f"xref {xref}: treating as an optimization candidate")
        if xref not in pageno_for_xref:
            pageno_for_xref[xref] = pageno

This seems to work with our problematic PDFs.

@user1584
Copy link
Author

@jbarlow83: We found a PDF that could not even be processed with the changes posted above. The new (and so far working) version is this:

def _find_image_xrefs_container(
    pdf: Pdf,
    container: Object,
    pageno: int,
    include_xrefs: MutableSet[ocrmypdf.optimize.Xref],
    exclude_xrefs: MutableSet[ocrmypdf.optimize.Xref],
    pageno_for_xref: dict[ocrmypdf.optimize.Xref, int],
    depth: int = 0,
    already_processed_xrefs: typing.Optional[MutableSet[ocrmypdf.optimize.Xref]] = None,
    strange_xrefs: typing.Optional[MutableSet[ocrmypdf.optimize.Xref]] = None,
):
    """Find all image XRefs or Form XObject and add to the include/exclude sets."""
    if depth > 10:
        ocrmypdf.optimize.log.warning(
            "Recursion depth exceeded in _find_image_xrefs_page"
        )
        return

    try:
        xobjs = container.Resources.XObject
    except AttributeError:
        return

    # If no xrefs were passed, an empty set is created.
    _already_processed_xrefs: MutableSet[ocrmypdf.optimize.Xref]
    if already_processed_xrefs is None:
        _already_processed_xrefs = set()
    else:
        _already_processed_xrefs = already_processed_xrefs

    # If no strange xrefs were passed, an empty set is created.
    _strange_xrefs: MutableSet[ocrmypdf.optimize.Xref]
    if strange_xrefs is None:
        _strange_xrefs = set()
    else:
        _strange_xrefs = strange_xrefs

    # This is a mapping of objects' xrefs to the objects' names and objects themself.
    to_be_processed_objects = {
        ocrmypdf.optimize.Xref(image.objgen[0]): (_imname, image)
        for _imname, image in dict(xobjs).items()
    }

    # Only the xrefs of the to-be-processed objects.
    to_be_processed_xrefs: set[ocrmypdf.optimize.Xref] = set(
        to_be_processed_objects.keys()
    )

    # These are the xrefs that have not been processed before.
    unprocessed_xrefs = to_be_processed_xrefs - _already_processed_xrefs

    # Collecting "strange" xrefs (probably involved in cyclic xrefs?) for later logging.
    _strange_xrefs |= to_be_processed_xrefs - unprocessed_xrefs

    # This will be passed to further function calls. Thus, we can consider
    # these xrefs processed.
    to_be_already_processed_xrefs: set[ocrmypdf.optimize.Xref] = (
        set(_already_processed_xrefs) | unprocessed_xrefs
    )

    for xref in unprocessed_xrefs:
        _imname, image = to_be_processed_objects[xref]
        if image.objgen[1] != 0:
            continue  # Ignore images in an incremental PDF

        if Name.Subtype in image and image.Subtype == Name.Form:
            # Recurse into Form XObjects
            ocrmypdf.optimize.log.debug(
                f"Recursing into Form XObject {_imname} in page {pageno}"
            )
            _find_image_xrefs_container(
                pdf=pdf,
                container=image,
                pageno=pageno,
                include_xrefs=include_xrefs,
                exclude_xrefs=exclude_xrefs,
                pageno_for_xref=pageno_for_xref,
                depth=depth + 1,
                already_processed_xrefs=to_be_already_processed_xrefs,
                strange_xrefs=_strange_xrefs,
            )
            continue

        if Name.SMask in image:
            # Ignore soft masks
            smask_xref = ocrmypdf.optimize.Xref(image.SMask.objgen[0])
            exclude_xrefs.add(smask_xref)
            ocrmypdf.optimize.log.debug(
                f"xref {smask_xref}: skipping image because it is an SMask"
            )

        # Everything that passed the previous filters is relevant for optimization.
        include_xrefs.add(xref)
        ocrmypdf.optimize.log.debug(
            f"xref {xref}: treating as an optimization candidate"
        )
        if xref not in pageno_for_xref:
            pageno_for_xref[xref] = pageno

    # Notifying about strange xrefs.
    if depth == 0 and _strange_xrefs:
        ocrmypdf.optimize.log.warning(
            f"{len(_strange_xrefs)} xrefs on page {pageno} were found to be kind of strange."
        )

As it turns out, finding suitable variable names is even harder when writing recursive functions. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants