Skip to content

Recursion Error: Process_element / process_tag mutual recursion (infinite loop) #256

@benmcmechan

Description

@benmcmechan

When converting HTML that contains circular parent-child references in the parsed BeautifulSoup tree (produced by certain PDF-to-HTML pipelines), process_element and process_tag recurse infinitely, crashing with a RecursionError.

Encountered when using markdownify via marker-pdf to convert PDF documents. The HTML produced by the PDF parser contained structures that caused BeautifulSoup's html.parser to create a non-tree graph where a descendant node held a reference back to an ancestor creating an unbounded call stack.

The mutual recursion introduced in 1.2.2 between process_element and process_tag has no cycle guard:

  • process_tag iterates node.children and calls process_element for each child
  • process_element calls process_tag for any Tag node
RecursionError: maximum recursion depth exceeded

File "markdownify/__init__.py", line 232, in process_element
    return self.process_tag(node, parent_tags=parent_tags)
File "markdownify/__init__.py", line 287, in process_tag
    child_strings = [
File "markdownify/__init__.py", line 288, in <listcomp>
    self.process_element(el, parent_tags=parent_tags_for_children)
File "markdownify/__init__.py", line 232, in process_element
    return self.process_tag(node, parent_tags=parent_tags)
... (repeating until stack exhausted)

Suggested fix:

Pass a visited set of node ids through the call chain to detect and break cycles:

def process_element(self, node, parent_tags=None, _visited=None):
    if isinstance(node, NavigableString):
        return self.process_text(node, parent_tags=parent_tags)
    else:
        return self.process_tag(node, parent_tags=parent_tags, _visited=_visited)

def process_tag(self, node, parent_tags=None, _visited=None):
    if parent_tags is None:
        parent_tags = set()

    # Cycle detection
    if _visited is None:
        _visited = set()
    node_id = id(node)
    if node_id in _visited:
        return ''
    _visited.add(node_id)

    # ... rest of method unchanged, but pass _visited= to process_element calls
    child_strings = [
        self.process_element(el, parent_tags=parent_tags_for_children, _visited=_visited)
        for el in children_to_convert
    ]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions