Skip to content

IndexError: pop from empty list in read_fld_char on unmatched w:fldChar end/separate #168

@pramodavansaber

Description

@pramodavansaber

Description

mammoth.convert_to_html (and any other entry point that exercises body_xml.read_fld_char) crashes with IndexError: pop from empty list when it encounters a <w:fldChar w:fldCharType="end"/> (or "separate") that has no matching prior "begin" element.

The root cause is in mammoth/docx/body_xml.py:

def read_fld_char(element):
    fld_char_type = element.attributes.get("w:fldCharType")
    if fld_char_type == "begin":
        complex_field_stack.append(...)
        ...
    elif fld_char_type == "end":
        complex_field = complex_field_stack.pop()      # <-- line 206
        ...
    elif fld_char_type == "separate":
        complex_field_separate = complex_field_stack.pop()  # <-- line 214
        ...

Both .pop() calls assume the stack is non-empty, which is true for well-formed documents but not guaranteed for arbitrary input. A document whose first fldChar is end (or separate) — for example produced by a buggy DOCX generator, hand-edited, recovered from a partially corrupted file, or carved out of a larger document — leaks IndexError to the caller.

This is similar in shape to #158 ('w:ilvl' when parsing malformed docx numbering), which you accepted and fixed by hardening the malformed-input path.

Reproduction

Minimal standalone repro (no template needed — builds the .docx in memory):

import io, zipfile, mammoth

DOCUMENT_XML = b"""<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:body>
    <w:p>
      <w:r><w:fldChar w:fldCharType="end"/></w:r>
      <w:r><w:t>Hello</w:t></w:r>
    </w:p>
  </w:body>
</w:document>"""

CONTENT_TYPES_XML = b"""<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
  <Default Extension="xml" ContentType="application/xml"/>
  <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
  <Override PartName="/word/document.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
</Types>"""

PACKAGE_RELS = b"""<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="word/document.xml"/>
</Relationships>"""

buf = io.BytesIO()
with zipfile.ZipFile(buf, "w", zipfile.ZIP_DEFLATED) as z:
    z.writestr("[Content_Types].xml", CONTENT_TYPES_XML)
    z.writestr("_rels/.rels", PACKAGE_RELS)
    z.writestr("word/document.xml", DOCUMENT_XML)
buf.seek(0)
mammoth.convert_to_html(buf)

Traceback (HEAD as of 2026-05-24, commit on master):

  File ".../mammoth/docx/body_xml.py", line 206, in read_fld_char
    complex_field = complex_field_stack.pop()
IndexError: pop from empty list

Switching the fldCharType in the repro to "separate" hits the matching crash on line 214.

Suggested fix

Guard both pops:

elif fld_char_type == "end":
    if not complex_field_stack:
        return _empty_result
    complex_field = complex_field_stack.pop()
    ...
elif fld_char_type == "separate":
    if not complex_field_stack:
        return _empty_result
    complex_field_separate = complex_field_stack.pop()
    ...

Happy to send a PR if useful.

Context

Found by tailtest, an adversarial test generator I'm building. Filing on behalf of the run; the issue is reproduced and confirmed independently against current master.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions