Skip to content

BadZipFile crash on .docx files with case-mismatched zip entry names #1812

@patrick-kidger-bot

Description

@patrick-kidger-bot

Bug

Some .docx files have case mismatches between the zip central directory and local file headers. For example, the central directory lists customXml/item2.xml but the local file header contains customXML/item2.xml. This is technically a violation of the zip spec (filenames are case-sensitive), but it's produced by certain versions of Microsoft Word and other .docx producers, and most zip tools handle it fine.

Python's zipfile module strictly validates this and raises BadZipFile:

markitdown._exceptions.FileConversionException: File conversion failed after 1 attempts:
 - DocxConverter threw BadZipFile with message: File name in directory 'customXml/item2.xml' and header b'customXML/item2.xml' differ.

Suggested fix

In converter_utils/docx/pre_process.py, add a step at the start of pre_process_docx that reads the zip into a bytearray, iterates the central directory entries via zipfile.ZipFile.infolist() (which parses fine), and patches any local file headers whose names differ only in case to match the central directory. This is a safe in-memory fix — the central directory is authoritative, and the patch only applies when the names have the same byte length (which is always true for a case-only difference in ASCII paths).

import struct

def _fix_zip_name_casing(input_docx: BinaryIO) -> BinaryIO:
    input_docx.seek(0)
    raw = bytearray(input_docx.read())
    patched = False
    with zipfile.ZipFile(BytesIO(raw), mode="r") as zf:
        for info in zf.infolist():
            offset = info.header_offset
            if raw[offset:offset + 4] != b"PK\x03\x04":
                continue
            fname_len = struct.unpack_from("<H", raw, offset + 26)[0]
            local_name = raw[offset + 30:offset + 30 + fname_len]
            central_name = info.filename.encode("utf-8")
            if local_name != central_name and len(local_name) == len(central_name):
                raw[offset + 30:offset + 30 + fname_len] = central_name
                patched = True
    if patched:
        return BytesIO(bytes(raw))
    input_docx.seek(0)
    return input_docx

Then call input_docx = _fix_zip_name_casing(input_docx) as the first line of pre_process_docx.

Reproduction

Any .docx file where the OPC package has inconsistent casing between local headers and the central directory will trigger this. This is common with files produced by certain legal document systems.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions