Bug
Some .docx files have case mismatches between the zip central directory and local file headers. For example, the central directory lists customXml/item2.xml but the local file header contains customXML/item2.xml. This is technically a violation of the zip spec (filenames are case-sensitive), but it's produced by certain versions of Microsoft Word and other .docx producers, and most zip tools handle it fine.
Python's zipfile module strictly validates this and raises BadZipFile:
markitdown._exceptions.FileConversionException: File conversion failed after 1 attempts:
- DocxConverter threw BadZipFile with message: File name in directory 'customXml/item2.xml' and header b'customXML/item2.xml' differ.
Suggested fix
In converter_utils/docx/pre_process.py, add a step at the start of pre_process_docx that reads the zip into a bytearray, iterates the central directory entries via zipfile.ZipFile.infolist() (which parses fine), and patches any local file headers whose names differ only in case to match the central directory. This is a safe in-memory fix — the central directory is authoritative, and the patch only applies when the names have the same byte length (which is always true for a case-only difference in ASCII paths).
import struct
def _fix_zip_name_casing(input_docx: BinaryIO) -> BinaryIO:
input_docx.seek(0)
raw = bytearray(input_docx.read())
patched = False
with zipfile.ZipFile(BytesIO(raw), mode="r") as zf:
for info in zf.infolist():
offset = info.header_offset
if raw[offset:offset + 4] != b"PK\x03\x04":
continue
fname_len = struct.unpack_from("<H", raw, offset + 26)[0]
local_name = raw[offset + 30:offset + 30 + fname_len]
central_name = info.filename.encode("utf-8")
if local_name != central_name and len(local_name) == len(central_name):
raw[offset + 30:offset + 30 + fname_len] = central_name
patched = True
if patched:
return BytesIO(bytes(raw))
input_docx.seek(0)
return input_docx
Then call input_docx = _fix_zip_name_casing(input_docx) as the first line of pre_process_docx.
Reproduction
Any .docx file where the OPC package has inconsistent casing between local headers and the central directory will trigger this. This is common with files produced by certain legal document systems.
Bug
Some
.docxfiles have case mismatches between the zip central directory and local file headers. For example, the central directory listscustomXml/item2.xmlbut the local file header containscustomXML/item2.xml. This is technically a violation of the zip spec (filenames are case-sensitive), but it's produced by certain versions of Microsoft Word and other .docx producers, and most zip tools handle it fine.Python's
zipfilemodule strictly validates this and raisesBadZipFile:Suggested fix
In
converter_utils/docx/pre_process.py, add a step at the start ofpre_process_docxthat reads the zip into abytearray, iterates the central directory entries viazipfile.ZipFile.infolist()(which parses fine), and patches any local file headers whose names differ only in case to match the central directory. This is a safe in-memory fix — the central directory is authoritative, and the patch only applies when the names have the same byte length (which is always true for a case-only difference in ASCII paths).Then call
input_docx = _fix_zip_name_casing(input_docx)as the first line ofpre_process_docx.Reproduction
Any
.docxfile where the OPC package has inconsistent casing between local headers and the central directory will trigger this. This is common with files produced by certain legal document systems.