Skip to content

Incorrect parsing of Unicode smart quotes from .docx files #1219

@josh-b-2210

Description

@josh-b-2210

Bug: Incorrect parsing of Unicode smart quotes from .docx files

When using MarkItDown to convert .docx files created by Microsoft Word (default settings, smart quotes enabled), Unicode characters such as:

  • Apostrophes ( U+2019)
  • Left double quotes ( U+201C)
  • Right double quotes ( U+201D)

are incorrectly parsed and appear in the Markdown output as corrupted characters like Æ, ô, ö.

Steps to Reproduce:

  1. Create a new .docx in Word with smart quotes enabled (default setting).
  2. Add text such as: It’s important to “quote” text properly.
  3. Run MarkItDown to convert the .docx to .md.
  4. Observe corrupted characters in the output.

Expected Behavior:
Smart punctuation should either:

  • Be preserved correctly as Unicode characters, or
  • Be flattened gracefully to ASCII equivalents (' and ").

Actual Behavior:
Corrupted non-ASCII characters appear in Markdown.

Workarounds:

  • Disabling smart quotes in Word avoids the issue.
  • Alternative tools like Pandoc handle .docx smart punctuation correctly.

Environment:

  • MarkItDown version: 0.1.1
  • Python version: 3.12
  • OS: Windows 11

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions