Skip to content

fix: write UTF-8 to stdout.buffer to avoid UnicodeEncodeError on non-UTF-8 systems#1790

Open
octo-patch wants to merge 1 commit intomicrosoft:mainfrom
octo-patch:fix/issue-1788-unicode-stdout-encoding
Open

fix: write UTF-8 to stdout.buffer to avoid UnicodeEncodeError on non-UTF-8 systems#1790
octo-patch wants to merge 1 commit intomicrosoft:mainfrom
octo-patch:fix/issue-1788-unicode-stdout-encoding

Conversation

@octo-patch
Copy link
Copy Markdown

Fixes #1788

Problem

On Windows systems with a non-UTF-8 locale (e.g. GBK on Chinese Windows), running:

markitdown file.pdf > output.md

raises a UnicodeEncodeError:

UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 140: illegal multibyte sequence

The previous approach of encoding to sys.stdout.encoding with errors='replace' had two remaining issues:

  1. sys.stdout.encoding can be None when stdout is a raw pipe, causing a TypeError instead of a graceful failure.
  2. Characters are silently replaced with '?' (lossy output), which is undesirable when redirecting to a file.

Solution

Write UTF-8 encoded bytes directly to sys.stdout.buffer when available. This produces lossless UTF-8 output regardless of the system locale encoding, consistent with the behavior of the -o/--output flag (which already writes UTF-8 explicitly).

A safe fallback handles the rare case where stdout.buffer is not available (e.g. some embedded or wrapped stdout objects), using the locale encoding with errors='replace' and guarding against None encoding.

Testing

  • Verified the fix handles the sys.stdout.encoding is None case without raising TypeError
  • Verified lossless UTF-8 output when redirecting (> file.md) on systems with non-UTF-8 locale encoding

…UTF-8 systems

On Windows systems with a non-UTF-8 locale (e.g. GBK on Chinese
Windows), running `markitdown file.pdf > output.md` raises:

  UnicodeEncodeError: 'gbk' codec can't encode character '\u2022'

Two problems existed in the previous approach of encoding to
sys.stdout.encoding with errors='replace':
1. sys.stdout.encoding can be None when stdout is a raw pipe,
   causing a TypeError.
2. Characters are silently replaced with '?' (lossy output), which
   is undesirable when redirecting to a file.

Fix by writing UTF-8 encoded bytes directly to sys.stdout.buffer
when available. This produces lossless UTF-8 output regardless of
the system locale, matching the behaviour of the -o/--output flag.
A safe fallback handles the rare case where stdout.buffer is absent.

Fixes microsoft#1788
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UnicodeEncodeError

1 participant