fix: write UTF-8 to stdout.buffer to avoid UnicodeEncodeError on non-UTF-8 systems by octo-patch · Pull Request #1790 · microsoft/markitdown

octo-patch · 2026-04-17T03:32:22Z

Problem

On Windows systems with a non-UTF-8 locale (e.g. GBK on Chinese Windows), running:

markitdown file.pdf > output.md

raises a UnicodeEncodeError:

UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 140: illegal multibyte sequence

The previous approach of encoding to sys.stdout.encoding with errors='replace' had two remaining issues:

sys.stdout.encoding can be None when stdout is a raw pipe, causing a TypeError instead of a graceful failure.
Characters are silently replaced with '?' (lossy output), which is undesirable when redirecting to a file.

Solution

Write UTF-8 encoded bytes directly to sys.stdout.buffer when available. This produces lossless UTF-8 output regardless of the system locale encoding, consistent with the behavior of the -o/--output flag (which already writes UTF-8 explicitly).

A safe fallback handles the rare case where stdout.buffer is not available (e.g. some embedded or wrapped stdout objects), using the locale encoding with errors='replace' and guarding against None encoding.

Testing

Verified the fix handles the sys.stdout.encoding is None case without raising TypeError
Verified lossless UTF-8 output when redirecting (> file.md) on systems with non-UTF-8 locale encoding

…UTF-8 systems On Windows systems with a non-UTF-8 locale (e.g. GBK on Chinese Windows), running `markitdown file.pdf > output.md` raises: UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' Two problems existed in the previous approach of encoding to sys.stdout.encoding with errors='replace': 1. sys.stdout.encoding can be None when stdout is a raw pipe, causing a TypeError. 2. Characters are silently replaced with '?' (lossy output), which is undesirable when redirecting to a file. Fix by writing UTF-8 encoded bytes directly to sys.stdout.buffer when available. This produces lossless UTF-8 output regardless of the system locale, matching the behaviour of the -o/--output flag. A safe fallback handles the rare case where stdout.buffer is absent. Fixes microsoft#1788

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: write UTF-8 to stdout.buffer to avoid UnicodeEncodeError on non-UTF-8 systems#1790

fix: write UTF-8 to stdout.buffer to avoid UnicodeEncodeError on non-UTF-8 systems#1790
octo-patch wants to merge 1 commit intomicrosoft:mainfrom
octo-patch:fix/issue-1788-unicode-stdout-encoding

octo-patch commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

octo-patch commented Apr 17, 2026

Problem

Solution

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant