fix: handle UnicodeDecodeError in PlainTextConverter when charset detection is inaccurate (fixes #1505) by hanhan761 · Pull Request #1938 · microsoft/markitdown

hanhan761 · 2026-05-30T07:48:16Z

Summary

When stream_info.charset is detected from partial file content (first 4096 bytes) and returns 'ascii', but the full file contains non-ASCII characters (e.g., Spanish accents, UTF-8 characters), .decode('ascii') raises UnicodeDecodeError.

This fix adds a fallback: if decoding with the detected charset fails, fall back to charset_normalizer for automatic charset detection on the full file content.

Issue

Fixes #1505

Verification

UTF-8 files with Spanish characters now decode correctly
Pure ASCII files continue to work as before
Files where charset detection from partial content returns 'ascii' but full content has UTF-8 characters fall back gracefully

…ection is inaccurate (fixes microsoft#1505)

fix: handle UnicodeDecodeError in PlainTextConverter when charset det…

83f82d4

…ection is inaccurate (fixes microsoft#1505)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle UnicodeDecodeError in PlainTextConverter when charset detection is inaccurate (fixes #1505)#1938

fix: handle UnicodeDecodeError in PlainTextConverter when charset detection is inaccurate (fixes #1505)#1938
hanhan761 wants to merge 1 commit into
microsoft:mainfrom
hanhan761:fix-1505-plaintext-unicode-decode

hanhan761 commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hanhan761 commented May 30, 2026

Summary

Issue

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant