fix: fall back to charset_normalizer when detected charset fails to decode (fixes #1949) by hanhan761 · Pull Request #1956 · microsoft/markitdown

hanhan761 · 2026-05-30T08:05:37Z

Summary

Fixes UnicodeDecodeError in CsvConverter and PlainTextConverter when stream_info.charset reports a charset (e.g., 'ascii') that doesn't match the actual file content.

Root Cause

Both converters call .decode(stream_info.charset) on the full file content. When charset detection runs on a partial file prefix (first 4096 bytes) that happens to be ASCII-only but the full file contains non-ASCII bytes (e.g., UTF-8), the decode fails with UnicodeDecodeError.

Changes

_csv_converter.py: Wrap .decode() in try/except, fall back to charset_normalizer on failure
_plain_text_converter.py: Same fix (same class of bug as Bug with Spanish symbols: PlainTextConverter threw UnicodeDecodeError with message: 'ascii' codec can't decode byte #1505)

Both files also now read
aw bytes once instead of calling ile_stream.read() twice.

Issue

Fixes #1949

Verification

Reproduction case from the issue now succeeds
ASCII and known-charset files continue to use the fast path
Same fix pattern verified in PR fix: handle UnicodeDecodeError in PlainTextConverter when charset detection is inaccurate (fixes #1505) #1938 for PlainTextConverter (Bug with Spanish symbols: PlainTextConverter threw UnicodeDecodeError with message: 'ascii' codec can't decode byte #1505)

…ecode CSV/plain text (fixes microsoft#1949)

fix: fall back to charset_normalizer when detected charset fails to d…

d29a3da

…ecode CSV/plain text (fixes microsoft#1949)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fall back to charset_normalizer when detected charset fails to decode (fixes #1949)#1956

fix: fall back to charset_normalizer when detected charset fails to decode (fixes #1949)#1956
hanhan761 wants to merge 1 commit into
microsoft:mainfrom
hanhan761:fix-1949-csv-charset-fallback

hanhan761 commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hanhan761 commented May 30, 2026

Summary

Root Cause

Changes

Issue

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant