fix: handle UnicodeDecodeError in CsvConverter when charset detection is inaccurate (fixes #1949) by hanhan761 · Pull Request #1954 · microsoft/markitdown

hanhan761 · 2026-05-30T08:04:38Z

Summary

CsvConverter.decode() uses stream_info.charset (detected from partial file content) to decode CSV files. When the detected charset is inaccurate (e.g., 'ascii' from the first 4096 bytes, but the full file contains UTF-8 characters), the decode fails with UnicodeDecodeError.

This fix adds a fallback: if decoding with the detected charset fails, use charset_normalizer for automatic charset detection on the full file content.

This is the same class of bug as #1505 (PlainTextConverter), which was fixed in PR #1938.

Issue

Fixes #1949

Verification

UTF-8 CSV files with non-ASCII characters now decode correctly
Pure ASCII CSV files continue to work as before

… is inaccurate (fixes microsoft#1949)

fix: handle UnicodeDecodeError in CsvConverter when charset detection…

f429138

… is inaccurate (fixes microsoft#1949)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle UnicodeDecodeError in CsvConverter when charset detection is inaccurate (fixes #1949)#1954

fix: handle UnicodeDecodeError in CsvConverter when charset detection is inaccurate (fixes #1949)#1954
hanhan761 wants to merge 1 commit into
microsoft:mainfrom
hanhan761:fix-1949-csv-unicode-decode

hanhan761 commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hanhan761 commented May 30, 2026

Summary

Issue

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant