Skip to content

fix: handle UnicodeDecodeError in CsvConverter when charset detection is inaccurate (fixes #1949)#1954

Open
hanhan761 wants to merge 1 commit into
microsoft:mainfrom
hanhan761:fix-1949-csv-unicode-decode
Open

fix: handle UnicodeDecodeError in CsvConverter when charset detection is inaccurate (fixes #1949)#1954
hanhan761 wants to merge 1 commit into
microsoft:mainfrom
hanhan761:fix-1949-csv-unicode-decode

Conversation

@hanhan761
Copy link
Copy Markdown

Summary

CsvConverter.decode() uses stream_info.charset (detected from partial file content) to decode CSV files. When the detected charset is inaccurate (e.g., 'ascii' from the first 4096 bytes, but the full file contains UTF-8 characters), the decode fails with UnicodeDecodeError.

This fix adds a fallback: if decoding with the detected charset fails, use charset_normalizer for automatic charset detection on the full file content.

This is the same class of bug as #1505 (PlainTextConverter), which was fixed in PR #1938.

Issue

Fixes #1949

Verification

  • UTF-8 CSV files with non-ASCII characters now decode correctly
  • Pure ASCII CSV files continue to work as before

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CsvConverter throws UnicodeDecodeError when charset detection from partial content returns 'ascii'

1 participant