Skip to content

fix: fall back to charset_normalizer when detected charset fails to decode (fixes #1949)#1956

Open
hanhan761 wants to merge 1 commit into
microsoft:mainfrom
hanhan761:fix-1949-csv-charset-fallback
Open

fix: fall back to charset_normalizer when detected charset fails to decode (fixes #1949)#1956
hanhan761 wants to merge 1 commit into
microsoft:mainfrom
hanhan761:fix-1949-csv-charset-fallback

Conversation

@hanhan761
Copy link
Copy Markdown

Summary

Fixes UnicodeDecodeError in CsvConverter and PlainTextConverter when stream_info.charset reports a charset (e.g., 'ascii') that doesn't match the actual file content.

Root Cause

Both converters call .decode(stream_info.charset) on the full file content. When charset detection runs on a partial file prefix (first 4096 bytes) that happens to be ASCII-only but the full file contains non-ASCII bytes (e.g., UTF-8), the decode fails with UnicodeDecodeError.

Changes

Both files also now read
aw bytes once instead of calling ile_stream.read() twice.

Issue

Fixes #1949

Verification

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CsvConverter throws UnicodeDecodeError when charset detection from partial content returns 'ascii'

1 participant