Skip to content

fix: handle UnicodeDecodeError in PlainTextConverter when charset detection is inaccurate (fixes #1505)#1938

Open
hanhan761 wants to merge 1 commit into
microsoft:mainfrom
hanhan761:fix-1505-plaintext-unicode-decode
Open

fix: handle UnicodeDecodeError in PlainTextConverter when charset detection is inaccurate (fixes #1505)#1938
hanhan761 wants to merge 1 commit into
microsoft:mainfrom
hanhan761:fix-1505-plaintext-unicode-decode

Conversation

@hanhan761
Copy link
Copy Markdown

Summary

When stream_info.charset is detected from partial file content (first 4096 bytes) and returns 'ascii', but the full file contains non-ASCII characters (e.g., Spanish accents, UTF-8 characters), .decode('ascii') raises UnicodeDecodeError.

This fix adds a fallback: if decoding with the detected charset fails, fall back to charset_normalizer for automatic charset detection on the full file content.

Issue

Fixes #1505

Verification

  • UTF-8 files with Spanish characters now decode correctly
  • Pure ASCII files continue to work as before
  • Files where charset detection from partial content returns 'ascii' but full content has UTF-8 characters fall back gracefully

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug with Spanish symbols: PlainTextConverter threw UnicodeDecodeError with message: 'ascii' codec can't decode byte

1 participant