-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cp1252 encoding not detected in this case #23997
Comments
@skest3qc can you attach the file here? If the file does not contain sufficient characters that lead to cp1252, it will not report any encoding and fallback to UTF-8. |
Create a utf8.txt with content "謋 鰊", then save with encoding GB18030, then reopen with encoding windows 1252.... Maybe we can use ANSI encoding on Windows when the file is not UTF8-nobom encoding(has �) or other unicode encoding and the encoding can't be guessed |
I can reproduce. We are using https://github.com/aadsm/jschardet and I will create bug reports in their repository once we have collected some more data. @katainaka0503 fyi |
Actually the issue with #23997 (comment) is that we only use the first 512 bytes to detect the encoding from the file. Maybe we should increase this limit when |
With this fix I am increasing the number of bytes that we send to |
I am not much agreeing with the fix status here. There should not be any such limit (4096 first bytes it seems now). Other tools actually supporting lack of BOM does not seem to have it. That still cause file corruption from time to time with VsCode, meaning that indeed, we should not consider VsCode as a tool able to cope with files not having a BOM, banning editing of ASCII files with VsCode. We have experienced corruption on this file. Just a release notes file but still, this causes VsCode to be unreliable for editing ASCII files. This release notes file is windows-1252 encoded with some characters specific to this encoding toward its end (starting from here). I think that the detection algorithm should not default to UTF-8 without encoding when reaching end of buffer scan: it should probably instead load a new chunk of bytes and repeat, defaulting to UTF-8 only if reaching the end-of-file. Or maybe flag the file as "undetermined" and when saving it back to disk or displaying its tail, detect it is gonna to corrupt some characters that were not scanned for encoding detection, and then recover by switching to the right encoding for the file. (Quite more elaborated of course: if some non ASCII 7-bit characters have already been inserted at the start, they will need to be re-encoded...) |
Steps to Reproduce:
set "files.autoGuessEncoding": true in settings
The text was updated successfully, but these errors were encountered: