cp1252 encoding not detected in this case #23997

skest3qc · 2017-04-06T06:43:53Z

VSCode Version: 1.11.0
OS Version: Windows 10 1607

Steps to Reproduce:

set "files.autoGuessEncoding": true in settings

Open file saved with encoding ANSI (1252)
VS Code show encoding as UTF8
Reopen with Encoding -> there is no hint which encoding was guessed or if it was tried to guess

bpasero · 2017-04-06T07:23:30Z

@skest3qc can you attach the file here? If the file does not contain sufficient characters that lead to cp1252, it will not report any encoding and fallback to UTF-8.

skest3qc · 2017-04-06T07:33:44Z

test.txt

kzhui125 · 2017-04-06T09:18:09Z

Create a utf8.txt with content "謋鰊", then save with encoding GB18030, then reopen with encoding windows 1252....
the hex codes can be interpreted by more than one encoding...

Maybe we can use ANSI encoding on Windows when the file is not UTF8-nobom encoding(has �) or other unicode encoding and the encoding can't be guessed

bpasero · 2017-04-06T14:05:11Z

I can reproduce. We are using https://github.com/aadsm/jschardet and I will create bug reports in their repository once we have collected some more data.

@katainaka0503 fyi

bpasero · 2017-04-06T16:52:30Z

Actually the issue with #23997 (comment) is that we only use the first 512 bytes to detect the encoding from the file. Maybe we should increase this limit when files.autoGuessEncoding is enabled and specifically when using the encoding selector.

bpasero · 2017-04-07T05:21:29Z

With this fix I am increasing the number of bytes that we send to jschardet for detection and it seems to fix the issue reported.

fredericDelaporte · 2017-09-21T10:03:34Z

I am not much agreeing with the fix status here. There should not be any such limit (4096 first bytes it seems now). Other tools actually supporting lack of BOM does not seem to have it. That still cause file corruption from time to time with VsCode, meaning that indeed, we should not consider VsCode as a tool able to cope with files not having a BOM, banning editing of ASCII files with VsCode.

We have experienced corruption on this file. Just a release notes file but still, this causes VsCode to be unreliable for editing ASCII files.

This release notes file is windows-1252 encoded with some characters specific to this encoding toward its end (starting from here).

I think that the detection algorithm should not default to UTF-8 without encoding when reaching end of buffer scan: it should probably instead load a new chunk of bytes and repeat, defaulting to UTF-8 only if reaching the end-of-file. Or maybe flag the file as "undetermined" and when saving it back to disk or displaying its tail, detect it is gonna to corrupt some characters that were not scanned for encoding detection, and then recover by switching to the right encoding for the file. (Quite more elaborated of course: if some non ASCII 7-bit characters have already been inserted at the start, they will need to be re-encoded...)

bpasero self-assigned this Apr 6, 2017

bpasero added the info-needed Issue requires more information from poster label Apr 6, 2017

bpasero added this to the Backlog milestone Apr 6, 2017

bpasero modified the milestones: April 2017, Backlog Apr 6, 2017

bpasero added bug Issue identified by VS Code Team member as probable bug upstream Issue identified as 'upstream' component related (exists outside of VS Code) and removed info-needed Issue requires more information from poster labels Apr 6, 2017

bpasero changed the title ~~autoGuessEncoding does not detect file as ANSI~~ Issues with encoding detection (jschardet) Apr 6, 2017

bpasero changed the title ~~Issues with encoding detection (jschardet)~~ cp1252 encoding not detected in this case Apr 7, 2017

bpasero removed the upstream Issue identified as 'upstream' component related (exists outside of VS Code) label Apr 7, 2017

bpasero modified the milestones: April 2017, Backlog Apr 7, 2017

bpasero closed this as completed in bfe1d2b Apr 7, 2017

sandy081 added the verified Verification succeeded label Apr 28, 2017

fredericDelaporte mentioned this issue Sep 21, 2017

NH-4000 - Prepare release of v5.0 nhibernate/nhibernate-core#693

Merged

vscodebot bot locked and limited conversation to collaborators Nov 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cp1252 encoding not detected in this case #23997

cp1252 encoding not detected in this case #23997

skest3qc commented Apr 6, 2017 •

edited

bpasero commented Apr 6, 2017

skest3qc commented Apr 6, 2017

kzhui125 commented Apr 6, 2017 •

edited

bpasero commented Apr 6, 2017 •

edited

bpasero commented Apr 6, 2017

bpasero commented Apr 7, 2017 •

edited

fredericDelaporte commented Sep 21, 2017 •

edited

cp1252 encoding not detected in this case #23997

cp1252 encoding not detected in this case #23997

Comments

skest3qc commented Apr 6, 2017 • edited

bpasero commented Apr 6, 2017

skest3qc commented Apr 6, 2017

kzhui125 commented Apr 6, 2017 • edited

bpasero commented Apr 6, 2017 • edited

bpasero commented Apr 6, 2017

bpasero commented Apr 7, 2017 • edited

fredericDelaporte commented Sep 21, 2017 • edited

skest3qc commented Apr 6, 2017 •

edited

kzhui125 commented Apr 6, 2017 •

edited

bpasero commented Apr 6, 2017 •

edited

bpasero commented Apr 7, 2017 •

edited

fredericDelaporte commented Sep 21, 2017 •

edited