Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cp1252 encoding not detected in this case #23997

Closed
skest3qc opened this issue Apr 6, 2017 · 7 comments
Closed

cp1252 encoding not detected in this case #23997

skest3qc opened this issue Apr 6, 2017 · 7 comments
Assignees
Labels
bug Issue identified by VS Code Team member as probable bug verified Verification succeeded
Milestone

Comments

@skest3qc
Copy link

skest3qc commented Apr 6, 2017

  • VSCode Version: 1.11.0
  • OS Version: Windows 10 1607

Steps to Reproduce:

set "files.autoGuessEncoding": true in settings

  1. Open file saved with encoding ANSI (1252)
  2. VS Code show encoding as UTF8
  3. Reopen with Encoding -> there is no hint which encoding was guessed or if it was tried to guess
@bpasero
Copy link
Member

bpasero commented Apr 6, 2017

@skest3qc can you attach the file here? If the file does not contain sufficient characters that lead to cp1252, it will not report any encoding and fallback to UTF-8.

@bpasero bpasero self-assigned this Apr 6, 2017
@bpasero bpasero added the info-needed Issue requires more information from poster label Apr 6, 2017
@bpasero bpasero added this to the Backlog milestone Apr 6, 2017
@skest3qc
Copy link
Author

skest3qc commented Apr 6, 2017

test.txt

@kzhui125
Copy link

kzhui125 commented Apr 6, 2017

Create a utf8.txt with content "謋 鰊", then save with encoding GB18030, then reopen with encoding windows 1252....
the hex codes can be interpreted by more than one encoding...

Maybe we can use ANSI encoding on Windows when the file is not UTF8-nobom encoding(has �) or other unicode encoding and the encoding can't be guessed

@bpasero bpasero modified the milestones: April 2017, Backlog Apr 6, 2017
@bpasero bpasero added bug Issue identified by VS Code Team member as probable bug upstream Issue identified as 'upstream' component related (exists outside of VS Code) and removed info-needed Issue requires more information from poster labels Apr 6, 2017
@bpasero
Copy link
Member

bpasero commented Apr 6, 2017

I can reproduce. We are using https://github.com/aadsm/jschardet and I will create bug reports in their repository once we have collected some more data.

@katainaka0503 fyi

@bpasero bpasero changed the title autoGuessEncoding does not detect file as ANSI Issues with encoding detection (jschardet) Apr 6, 2017
@bpasero
Copy link
Member

bpasero commented Apr 6, 2017

Actually the issue with #23997 (comment) is that we only use the first 512 bytes to detect the encoding from the file. Maybe we should increase this limit when files.autoGuessEncoding is enabled and specifically when using the encoding selector.

@bpasero bpasero changed the title Issues with encoding detection (jschardet) cp1252 encoding not detected in this case Apr 7, 2017
@bpasero bpasero removed the upstream Issue identified as 'upstream' component related (exists outside of VS Code) label Apr 7, 2017
@bpasero bpasero modified the milestones: April 2017, Backlog Apr 7, 2017
@bpasero
Copy link
Member

bpasero commented Apr 7, 2017

With this fix I am increasing the number of bytes that we send to jschardet for detection and it seems to fix the issue reported.

@fredericDelaporte
Copy link

fredericDelaporte commented Sep 21, 2017

I am not much agreeing with the fix status here. There should not be any such limit (4096 first bytes it seems now). Other tools actually supporting lack of BOM does not seem to have it. That still cause file corruption from time to time with VsCode, meaning that indeed, we should not consider VsCode as a tool able to cope with files not having a BOM, banning editing of ASCII files with VsCode.

We have experienced corruption on this file. Just a release notes file but still, this causes VsCode to be unreliable for editing ASCII files.

This release notes file is windows-1252 encoded with some characters specific to this encoding toward its end (starting from here).

I think that the detection algorithm should not default to UTF-8 without encoding when reaching end of buffer scan: it should probably instead load a new chunk of bytes and repeat, defaulting to UTF-8 only if reaching the end-of-file. Or maybe flag the file as "undetermined" and when saving it back to disk or displaying its tail, detect it is gonna to corrupt some characters that were not scanned for encoding detection, and then recover by switching to the right encoding for the file. (Quite more elaborated of course: if some non ASCII 7-bit characters have already been inserted at the start, they will need to be re-encoded...)

@vscodebot vscodebot bot locked and limited conversation to collaborators Nov 17, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Issue identified by VS Code Team member as probable bug verified Verification succeeded
Projects
None yet
Development

No branches or pull requests

5 participants