Allow to configure a list of encodings to use when guessing #36951

JasonJunMa · 2017-10-26T05:31:42Z

The files.autoGuessEncoding=true doesn't work well in some circumstances.

I think that would be good if you guys add some features like files.forceEncoding="encode1:encode2,encode3:encode4".

So it can force 'encode1' to 'encode2'. That's a solution for wrong encoding detection I think.

The text was updated successfully, but these errors were encountered:

fseasy · 2017-10-27T07:18:21Z

Yes, I'm totally agree because It is so weak for auto guess.
Add a candidate may be better!
For me, of may be Many Chinese Coder, only UTF-8 and GB18030 are most commonly meet, but auto-guess give me the Windows 1532??? I think is is easier to detect in users' encoding candidates.

phobos2077 · 2017-11-15T09:24:06Z

I agree. In my environment we have files in two encodings - UTF-8 and Windows1251 (most popular text file encoding in Russia), so I need to use encoding detection. However, it sometimes detects windows1251-encoded files as "maccyrillic" or "Windows1252" or some other encoding that I've never seen in my life :D
Definitely need a setting like

files.detectEncodings=["utf8","windows1251]

So instead of just "true", you can specify which encodings you want it to detect from. As far as I know, encoding detection works based on probabilities (you can't 100% say which files is which encoding, so the software has to pick the most probable answer), so I think it is possible to implement - just filter out the list of possible encoding to those user selected.

bpasero · 2018-09-13T10:02:30Z

Verification: There is now a files.guessableEncodings setting where you can fill in encodings to support when guessing. From the explanation: If provided, will restrict the list of encodings that can be used when guessing. If the guessed file encoding is not in the list, the default encoding will be used.

Update: I decided to rename the setting to files.guessableEncodings

octref · 2018-09-26T18:56:36Z

@bpasero With these settings:

    "files.autoGuessEncoding": true,
    "files.guessableEncodings": [
      "gbk"
    ]

I still get this file as UTF-8. It is in gbk encoding with two Chinese characters.

foo.txt

bpasero · 2018-09-27T06:03:54Z

@octref you have to use a file that jschardet can detect properly. In your case it tells me:

So it makes sense that UTF-8 if used

bpasero · 2018-09-27T06:04:33Z

To verify you can use src/vs/base/test/node/encoding/fixtures/some.cp1252.txt with CP1252 encoding!

octref · 2018-09-27T16:45:43Z

@bpasero I see, the logic is

Guessed encoding is not in files.guessableEncodings
Fall back to utf-8

But I would argue this doesn't solve the users' problems. Let's say the user has a bunch of files that he knows is gbk encoding, but jschardet could have guessed either of these:

If the user wants all files to be opened as gbk. This setting would not work for him.
The original request is more for being able to set fallbacks. For example,

If guessed encoding is gb2312, gb18030, fall back to gbk.
Otherwise, fall back to utf-8.

A setting like this would be more useful:

{
  "files.encodingAssociations": {
    "gbk": ["gb2312", "gb18030"],
    "cp950": ["big5hkscs"]
    // Everything else falls back to "utf-8"
  }
}

bpasero · 2018-09-27T17:54:20Z

Maybe someone from this issue could comment if that was the desired solution or not (@JasonJunMa).

irudoy · 2018-09-28T02:09:37Z

@bpasero in the implementation from original pull request, the encoding falls back to the first one in the list instead of utf-8. It was not a great solution, definitely. I consider that @octref solution will resolve an issue.

formigoni · 2022-10-06T17:05:36Z

Hi,
I'm using VS Code version 1.71.2 and the issue still happens to me.
As far as I understood @aadsm has implemented "allowed encodings" on jschardet side.
What is missing to work in vscode is to make use of this functionality?

nfrance709 · 2022-10-08T16:09:39Z

Hi, I'm using VS Code version 1.71.2 and the issue still happens to me. As far as I understood @aadsm has implemented "allowed encodings" on jschardet side. What is missing to work in vscode is to make use of this functionality?

I still have the same issue myself, and unfortunately, I still need to compile from source and add a small modification for VSCode to work correctly with the files I use. See my previous post here with instructions https://github.com/microsoft/vscode/issues/36951#issuecomment-989472107

formigoni · 2022-10-13T13:16:06Z

Hi, I'm using VS Code version 1.71.2 and the issue still happens to me. As far as I understood @aadsm has implemented "allowed encodings" on jschardet side. What is missing to work in vscode is to make use of this functionality?

I still have the same issue myself, and unfortunately, I still need to compile from source and add a small modification for VSCode to work correctly with the files I use. See my previous post here with instructions https://github.com/microsoft/vscode/issues/36951#issuecomment-989472107

After reading your response I've tried myself the same solution and could manage to compile it from source with the same modification. I could successfully enable extensions and the only thing that is now preventing me from using the modified/compiled version is that I can't synchronize settings. The option "Settings Sync" doesn't appear on my application. Was this option (Settings Sync) meant to work on OSS version or is it a Microsoft specific customization?

nfrance709 · 2022-10-13T13:23:18Z

Hi, I'm using VS Code version 1.71.2 and the issue still happens to me. As far as I understood @aadsm has implemented "allowed encodings" on jschardet side. What is missing to work in vscode is to make use of this functionality?

I still have the same issue myself, and unfortunately, I still need to compile from source and add a small modification for VSCode to work correctly with the files I use. See my previous post here with instructions https://github.com/microsoft/vscode/issues/36951#issuecomment-989472107

After reading your response I've tried myself the same solution and could manage to compile it from source with the same modification. I could successfully enable extensions and the only thing that is now preventing me from using the modified/compiled version is that I can't synchronize settings. The option "Settings Sync" doesn't appear on my application. Was this option (Settings Sync) meant to work on OSS version or is it a Microsoft specific customization?

Unfortunately, it's not available in the OSS version. I use this extension https://marketplace.visualstudio.com/items?itemName=Shan.code-settings-sync

nfrance709 · 2022-10-13T13:29:54Z

By the way if you want to run GitHub Copilot in the OSS version you may need to follow the instructions in the link below. It was needed a while back but not sure if that's changed now as I haven't used GitHub Copilot for some time.

https://github.com/community/community/discussions/6629#discussioncomment-1524627

petersladek · 2023-03-30T15:46:53Z

As many people in this thread, I have to deal with files with utf8 and some other encoding (cp-1250 or iso-8859-2 in my case) for different types of files. VS Code is basically the only editor I have installed that has trouble detecting the encodings correctly. autoGuessEncoding="true" is not working at all for me, and I don't want to change default files encoding as I want my new files to be saved in utf-8. Every other editor I tried opened the files correctly - notepad, notepad2, notepad++, sublime text, visual studio etc... Please try to fix this. Thanks

zinface · 2023-08-21T08:47:16Z

我想，通过一个配置项目列出我认为该工作空间内可能有哪些不同的编码。这样就能让该功能进行更好的选择。

formigoni · 2023-10-25T15:40:45Z

Hello!
This Issue has been open for 6 years and still has no solution.
I (and others in this thread) still need to compile from the source code, use the OSS version, and accept the limitations that come with it. For example, I can't use Sync Setting because it's available only on Microsoft version, and I can't use Dev Containers for the same reason.
A text editor that is probably the most widely used in the world by developers, and one that I personally love, shouldn't have seemingly trivial issues taking so long to be resolved.
What if the author of jschardet never releases the version? What if they pass away? Win the lottery?
@bpasero , please help us!

nfrance709 · 2023-10-25T17:12:46Z

I assume that the vast majority of VSCode users fall into one of the following categories:

Use UTF-8 exclusively.
Use another encoding exclusively.
Use either UTF-8 or a single other encoding.

The basic solution provided here: #36951 (comment) addresses this, but it requires compiling from source.

What's needed is either a new configuration option or a way to leverage an existing option. This would specify that if a file is not detected as UTF-8, it should fall back to the specified encoding. This is essentially what the fix linked above does. While UTF-8 detection appears to be reliable, determining other encodings will always involve some guesswork and will never be 100% accurate.

I'd prefer to use the release version of VSCode, but due to this issue, I'm forced to compile and use the OSS version. It shouldn't be that difficult to fix this via a configuration option.

bigdogs · 2023-11-03T06:59:17Z

any updates? still hoping a solution

peminator · 2023-11-03T08:08:43Z

me waiting too, need to allow list of "priority guess", say i often use utf8 + Win1250, so these first tested first, and only if fail both, use classic guess... can someone smart enought find a way and propose a commit to them ? Pretty pretty please, would make a proper xmas present to me

Genius-er · 2024-02-23T09:54:23Z

When will the problem of guessing the wrong encoding be solved? I only need a configuration guess list to solve most of my problems. The main encodings I use are utf8 and gb18030.

aadsm · 2024-02-23T16:08:28Z

I have an idea on this that I'm experimenting right now. Should have some updates soon. To be clear: I'm the jschardet author, I have no affiliation with vscode or MS.

…

On Fri, Feb 23, 2024 at 01:54 Genius-er ***@***.***> wrote: When will the problem of guessing the wrong encoding be solved? I only need a configuration guess list to solve most of my problems. The main encodings I use are utf8 and gb18030. — Reply to this email directly, view it on GitHub <#36951 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAATCKSUKG7ZMKP6L7YRJALYVBRNXAVCNFSM4EA4D3M2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJWGEYDENZTGYZQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

peminator · 2024-03-18T07:00:23Z

@aadsm hi any news on case of jschardet in vscode?

Anything i do, if guessing is on, my Windows1250 files open as various ISO-... randomly based on contained character
It should be updated on VSCode side to first try to validate the default encoding setting may be the right one.... and only if there the score very low, go on to detecting

most ppl need one encoding + UTF8 afaik, bc the one non utf is from pre-utf era. So if know where, just find where the detection is used, and make it first check against the default encoding setting. In my case, as exampe i have problems with accented characters like "ť" - in windows1250 it is small "t" plus small accent symbol on top right, ISO8859-2 i got detected display it as some weirdo character - square with question mark within

aadsm · 2024-03-18T22:30:51Z

Hey! I've been working on creating a CI for the past 2 weeks first. This has been my biggest issue when it comes to ship new versions.
I added support for preferred encodings 2 years ago (aadsm/jschardet@9b49243), but I haven't published a new version with it yet.
Once I have the CI up and running (should be pretty soon now), a new version will come out with this new api change.
I'll then look into see if I can support a config file. Either a jschardet specific one (which might be the better way since many other projects use this library), or an entry in the vscode config file jschardet.* that could also be useful when working on different encoding demanding type of projects.

yutotnh · 2024-03-23T14:05:53Z

@aadsm
Thank you for releasing 3.1.0!

I am trying to add functionality for this issue to VS Code using the detectEncodings option that was added.
However, I found one bug in jschardet.
That is that jschardet throws an error if any of the following 6 encodings are specified in detectEncodings

Shift-JIS
EUC-JP
GB2312
EUC-KR
Big5
EUC-TW

I just submitted a pull request to fix this bug.
Is it possible to import this fix into jschardet and release it as 3.1.1?
aadsm/jschardet#91

I use Shift_JIS and EUC-JP on a daily basis, so fingers crossed that this fix will be incorporated.

aadsm · 2024-03-23T18:20:06Z

I've just published a new patch version to npm. Those hours learning github actions and workflows were really worth it in the end ahah.

nfrance709 · 2024-03-23T23:08:49Z

When using version 3.1.1 of jschardet and testing with my attached charset_test_file.php file it still does not return the correct encoding which is windows-1252.

charset_test_file.php.txt

It returns { encoding: 'ISO-8859-2', confidence: 0.8496565744888162 } by default and it returns { encoding: null, confidence: 0 } when using { detectEncodings: ["UTF-8", "windows-1252"] }

const fs = require('fs');
const jschardet = require('jschardet');

jschardet.enableDebug();

const content = fs.readFileSync('charset_test_file.php');
// const result = jschardet.detect(content);
const result = jschardet.detect(content, { detectEncodings: ["UTF-8", "windows-1252"] });
console.log(result);

In my project the files are either UTF-8 or windows-1252 which I suspect most projects are either exclusively UTF-8 or UTF-8 with one other local encoding so ideally we need an option so that if UTF-8 is not detected then fallback to the local encoding provided in the array.

For the now I’ll have to continue modifying \src\vs\workbench\services\textfile\common\encoding.ts as described here #36951 (comment)

…#36951

yutotnh · 2024-03-25T03:48:21Z

@aadsm
Thanks for releasing 3.1.1.
Thanks to you, I was able to create a pull request (#208550).

@nfrance709
This pull request will open with files.encoding if the encoding cannot be guessed, as in charset_test_file.php.txt.

aadsm · 2024-03-25T04:05:01Z

@nfrance709 thanks for your detailed info. I was able to find and fix a couple of bugs related to detecting windows-1252. I also added a new test case with the exact code that was failing for you. I'm using the file you provided in my test. I hope that's ok with you!
I've released these fixes under version 3.1.2.

aadsm · 2024-03-25T04:10:24Z

@yutotnh sorry about this, but could you update your pr with 3.1.2? 😅

yutotnh · 2024-03-25T04:31:05Z

@aadsm Updated jschardet to 3.1.2 at ff546d5.

nfrance709 · 2024-03-25T04:32:28Z

@nfrance709 thanks for your detailed info. I was able to find and fix a couple of bugs related to detecting windows-1252. I also added a new test case with the exact code that was failing for you. I'm using the file you provided in my test. I hope that's ok with you! I've released these fixes under version 3.1.2.

@yutotnh thank you for the fixes in 3.1.2 and yes use the file as a test case.

Once @yutotnh pull request is accepted and a new version of VSCode is released I can go back to using the release version as there are a number of features missing from the OSS version that I would like to use.

nfrance709 · 2024-03-25T05:45:44Z

@aadsm Thanks for releasing 3.1.1. Thanks to you, I was able to create a pull request (#208550).

@nfrance709 This pull request will open with files.encoding if the encoding cannot be guessed, as in charset_test_file.php.txt.

Thank you, I just compiled and tested your latest version using 3.1.2 and it works as expected. I hope your pull request is accepted soon.

vscodebot bot assigned isidorn Oct 26, 2017

vscodebot bot added the workbench label Oct 26, 2017

isidorn assigned bpasero and unassigned isidorn Oct 26, 2017

isidorn added the feature-request Request for new features or functionality label Oct 26, 2017

bpasero removed their assignment Oct 26, 2017

bpasero added the file-explorer Explorer widget issues label Oct 26, 2017

bpasero removed the workbench label Nov 16, 2017

isidorn added file-encoding File encoding type issues and removed file-explorer Explorer widget issues labels Nov 17, 2017

isidorn assigned bpasero Nov 17, 2017

bpasero removed their assignment Nov 18, 2017

bpasero changed the title ~~Request feature in terms of encoding detection~~ Allow to configure a list of encodings to force Sep 11, 2018

bpasero changed the title ~~Allow to configure a list of encodings to force~~ Allow to configure a list of encodings to use when guessing Sep 11, 2018

bpasero mentioned this issue Sep 11, 2018

autoGuessEncoding improvements #30857

Closed

bpasero self-assigned this Sep 13, 2018

bpasero closed this as completed in 9c882c7 Sep 13, 2018

bpasero added this to the September 2018 milestone Sep 13, 2018

bpasero added the verification-needed Verification of issue is requested label Sep 13, 2018

octref added the verification-found Issue verification failed label Sep 26, 2018

bpasero removed the verification-found Issue verification failed label Sep 27, 2018

metablaster mentioned this issue Dec 12, 2022

files.autoGuessEncoding not working for ANSI encoded file #168803

Closed

yutotnh added a commit to yutotnh/vscode that referenced this issue Mar 25, 2024

Allow to configure a list of encodings to use when guessing microsoft…

3d97dde

…#36951

yutotnh added a commit to yutotnh/vscode that referenced this issue Mar 25, 2024

Allow to configure a list of encodings to use when guessing microsoft…

2eeb2ad

…#36951

yutotnh mentioned this issue Mar 25, 2024

Add the ability to specify a list of candidate encodings when guessing encoding (#36951) #208550

Open

yutotnh added a commit to yutotnh/vscode that referenced this issue Mar 25, 2024

Bump up the jschardet version into 3.1.2 microsoft#36951

f3b6ec4

yutotnh added a commit to yutotnh/vscode that referenced this issue Mar 25, 2024

Bump up the jschardet version into 3.1.2 microsoft#36951

ff546d5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to configure a list of encodings to use when guessing #36951

Allow to configure a list of encodings to use when guessing #36951

JasonJunMa commented Oct 26, 2017

fseasy commented Oct 27, 2017

phobos2077 commented Nov 15, 2017

bpasero commented Sep 13, 2018 •

edited

octref commented Sep 26, 2018

bpasero commented Sep 27, 2018

bpasero commented Sep 27, 2018

octref commented Sep 27, 2018

bpasero commented Sep 27, 2018 •

edited

irudoy commented Sep 28, 2018

formigoni commented Oct 6, 2022

nfrance709 commented Oct 8, 2022

formigoni commented Oct 13, 2022

nfrance709 commented Oct 13, 2022

nfrance709 commented Oct 13, 2022

petersladek commented Mar 30, 2023

zinface commented Aug 21, 2023

formigoni commented Oct 25, 2023

nfrance709 commented Oct 25, 2023 •

edited

bigdogs commented Nov 3, 2023

peminator commented Nov 3, 2023

Genius-er commented Feb 23, 2024

aadsm commented Feb 23, 2024 via email

peminator commented Mar 18, 2024

aadsm commented Mar 18, 2024

yutotnh commented Mar 23, 2024

aadsm commented Mar 23, 2024

nfrance709 commented Mar 23, 2024 •

edited

yutotnh commented Mar 25, 2024

aadsm commented Mar 25, 2024

aadsm commented Mar 25, 2024

yutotnh commented Mar 25, 2024 •

edited

nfrance709 commented Mar 25, 2024

nfrance709 commented Mar 25, 2024

Allow to configure a list of encodings to use when guessing #36951

Allow to configure a list of encodings to use when guessing #36951

Comments

JasonJunMa commented Oct 26, 2017

fseasy commented Oct 27, 2017

phobos2077 commented Nov 15, 2017

bpasero commented Sep 13, 2018 • edited

octref commented Sep 26, 2018

bpasero commented Sep 27, 2018

bpasero commented Sep 27, 2018

octref commented Sep 27, 2018

bpasero commented Sep 27, 2018 • edited

irudoy commented Sep 28, 2018

formigoni commented Oct 6, 2022

nfrance709 commented Oct 8, 2022

formigoni commented Oct 13, 2022

nfrance709 commented Oct 13, 2022

nfrance709 commented Oct 13, 2022

petersladek commented Mar 30, 2023

zinface commented Aug 21, 2023

formigoni commented Oct 25, 2023

nfrance709 commented Oct 25, 2023 • edited

bigdogs commented Nov 3, 2023

peminator commented Nov 3, 2023

Genius-er commented Feb 23, 2024

aadsm commented Feb 23, 2024 via email

peminator commented Mar 18, 2024

aadsm commented Mar 18, 2024

yutotnh commented Mar 23, 2024

aadsm commented Mar 23, 2024

nfrance709 commented Mar 23, 2024 • edited

yutotnh commented Mar 25, 2024

aadsm commented Mar 25, 2024

aadsm commented Mar 25, 2024

yutotnh commented Mar 25, 2024 • edited

nfrance709 commented Mar 25, 2024

nfrance709 commented Mar 25, 2024

bpasero commented Sep 13, 2018 •

edited

bpasero commented Sep 27, 2018 •

edited

nfrance709 commented Oct 25, 2023 •

edited

nfrance709 commented Mar 23, 2024 •

edited

yutotnh commented Mar 25, 2024 •

edited