Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to configure a list of encodings to use when guessing #36951

Open
JasonJunMa opened this issue Oct 26, 2017 · 73 comments
Open

Allow to configure a list of encodings to use when guessing #36951

JasonJunMa opened this issue Oct 26, 2017 · 73 comments
Labels
feature-request Request for new features or functionality file-encoding File encoding type issues
Milestone

Comments

@JasonJunMa
Copy link

The files.autoGuessEncoding=true doesn't work well in some circumstances.

I think that would be good if you guys add some features like files.forceEncoding="encode1:encode2,encode3:encode4".

So it can force 'encode1' to 'encode2'. That's a solution for wrong encoding detection I think.

@vscodebot vscodebot bot added the workbench label Oct 26, 2017
@isidorn isidorn assigned bpasero and unassigned isidorn Oct 26, 2017
@isidorn isidorn added the feature-request Request for new features or functionality label Oct 26, 2017
@bpasero bpasero removed their assignment Oct 26, 2017
@bpasero bpasero added the file-explorer Explorer widget issues label Oct 26, 2017
@fseasy
Copy link

fseasy commented Oct 27, 2017

Yes, I'm totally agree because It is so weak for auto guess.
Add a candidate may be better!
For me, of may be Many Chinese Coder, only UTF-8 and GB18030 are most commonly meet, but auto-guess give me the Windows 1532??? I think is is easier to detect in users' encoding candidates.

@phobos2077
Copy link

I agree. In my environment we have files in two encodings - UTF-8 and Windows1251 (most popular text file encoding in Russia), so I need to use encoding detection. However, it sometimes detects windows1251-encoded files as "maccyrillic" or "Windows1252" or some other encoding that I've never seen in my life :D
Definitely need a setting like

files.detectEncodings=["utf8","windows1251]

So instead of just "true", you can specify which encodings you want it to detect from. As far as I know, encoding detection works based on probabilities (you can't 100% say which files is which encoding, so the software has to pick the most probable answer), so I think it is possible to implement - just filter out the list of possible encoding to those user selected.

@bpasero bpasero removed the workbench label Nov 16, 2017
@isidorn isidorn added file-encoding File encoding type issues and removed file-explorer Explorer widget issues labels Nov 17, 2017
@bpasero bpasero removed their assignment Nov 18, 2017
@bpasero bpasero changed the title Request feature in terms of encoding detection Allow to configure a list of encodings to force Sep 11, 2018
@bpasero bpasero changed the title Allow to configure a list of encodings to force Allow to configure a list of encodings to use when guessing Sep 11, 2018
@bpasero bpasero self-assigned this Sep 13, 2018
@bpasero bpasero added this to the September 2018 milestone Sep 13, 2018
@bpasero bpasero added the verification-needed Verification of issue is requested label Sep 13, 2018
@bpasero
Copy link
Member

bpasero commented Sep 13, 2018

Verification: There is now a files.guessableEncodings setting where you can fill in encodings to support when guessing. From the explanation: If provided, will restrict the list of encodings that can be used when guessing. If the guessed file encoding is not in the list, the default encoding will be used.

Update: I decided to rename the setting to files.guessableEncodings

@octref
Copy link
Contributor

octref commented Sep 26, 2018

@bpasero With these settings:

    "files.autoGuessEncoding": true,
    "files.guessableEncodings": [
      "gbk"
    ]

I still get this file as UTF-8. It is in gbk encoding with two Chinese characters.

foo.txt

@octref octref added the verification-found Issue verification failed label Sep 26, 2018
@bpasero
Copy link
Member

bpasero commented Sep 27, 2018

@octref you have to use a file that jschardet can detect properly. In your case it tells me:

image

So it makes sense that UTF-8 if used

@bpasero bpasero removed the verification-found Issue verification failed label Sep 27, 2018
@bpasero
Copy link
Member

bpasero commented Sep 27, 2018

To verify you can use src/vs/base/test/node/encoding/fixtures/some.cp1252.txt with CP1252 encoding!

@octref
Copy link
Contributor

octref commented Sep 27, 2018

@bpasero I see, the logic is

  • Guessed encoding is not in files.guessableEncodings
  • Fall back to utf-8

But I would argue this doesn't solve the users' problems. Let's say the user has a bunch of files that he knows is gbk encoding, but jschardet could have guessed either of these:

image

If the user wants all files to be opened as gbk. This setting would not work for him.
The original request is more for being able to set fallbacks. For example,

  • If guessed encoding is gb2312, gb18030, fall back to gbk.
  • Otherwise, fall back to utf-8.

A setting like this would be more useful:

{
  "files.encodingAssociations": {
    "gbk": ["gb2312", "gb18030"],
    "cp950": ["big5hkscs"]
    // Everything else falls back to "utf-8"
  }
}

@bpasero
Copy link
Member

bpasero commented Sep 27, 2018

Maybe someone from this issue could comment if that was the desired solution or not (@JasonJunMa).

@irudoy
Copy link

irudoy commented Sep 28, 2018

@bpasero in the implementation from original pull request, the encoding falls back to the first one in the list instead of utf-8. It was not a great solution, definitely. I consider that @octref solution will resolve an issue.

@formigoni
Copy link

Hi,
I'm using VS Code version 1.71.2 and the issue still happens to me.
As far as I understood @aadsm has implemented "allowed encodings" on jschardet side.
What is missing to work in vscode is to make use of this functionality?

@nfrance709
Copy link

Hi, I'm using VS Code version 1.71.2 and the issue still happens to me. As far as I understood @aadsm has implemented "allowed encodings" on jschardet side. What is missing to work in vscode is to make use of this functionality?

I still have the same issue myself, and unfortunately, I still need to compile from source and add a small modification for VSCode to work correctly with the files I use. See my previous post here with instructions https://github.com/microsoft/vscode/issues/36951#issuecomment-989472107

@formigoni
Copy link

Hi, I'm using VS Code version 1.71.2 and the issue still happens to me. As far as I understood @aadsm has implemented "allowed encodings" on jschardet side. What is missing to work in vscode is to make use of this functionality?

I still have the same issue myself, and unfortunately, I still need to compile from source and add a small modification for VSCode to work correctly with the files I use. See my previous post here with instructions https://github.com/microsoft/vscode/issues/36951#issuecomment-989472107

After reading your response I've tried myself the same solution and could manage to compile it from source with the same modification. I could successfully enable extensions and the only thing that is now preventing me from using the modified/compiled version is that I can't synchronize settings. The option "Settings Sync" doesn't appear on my application. Was this option (Settings Sync) meant to work on OSS version or is it a Microsoft specific customization?

@nfrance709
Copy link

Hi, I'm using VS Code version 1.71.2 and the issue still happens to me. As far as I understood @aadsm has implemented "allowed encodings" on jschardet side. What is missing to work in vscode is to make use of this functionality?

I still have the same issue myself, and unfortunately, I still need to compile from source and add a small modification for VSCode to work correctly with the files I use. See my previous post here with instructions https://github.com/microsoft/vscode/issues/36951#issuecomment-989472107

After reading your response I've tried myself the same solution and could manage to compile it from source with the same modification. I could successfully enable extensions and the only thing that is now preventing me from using the modified/compiled version is that I can't synchronize settings. The option "Settings Sync" doesn't appear on my application. Was this option (Settings Sync) meant to work on OSS version or is it a Microsoft specific customization?

Unfortunately, it's not available in the OSS version. I use this extension https://marketplace.visualstudio.com/items?itemName=Shan.code-settings-sync

@nfrance709
Copy link

By the way if you want to run GitHub Copilot in the OSS version you may need to follow the instructions in the link below. It was needed a while back but not sure if that's changed now as I haven't used GitHub Copilot for some time.

https://github.com/community/community/discussions/6629#discussioncomment-1524627

@petersladek
Copy link

As many people in this thread, I have to deal with files with utf8 and some other encoding (cp-1250 or iso-8859-2 in my case) for different types of files. VS Code is basically the only editor I have installed that has trouble detecting the encodings correctly. autoGuessEncoding="true" is not working at all for me, and I don't want to change default files encoding as I want my new files to be saved in utf-8. Every other editor I tried opened the files correctly - notepad, notepad2, notepad++, sublime text, visual studio etc... Please try to fix this. Thanks

@zinface
Copy link

zinface commented Aug 21, 2023

我想,通过一个配置项目列出我认为该工作空间内可能有哪些不同的编码。这样就能让该功能进行更好的选择。

@formigoni
Copy link

Hello!
This Issue has been open for 6 years and still has no solution.
I (and others in this thread) still need to compile from the source code, use the OSS version, and accept the limitations that come with it. For example, I can't use Sync Setting because it's available only on Microsoft version, and I can't use Dev Containers for the same reason.
A text editor that is probably the most widely used in the world by developers, and one that I personally love, shouldn't have seemingly trivial issues taking so long to be resolved.
What if the author of jschardet never releases the version? What if they pass away? Win the lottery?
@bpasero , please help us!

@nfrance709
Copy link

nfrance709 commented Oct 25, 2023

I assume that the vast majority of VSCode users fall into one of the following categories:

  1. Use UTF-8 exclusively.
  2. Use another encoding exclusively.
  3. Use either UTF-8 or a single other encoding.

The basic solution provided here: #36951 (comment) addresses this, but it requires compiling from source.

What's needed is either a new configuration option or a way to leverage an existing option. This would specify that if a file is not detected as UTF-8, it should fall back to the specified encoding. This is essentially what the fix linked above does. While UTF-8 detection appears to be reliable, determining other encodings will always involve some guesswork and will never be 100% accurate.

I'd prefer to use the release version of VSCode, but due to this issue, I'm forced to compile and use the OSS version. It shouldn't be that difficult to fix this via a configuration option.

@bigdogs
Copy link

bigdogs commented Nov 3, 2023

any updates? still hoping a solution

@peminator
Copy link

me waiting too, need to allow list of "priority guess", say i often use utf8 + Win1250, so these first tested first, and only if fail both, use classic guess... can someone smart enought find a way and propose a commit to them ? Pretty pretty please, would make a proper xmas present to me

@Genius-er
Copy link

When will the problem of guessing the wrong encoding be solved? I only need a configuration guess list to solve most of my problems. The main encodings I use are utf8 and gb18030.

@aadsm
Copy link

aadsm commented Feb 23, 2024 via email

@peminator
Copy link

@aadsm hi any news on case of jschardet in vscode?

Anything i do, if guessing is on, my Windows1250 files open as various ISO-... randomly based on contained character
It should be updated on VSCode side to first try to validate the default encoding setting may be the right one.... and only if there the score very low, go on to detecting

most ppl need one encoding + UTF8 afaik, bc the one non utf is from pre-utf era. So if know where, just find where the detection is used, and make it first check against the default encoding setting. In my case, as exampe i have problems with accented characters like "ť" - in windows1250 it is small "t" plus small accent symbol on top right, ISO8859-2 i got detected display it as some weirdo character - square with question mark within

@aadsm
Copy link

aadsm commented Mar 18, 2024

Hey! I've been working on creating a CI for the past 2 weeks first. This has been my biggest issue when it comes to ship new versions.
I added support for preferred encodings 2 years ago (aadsm/jschardet@9b49243), but I haven't published a new version with it yet.
Once I have the CI up and running (should be pretty soon now), a new version will come out with this new api change.
I'll then look into see if I can support a config file. Either a jschardet specific one (which might be the better way since many other projects use this library), or an entry in the vscode config file jschardet.* that could also be useful when working on different encoding demanding type of projects.

@yutotnh
Copy link
Contributor

yutotnh commented Mar 23, 2024

@aadsm
Thank you for releasing 3.1.0!

I am trying to add functionality for this issue to VS Code using the detectEncodings option that was added.
However, I found one bug in jschardet.
That is that jschardet throws an error if any of the following 6 encodings are specified in detectEncodings

  • Shift-JIS
  • EUC-JP
  • GB2312
  • EUC-KR
  • Big5
  • EUC-TW

I just submitted a pull request to fix this bug.
Is it possible to import this fix into jschardet and release it as 3.1.1?
aadsm/jschardet#91

I use Shift_JIS and EUC-JP on a daily basis, so fingers crossed that this fix will be incorporated.

@aadsm
Copy link

aadsm commented Mar 23, 2024

I've just published a new patch version to npm. Those hours learning github actions and workflows were really worth it in the end ahah.

@nfrance709
Copy link

nfrance709 commented Mar 23, 2024

When using version 3.1.1 of jschardet and testing with my attached charset_test_file.php file it still does not return the correct encoding which is windows-1252.

charset_test_file.php.txt

It returns { encoding: 'ISO-8859-2', confidence: 0.8496565744888162 } by default and it returns { encoding: null, confidence: 0 } when using { detectEncodings: ["UTF-8", "windows-1252"] }

const fs = require('fs');
const jschardet = require('jschardet');

jschardet.enableDebug();

const content = fs.readFileSync('charset_test_file.php');
// const result = jschardet.detect(content);
const result = jschardet.detect(content, { detectEncodings: ["UTF-8", "windows-1252"] });
console.log(result);

In my project the files are either UTF-8 or windows-1252 which I suspect most projects are either exclusively UTF-8 or UTF-8 with one other local encoding so ideally we need an option so that if UTF-8 is not detected then fallback to the local encoding provided in the array.

For the now I’ll have to continue modifying \src\vs\workbench\services\textfile\common\encoding.ts as described here #36951 (comment)

@yutotnh
Copy link
Contributor

yutotnh commented Mar 25, 2024

@aadsm
Thanks for releasing 3.1.1.
Thanks to you, I was able to create a pull request (#208550).

@nfrance709
This pull request will open with files.encoding if the encoding cannot be guessed, as in charset_test_file.php.txt.

@aadsm
Copy link

aadsm commented Mar 25, 2024

@nfrance709 thanks for your detailed info. I was able to find and fix a couple of bugs related to detecting windows-1252. I also added a new test case with the exact code that was failing for you. I'm using the file you provided in my test. I hope that's ok with you!
I've released these fixes under version 3.1.2.

@aadsm
Copy link

aadsm commented Mar 25, 2024

@yutotnh sorry about this, but could you update your pr with 3.1.2? 😅

yutotnh added a commit to yutotnh/vscode that referenced this issue Mar 25, 2024
@yutotnh
Copy link
Contributor

yutotnh commented Mar 25, 2024

@aadsm Updated jschardet to 3.1.2 at ff546d5.

@nfrance709
Copy link

@nfrance709 thanks for your detailed info. I was able to find and fix a couple of bugs related to detecting windows-1252. I also added a new test case with the exact code that was failing for you. I'm using the file you provided in my test. I hope that's ok with you! I've released these fixes under version 3.1.2.

@yutotnh thank you for the fixes in 3.1.2 and yes use the file as a test case.

Once @yutotnh pull request is accepted and a new version of VSCode is released I can go back to using the release version as there are a number of features missing from the OSS version that I would like to use.

yutotnh added a commit to yutotnh/vscode that referenced this issue Mar 25, 2024
@nfrance709
Copy link

@aadsm Thanks for releasing 3.1.1. Thanks to you, I was able to create a pull request (#208550).

@nfrance709 This pull request will open with files.encoding if the encoding cannot be guessed, as in charset_test_file.php.txt.

Thank you, I just compiled and tested your latest version using 3.1.2 and it works as expected. I hope your pull request is accepted soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request Request for new features or functionality file-encoding File encoding type issues
Projects
None yet
Development

No branches or pull requests