New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to configure a list of encodings to use when guessing #36951
Comments
Yes, I'm totally agree because It is so weak for auto guess. |
I agree. In my environment we have files in two encodings - UTF-8 and Windows1251 (most popular text file encoding in Russia), so I need to use encoding detection. However, it sometimes detects windows1251-encoded files as "maccyrillic" or "Windows1252" or some other encoding that I've never seen in my life :D
So instead of just "true", you can specify which encodings you want it to detect from. As far as I know, encoding detection works based on probabilities (you can't 100% say which files is which encoding, so the software has to pick the most probable answer), so I think it is possible to implement - just filter out the list of possible encoding to those user selected. |
Verification: There is now a Update: I decided to rename the setting to |
@octref you have to use a file that jschardet can detect properly. In your case it tells me: So it makes sense that UTF-8 if used |
To verify you can use |
@bpasero I see, the logic is
But I would argue this doesn't solve the users' problems. Let's say the user has a bunch of files that he knows is If the user wants all files to be opened as
A setting like this would be more useful: {
"files.encodingAssociations": {
"gbk": ["gb2312", "gb18030"],
"cp950": ["big5hkscs"]
// Everything else falls back to "utf-8"
}
} |
Maybe someone from this issue could comment if that was the desired solution or not (@JasonJunMa). |
Hi, |
I still have the same issue myself, and unfortunately, I still need to compile from source and add a small modification for VSCode to work correctly with the files I use. See my previous post here with instructions https://github.com/microsoft/vscode/issues/36951#issuecomment-989472107 |
After reading your response I've tried myself the same solution and could manage to compile it from source with the same modification. I could successfully enable extensions and the only thing that is now preventing me from using the modified/compiled version is that I can't synchronize settings. The option "Settings Sync" doesn't appear on my application. Was this option (Settings Sync) meant to work on OSS version or is it a Microsoft specific customization? |
Unfortunately, it's not available in the OSS version. I use this extension https://marketplace.visualstudio.com/items?itemName=Shan.code-settings-sync |
By the way if you want to run GitHub Copilot in the OSS version you may need to follow the instructions in the link below. It was needed a while back but not sure if that's changed now as I haven't used GitHub Copilot for some time. https://github.com/community/community/discussions/6629#discussioncomment-1524627 |
As many people in this thread, I have to deal with files with utf8 and some other encoding (cp-1250 or iso-8859-2 in my case) for different types of files. VS Code is basically the only editor I have installed that has trouble detecting the encodings correctly. autoGuessEncoding="true" is not working at all for me, and I don't want to change default files encoding as I want my new files to be saved in utf-8. Every other editor I tried opened the files correctly - notepad, notepad2, notepad++, sublime text, visual studio etc... Please try to fix this. Thanks |
我想,通过一个配置项目列出我认为该工作空间内可能有哪些不同的编码。这样就能让该功能进行更好的选择。 |
Hello! |
I assume that the vast majority of VSCode users fall into one of the following categories:
The basic solution provided here: #36951 (comment) addresses this, but it requires compiling from source. What's needed is either a new configuration option or a way to leverage an existing option. This would specify that if a file is not detected as UTF-8, it should fall back to the specified encoding. This is essentially what the fix linked above does. While UTF-8 detection appears to be reliable, determining other encodings will always involve some guesswork and will never be 100% accurate. I'd prefer to use the release version of VSCode, but due to this issue, I'm forced to compile and use the OSS version. It shouldn't be that difficult to fix this via a configuration option. |
any updates? still hoping a solution |
me waiting too, need to allow list of "priority guess", say i often use utf8 + Win1250, so these first tested first, and only if fail both, use classic guess... can someone smart enought find a way and propose a commit to them ? Pretty pretty please, would make a proper xmas present to me |
When will the problem of guessing the wrong encoding be solved? I only need a configuration guess list to solve most of my problems. The main encodings I use are utf8 and gb18030. |
I have an idea on this that I'm experimenting right now. Should have some
updates soon. To be clear: I'm the jschardet author, I have no affiliation
with vscode or MS.
…On Fri, Feb 23, 2024 at 01:54 Genius-er ***@***.***> wrote:
When will the problem of guessing the wrong encoding be solved? I only
need a configuration guess list to solve most of my problems. The main
encodings I use are utf8 and gb18030.
—
Reply to this email directly, view it on GitHub
<#36951 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAATCKSUKG7ZMKP6L7YRJALYVBRNXAVCNFSM4EA4D3M2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJWGEYDENZTGYZQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@aadsm hi any news on case of jschardet in vscode? Anything i do, if guessing is on, my Windows1250 files open as various ISO-... randomly based on contained character most ppl need one encoding + UTF8 afaik, bc the one non utf is from pre-utf era. So if know where, just find where the detection is used, and make it first check against the default encoding setting. In my case, as exampe i have problems with accented characters like "ť" - in windows1250 it is small "t" plus small accent symbol on top right, ISO8859-2 i got detected display it as some weirdo character - square with question mark within |
Hey! I've been working on creating a CI for the past 2 weeks first. This has been my biggest issue when it comes to ship new versions. |
@aadsm I am trying to add functionality for this issue to VS Code using the
I just submitted a pull request to fix this bug. I use Shift_JIS and EUC-JP on a daily basis, so fingers crossed that this fix will be incorporated. |
I've just published a new patch version to npm. Those hours learning github actions and workflows were really worth it in the end ahah. |
When using version 3.1.1 of jschardet and testing with my attached charset_test_file.php file it still does not return the correct encoding which is windows-1252. It returns { encoding: 'ISO-8859-2', confidence: 0.8496565744888162 } by default and it returns { encoding: null, confidence: 0 } when using { detectEncodings: ["UTF-8", "windows-1252"] }
In my project the files are either UTF-8 or windows-1252 which I suspect most projects are either exclusively UTF-8 or UTF-8 with one other local encoding so ideally we need an option so that if UTF-8 is not detected then fallback to the local encoding provided in the array. For the now I’ll have to continue modifying \src\vs\workbench\services\textfile\common\encoding.ts as described here #36951 (comment) |
@aadsm @nfrance709 |
@nfrance709 thanks for your detailed info. I was able to find and fix a couple of bugs related to detecting windows-1252. I also added a new test case with the exact code that was failing for you. I'm using the file you provided in my test. I hope that's ok with you! |
@yutotnh sorry about this, but could you update your pr with 3.1.2? 😅 |
@yutotnh thank you for the fixes in 3.1.2 and yes use the file as a test case. Once @yutotnh pull request is accepted and a new version of VSCode is released I can go back to using the release version as there are a number of features missing from the OSS version that I would like to use. |
Thank you, I just compiled and tested your latest version using 3.1.2 and it works as expected. I hope your pull request is accepted soon. |
The
files.autoGuessEncoding=true
doesn't work well in some circumstances.I think that would be good if you guys add some features like
files.forceEncoding="encode1:encode2,encode3:encode4"
.So it can force 'encode1' to 'encode2'. That's a solution for wrong encoding detection I think.
The text was updated successfully, but these errors were encountered: