New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try Windows-1252 encoding on failure to decode utf-8 at file load #57

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
3 participants
@daloic

daloic commented Sep 22, 2018

For a lot of Fortran code created with Visual Studio under Windows, the encoding is Windows-1252. In case of error when loading a file as utf-8 encoded, try the Windows encoding. If this fails, just fail and do not try to further detect.

One could try to further detect using the chardet library or something similar, but this is very slow and not reliable. The problem is that the files are nearly always mostly ASCII and then in the middle of a comment at the end, you have an accented character like è in a comment. The chardet library is only doing detection on the first characters of the file because it is a very slow process. This is why I am doing just a single "Try Windows and fail if both are not working."

@michaelkonecny

This comment has been minimized.

Contributor

michaelkonecny commented Sep 22, 2018

I don't think this is a good solution, the encoding isn't going to be always Windows-1252, but will depend on your system language and the language you're using as well.
For example, for Czech, the encoding would normally be Windows-1250. You're going to have many other variants.

If possible, the encoding used here should be the same the editor uses. Can you have a look if this is sent in one of the LSP messages?

If LSP doesn't support this, then I guess the server could detect the encoding in a similar way one of the editors do, but this is potentially going to create a mess in case the algorithms don't match and the editor ends up using a different encoding than the server.

@daloic

This comment has been minimized.

daloic commented Sep 22, 2018

You do not have the encoding information when you load the files from disk to parse the workspace. This is the problem. It means you effectively have no ideas what you get in your file and the current assumption is to use utf-8 and simply stop on decoding error.

Windows-1252 is very compatible with latin-1 and are together the most widely used encodings after utf-8. What is important is that they have the same character length per bit, which is important for the parser, if one character is not recognized, this is not really an issue as Fortran variables, etc. are all ASCII. But you want to correctly match the positions in the file.

@michaelkonecny

This comment has been minimized.

Contributor

michaelkonecny commented Sep 22, 2018

I didn't know Windows-1252 was that popular.

However, I'm still not convinced preferring one specific encoding over all others is a good way.
Can't we just do

open(filepath, 'r', encoding="utf-8", errors="replace")

to kill all the birds with one stone?

@hansec

This comment has been minimized.

Owner

hansec commented Sep 28, 2018

Thanks for the report and suggestions. I think the best way to go here is what @michaelkonecny suggested by just either replacing unsupported characters beyond UTF-8. As far as I know all actual Fortran source code has to be ASCII anyways so this will only affect comments, which shouldn't be too bad.

@hansec

This comment has been minimized.

Owner

hansec commented Nov 21, 2018

I just released a new version (0.9.2) that hopefully fixes these errors using the method recommended by @michaelkonecny. Sorry it took so long. I am going to close this pull request, but let me know if this update does not fix this issue for you. Thanks again for the report and input.

@hansec hansec closed this Nov 21, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment