Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vswhere.exe uses local code page to emit invalid JSON/XML #146

Closed
Suzumizaki opened this issue Apr 26, 2018 · 17 comments

Comments

Projects
None yet
5 participants
@Suzumizaki
Copy link

commented Apr 26, 2018

When I run vswhere -products * -legacy -format json under the Japanese mode/edition/version of Windows 10 Pro, I got one of the line:

"description": "学生、オープン ソース、および個々の開発者のための無料で完全な機能を備えた IDE",

The message above is correct, but encoded by code-page 932 (The default codec for Japanese mode).

Today, As described in RFC 8259, at the section "8.1. Character Encoding", JSON files MUST use UTF-8 (and must NOT use byte-order-mark). Please use UTF-8, to make the valid JSON even when it includes non-ASCII string like above. Otherwise, valid JSON decoders claim the JSON file as invalid, especially they process the file as including bad Javascript \ escapes.

Almost same thing about vswhere -products * -legacy -format xml. vswhere.exe uses local code page (cp932, under my environment) without encoding declaration at the beginning of xml file. To simplify, just hard-code to use UTF-8.

On the other hand, default format mode (-format text or not using -format) should use local code page, I think. Otherwise it shows unreadable strings(mojibake) in the window of cmd.exe.

@heaths

This comment has been minimized.

Copy link
Member

commented Apr 26, 2018

Please see #92. As a console application, it uses the console code page unless redirected to a file. See https://github.com/Microsoft/vswhere/wiki/Encoding for more information and workarounds for console issues all console applications exhibit.

@Suzumizaki

This comment has been minimized.

Copy link
Author

commented Apr 26, 2018

...No, sorry, I try to read the page you guided.

[I Cancel below]
Sorry to my poor explanation, but I'm talking about THE CASE redirected to a file as you say.
Or am I misunderstanding how to redirect to a file?: vswhere -products * -legacy -format json > vswhere.txt

@heaths

This comment has been minimized.

Copy link
Member

commented Apr 26, 2018

Thank you. That helps explain the problem. Console output will always use the current console code page, but when redirecting I could look at explicitly setting UTF-8. The concern would be whether this would break other applications that have worked around this.

@heaths heaths reopened this Apr 26, 2018

@Suzumizaki

This comment has been minimized.

Copy link
Author

commented Apr 27, 2018

Sorry again, but I can't understand the behavior.

Please tell me why chcp 65001 cannot resolve the my problem? vswhere.exe fails to encode non-ASCII text with UTF-8. But cl.exe and link.exe can do it.

(In fact, cl.exe also fails only when it calls link.exe to build .exe file directly, but that may be not your matter.)

@heaths

This comment has been minimized.

Copy link
Member

commented Apr 27, 2018

There's a number of issues that lead to this problem. The input and output console code pages can affect the output through either the | or > operators to another program, and the other program may use the console input code page. In some cases trying to resolve this, I found characters were getting transcoded again even though valid UTF-8 or UTF-16 code points.

This is going to take some time to resolve (I'm discussing with experts internally) and even longer to publish since this is likely a breaking change we won't take so lightly until we understand the full ramifications. For now, please use the workarounds I previously recommended.

@heaths

This comment has been minimized.

Copy link
Member

commented Apr 28, 2018

@Suzumizaki, @vector-of-bool, @itn3000: while vswhere.exe is writing localized console output as recommended by a number of experts in this area (i.e. assuming the console code page is the same as the current language, though I'm considering taking the language for display strings as parameter input), @Suzumizaki raises a good point that JSON should always be in UTF8. Even when I wrote the XML formatted, I had considered adding the output encoding to the <?xml?> processing instruction but getting the XML-approved name is not straight-forward across all supported versions of Windows and would require dramatically increasing the size of vswhere. Instead, I could make sure that redirected output - at least for JSON and XML, but perhaps all formats - is always UTF-8. But that would be a breaking change for anyone handling output with OEM code pages currently.

So we're weighing options, but I'd love your input:

  1. Just do it. Most scripts are only interested in the InstallationPath property anyway, which doesn't have localized values (unless specified that way during installation, in which case it would have to match the console code page anyway for most product scenarios to work in a console).
  2. Change the major version to 3, thus following semver rules of making breaking changes and expect people to update their code as necessary.
  3. Add a switch to always force UTF-8 similar to what @itn3000 proposed in PR #132 a while back (though there's an easier way I can do this).
  4. Add -o <file> to output to a file always using UTF-8. This is similar to a handful of other programs.

If people are obtaining vswhere through nuget or via similar mechanisms of isolating packages option 1 would probably be easiest and the most complete; however, since we ship vswhere at "%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" with a guarantee to support that path, breaking changes could be problematic - but only if you use localized text. Most - but certainly not all - users are getting properties like InstallationPath and very little else from what I've seen in code changes for which I've been asked to review.

@KindDragon

This comment has been minimized.

Copy link
Contributor

commented Apr 28, 2018

Add -o to output to a file always using UTF-8. This is similar to a handful of other programs.

👍

@vector-of-bool

This comment has been minimized.

Copy link

commented Apr 28, 2018

Option # 3 is the safest bet, but I also think that writing UTF-8 unconditionally when in JSON mode is also a safe bet. I have high doubts that many tools out there that are consuming the JSON data were (or are) able to handle non-UTF-8 JSON.

@heaths

This comment has been minimized.

Copy link
Member

commented Apr 28, 2018

@vector-of-bool, I was thinking of doing the same with XML and sticking the encoding in the processing instruction.

What's troubling is that using the redirection operator > in PowerShell also prefaces a UTF8 BOM to the file for any output. That's why I'm personally leaning toward option 2 and then adding a parameter to override the culture. The console code page for printing would still have to be compatible, but redirecting to a file or pipe should always be correct in cmd.exe or powershell.exe.

@KindDragon, option 4 certainly works around a number of problems but may not be obvious to people (I would explain in -? but several bugs and many emails prove people aren't reading it), nor sure it since the problem easily of piping unless, perhaps, like wget I support something like -o- (and/or probably - o-) which would write to stdout. Still not the most discoverable, but at least solves some of the problem.

@ita1024

This comment has been minimized.

Copy link

commented May 2, 2018

Another option could be to add a command-line switch or environment variable to force vswhere.exe to always output utf8-encoded data?

@heaths

This comment has been minimized.

Copy link
Member

commented May 2, 2018

The switch is option 3, but I'll count that as a vote. :) I wouldn't use environment variables, though. That's far less discoverable and often requires more setup in a programming environment than just passing a switch or taking advantage of inherent behavior. Thanks!

@Suzumizaki

This comment has been minimized.

Copy link
Author

commented May 4, 2018

I like option 3. This option will be able to keep the contents of the "description"(see 1st post) in local language, not English. I can also accept option 2, but the problem of the PowerShell remains as @heaths said.

@heaths

This comment has been minimized.

Copy link
Member

commented May 4, 2018

@Suzumizaki, for now, you can work around this by using the right console encoding corresponding to the current user culture. While JSON should use UTF8, the text currently is still valid with the right encoding in this case. After all, this was designed as a console program and Window's support for UTF8 it not like on other platforms (pretty much non-existent for console applications).

@Suzumizaki

This comment has been minimized.

Copy link
Author

commented May 5, 2018

@heaths:

After all, this was designed as a console program

Do you mean 'vswhere.exe` and other development tools should be used without redirecting or piping? Or something else will you suggest?

Current problem I have met is not my own project. See Adhoc fix where waf cannot run under Japanese version of Windows on the waf-project.

Anyway, I'm happy when option 3 or 2 is available. And I'll suggest to use it to the waf-project.

@heaths

This comment has been minimized.

Copy link
Member

commented May 5, 2018

I'm saying console programs almost always output using the console code page. That's the default behavior, and since Windows doesn't really support UTF8 output in the console (not very well anyway) the current OEM console code page gets used. The fact the text happens to be formatted like JSON was targeting scenarios for finding Visual Studio, not really displaying localized strings in other programs. As mentioned, if your current culture and console code page match (the default, and typical case) there is no problem. And so in applications where you console the redirected output, matching the console code page will also work if you ignore that the JSON-like text isn't UTF-8 as dictated by the RFC.

@Suzumizaki

This comment has been minimized.

Copy link
Author

commented May 5, 2018

@heaths:

if your current culture and console code page match (the default, and typical case) there is no problem.

Which do you mean?:

  1. When something trouble happened even when "if your current culture and console code page match", that is the bug of the application software which uses vswhere.exe.
    • If this case, please tell that fact to the developers of the waf-project. Of course I also will tell.
  2. "matching the console code page will also work if you ignore that the JSON-like text isn't UTF-8 as dictated by the RFC" is always true.
    • That's WRONG. JSON decoder cannot distinguish the 2nd byte of the DBCS/MBCS/OEM-codepage characters (that can be 0x5c even that is not means backslash) with backslash-escaping under JSON specification.
@heaths

This comment has been minimized.

Copy link
Member

commented May 5, 2018

Depends on the JSON decoder. PowerShell's convertfrom-json handles it fine.

heaths added a commit that referenced this issue Jun 11, 2018

Add -utf8 option to force UTF8 encoding
Attempt to fix #146. The console host and shell's output encoding still play a major factor, however. In cmd.exe, you still need to set chcp to display strings. In powershell.exe, you need to set chcp to display strings and use [Console]::OutputEncoding = 'UTF8' when redirecting to a file (which will itself encode as Unicode). The -utf8 switch does, however, fix the problem in testing with Node's child_process.execFile.

@heaths heaths closed this in #149 Jun 13, 2018

heaths added a commit that referenced this issue Jun 13, 2018

Add -utf8 option to force UTF8 encoding
Attempt to fix #146. The console host and shell's output encoding still play a major factor, however. In cmd.exe, you still need to set chcp to display strings. In powershell.exe, you need to set chcp to display strings and use [Console]::OutputEncoding = 'UTF8' when redirecting to a file (which will itself encode as Unicode). The -utf8 switch does, however, fix the problem in testing with Node's child_process.execFile.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.