Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test for and fix encoding errors on Windows #25322

Open
lptr opened this issue Jun 7, 2023 · 5 comments
Open

Test for and fix encoding errors on Windows #25322

lptr opened this issue Jun 7, 2023 · 5 comments
Assignees
Labels
a:chore Minor issue without significant impact in:daemon in:exec-tasks re:windows Issue related to using Gradle on Windows

Comments

@lptr
Copy link
Member

lptr commented Jun 7, 2023

This is a followup to #25261.

Let's enable the use of non-ASCII paths for Windows integration tests, and fix any discovered problems.

@lptr lptr added the @execution label Jun 7, 2023
@lptr lptr added this to the 8.3 RC1 milestone Jun 7, 2023
@lptr lptr self-assigned this Jun 7, 2023
@lptr
Copy link
Member Author

lptr commented Jun 7, 2023

One problem discovered is that when we spawn a daemon process with cmd /c ..., the daemon parameters passed on the command line like that get their encoding garbled.

I tried the following:

  • java -cp . -Dvar=teșt Main from PowerShell: special character replaced with ?

    Expected chars: [ u0074 u0065 u00c8 u2122 u0074 ], bytes: [ 74 65 c3 88 e2 84 a2 74 ]
    Received chars: [ u0074 u0065 u003f u0074 ],       bytes: [ 74 65 3f 74 ]
    
  • java -cp . -Dvar=teșt Main from CMD: special character garbled

    Expected chars: [ u0074 u0065 u00c8 u2122 u0074 ], bytes: [ 74 65 c3 88 e2 84 a2 74 ]
    Received chars: [ u0074 u0065 u002b u00d6 u0074 ], bytes: [ 74 65 2b c3 96 74 ]
    
  • Running java -cp . -Dvar=teșt Main from batch file: special character garbled

    Expected chars: [ u0074 u0065 u00c8 u2122 u0074 ], bytes: [ 74 65 c3 88 e2 84 a2 74 ]
    Received chars: [ u0074 u0065 u002b u00d6 u0074 ], bytes: [ 74 65 2b c3 96 74 ]
    
  • cmd /c java -cp . -Dvar=teșt Main: special character replaced with ?

    Expected chars: [ u0074 u0065 u00c8 u2122 u0074 ], bytes: [ 74 65 c3 88 e2 84 a2 74 ]
    Received chars: [ u0074 u0065 u003f u0074 ],       bytes: [ 74 65 3f 74 ]
    
  • BUT: Running java -cp . -Dvar=teșt Main from a PowerShell script (.ps1) works! 🎉

    Expected chars: [ u0074 u0065 u00c8 u2122 u0074 ], bytes: [ 74 65 c3 88 e2 84 a2 74 ]
    Received chars: [ u0074 u0065 u00c8 u2122 u0074 ], bytes: [ 74 65 c3 88 e2 84 a2 74 ]
    

(Not sure if this is important, but Java uses NKD (decomposed normal form) even when the script file is in NKC (composed form).)

All this was tested on Windows 10.0.18363.2274 with PowerShell 5.1.18362.2212. Most importantly the system encoding is set to CP1252 as shown by querying sun.jnu.encoding.

@lptr
Copy link
Member Author

lptr commented Jun 7, 2023

Actually, those code points stand for something completely different... 🤔 I guess it's my terminal encoding.

But I tried running the child process from Java with ProcessBuilder, using the explicit character code \u0219 and got:

> Sending:  te?t, chars: [ u0074 u0065 u0219 u0074 ], bytes: [ 74 65 c8 99 74 ]
> Received: te?t, chars: [ u0074 u0065 u003f u0074 ], bytes: [ 74 65 3f 74 ]

It doesn't matter if I use cmd:

new ProcessBuilder("cmd", "/u", "/c", "java", "-cp", ".", "-Dvar=" + value, "Main")

(with or without /u)

Or execute Java directly:

new ProcessBuilder("java", "-cp", ".", "-Dvar=" + value, "Main")

One option I can imagine working is to encode the parameters ourselves with some ASCII-only encoding. URI encoding comes to mind, but % would need to be double-escaped...

@lptr
Copy link
Member Author

lptr commented Jun 7, 2023

Also note that if I try to send non-ASCII characters that are part of CP1252, like tést, the transmission works (though the console output is still garbled, the codepoints are correct):

> Sending:  t??st, chars: [ u0074 u00c3 u00a9 u0073 u0074 ], bytes: [ 74 c3 83 c2 a9 73 74 ]
> Received: t??st, chars: [ u0074 u00c3 u00a9 u0073 u0074 ], bytes: [ 74 c3 83 c2 a9 73 74 ]

@lptr
Copy link
Member Author

lptr commented Jun 7, 2023

Using chcp 65001 to set the codepage to something unicode-i fixes my console, but doesn't fix the process invocation:

> Sending:  teșt, chars: [ u0074 u0065 u0219 u0074 ], bytes: [ 74 65 c8 99 74 ]
> Received: te?t, chars: [ u0074 u0065 u003f u0074 ], bytes: [ 74 65 3f 74 ]

bot-gradle added a commit that referenced this issue Jun 15, 2023
… Linux and macOS

The overarching goal here is to shake out more encoding problems throughout Gradle by running all our integration tests in a way that the tested code has to deal with non-ASCII characters in file paths. This PR takes a step towards that goal by forcing all our non-Windows integration tests to use such a path.

To keep the scope manageable, this PR does not force non-ASCII paths for Windows. That needs to be enabled in a followup PR where we can deal with Windows-specific encoding problems.

The idea is similar to how we add a space to the `build/tmp/test files` directory's name where all the test output is typically located; this time we replace the `s` in `test files` with an `ŝ` (see [U+015D](https://www.compart.com/en/unicode/U+015D)). Importantly this character is not part of ASCII nor any ISO-8859-X codepage, and cannot be represented by a single byte. (See [Wikipedia](https://en.wikipedia.org/wiki/ŝ)).

There is also an escape hatch for tests that for some reason can't support Unicode paths; these need to be tagged with `@DoesNotSupportNonAsciiPaths`. We have such offenders today:

  - Checkstyle fails because of this bug: checkstyle/checkstyle#13012
  - Java 6 wrapper tests fail because Java 6 barfs on non-ASCII characters in the path

Most of the problems that had to be fixed for Unixes come from the fact that `URI.toString()` does not encode non-ASCII characters, and some tools can't parse string representations of URIs with non-ASCII characters in them.

So some of the `URI`s are now converted to strings using `toASCIIString()` instead. This is the canonical form of a URI, and is the intended way to go when the string form is passed to places where we can't ensure that it will be read with the right encoding (see https://www.w3.org/Addressing/URL/3_URI_Choices.html)

There are followups:
- #25316
- #25322

This PR is a followup to the daemon encoding fix in:
- #25319

Co-authored-by: Lóránt Pintér <lorant@gradle.com>
@lptr lptr removed this from the 8.3 RC1 milestone Jul 7, 2023
@ov7a ov7a added a:chore Minor issue without significant impact in:building-gradle gradle/gradle build re:windows Issue related to using Gradle on Windows labels Sep 12, 2023
@lptr lptr removed the @execution label Feb 6, 2024
@ov7a ov7a added in:daemon in:exec-tasks and removed in:building-gradle gradle/gradle build labels Mar 21, 2024
@Lolothepro
Copy link

Certain specific characters in the Windows username break compilation of project (#29213)
When the Windows username contains the "ï" character, it is not possible to compile NeoForge mods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:chore Minor issue without significant impact in:daemon in:exec-tasks re:windows Issue related to using Gradle on Windows
Projects
None yet
Development

No branches or pull requests

3 participants