Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8274329: Fix non-portable HotSpot code in MethodMatcher::parse_method_pattern #5704

Closed
wants to merge 3 commits into from

Conversation

DamonFool
Copy link
Member

@DamonFool DamonFool commented Sep 26, 2021

Hi all,

I tried to build OpenJDK on Cygwin (Windows 2016 + VS2019).
However, I failed with C4474 and C4778 warnings as below:

Compiling 100 properties into resource bundles for java.desktop
Compiling 3038 files for java.base
e:\jiefu\ws\jdk\src\hotspot\share\compiler\methodMatcher.cpp(269): error C2220: the following warning is treated as an error
e:\jiefu\ws\jdk\src\hotspot\share\compiler\methodMatcher.cpp(269): warning C4778: 'sscanf' : unterminated format string '%255[*\x01\x02\x03\x04\x05\x06\a\b\n\v\f\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f!\"#$%&'*+,-0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ\\^_`abcdefghijklmnopqrstuvwxyz{|}~\xe2\x82\xac\xe4\xba\x97\xe5\x84\x8e\xe5\x8e\x97%n'
e:\jiefu\ws\jdk\src\hotspot\share\compiler\methodMatcher.cpp(269): warning C4474: 'sscanf' : too many arguments passed for format string
e:\jiefu\ws\jdk\src\hotspot\share\compiler\methodMatcher.cpp(269): note: placeholders and their parameters expect 1 variadic arguments, but 3 were provided
e:\jiefu\ws\jdk\src\hotspot\share\compiler\methodMatcher.cpp(319): warning C4778: 'sscanf' : unterminated format string '%1022[[);/\x01\x02\x03\x04\x05\x06\a\b\n\v\f\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f!\"#$%&'*+,-0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ\\^_`abcdefghijklmnopqrstuvwxyz{|}~\xe2\x82\xac\xe4\xba\x97\xe5\x84\x8e\xe5\x8e%n'
e:\jiefu\ws\jdk\src\hotspot\share\compiler\methodMatcher.cpp(319): warning C4474: 'sscanf' : too many arguments passed for format string
e:\jiefu\ws\jdk\src\hotspot\share\compiler\methodMatcher.cpp(319): note: placeholders and their parameters expect 0 variadic arguments, but 2 were provided

The failure is caused by non-ASCII chars in the format string of sscanf [1][2], which is non-portable on our Windows platform.
In fact, these non-ASCII coding also triggers C4819 warning, which had been disabled in JDK-8216154 [3].
And I also found an article showing that sscanf may fail with non-ASCII in the format string [4].

So it would be nice to remove these non-ASCII chars (\x80 ~ \xef).
And I think it's safe to do so.

This is because:

  1. There are actually no non-ASCII chars for package/class/method/signature names.
  2. I don't think there is a use case, in which people will input non-ASCII for CompileCommand.

You may argue that the non-ASCII may be used by the parser itself.
But I didn't find that usage at all. (Please let me know if I miss something.)

So I suggest to remove these non-ASCII code to make HotSpot to be more portable.
And if we do so, we can also remove the only one PRAGMA_DISABLE_MSVC_WARNING(4819) [5].

Testing:

  • Build tests on Windows
  • tier1~3 on Linux/x64

Thanks.
Best regards,
Jie

[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/compiler/methodMatcher.cpp#L269
[2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/compiler/methodMatcher.cpp#L319
[3] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-January/032014.html
[4] https://jeffpar.github.io/kbarchive/kb/047/Q47369/
[5] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/compiler/methodMatcher.cpp#L246


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8274329: Fix non-portable HotSpot code in MethodMatcher::parse_method_pattern

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/5704/head:pull/5704
$ git checkout pull/5704

Update a local copy of the PR:
$ git checkout pull/5704
$ git pull https://git.openjdk.java.net/jdk pull/5704/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 5704

View PR using the GUI difftool:
$ git pr show -t 5704

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/5704.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 26, 2021

👋 Welcome back jiefu! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@DamonFool DamonFool marked this pull request as ready for review September 26, 2021 10:59
@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 26, 2021
@openjdk
Copy link

openjdk bot commented Sep 27, 2021

@DamonFool The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Sep 27, 2021
@mlbridge
Copy link

mlbridge bot commented Sep 27, 2021

Webrevs

@JornVernee
Copy link
Member

FWIW, there was some prior discussion here about this code as well: #3107

tl;dr MSVC uses the system locale's code page to parse this, which must be set to en-us to work. From the fact that you are getting these warnings, I guess you have a different locale set (can check with systeminfo). See also: https://bugs.openjdk.java.net/browse/JDK-8264425

Though, I can't comment on the changes in this PR.

@DamonFool
Copy link
Member Author

FWIW, there was some prior discussion here about this code as well: #3107

tl;dr MSVC uses the system locale's code page to parse this, which must be set to en-us to work. From the fact that you are getting these warnings, I guess you have a different locale set (can check with systeminfo). See also: https://bugs.openjdk.java.net/browse/JDK-8264425

Though, I can't comment on the changes in this PR.

Thanks @JornVernee for your comments.

My system local is zh-cn.

But changing the locale isn't acceptable since many of our Apps require zh-cn in our country.

According to the JBS, C4819 warning was first observed with VS2017 and was disabled by JDK-8216154.
Then C4474 and C4778 warnings are triggered with VS2019.
So the code is actually non-portable.

If the non-ascii code is useless, it should be removed to make HotSpot to be more portable.
Then C4819, C4474 and C4778 would disappear.
And OpenJDK can be built without changing the locale setting.

Thanks.

@JornVernee
Copy link
Member

My system local is zh-cn.

But changing the locale isn't acceptable since many of our Apps require zh-cn in our country.

I understand, and that is totally reasonable to me.

There might be another way to change the locale just for the compilation [1], but I haven't had time to test that (so for now I think the official advice is to us en-us as a system locale). I think something like this should also remove the warnings (after reconfigure clean):

diff --git a/make/autoconf/flags-cflags.m4 b/make/autoconf/flags-cflags.m4
index ba4cb3e9ac7..3078a2e916c 100644
--- a/make/autoconf/flags-cflags.m4
+++ b/make/autoconf/flags-cflags.m4
@@ -496,7 +496,7 @@ AC_DEFUN([FLAGS_SETUP_CFLAGS_HELPER],
     TOOLCHAIN_CFLAGS_JVM="-qtbtable=full -qtune=balanced \
         -qalias=noansi -qstrict -qtls=default -qnortti -qnoeh -qignerrno -qstackprotect"
   elif test "x$TOOLCHAIN_TYPE" = xmicrosoft; then
-    TOOLCHAIN_CFLAGS_JVM="-nologo -MD -MP"
+    TOOLCHAIN_CFLAGS_JVM="-nologo -MD -MP -utf-8"
     TOOLCHAIN_CFLAGS_JDK="-nologo -MD -Zc:wchar_t-"
   fi

In your case the compiler produced some warnings, but I'm wondering if using a different encoding could also silently create subtle behavioral changes. I think it would be good if a specific encoding could be used at build time.

According to the JBS, C4819 warning was first observed with VS2017 and was disabled by JDK-8216154.
Then C4474 and C4778 warnings are triggered with VS2019.
So the code is actually non-portable.

If the non-ascii code is useless, it should be removed to make HotSpot to be more portable.
Then C4819, C4474 and C4778 would disappear.
And OpenJDK can be built without changing the locale setting.

I agree with your reasoning, but I can not comment on the contents of the patch, because I'm not a maintainer of this code.

[1] : https://bugs.openjdk.java.net/browse/JDK-8268719

@DamonFool
Copy link
Member Author

There might be another way to change the locale just for the compilation [1], but I haven't had time to test that (so for now I think the official advice is to us en-us as a system locale). I think something like this should also remove the warnings (after reconfigure clean):

diff --git a/make/autoconf/flags-cflags.m4 b/make/autoconf/flags-cflags.m4
index ba4cb3e9ac7..3078a2e916c 100644
--- a/make/autoconf/flags-cflags.m4
+++ b/make/autoconf/flags-cflags.m4
@@ -496,7 +496,7 @@ AC_DEFUN([FLAGS_SETUP_CFLAGS_HELPER],
     TOOLCHAIN_CFLAGS_JVM="-qtbtable=full -qtune=balanced \
         -qalias=noansi -qstrict -qtls=default -qnortti -qnoeh -qignerrno -qstackprotect"
   elif test "x$TOOLCHAIN_TYPE" = xmicrosoft; then
-    TOOLCHAIN_CFLAGS_JVM="-nologo -MD -MP"
+    TOOLCHAIN_CFLAGS_JVM="-nologo -MD -MP -utf-8"
     TOOLCHAIN_CFLAGS_JDK="-nologo -MD -Zc:wchar_t-"
   fi

In your case the compiler produced some warnings, but I'm wondering if using a different encoding could also silently create subtle behavioral changes. I think it would be good if a specific encoding could be used at build time.

Thanks for your suggestions, @JornVernee .
I tested the above patch, but it still failed. (rm build -rf; bash configure ..; make images)

@JornVernee
Copy link
Member

Thanks for your suggestions, @JornVernee .
I tested the above patch, but it still failed. (rm build -rf; bash configure ..; make images)

Ok, that's unfortunate. Thanks for testing.

@DamonFool
Copy link
Member Author

Thanks for your suggestions, @JornVernee .
I tested the above patch, but it still failed. (rm build -rf; bash configure ..; make images)

Ok, that's unfortunate. Thanks for testing.

Let's see what others think of the change.

Hope the non-ascii code is actually not used.
Then we can safely remove it.
Thanks.

@vnkozlov
Copy link
Contributor

/label add build

@openjdk openjdk bot added the build build-dev@openjdk.org label Sep 29, 2021
@openjdk
Copy link

openjdk bot commented Sep 29, 2021

@vnkozlov
The build label was successfully added.

@vnkozlov
Copy link
Contributor

This should be discussed with build group and may be runtime to get more comments.

@vnkozlov
Copy link
Contributor

/label add hotspot-runtime

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label Sep 29, 2021
@openjdk
Copy link

openjdk bot commented Sep 29, 2021

@vnkozlov
The hotspot-runtime label was successfully added.

@vnkozlov
Copy link
Contributor

RANGEBASE was added by JDK-6500501 and later was modified by JDK-8027829

Note the original comment from 6500501:

// The characters allowed in a class or method name.  All characters > 0x7f
// are allowed in order to handle obfuscated class files (e.g. Volano)

@DamonFool
Copy link
Member Author

RANGEBASE was added by JDK-6500501 and later was modified by JDK-8027829

Note the original comment from 6500501:

// The characters allowed in a class or method name.  All characters > 0x7f
// are allowed in order to handle obfuscated class files (e.g. Volano)

Thanks @vnkozlov for your very helpful comments.

I have one question: how can we specify (non-ascii chars) and (non-printable ascii chars) through -XX:CompileCommand?

I just learned from https://bugs.openjdk.java.net/browse/JDK-8027829 that we can use unicode like \uxxxx.
But it doesn't work in my experiments.

My example was made from: https://bugs.openjdk.java.net/secure/attachment/17128/UnicodeIdentifierTest.java

public class UnicodeIdentifierTest {
    public static void main(String args[]) {
        System.out.println("Can I use \\u0001 in identifier name? " +
                           (Character.isJavaIdentifierPart(1) ? "yes" : "no"));
        for (int i = 0; i < 100000; i++ )
        methodWithUnicode\u0001Char();

        System.out.println("Can I use \\u00aa in identifier name? " +
                           (Character.isJavaIdentifierPart(0xaa) ? "yes" : "no"));
        for (int i = 0; i < 100000; i++ )
        methodWithUnicode\u00aaChar();

        System.out.println("Can I use \\u006b in identifier name? " +
                           (Character.isJavaIdentifierPart(0x6b) ? "yes" : "no"));
        for (int i = 0; i < 100000; i++ )
        methodWithUnicode\u006bChar();

    }
    public static int a = 0;
    public static void methodWithUnicode\u0001Char() {
        a++;
    }

    public static void methodWithUnicode\u00aaChar() {
        a++;
    }

    public static void methodWithUnicode\u006bChar() {
        a++;
    }
}

And I tried to exclude some specific methods like this

${JDK}/bin/java \
   -XX:+PrintCompilation \
   -XX:CompileCommand=exclude,`echo -e "UnicodeIdentifierTest::methodWithUnicode\u0001Char"` \
   -XX:CompileCommand=exclude,`echo -e "UnicodeIdentifierTest.methodWithUnicode\u0001Char"` \
   -XX:CompileCommand=exclude,"UnicodeIdentifierTest.methodWithUnicode\u0001Char" \
   -XX:CompileCommand=exclude,'UnicodeIdentifierTest.methodWithUnicode\u0001Char' \
   -XX:CompileCommand=exclude,UnicodeIdentifierTest.methodWithUnicode\u0001Char \
   -XX:CompileCommand=exclude,`echo -e "UnicodeIdentifierTest::methodWithUnicode\u00aaChar"` \
   -XX:CompileCommand=exclude,`echo -e "UnicodeIdentifierTest.methodWithUnicode\u00aaChar"` \
   -XX:CompileCommand=exclude,"UnicodeIdentifierTest.methodWithUnicode\u00aaChar" \
   -XX:CompileCommand=exclude,'UnicodeIdentifierTest.methodWithUnicode\u00aaChar' \
   -XX:CompileCommand=exclude,UnicodeIdentifierTest.methodWithUnicode\u00aaChar \
   -XX:CompileCommand=exclude,`echo -e "UnicodeIdentifierTest::methodWithUnicode\u006bChar"` \
   -XX:CompileCommand=exclude,`echo -e "UnicodeIdentifierTest.methodWithUnicode\u006bChar"` \
   -XX:CompileCommand=exclude,"UnicodeIdentifierTest.methodWithUnicode\u006bChar" \
   -XX:CompileCommand=exclude,'UnicodeIdentifierTest.methodWithUnicode\u006bChar' \
   -XX:CompileCommand=exclude,UnicodeIdentifierTest.methodWithUnicode\u006bChar \
   ${TEST}

But none of them worked.

So if there is no other way to specify a non-ascii chars, it seems safe to remove the non-ascii code.

If I miss something, please let me know.
Thanks.

@magicus
Copy link
Member

magicus commented Sep 30, 2021

Some misc remarks from a build PoV:

  • We count language and region settings as a build environment requirement, not a portability issue.
  • It is really a shame that Microsoft is making changes to these so darned hard. On all other platforms, LC_ALL=C in the make file just fixes the problem. :-(
  • But we do want the JDK to be easy to build, so this means that we might need to support building on more than en_US on Windows, at least until Microsoft get's their act together.

From what I see in the discussion here there seems to be no clarity in what range of character the specification allows. This needs to be absolutely clear for any changes here -- we can't filter out legal characters just because they are problematic to build on non en_US platforms.

However, I'm thinking that you need to take a step back and see what you are really trying to solve. To me, it seems that sscanf is not the right tool for the job, and the fact that it has worked until now is more a lucky coincidence. It seems, from a quick glance, that you should consider the input a byte array, and process it like that, instead of a string, if the encoding is unclear, and the spec is talking about character values (like 0x7f) rather than what characters they are supposed to represent in a specific encoding.

@DamonFool
Copy link
Member Author

Some misc remarks from a build PoV:

  • We count language and region settings as a build environment requirement, not a portability issue.
  • It is really a shame that Microsoft is making changes to these so darned hard. On all other platforms, LC_ALL=C in the make file just fixes the problem. :-(
  • But we do want the JDK to be easy to build, so this means that we might need to support building on more than en_US on Windows, at least until Microsoft get's their act together.

From what I see in the discussion here there seems to be no clarity in what range of character the specification allows. This needs to be absolutely clear for any changes here -- we can't filter out legal characters just because they are problematic to build on non en_US platforms.

However, I'm thinking that you need to take a step back and see what you are really trying to solve. To me, it seems that sscanf is not the right tool for the job, and the fact that it has worked until now is more a lucky coincidence. It seems, from a quick glance, that you should consider the input a byte array, and process it like that, instead of a string, if the encoding is unclear, and the spec is talking about character values (like 0x7f) rather than what characters they are supposed to represent in a specific encoding.

Thanks @magicus .

The background is that we want to build CI/CD pipelines for Windows platforms to help the OpenJDK development.

We already have enough Linux and MacOS pipelines but still not have one for Windows.
So we just plan to setup some Windows pipelines to further improve OpenJDK product quality.

But to my surprise, OpenJDK fails to build on our Windows platforms.

You may suggest changing the locale settings.
But many of our Apps don't allow us to changet it since we are non-English speaking country.
It's unfortunate that OpenJDK can't build on our Windows platforms.

It's not our goal to make CompileCommand work with non-ASCII chars.
If it doesn't make anything worse, we can just remove the non-ASCII code to make it to be more portable.

@iklam
Copy link
Member

iklam commented Sep 30, 2021

RANGEBASE was added by JDK-6500501 and later was modified by JDK-8027829
Note the original comment from 6500501:

// The characters allowed in a class or method name.  All characters > 0x7f
// are allowed in order to handle obfuscated class files (e.g. Volano)

Thanks @vnkozlov for your very helpful comments.

I have one question: how can we specify (non-ascii chars) and (non-printable ascii chars) through -XX:CompileCommand?

I just learned from https://bugs.openjdk.java.net/browse/JDK-8027829 that we can use unicode like \uxxxx. But it doesn't work in my experiments.

My example was made from: https://bugs.openjdk.java.net/secure/attachment/17128/UnicodeIdentifierTest.java

public class UnicodeIdentifierTest {
    public static void main(String args[]) {
        System.out.println("Can I use \\u0001 in identifier name? " +
                           (Character.isJavaIdentifierPart(1) ? "yes" : "no"));
        for (int i = 0; i < 100000; i++ )
        methodWithUnicode\u0001Char();

        System.out.println("Can I use \\u00aa in identifier name? " +
                           (Character.isJavaIdentifierPart(0xaa) ? "yes" : "no"));
        for (int i = 0; i < 100000; i++ )
        methodWithUnicode\u00aaChar();

        System.out.println("Can I use \\u006b in identifier name? " +
                           (Character.isJavaIdentifierPart(0x6b) ? "yes" : "no"));
        for (int i = 0; i < 100000; i++ )
        methodWithUnicode\u006bChar();

    }
    public static int a = 0;
    public static void methodWithUnicode\u0001Char() {
        a++;
    }

    public static void methodWithUnicode\u00aaChar() {
        a++;
    }

    public static void methodWithUnicode\u006bChar() {
        a++;
    }
}

And I tried to exclude some specific methods like this

${JDK}/bin/java \
   -XX:+PrintCompilation \
   -XX:CompileCommand=exclude,`echo -e "UnicodeIdentifierTest::methodWithUnicode\u0001Char"` \
   -XX:CompileCommand=exclude,`echo -e "UnicodeIdentifierTest.methodWithUnicode\u0001Char"` \
   -XX:CompileCommand=exclude,"UnicodeIdentifierTest.methodWithUnicode\u0001Char" \
   -XX:CompileCommand=exclude,'UnicodeIdentifierTest.methodWithUnicode\u0001Char' \
   -XX:CompileCommand=exclude,UnicodeIdentifierTest.methodWithUnicode\u0001Char \
   -XX:CompileCommand=exclude,`echo -e "UnicodeIdentifierTest::methodWithUnicode\u00aaChar"` \
   -XX:CompileCommand=exclude,`echo -e "UnicodeIdentifierTest.methodWithUnicode\u00aaChar"` \
   -XX:CompileCommand=exclude,"UnicodeIdentifierTest.methodWithUnicode\u00aaChar" \
   -XX:CompileCommand=exclude,'UnicodeIdentifierTest.methodWithUnicode\u00aaChar' \
   -XX:CompileCommand=exclude,UnicodeIdentifierTest.methodWithUnicode\u00aaChar \
   -XX:CompileCommand=exclude,`echo -e "UnicodeIdentifierTest::methodWithUnicode\u006bChar"` \
   -XX:CompileCommand=exclude,`echo -e "UnicodeIdentifierTest.methodWithUnicode\u006bChar"` \
   -XX:CompileCommand=exclude,"UnicodeIdentifierTest.methodWithUnicode\u006bChar" \
   -XX:CompileCommand=exclude,'UnicodeIdentifierTest.methodWithUnicode\u006bChar' \
   -XX:CompileCommand=exclude,UnicodeIdentifierTest.methodWithUnicode\u006bChar \
   ${TEST}

But none of them worked.

So if there is no other way to specify a non-ascii chars, it seems safe to remove the non-ascii code.

If I miss something, please let me know. Thanks.

(The Chinese characters in this comment may not be displayed properly inside an e-mail reader. Please see this comment on GitHub #5704)

-XX:CompileCommand does not process \uxxxx sequences. However, if your shell's locale is UTF8, you can do something like this, by directly entering them on the command-line, without escaping with \u:

public class CJK {
    public static void main(String args[]) {
        \u722a\u54c7();
    }

    static void \u722a\u54c7() { // Chinese word for "Java"
        Thread.dumpStack();
    }
}
=======
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

$ javac CJK.java
$ java -Xcomp -XX:-BackgroundCompilation -XX:CompileCommand='compileonly,*::爪哇' -XX:+PrintCompilation -cp . CJK > log.txt
java.lang.Exception: Stack trace
	at java.base/java.lang.Thread.dumpStack(Thread.java:1380)
	at CJK.爪哇  (CJK.java:7)
	at CJK.main(CJK.java:3)
$ grep '^   ' log.txt
     53    1    b  3       CJK::\u722a\u54c7 (4 bytes)
     53    2    b  4       CJK::\u722a\u54c7 (4 bytes)
     53    1       3       CJK::\u722a\u54c7 (4 bytes)   made not entrant

@iklam
Copy link
Member

iklam commented Sep 30, 2021

The current limitations of the MethodMather class are:

[1] parse_method_pattern(char*& line, ...) requires line to be a UTF8-encoded byte sequence. Essentially, this means that for -XX:CompileCommand to support non-ASCII characters, the JVM (and the shell that runs the JVM) must be using UTF8 character encoding.

Note that a "locale" contains 3 parts: language, country and character encoding. For example,

  • en_US.utf8 (English language, United States, UTF8 character encoding)
  • zh_CN.utf8 (Chinese language, China, UTF8 character encoding)
  • zh_CN.gbk (Chinese language, China, GBK character encoding)

The first two support non-ASCII characters in -XX:CompileCommand, but the third one doesn't.

[2] MethodMather uses sscanf. It assumes that the JVM is running with UTF8 character encoding because

  • It uses char* strings returned by sscanf to compare with the bytes stored in Symbols. This requires sscanf to return strings that are encoded in UTF8, because Symbols stores the string with UTF8-encoded bytes.
  • It relies on range checking by sscanf to enforce the following restrictions. However, these restrictions are given with the RANGE macro which is UTF8 encoded bytes (and I suspect that these are incorrect when handling multi-byte UTF8-encoded characters):
// '\0' and 0xf0-0xff are disallowed in constant string values
// 0x20 ' ', 0x09 '\t' and, 0x2c ',' are used in the matching
// 0x5b '[' and 0x5d ']' can not be used because of the matcher
// 0x28 '(' and 0x29 ')' are used for the signature
// 0x2e '.' is always replaced before the matching
// 0x2f '/' is only used in the class name as package separator

==================================
Proposed solutions:

I don't think we can solve [1] easily. To handle non-ASCII characters that are non encoded in UTF8, we need to call NewPlatformString() in src/java.base/share/native/libjli/java.c. However, this executes Java code, but -XX:CompileCommand needs to be processed before any Java code execution. ==> Proposal: IGNORE it for now.

For [2], there are two distinct issues:

(a) The restriction checks are invalid when the JVM is running in an non-UTF8 encoding -- this is a moot point because we can't handle [1] anyway, so the data given to sscanf() is already bad. => Proposal: IGNORE it for now

(b) VC++ compilation warning when methodMather.cpp is compiled in non-UTF8 environments

This is just a warning, and (I think .....) it doesn't change the object file at all. I.e., the literal strings in methodMatcher.obj are exactly the same as if methodMather.cpp is compiled under a UTF8 environment.

Proposal: use pragma to disable the warning.
Assuming that my analysis for [1] and (a) is correct, there's no reason to fix the sscanf code. Disabling the warnings with pragma is the most painless and easiest way to handle this.

@DamonFool could you try this experiment:

  • Implement the pragma and build two JDKs -- one in a Chinese Windows environment, and another in an English Windows environment
  • run strings methodMatcher.obj and see if the output is identical
  • run the "CJK" test example in my previous comment, and see if you get identical results with both JDKs

(If this doesn't work, an alternative is to avoid using sscanf and write our own parser).

Thanks

@DamonFool
Copy link
Member Author

The current limitations of the MethodMather class are:

[1] parse_method_pattern(char*& line, ...) requires line to be a UTF8-encoded byte sequence. Essentially, this means that for -XX:CompileCommand to support non-ASCII characters, the JVM (and the shell that runs the JVM) must be using UTF8 character encoding.

Note that a "locale" contains 3 parts: language, country and character encoding. For example,

  • en_US.utf8 (English language, United States, UTF8 character encoding)
  • zh_CN.utf8 (Chinese language, China, UTF8 character encoding)
  • zh_CN.gbk (Chinese language, China, GBK character encoding)

The first two support non-ASCII characters in -XX:CompileCommand, but the third one doesn't.

[2] MethodMather uses sscanf. It assumes that the JVM is running with UTF8 character encoding because

  • It uses char* strings returned by sscanf to compare with the bytes stored in Symbols. This requires sscanf to return strings that are encoded in UTF8, because Symbols stores the string with UTF8-encoded bytes.
  • It relies on range checking by sscanf to enforce the following restrictions. However, these restrictions are given with the RANGE macro which is UTF8 encoded bytes (and I suspect that these are incorrect when handling multi-byte UTF8-encoded characters):
// '\0' and 0xf0-0xff are disallowed in constant string values
// 0x20 ' ', 0x09 '\t' and, 0x2c ',' are used in the matching
// 0x5b '[' and 0x5d ']' can not be used because of the matcher
// 0x28 '(' and 0x29 ')' are used for the signature
// 0x2e '.' is always replaced before the matching
// 0x2f '/' is only used in the class name as package separator

================================== Proposed solutions:

I don't think we can solve [1] easily. To handle non-ASCII characters that are non encoded in UTF8, we need to call NewPlatformString() in src/java.base/share/native/libjli/java.c. However, this executes Java code, but -XX:CompileCommand needs to be processed before any Java code execution. ==> Proposal: IGNORE it for now.

For [2], there are two distinct issues:

(a) The restriction checks are invalid when the JVM is running in an non-UTF8 encoding -- this is a moot point because we can't handle [1] anyway, so the data given to sscanf() is already bad. => Proposal: IGNORE it for now

(b) VC++ compilation warning when methodMather.cpp is compiled in non-UTF8 environments

This is just a warning, and (I think .....) it doesn't change the object file at all. I.e., the literal strings in methodMatcher.obj are exactly the same as if methodMather.cpp is compiled under a UTF8 environment.

Proposal: use pragma to disable the warning. Assuming that my analysis for [1] and (a) is correct, there's no reason to fix the sscanf code. Disabling the warnings with pragma is the most painless and easiest way to handle this.

@DamonFool could you try this experiment:

  • Implement the pragma and build two JDKs -- one in a Chinese Windows environment, and another in an English Windows environment

  • run strings methodMatcher.obj and see if the output is identical

  • run the "CJK" test example in my previous comment, and see if you get identical results with both JDKs

(If this doesn't work, an alternative is to avoid using sscanf and write our own parser).

Thanks

Thanks @iklam for your excellent analysis.

So HotSpot does support non-ASCII chars as names.
Then we shouldn't simply remove such non-ASCII code.

I will do your experiment next week.
This is because it's already our National Day week and I can't find an English Windows machine until next week.
I'll let you know the result as soon as possible.
Thanks.

@iklam
Copy link
Member

iklam commented Sep 30, 2021

I will do your experiment next week. This is because it's already our National Day week and I can't find an English Windows machine until next week. I'll let you know the result as soon as possible. Thanks.

No need to hurry :-). In case you can't find an English Windows, I think you can use the chcp 65001 command mentioned in https://stackoverflow.com/questions/388490/how-to-use-unicode-characters-in-windows-command-line to change your command-line window to use the UTF8 codepage.

@DamonFool
Copy link
Member Author

I will do your experiment next week. This is because it's already our National Day week and I can't find an English Windows machine until next week. I'll let you know the result as soon as possible. Thanks.

No need to hurry :-). In case you can't find an English Windows, I think you can use the chcp 65001 command mentioned in https://stackoverflow.com/questions/388490/how-to-use-unicode-characters-in-windows-command-line to change your command-line window to use the UTF8 codepage.

Okay.

I also note the warning

e:\jiefu\ws\jdk\src\hotspot\share\compiler\methodMatcher.cpp(269): warning C4778: 'sscanf' : unterminated format string '%255[*\x01\x02\x03\x04\x05\x06\a\b\n\v\f\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f!\"#$%&'*+,-0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ\\^_`abcdefghijklmnopqrstuvwxyz{|}~\xe2\x82\xac\xe4\xba\x97\xe5\x84\x8e\xe5\x8e\x97%n'

It is already different with the original RANGEBASE sequence [1].

~\xe2\x82\xac\xe4\xba\x97\xe5\x84\x8e\xe5\x8e\x97%n

vs.

    "\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f" \
    "\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f" \
    "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf" \
    "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf" \
    "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf" \
    "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf" \
    "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef"

Not sure if this fact is sufficient to say the literal strings will be different in methodMatcher.obj.
[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/compiler/methodMatcher.cpp#L49

@DamonFool
Copy link
Member Author

I will do your experiment next week. This is because it's already our National Day week and I can't find an English Windows machine until next week. I'll let you know the result as soon as possible. Thanks.

No need to hurry :-). In case you can't find an English Windows, I think you can use the chcp 65001 command mentioned in https://stackoverflow.com/questions/388490/how-to-use-unicode-characters-in-windows-command-line to change your command-line window to use the UTF8 codepage.

Hi @iklam ,

methodMatcher.obj [1] built with System Locale: zh-cn;Chinese (China)
methodMatcher.obj [2] built with System Locale: en-us;English (United States)"

There seems no difference when checking with strings methodMatcher.obj.

The warnings disappear when the system locale is en-us;English (United States).
But unfortunately, I can't reproduce the "CJK" test example, which means non-ASCII chars for CompileCommand still fail for both jdk images (even when built with en-us locale, no warnings at all).

So it's far more complicated than I had thought.
I will just close this pr since we can't remove the non-ASCII code, which works in some countries.

Thank you all for your help and valuable comments.

Best regards,
Jie

[1] https://github.com/DamonFool/experiment/blob/main/JDK-8274329/ch-methodMatcher.obj
[2] https://github.com/DamonFool/experiment/blob/main/JDK-8274329/en-methodMatcher.obj

@iklam
Copy link
Member

iklam commented Oct 5, 2021

My experiments above with -XX:CompileCommand='compileonly,*::爪哇' was done on Linux. I tried doing the same on Windows. On US-English Windows, the default codepage is 437 (DOS Latin US). If I change it to 65001 (UTF8) then Java is able to output CJK characters to the console.

public class CJK {
    public static void main(String args[]) {
	System.out.println(args[0]);
        \u722a\u54c7();
    }

    static void \u722a\u54c7() { // Chinese word for "Java"
        Thread.dumpStack();
    }
}
c:\ade>chcp
Active code page: 437

c:\ade>jdk-17\bin\java -cp . CJK 123
123
java.lang.Exception: Stack trace
        at java.base/java.lang.Thread.dumpStack(Thread.java:1380)
        at CJK.??(CJK.java:8)
        at CJK.main(CJK.java:4)

c:\ade>chcp 65001
Active code page: 65001

c:\ade>jdk-17\bin\java -cp . CJK 爪哇
??
java.lang.Exception: Stack trace
        at java.base/java.lang.Thread.dumpStack(Thread.java:1380)
        at CJK.爪哇(CJK.java:8)
        at CJK.main(CJK.java:4)

As you can see, the CJK characters in the command-line arguments can't even be correctly passed as arguments to the Java main class. If that doesn't work, I can't see how we can get -XX:CompileCommand to work with CJK characters.

@mlbridge
Copy link

mlbridge bot commented Oct 5, 2021

Mailing list message from Magnus Ihse Bursie on build-dev:

On 2021-10-05 08:41, Ioi Lam wrote:

As you can see, the CJK characters in the command-line arguments can't
even be correctly passed as arguments to the Java main class. If that
doesn't work, I can't see how we can get `-XX:CompileCommand` to work
with CJK characters.

So, what does that mean? That we should explicitly limit
`-XX:CompileCommand`to work with ASCII-only arguments? I accept that we
might not get all characters to work in all circumstances due to
limitations in Windows, but the current state of affairs still feel
unsatisfactory. We should at least have a better failure mode, and
document any limitations.

/Magnus

@DamonFool
Copy link
Member Author

Thanks @iklam and @magicus for your experiments and comments.

My experiments show that CompileCommand doesn't work with non-US-English env Windows.
And @iklam 's experiments show that it doesn't work with US-English env Windows either.
So I suggest we disable non-ASCII chars for Windows.

The patch has been updated.

  1. On non-Windows platforms, CompileCommand still works as before.
  2. On Windows, it will be limited to work with ASCII-only arguments.
    For non-ASCII chars, the parser will fail like this:
    >java -XX:CompileCommand=compileonly,*::爪哇 -version
    CompileCommand: An error occurred during parsing
    Error: Non-ASCII characters are not supported on Windows.
    Line: 'compileonly,*::爪哇'
    

What do you think?
Thanks.

Copy link
Member

@iklam iklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea looks good to me. I just have a suggestion to make the code more readable.

"\x50\x51\x52\x53\x54\x55\x56\x57\x58\x59\x5a\x5c\x5e\x5f" \
"\x60\x61\x62\x63\x64\x65\x66\x67\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f" \
"\x70\x71\x72\x73\x74\x75\x76\x77\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to tell what's the difference between these two RANGEBASE definitions. How about doing it like this to make the code more readable?

#define RANGEBASE_ASCII "....."
#define RANGEBASE_NON_ASCII "....."
#ifdef WINDOWS
#define RANGEBASE RANGEBASE_ASCII
#else  
#define RANGEBASE RANGEBASE_ASCII RANGEBASE_NON_ASCII 
#endif

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to tell what's the difference between these two RANGEBASE definitions. How about doing it like this to make the code more readable?

#define RANGEBASE_ASCII "....."
#define RANGEBASE_NON_ASCII "....."
#ifdef WINDOWS
#define RANGEBASE RANGEBASE_ASCII
#else  
#define RANGEBASE RANGEBASE_ASCII RANGEBASE_NON_ASCII 
#endif

Good suggestion!
Updated.
Thanks.

@openjdk
Copy link

openjdk bot commented Oct 6, 2021

@DamonFool This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8274329: Fix non-portable HotSpot code in MethodMatcher::parse_method_pattern

Reviewed-by: iklam, ihse, kvn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 126 new commits pushed to the master branch:

  • c10de35: 8262944: Improve exception message when automatic module lists provider class not in JAR file
  • b8af6a9: 8273917: Remove 'leaf' ranking for Mutex
  • c80a612: 8273381: Assert in PtrQueueBufferAllocatorTest.stress_free_list_allocator_vm
  • 9759fcb: 8274496: Use String.contains() instead of String.indexOf() in java.desktop
  • cdf8930: 8274625: Search field placeholder behavior
  • df125f6: 8273410: IR verification framework fails with "Should find method name in validIrRulesMap"
  • c74726d: 8274730: AArch64: AES/GCM acceleration is broken by the fix for JDK-8273297
  • 2faced0: 8274738: ZGC: Use relaxed atomic load when reading bits in the live map
  • 8a4d2b4: 8274680: Remove unnecessary conversion to String in java.desktop
  • 986ee5d: 8274670: Improve version string handling in SA
  • ... and 116 more: https://git.openjdk.java.net/jdk/compare/5ec1cdcaf39229a7d2457313600b0dc2bf8c6453...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Oct 6, 2021
Copy link
Member

@magicus magicus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was the best possible solution.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Let me test it before approval.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passed my tier1-3 testing

@DamonFool
Copy link
Member Author

Thanks @iklam @magicus and @vnkozlov .
/integrate

@openjdk
Copy link

openjdk bot commented Oct 6, 2021

Going to push as commit c833b4d.
Since your change was applied there have been 133 commits pushed to the master branch:

  • d57fb6f: 8274456: Remove jtreg tag manual=yesno java/awt/print/PrinterJob/PageDialogTest.java
  • 734d1fb: 8274211: Test man page that options are documented
  • 9561fea: 8273102: Delete deprecated for removal the empty finalize() in java.desktop module
  • 9945f7a: 8274318: Replace 'for' cycles with iterator with enhanced-for in java.management
  • 754bc82: 8274525: Replace uses of StringBuffer with StringBuilder in java.xml
  • 4e7d7ca: 8273711: Remove redundant stream() call before forEach in jdk.jlink
  • f3cedbe: 8274464: Remove redundant stream() call before forEach in java.* modules
  • c10de35: 8262944: Improve exception message when automatic module lists provider class not in JAR file
  • b8af6a9: 8273917: Remove 'leaf' ranking for Mutex
  • c80a612: 8273381: Assert in PtrQueueBufferAllocatorTest.stress_free_list_allocator_vm
  • ... and 123 more: https://git.openjdk.java.net/jdk/compare/5ec1cdcaf39229a7d2457313600b0dc2bf8c6453...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot closed this Oct 6, 2021
@openjdk openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Oct 6, 2021
@openjdk
Copy link

openjdk bot commented Oct 6, 2021

@DamonFool Pushed as commit c833b4d.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@DamonFool DamonFool deleted the JDK-8274329 branch October 6, 2021 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build build-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

5 participants