-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8274329: Fix non-portable HotSpot code in MethodMatcher::parse_method_pattern #5704
Conversation
👋 Welcome back jiefu! A progress list of the required criteria for merging this PR into |
@DamonFool The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
FWIW, there was some prior discussion here about this code as well: #3107 tl;dr MSVC uses the system locale's code page to parse this, which must be set to Though, I can't comment on the changes in this PR. |
Thanks @JornVernee for your comments. My system local is zh-cn. But changing the locale isn't acceptable since many of our Apps require zh-cn in our country. According to the JBS, C4819 warning was first observed with VS2017 and was disabled by JDK-8216154. If the non-ascii code is useless, it should be removed to make HotSpot to be more portable. Thanks. |
I understand, and that is totally reasonable to me. There might be another way to change the locale just for the compilation [1], but I haven't had time to test that (so for now I think the official advice is to us
In your case the compiler produced some warnings, but I'm wondering if using a different encoding could also silently create subtle behavioral changes. I think it would be good if a specific encoding could be used at build time.
I agree with your reasoning, but I can not comment on the contents of the patch, because I'm not a maintainer of this code. |
Thanks for your suggestions, @JornVernee . |
Ok, that's unfortunate. Thanks for testing. |
Let's see what others think of the change. Hope the non-ascii code is actually not used. |
/label add build |
@vnkozlov |
This should be discussed with |
/label add hotspot-runtime |
@vnkozlov |
Note the original comment from 6500501:
|
Thanks @vnkozlov for your very helpful comments. I have one question: how can we specify (non-ascii chars) and (non-printable ascii chars) through I just learned from https://bugs.openjdk.java.net/browse/JDK-8027829 that we can use unicode like My example was made from: https://bugs.openjdk.java.net/secure/attachment/17128/UnicodeIdentifierTest.java
And I tried to exclude some specific methods like this
But none of them worked. So if there is no other way to specify a non-ascii chars, it seems safe to remove the non-ascii code. If I miss something, please let me know. |
Some misc remarks from a build PoV:
From what I see in the discussion here there seems to be no clarity in what range of character the specification allows. This needs to be absolutely clear for any changes here -- we can't filter out legal characters just because they are problematic to build on non en_US platforms. However, I'm thinking that you need to take a step back and see what you are really trying to solve. To me, it seems that sscanf is not the right tool for the job, and the fact that it has worked until now is more a lucky coincidence. It seems, from a quick glance, that you should consider the input a byte array, and process it like that, instead of a string, if the encoding is unclear, and the spec is talking about character values (like 0x7f) rather than what characters they are supposed to represent in a specific encoding. |
Thanks @magicus . The background is that we want to build CI/CD pipelines for Windows platforms to help the OpenJDK development. We already have enough Linux and MacOS pipelines but still not have one for Windows. But to my surprise, OpenJDK fails to build on our Windows platforms. You may suggest changing the locale settings. It's not our goal to make CompileCommand work with non-ASCII chars. |
(The Chinese characters in this comment may not be displayed properly inside an e-mail reader. Please see this comment on GitHub #5704) -XX:CompileCommand does not process \uxxxx sequences. However, if your shell's locale is UTF8, you can do something like this, by directly entering them on the command-line, without escaping with \u:
|
The current limitations of the MethodMather class are: [1] Note that a "locale" contains 3 parts: language, country and character encoding. For example,
The first two support non-ASCII characters in -XX:CompileCommand, but the third one doesn't. [2] MethodMather uses
================================== I don't think we can solve [1] easily. To handle non-ASCII characters that are non encoded in UTF8, we need to call NewPlatformString() in src/java.base/share/native/libjli/java.c. However, this executes Java code, but -XX:CompileCommand needs to be processed before any Java code execution. ==> Proposal: IGNORE it for now. For [2], there are two distinct issues: (a) The restriction checks are invalid when the JVM is running in an non-UTF8 encoding -- this is a moot point because we can't handle [1] anyway, so the data given to sscanf() is already bad. => Proposal: IGNORE it for now (b) VC++ compilation warning when methodMather.cpp is compiled in non-UTF8 environments This is just a warning, and (I think .....) it doesn't change the object file at all. I.e., the literal strings in methodMatcher.obj are exactly the same as if methodMather.cpp is compiled under a UTF8 environment. Proposal: use pragma to disable the warning. @DamonFool could you try this experiment:
(If this doesn't work, an alternative is to avoid using sscanf and write our own parser). Thanks |
Thanks @iklam for your excellent analysis. So HotSpot does support non-ASCII chars as names. I will do your experiment next week. |
No need to hurry :-). In case you can't find an English Windows, I think you can use the |
Okay. I also note the warning
It is already different with the original
vs.
Not sure if this fact is sufficient to say the literal strings will be different in methodMatcher.obj. |
Hi @iklam , methodMatcher.obj [1] built with There seems no difference when checking with The warnings disappear when the system locale is So it's far more complicated than I had thought. Thank you all for your help and valuable comments. Best regards, [1] https://github.com/DamonFool/experiment/blob/main/JDK-8274329/ch-methodMatcher.obj |
My experiments above with
As you can see, the CJK characters in the command-line arguments can't even be correctly passed as arguments to the Java main class. If that doesn't work, I can't see how we can get |
Mailing list message from Magnus Ihse Bursie on build-dev: On 2021-10-05 08:41, Ioi Lam wrote:
So, what does that mean? That we should explicitly limit /Magnus |
Thanks @iklam and @magicus for your experiments and comments. My experiments show that CompileCommand doesn't work with non-US-English env Windows. The patch has been updated.
What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea looks good to me. I just have a suggestion to make the code more readable.
"\x50\x51\x52\x53\x54\x55\x56\x57\x58\x59\x5a\x5c\x5e\x5f" \ | ||
"\x60\x61\x62\x63\x64\x65\x66\x67\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f" \ | ||
"\x70\x71\x72\x73\x74\x75\x76\x77\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f" | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to tell what's the difference between these two RANGEBASE definitions. How about doing it like this to make the code more readable?
#define RANGEBASE_ASCII "....."
#define RANGEBASE_NON_ASCII "....."
#ifdef WINDOWS
#define RANGEBASE RANGEBASE_ASCII
#else
#define RANGEBASE RANGEBASE_ASCII RANGEBASE_NON_ASCII
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to tell what's the difference between these two RANGEBASE definitions. How about doing it like this to make the code more readable?
#define RANGEBASE_ASCII "....." #define RANGEBASE_NON_ASCII "....." #ifdef WINDOWS #define RANGEBASE RANGEBASE_ASCII #else #define RANGEBASE RANGEBASE_ASCII RANGEBASE_NON_ASCII #endif
Good suggestion!
Updated.
Thanks.
@DamonFool This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 126 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was the best possible solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Let me test it before approval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passed my tier1-3 testing
Going to push as commit c833b4d.
Your commit was automatically rebased without conflicts. |
@DamonFool Pushed as commit c833b4d. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
Hi all,
I tried to build OpenJDK on Cygwin (Windows 2016 + VS2019).
However, I failed with C4474 and C4778 warnings as below:
The failure is caused by non-ASCII chars in the format string of sscanf [1][2], which is non-portable on our Windows platform.
In fact, these non-ASCII coding also triggers C4819 warning, which had been disabled in JDK-8216154 [3].
And I also found an article showing that sscanf may fail with non-ASCII in the format string [4].
So it would be nice to remove these non-ASCII chars (
\x80 ~ \xef
).And I think it's safe to do so.
This is because:
CompileCommand
.You may argue that the non-ASCII may be used by the parser itself.
But I didn't find that usage at all. (Please let me know if I miss something.)
So I suggest to remove these non-ASCII code to make HotSpot to be more portable.
And if we do so, we can also remove the only one
PRAGMA_DISABLE_MSVC_WARNING(4819)
[5].Testing:
Thanks.
Best regards,
Jie
[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/compiler/methodMatcher.cpp#L269
[2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/compiler/methodMatcher.cpp#L319
[3] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-January/032014.html
[4] https://jeffpar.github.io/kbarchive/kb/047/Q47369/
[5] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/compiler/methodMatcher.cpp#L246
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/5704/head:pull/5704
$ git checkout pull/5704
Update a local copy of the PR:
$ git checkout pull/5704
$ git pull https://git.openjdk.java.net/jdk pull/5704/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 5704
View PR using the GUI difftool:
$ git pr show -t 5704
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/5704.diff