-
Notifications
You must be signed in to change notification settings - Fork 171
8310026: [8u] make java_lang_String::hash_code consistent across platforms #336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back zzambers! A progress list of the required criteria for merging this PR into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm.
|
@zzambers This change now passes all automated pre-integration checks. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 3 new commits pushed to the
Please see this link for an up-to-date comparison between the source branch of this pull request and the ➡️ To integrate this PR with the above commit message to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good
| } | ||
| return h; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this. According to the comment, both of these functions are to mimic String.hashCode. But only the jchar variant does, or?
Assuming the jchar variant gets fed UCS2 and the jbyte variant UTF8. Those encodings could be different for the same java string if we have surrogate chars.
For example, let string be a single unicode "ぁ" character, aka U+3041, which would be encoded as 0x3041 (len 1) with UCS2, 0xE38181 as UTF8.
Hash for the first would use the jchar* variant, len=1, and return 0x3041. Hash for the UTF8 variant would get, I assume, a byte array of 0xE3 0x81 0x81 and a len of 3, and return 0x36443 ((((0xE3 * 0x1F) + 0x81) * 0x1F) + 0x81).
I must be missing something basic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not really sure what you are suggesting is a problem here, Thomas. I /think/ the only problem here is that the comment is wrong. You are right that only the jchar variant matches String.hashCode but I believe only that variant /needs/ to match String.hashCode. The jchar variant is used by all code operating on Java Strings proper. The jbyte variant is only used by the Symbol table and the agent.
The problem this is fixing is to do with the disparity between SymbolTable::hash_symbol and the agent HashTable. That was supposed to have been fixed by JDK-8028623. However, the fix is a hostage to fortune because SymbolTable::hash_symbol accepts and passes on to java_lang_String::hash_code a value of C type char* (which may be signed or unsigned depending on the OS) while the agent HashTable code operates on a Java byte[] (which is always signed). This means that the template code may or may not sign extend the values melded into the hash causing the SymbolTable and agent HashTable` to compute different results.
This current fix decouples the definitions of hash_code(const jchar* s, int len) and hash_code(const jbyte* s, int len) in order to allow the latter to match the redefined behaviour of the agent HashTable i.e. it sums individual unsigned 8 byte values in the input rather than unsigned 16 byte values.
As far as I can tell it doesn't actually matter what interpretation is placed on the data sitting in field String.value, whether it is considered as 8 byte or 16 byte values. What matters here is that they are hashed consistently by whatever code processes the contents. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few uses of java_lang_String::hash_code in JDK8 are following:
jbyte * varaiant is only used to hash symbols (I don't think this affects code in java):
| java_lang_String::hash_code(s, len); |
other uses dealing with Strings use jchar * variant:
| return java_lang_String::hash_code(value->char_at_addr(offset), length); |
| java_lang_String::hash_code(s, len); |
| hash = java_lang_String::hash(java_string); |
In newer JDKs, jbyte * variant is also used for compact strings (as zero extended latin1 should be equal to UCS2/UTF-16), e.g.:
https://github.com/openjdk/jdk/blob/06a1a15d014f5ca48f62f5f0c8e8682086c4ae0b/src/hotspot/share/classfile/javaClasses.cpp#L522
(but JDK8 does not have compact strings...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not really sure what you are suggesting is a problem here, Thomas. I /think/ the only problem here is that the comment is wrong. You are right that only the
jcharvariant matchesString.hashCodebut I believe only that variant /needs/ to matchString.hashCode. Thejcharvariant is used by all code operating on Java Strings proper. Thejbytevariant is only used by the Symbol table and the agent.The problem this is fixing is to do with the disparity between
SymbolTable::hash_symboland the agentHashTable. That was supposed to have been fixed by JDK-8028623. However, the fix is a hostage to fortune becauseSymbolTable::hash_symbolaccepts and passes on tojava_lang_String::hash_codea value of C typechar*(which may be signed or unsigned depending on the OS) while the agentHashTablecode operates on a Javabyte[](which is always signed). This means that the template code may or may not sign extend the values melded into the hash causing theSymbolTableand agent HashTable` to compute different results.This current fix decouples the definitions of
hash_code(const jchar* s, int len)andhash_code(const jbyte* s, int len)in order to allow the latter to match the redefined behaviour of the agentHashTablei.e. it sums individual unsigned 8 byte values in the input rather than unsigned 16 byte values.As far as I can tell it doesn't actually matter what interpretation is placed on the data sitting in field
String.value, whether it is considered as 8 byte or 16 byte values. What matters here is that they are hashed consistently by whatever code processes the contents. Am I missing something?
Thank you, Andrew, for disentangling this. I get it now. The byte variant only has to match its java counterpart in the agent.
| // Emulate the unsigned int in java_lang_String::hash_code | ||
| while (len-- > 0) { | ||
| h = 31*h + (0xFFFFFFFFL & buf[s]); | ||
| h = 31*h + (0xFFL & buf[s]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Byte.toUnsignedInt() would be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but that would also be different to upstream jdk11u
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good, after @adinn explained it to me.
|
@jerboaa @adinn @tstuefe @phohensee thank you |
|
Going to push as commit 15077ad.
Your commit was automatically rebased without conflicts. |
java_lang_String::hash_codeproduces inconsistent results on different platforms, whensischar*. This is because on some platformscharis signed, while on others unsigned (resulting incharto be either zero-extended or sign-extended, when cast tounsigned int). This causes 1 tier1 test failure on aarch64.Details:
This was discovered by examining one failing test (from tier1) present on aarch64 builds:
test/serviceability/sa/jmap-hashcode/Test8028623.javaTest was introduced by JDK-8028623. However fix done there does not work on aarch64. Code was later fixed (newer jdks) in hotspot part of JDK-8141132 (JEP 254: Compact Strings).
Fix:
Fixed by backporting very small portion of JDK-8141132.
Testing:
tier1 (x86, x86_64, aarch64): OK (tested by GH and in rhel-8 aarch64 VM)
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk8u-dev.git pull/336/head:pull/336$ git checkout pull/336Update a local copy of the PR:
$ git checkout pull/336$ git pull https://git.openjdk.org/jdk8u-dev.git pull/336/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 336View PR using the GUI difftool:
$ git pr show -t 336Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk8u-dev/pull/336.diff
Webrev
Link to Webrev Comment