Skip to content

Conversation

@zzambers
Copy link
Contributor

@zzambers zzambers commented Jun 14, 2023

java_lang_String::hash_code produces inconsistent results on different platforms, when s is char*. This is because on some platforms char is signed, while on others unsigned (resulting in char to be either zero-extended or sign-extended, when cast to unsigned int). This causes 1 tier1 test failure on aarch64.

Details:
This was discovered by examining one failing test (from tier1) present on aarch64 builds:
test/serviceability/sa/jmap-hashcode/Test8028623.java
Test was introduced by JDK-8028623. However fix done there does not work on aarch64. Code was later fixed (newer jdks) in hotspot part of JDK-8141132 (JEP 254: Compact Strings).

Fix:
Fixed by backporting very small portion of JDK-8141132.

Testing:
tier1 (x86, x86_64, aarch64): OK (tested by GH and in rhel-8 aarch64 VM)


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8310026: [8u] make java_lang_String::hash_code consistent across platforms (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk8u-dev.git pull/336/head:pull/336
$ git checkout pull/336

Update a local copy of the PR:
$ git checkout pull/336
$ git pull https://git.openjdk.org/jdk8u-dev.git pull/336/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 336

View PR using the GUI difftool:
$ git pr show -t 336

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk8u-dev/pull/336.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 14, 2023

👋 Welcome back zzambers! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jun 14, 2023
@mlbridge
Copy link

mlbridge bot commented Jun 14, 2023

Webrevs

Copy link
Member

@phohensee phohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm.

@openjdk
Copy link

openjdk bot commented Jun 14, 2023

@zzambers This change now passes all automated pre-integration checks.

After integration, the commit message for the final commit will be:

8310026: [8u] make java_lang_String::hash_code consistent across platforms

Reviewed-by: phh, adinn, stuefe

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 3 new commits pushed to the master branch:

  • 108003e: 8309143: [8u] fix archiving inconsistencies in GHA
  • a2a6873: 8205399: Set node color on pinned HashMap.TreeNode deletion
  • 0944384: 8200468: Port the native GSS-API bridge to Windows

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jun 14, 2023
@jerboaa
Copy link
Contributor

jerboaa commented Jul 7, 2023

@tstuefe @adinn Could you please help with a second review? Thanks!

Copy link
Contributor

@adinn adinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good

}
return h;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. According to the comment, both of these functions are to mimic String.hashCode. But only the jchar variant does, or?

Assuming the jchar variant gets fed UCS2 and the jbyte variant UTF8. Those encodings could be different for the same java string if we have surrogate chars.

For example, let string be a single unicode "ぁ" character, aka U+3041, which would be encoded as 0x3041 (len 1) with UCS2, 0xE38181 as UTF8.

Hash for the first would use the jchar* variant, len=1, and return 0x3041. Hash for the UTF8 variant would get, I assume, a byte array of 0xE3 0x81 0x81 and a len of 3, and return 0x36443 ((((0xE3 * 0x1F) + 0x81) * 0x1F) + 0x81).

I must be missing something basic here.

Copy link
Contributor

@adinn adinn Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really sure what you are suggesting is a problem here, Thomas. I /think/ the only problem here is that the comment is wrong. You are right that only the jchar variant matches String.hashCode but I believe only that variant /needs/ to match String.hashCode. The jchar variant is used by all code operating on Java Strings proper. The jbyte variant is only used by the Symbol table and the agent.

The problem this is fixing is to do with the disparity between SymbolTable::hash_symbol and the agent HashTable. That was supposed to have been fixed by JDK-8028623. However, the fix is a hostage to fortune because SymbolTable::hash_symbol accepts and passes on to java_lang_String::hash_code a value of C type char* (which may be signed or unsigned depending on the OS) while the agent HashTable code operates on a Java byte[] (which is always signed). This means that the template code may or may not sign extend the values melded into the hash causing the SymbolTable and agent HashTable` to compute different results.

This current fix decouples the definitions of hash_code(const jchar* s, int len) and hash_code(const jbyte* s, int len) in order to allow the latter to match the redefined behaviour of the agent HashTable i.e. it sums individual unsigned 8 byte values in the input rather than unsigned 16 byte values.

As far as I can tell it doesn't actually matter what interpretation is placed on the data sitting in field String.value, whether it is considered as 8 byte or 16 byte values. What matters here is that they are hashed consistently by whatever code processes the contents. Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few uses of java_lang_String::hash_code in JDK8 are following:

jbyte * varaiant is only used to hash symbols (I don't think this affects code in java):

java_lang_String::hash_code(s, len);

other uses dealing with Strings use jchar * variant:

return java_lang_String::hash_code(value->char_at_addr(offset), length);

java_lang_String::hash_code(s, len);

hash = java_lang_String::hash(java_string);

In newer JDKs, jbyte * variant is also used for compact strings (as zero extended latin1 should be equal to UCS2/UTF-16), e.g.:
https://github.com/openjdk/jdk/blob/06a1a15d014f5ca48f62f5f0c8e8682086c4ae0b/src/hotspot/share/classfile/javaClasses.cpp#L522
(but JDK8 does not have compact strings...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really sure what you are suggesting is a problem here, Thomas. I /think/ the only problem here is that the comment is wrong. You are right that only the jchar variant matches String.hashCode but I believe only that variant /needs/ to match String.hashCode. The jchar variant is used by all code operating on Java Strings proper. The jbyte variant is only used by the Symbol table and the agent.

The problem this is fixing is to do with the disparity between SymbolTable::hash_symbol and the agent HashTable. That was supposed to have been fixed by JDK-8028623. However, the fix is a hostage to fortune because SymbolTable::hash_symbol accepts and passes on to java_lang_String::hash_code a value of C type char* (which may be signed or unsigned depending on the OS) while the agent HashTable code operates on a Java byte[] (which is always signed). This means that the template code may or may not sign extend the values melded into the hash causing the SymbolTable and agent HashTable` to compute different results.

This current fix decouples the definitions of hash_code(const jchar* s, int len) and hash_code(const jbyte* s, int len) in order to allow the latter to match the redefined behaviour of the agent HashTable i.e. it sums individual unsigned 8 byte values in the input rather than unsigned 16 byte values.

As far as I can tell it doesn't actually matter what interpretation is placed on the data sitting in field String.value, whether it is considered as 8 byte or 16 byte values. What matters here is that they are hashed consistently by whatever code processes the contents. Am I missing something?

Thank you, Andrew, for disentangling this. I get it now. The byte variant only has to match its java counterpart in the agent.

// Emulate the unsigned int in java_lang_String::hash_code
while (len-- > 0) {
h = 31*h + (0xFFFFFFFFL & buf[s]);
h = 31*h + (0xFFL & buf[s]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Byte.toUnsignedInt() would be clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but that would also be different to upstream jdk11u

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good, after @adinn explained it to me.

@zzambers
Copy link
Contributor Author

@jerboaa @adinn @tstuefe @phohensee thank you
/integrate

@openjdk
Copy link

openjdk bot commented Jul 10, 2023

Going to push as commit 15077ad.
Since your change was applied there have been 3 commits pushed to the master branch:

  • 108003e: 8309143: [8u] fix archiving inconsistencies in GHA
  • a2a6873: 8205399: Set node color on pinned HashMap.TreeNode deletion
  • 0944384: 8200468: Port the native GSS-API bridge to Windows

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jul 10, 2023
@openjdk openjdk bot closed this Jul 10, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jul 10, 2023
@openjdk
Copy link

openjdk bot commented Jul 10, 2023

@zzambers Pushed as commit 15077ad.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

5 participants