Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ protected static long hashSymbol(byte[] buf) {
int len = buf.length;
// Emulate the unsigned int in java_lang_String::hash_code
while (len-- > 0) {
h = 31*h + (0xFFFFFFFFL & buf[s]);
h = 31*h + (0xFFL & buf[s]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Byte.toUnsignedInt() would be clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but that would also be different to upstream jdk11u

s++;
}
return h & 0xFFFFFFFFL;
Expand Down
12 changes: 11 additions & 1 deletion hotspot/src/share/vm/classfile/javaClasses.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -177,14 +177,24 @@ class java_lang_String : AllStatic {
// hash P(31) from Kernighan & Ritchie
//
// For this reason, THIS ALGORITHM MUST MATCH String.hashCode().
template <typename T> static unsigned int hash_code(T* s, int len) {
static unsigned int hash_code(const jchar* s, int len) {
unsigned int h = 0;
while (len-- > 0) {
h = 31*h + (unsigned int) *s;
s++;
}
return h;
}

static unsigned int hash_code(const jbyte* s, int len) {
unsigned int h = 0;
while (len-- > 0) {
h = 31*h + (((unsigned int) *s) & 0xFF);
s++;
}
return h;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. According to the comment, both of these functions are to mimic String.hashCode. But only the jchar variant does, or?

Assuming the jchar variant gets fed UCS2 and the jbyte variant UTF8. Those encodings could be different for the same java string if we have surrogate chars.

For example, let string be a single unicode "ぁ" character, aka U+3041, which would be encoded as 0x3041 (len 1) with UCS2, 0xE38181 as UTF8.

Hash for the first would use the jchar* variant, len=1, and return 0x3041. Hash for the UTF8 variant would get, I assume, a byte array of 0xE3 0x81 0x81 and a len of 3, and return 0x36443 ((((0xE3 * 0x1F) + 0x81) * 0x1F) + 0x81).

I must be missing something basic here.

Copy link
Contributor

@adinn adinn Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really sure what you are suggesting is a problem here, Thomas. I /think/ the only problem here is that the comment is wrong. You are right that only the jchar variant matches String.hashCode but I believe only that variant /needs/ to match String.hashCode. The jchar variant is used by all code operating on Java Strings proper. The jbyte variant is only used by the Symbol table and the agent.

The problem this is fixing is to do with the disparity between SymbolTable::hash_symbol and the agent HashTable. That was supposed to have been fixed by JDK-8028623. However, the fix is a hostage to fortune because SymbolTable::hash_symbol accepts and passes on to java_lang_String::hash_code a value of C type char* (which may be signed or unsigned depending on the OS) while the agent HashTable code operates on a Java byte[] (which is always signed). This means that the template code may or may not sign extend the values melded into the hash causing the SymbolTable and agent HashTable` to compute different results.

This current fix decouples the definitions of hash_code(const jchar* s, int len) and hash_code(const jbyte* s, int len) in order to allow the latter to match the redefined behaviour of the agent HashTable i.e. it sums individual unsigned 8 byte values in the input rather than unsigned 16 byte values.

As far as I can tell it doesn't actually matter what interpretation is placed on the data sitting in field String.value, whether it is considered as 8 byte or 16 byte values. What matters here is that they are hashed consistently by whatever code processes the contents. Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few uses of java_lang_String::hash_code in JDK8 are following:

jbyte * varaiant is only used to hash symbols (I don't think this affects code in java):

java_lang_String::hash_code(s, len);

other uses dealing with Strings use jchar * variant:

return java_lang_String::hash_code(value->char_at_addr(offset), length);

java_lang_String::hash_code(s, len);

hash = java_lang_String::hash(java_string);

In newer JDKs, jbyte * variant is also used for compact strings (as zero extended latin1 should be equal to UCS2/UTF-16), e.g.:
https://github.com/openjdk/jdk/blob/06a1a15d014f5ca48f62f5f0c8e8682086c4ae0b/src/hotspot/share/classfile/javaClasses.cpp#L522
(but JDK8 does not have compact strings...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really sure what you are suggesting is a problem here, Thomas. I /think/ the only problem here is that the comment is wrong. You are right that only the jchar variant matches String.hashCode but I believe only that variant /needs/ to match String.hashCode. The jchar variant is used by all code operating on Java Strings proper. The jbyte variant is only used by the Symbol table and the agent.

The problem this is fixing is to do with the disparity between SymbolTable::hash_symbol and the agent HashTable. That was supposed to have been fixed by JDK-8028623. However, the fix is a hostage to fortune because SymbolTable::hash_symbol accepts and passes on to java_lang_String::hash_code a value of C type char* (which may be signed or unsigned depending on the OS) while the agent HashTable code operates on a Java byte[] (which is always signed). This means that the template code may or may not sign extend the values melded into the hash causing the SymbolTable and agent HashTable` to compute different results.

This current fix decouples the definitions of hash_code(const jchar* s, int len) and hash_code(const jbyte* s, int len) in order to allow the latter to match the redefined behaviour of the agent HashTable i.e. it sums individual unsigned 8 byte values in the input rather than unsigned 16 byte values.

As far as I can tell it doesn't actually matter what interpretation is placed on the data sitting in field String.value, whether it is considered as 8 byte or 16 byte values. What matters here is that they are hashed consistently by whatever code processes the contents. Am I missing something?

Thank you, Andrew, for disentangling this. I get it now. The byte variant only has to match its java counterpart in the agent.

static unsigned int hash_code(oop java_string);

// This is the string hash code used by the StringTable, which may be
Expand Down
2 changes: 1 addition & 1 deletion hotspot/src/share/vm/classfile/symbolTable.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@ Symbol* SymbolTable::lookup(int index, const char* name,
unsigned int SymbolTable::hash_symbol(const char* s, int len) {
return use_alternate_hashcode() ?
AltHashing::halfsiphash_32(seed(), (const uint8_t*)s, len) :
java_lang_String::hash_code(s, len);
java_lang_String::hash_code((const jbyte*)s, len);
}


Expand Down