-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8352075: Perf regression accessing fields #24847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Radim Vansa <rvansa@azul.com>
|
👋 Welcome back rvansa! A progress list of the required criteria for merging this PR into |
|
@rvansa This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 841 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@plummercj, @rose00, @coleenp, @iklam, @jdksjolen, @kimbarrett) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
|
@fparain Hi, could I bring your attention here? Thanks! |
Sure, I'll take a look at it. |
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/oops/Field.java
Outdated
Show resolved
Hide resolved
src/hotspot/share/oops/fieldInfo.cpp
Outdated
|
|
||
| Array<u1>* FieldInfoStream::create_FieldInfoStream(GrowableArray<FieldInfo>* fields, int java_fields, int injected_fields, | ||
| int FieldInfoStream::compare_symbols(const Symbol *s1, const Symbol *s2) { | ||
| // not lexicographical sort, since we need only total ordering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If only a total ordering is required, why defining a new method instead of reusing Symbol::fast_compare() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is CDS; I have really started with fast_compare(), but after dehydration the pointers changed and the comparison did not work anymore. This is also a reason why I could not use the hashcode for the ordering.
If you'd prefer lexicographical sort (just a few extra lines) I could use that one...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not good to introduce a new symbol comparator in a random place. Symbol comparators should be defined and maintained centrally, in symbol.[ch]pp.
If there's a bug with fast_compare, I want us to fix it eventually. So for now, define a companion function in symbol.hpp called fast_compare_cds_safe. We can work with it from there. You can use your proposed logic but we will probably want to fiddle with it, and probably merge it with fast_compare (per se) if CDS can be adjusted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This problem should happen only with dynamic CDS archives.
With the static CDS archive (and AOT caches as well; -XX:AOTCache basically are static CDS archives), Symbols are copied into the archive while preserving the order of their relative addresses.
However, when creating a dynamic CDS archive, we first load a static archive. Let's say we load the static archive at address 0x800000000, a Symbol A in the static archive could have an address of 0x800001000. Now, we load a class K who has a method named B but such a Symbol is not in the static archive, this Symbol will be dynamically allocated from the Metaspace. Since the address is decided by the OS, this symbol could have an address below the static archive. E.g., 0x700000000;
Let's say K defines two methods, A() and B(), they are stored in the InstanceKlass::_methods array for K. This array is sorted by ascending addresses of the method's names. So K::B() comes before K::A().
When K is copied into the dynamic archive, the symbol B will also be copied into the dynamic archive. The dynamic archive resides in a memory region above the static archive, so B will have an address like 0x800002000. Therefore, we want the copy of K in the dynamic archive to have a _methods array where K::A() comes before K::B().
This is done in DynamicArchiveBuilder::sort_methods()
My recommendation is to sort your table using fast_compare(), and re-sort this table in the same fashion as DynamicArchiveBuilder::sort_methods().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointers; in fact I had the problem with a (intermediate) variant that encoded the symbol addresses right into the stream. Therefore the issue was comparing value of symbol values before CDS persisting those and after dehydration. Since we're comparing only symbols within single class, I hope that the issue you describe above won't happen. I'll keep the problem with dynamic archives in mind, though.
With the current code I can try using fast_compare again and get back if I find any problems again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if the fields are only defined within the same class, the names of the fields are shared among different classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To validate that you have sorted the table correctly, you should add a debug-only validation function. This function should be called after set_fieldinfo_search_table(), and also in InstanceKlass::restore_unshareable_info() and InstanceKlass::remove_unshareable_info().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thanks for raising those places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect there's a problem with dynamic archives. I will try to write a test case to demonstrate that.
|
The current patch triggered some failures in our internal tests. I'm investigating them. |
|
For the record, I'm reviewing and testing this PR also. |
|
Thanks for test investigation! I noticed that some code could be dependent on the order of fields, though to my best knowledge the specification does not guarantee that. |
|
For the record, in the ideal case I would like to backport this into JDK 21 as well. Do you think that the change in iteration order would be problematic for that? |
I think that also affects iteration order of Java methods like |
|
@SirYwell Oh, that's correct. I haven't noticed that requirement; I guess that this means that the PR needs to be updated. How does the order of iteration cooperate with |
jdksjolen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments.
| // The table_length (in bytes) should match number of elements provided | ||
| // by the supplier (when Supplier::next() returns false the whole array should | ||
| // be filled). | ||
| void fill(u1* table, size_t table_length, Supplier &supplier) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let the ampersand hug the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also seems like this can be a static method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class constructor calculates the sizes, masks and bitshift for the arguments.
| class Supplier { | ||
| public: | ||
| // Returns elements with already ordered keys. | ||
| // This function should return true when the key and value was set, | ||
| // and false when there's no more elements. | ||
| // Packed table does NOT support duplicate keys. | ||
| virtual bool next(uint32_t* key, uint32_t* value) = 0; | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to take the cost of an indirect call for each kv pair? You can't inline it, so the stack frame needs to be popped and pushed, and you're taking 2 registers (16 bytes) to give 8 bytes and 1 bit of information.
We can amortize the cost by implementing this signature instead:
virtual uint32_t next(Pair<uint32_t, uint32_t>* kvs, uint32_t kvs_size);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was done this way with a "Supplier" because this package would be useful for other Unsigned5 packed types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then you'd need create an array of Pair in create_search_table and copy the data into that. You wouldn't need a Supplier at all, just pass the array and its length to the fill method. If you are worried about the virtual calls, I could do that.
Alternatively, we could replace virtual calls with templated fill method. In that case the Comparator should be a template method as well, since that one is even more perf-critical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you misunderstand me, as what I'm proposing wouldn't requre creating an array in create_search_table.
I'm saying that you do this:
// Note: we require the supplier to provide the elements in the final order as we can't easily sort
// within this method - qsort() accepts only pure function as comparator.
void PackedTableBuilder::fill(u1* table, size_t table_length, Supplier &supplier) const {
Pair<uint32_t, uint32_t> kvs[4];
size_t offset = 0;
size_t len_read = 0;
while (len = supplier.next(kvs, 4)) {
// len tells you how many of the full capacity of kvs was used.
}
}Now each call of Supplier::next gives you up to 4 elements, quartering the amount of virtual calls necessary.
Alternatively, we could replace virtual calls with templated fill method. In that case the Comparator should be a template method as well, since that one is even more perf-critical?
Sure, you can try that out.
To be clear, I am only asking: Are virtual calls expensive enough here that not having them, or amortizing their cost, something that makes the performance better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The earlier version of this PR was not abstracting out PackedTable* so there were no virtual calls. In both the CCC.java and undisclosed customer reproducer I don't see a significant difference in performance - these involve whole JVM startup, so the efficiency of code with linear complexity is probably under the radar. If we want to optimize the hack out of it - yes, there would be space for that, maybe at the cost of maintainability and/or reusability.
TLDR: I don't have a (micro)benchmark that would prove one thing is better than the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, if you can't observe a difference then don't change a thing :-).
| for (; offset + sizeof(uint64_t) <= length && supplier.next(&key, &value); offset += _element_bytes) { | ||
| assert((key & ~_key_mask) == 0, "key out of bounds"); | ||
| assert((value & ~_value_mask) == 0, "value out of bounds: %x vs. %x (%x)", value, _value_mask, ~_value_mask); | ||
| *reinterpret_cast<uint64_t*>(data + offset) = static_cast<uint64_t>(key) | (static_cast<uint64_t>(value) << _value_shift); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's wrong about memcpy, or rather the builtin version?
Doesn't regular memcpy compile into the builtin anyway? Aren't there LE/BE concerns when you do this type of computation?
|
Hi @rvansa , What about this type of API for dealing with the compressed table? We do the 8 byte accesses as unsigned chars (important so that they're 0-extended and not sign extended) and write the compressed KV down in little-endian. We use bitwise OR to not squash whatever was there before. Comparing GCC x64 and PPC (power), it looks good. #include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <cstring>
// For this hard-coded example we use 2 byte keys
// and 3 byte values -- for an element_bytes of 5
uint32_t key_mask = (1 << 16) - 1;
uint32_t value_mask = (1 << 24) - 1;
uint32_t value_shift = 16;
uint32_t element_bytes = 5;
uint64_t* kv_array;
uint32_t unpack_key(uint64_t kv) {
return kv & key_mask;
}
uint32_t unpack_value(uint64_t kv) { return (kv >> value_shift) & value_mask; }
uint64_t pack_kv(uint64_t k, uint64_t v) {
return k | (v << value_shift);
}
uint32_t align_down(uint32_t x, uint32_t align) { return x & -align; }
uint32_t align_up(uint32_t x, uint32_t align) {
return (x + (align - 1)) & -align;
}
uint64_t u64s_required(int n) {
uint32_t bytes_required = element_bytes * n;
return align_up(bytes_required, 8) / 8;
}
uint64_t read_u8(uint8_t* p) {
uint64_t result = 0;
result |= ((uint64_t)*(p + 0)) << (0 * 8);
result |= ((uint64_t)*(p + 1)) << (1 * 8);
result |= ((uint64_t)*(p + 2)) << (2 * 8);
result |= ((uint64_t)*(p + 3)) << (3 * 8);
result |= ((uint64_t)*(p + 4)) << (4 * 8);
result |= ((uint64_t)*(p + 5)) << (5 * 8);
result |= ((uint64_t)*(p + 6)) << (6 * 8);
result |= ((uint64_t)*(p + 7)) << (7 * 8);
return result;
}
void write_u8(uint8_t* p, uint64_t u8) {
p[0] |= u8 & 0xFF;
p[1] |= (u8 >> 8*1) & 0xFF;
p[2] |= (u8 >> 8*2) & 0xFF;
p[3] |= (u8 >> 8*3) & 0xFF;
p[4] |= (u8 >> 8*4) & 0xFF;
p[5] |= (u8 >> 8*5) & 0xFF;
p[6] |= (u8 >> 8*6) & 0xFF;
p[7] |= (u8 >> 8*7) & 0xFF;
}
uint64_t read(int n) {
uint32_t byte_index = element_bytes * n;
return read_u8(&reinterpret_cast<uint8_t*>(kv_array)[byte_index]);
}
void fill(int n, uint64_t kv) {
uint32_t byte_index = element_bytes * n;
write_u8(&reinterpret_cast<uint8_t*>(kv_array)[byte_index], kv);
return;
}
int main() {
int num_elts = 65536;
uint64_t sz = u64s_required(num_elts);
kv_array = (uint64_t*)malloc(sz * 8);
for (int i = 0; i < sz; i++) {
kv_array[i] = 0;
}
for (int i = 0; i < num_elts; i++) {
uint64_t kv = pack_kv(i, 2 * i);
fill(i, kv);
}
for (int i = 0; i < num_elts; i++) {
uint64_t kv = read(i);
uint32_t k = unpack_key(kv);
uint32_t v = unpack_value(kv);
printf("K: %d, V: %d\n", k, v);
}
}
|
|
Shouldn't the Copy package or memcpy do all of this? @jdksjolen I don't think this should be in the packedTable code. |
|
@jdksjolen We'd need to solve the case without padding at the end of the table (or add the padding to ensure that we're not accessing past allocated area). However, I did not really get what problem are you trying to solve? |
My worry was that |
|
These files: classFileParser.hpp, fieldStreams.hpp, fieldStreams.inline.hpp and unsigned5.hpp need a copyright update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not have any further comments. This change looks good and I believe this is needed to resolve the performance problem.
|
Thank you! I'll mark this for integration and I would appreciate if I can get your sponsorship. |
|
/integrate |
|
@rvansa the base version of this PR is quite old. Please merge with the latest JDK repo before we integrate it into the mainline. How much testing have you done on your side? On what platforms? After you merge, someone at Oracle will run it on our testing pipeline and will sponsor it after all tests are clean. |
|
I've been testing this against mainline source base -with just this change patched in and it seems fine. I don't know if it's worth merging. Also if you do merge it do 'git merge' not 'git rebase' I'm rerunning tier1-7 now. |
@rvansa since Coleen is already testing against the mainline, there's no need for you to merge now. We will sponsor once the tests come out clean. |
|
Tier1 on oracle platforms passsed, and tier1-7 have passed with yesterdays OpenJDK code and this change. |
jdksjolen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the good and patient work.
|
@rvansa thank you for your work on this change, fixing the performance regression in an understandable way and answering all comments and questions. I think it's ready to go. |
|
Going to push as commit e18277b.
Your commit was automatically rebased without conflicts. |
|
Nice work, Radim! Thanks for persevering with this change. |
|
Thanks to everyone involved! |
This optimization is a followup to #24290 trying to reduce the performance regression in some scenarios introduced in https://bugs.openjdk.org/browse/JDK-8292818 . Based both on performance and memory consumption it is a (better) alternative to #24713 .
This PR optimizes local field lookup in classes with more than 16 fields; rather than sequentially iterating through all fields during lookup we sort the fields based on the field name. The stream includes extra table after the field information: for field at position 16, 32 ... we record the (variable-length-encoded) offset of the field info in this stream. On field lookup, rather than iterating through all fields, we iterate through this table, resolve names for given fields and continue field-by-field iteration only after the last record (hence at most 16 fields).
In classes with <= 16 fields this PR reduces the memory consumption by 1 byte that was left with value 0 at the end of stream. In classes with > 16 fields we add extra 4 bytes with offset of the table, and the table contains one varint for each 16 fields. The terminal byte is not used either.
My measurements on the attached reproducer
(the jdk25-master above already contains JDK-8353175)
While #24713 returned the performance to previous levels, this PR improves it by 25% compared to JDK 17 (which does not contain the regression)! This time, the undisclosed production-grade reproducer shows even higher improvement:
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24847/head:pull/24847$ git checkout pull/24847Update a local copy of the PR:
$ git checkout pull/24847$ git pull https://git.openjdk.org/jdk.git pull/24847/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 24847View PR using the GUI difftool:
$ git pr show -t 24847Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24847.diff
Using Webrev
Link to Webrev Comment