Skip to content

Conversation

@rvansa
Copy link

@rvansa rvansa commented Apr 24, 2025

This optimization is a followup to #24290 trying to reduce the performance regression in some scenarios introduced in https://bugs.openjdk.org/browse/JDK-8292818 . Based both on performance and memory consumption it is a (better) alternative to #24713 .

This PR optimizes local field lookup in classes with more than 16 fields; rather than sequentially iterating through all fields during lookup we sort the fields based on the field name. The stream includes extra table after the field information: for field at position 16, 32 ... we record the (variable-length-encoded) offset of the field info in this stream. On field lookup, rather than iterating through all fields, we iterate through this table, resolve names for given fields and continue field-by-field iteration only after the last record (hence at most 16 fields).

In classes with <= 16 fields this PR reduces the memory consumption by 1 byte that was left with value 0 at the end of stream. In classes with > 16 fields we add extra 4 bytes with offset of the table, and the table contains one varint for each 16 fields. The terminal byte is not used either.

My measurements on the attached reproducer

hyperfine -w 50 -r 100 '/path/to/jdk-17/bin/java -cp /tmp CCC'
Benchmark 1: /path/to/jdk-17/bin/java -cp /tmp CCC
  Time (mean ± σ):      51.3 ms ±   2.8 ms    [User: 44.7 ms, System: 13.7 ms]
  Range (min … max):    45.1 ms …  53.9 ms    100 runs

hyperfine -w 50 -r 100 '/path/to/jdk25-master/bin/java -cp /tmp CCC'
Benchmark 1: /path/to/jdk25-master/bin/java -cp /tmp CCC
  Time (mean ± σ):      78.2 ms ±   1.0 ms    [User: 74.6 ms, System: 17.3 ms]
  Range (min … max):    73.8 ms …  79.7 ms    100 runs

(the jdk25-master above already contains JDK-8353175)

hyperfine -w 50 -r 100 '/path/to/jdk25-this-pr/bin/java -cp /tmp CCC'
Benchmark 1: /path/to/jdk25-this-pr/jdk/bin/java -cp /tmp CCC
  Time (mean ± σ):      38.5 ms ±   0.5 ms    [User: 34.4 ms, System: 17.3 ms]
  Range (min … max):    37.7 ms …  42.1 ms    100 runs

While #24713 returned the performance to previous levels, this PR improves it by 25% compared to JDK 17 (which does not contain the regression)! This time, the undisclosed production-grade reproducer shows even higher improvement:

JDK 17: 1.6 s
JDK 21 (no patches): 22 s
JDK25-master: 12.3 s
JDK25-this-pr: 0.5 s

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8352075: Perf regression accessing fields (Bug - P3)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24847/head:pull/24847
$ git checkout pull/24847

Update a local copy of the PR:
$ git checkout pull/24847
$ git pull https://git.openjdk.org/jdk.git pull/24847/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 24847

View PR using the GUI difftool:
$ git pr show -t 24847

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24847.diff

Using Webrev

Link to Webrev Comment

Signed-off-by: Radim Vansa <rvansa@azul.com>
@bridgekeeper
Copy link

bridgekeeper bot commented Apr 24, 2025

👋 Welcome back rvansa! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Apr 24, 2025

@rvansa This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8352075: Perf regression accessing fields

Reviewed-by: coleenp, iklam, jsjolen

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 841 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@plummercj, @rose00, @coleenp, @iklam, @jdksjolen, @kimbarrett) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the rfr Pull request is ready for review label Apr 24, 2025
@openjdk
Copy link

openjdk bot commented Apr 24, 2025

@rvansa The following labels will be automatically applied to this pull request:

  • hotspot
  • serviceability

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added serviceability serviceability-dev@openjdk.org hotspot hotspot-dev@openjdk.org labels Apr 24, 2025
@mlbridge
Copy link

mlbridge bot commented Apr 24, 2025

@rvansa rvansa marked this pull request as draft April 25, 2025 10:46
@openjdk openjdk bot removed the rfr Pull request is ready for review label Apr 25, 2025
@rvansa rvansa marked this pull request as ready for review April 28, 2025 07:40
@openjdk openjdk bot added the rfr Pull request is ready for review label Apr 28, 2025
@rvansa
Copy link
Author

rvansa commented Apr 30, 2025

@fparain Hi, could I bring your attention here? Thanks!

@fparain
Copy link
Contributor

fparain commented Apr 30, 2025

@fparain Hi, could I bring your attention here? Thanks!

Sure, I'll take a look at it.
Thanks.


Array<u1>* FieldInfoStream::create_FieldInfoStream(GrowableArray<FieldInfo>* fields, int java_fields, int injected_fields,
int FieldInfoStream::compare_symbols(const Symbol *s1, const Symbol *s2) {
// not lexicographical sort, since we need only total ordering
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only a total ordering is required, why defining a new method instead of reusing Symbol::fast_compare() ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is CDS; I have really started with fast_compare(), but after dehydration the pointers changed and the comparison did not work anymore. This is also a reason why I could not use the hashcode for the ordering.
If you'd prefer lexicographical sort (just a few extra lines) I could use that one...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not good to introduce a new symbol comparator in a random place. Symbol comparators should be defined and maintained centrally, in symbol.[ch]pp.

If there's a bug with fast_compare, I want us to fix it eventually. So for now, define a companion function in symbol.hpp called fast_compare_cds_safe. We can work with it from there. You can use your proposed logic but we will probably want to fiddle with it, and probably merge it with fast_compare (per se) if CDS can be adjusted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This problem should happen only with dynamic CDS archives.

With the static CDS archive (and AOT caches as well; -XX:AOTCache basically are static CDS archives), Symbols are copied into the archive while preserving the order of their relative addresses.

However, when creating a dynamic CDS archive, we first load a static archive. Let's say we load the static archive at address 0x800000000, a Symbol A in the static archive could have an address of 0x800001000. Now, we load a class K who has a method named B but such a Symbol is not in the static archive, this Symbol will be dynamically allocated from the Metaspace. Since the address is decided by the OS, this symbol could have an address below the static archive. E.g., 0x700000000;

Let's say K defines two methods, A() and B(), they are stored in the InstanceKlass::_methods array for K. This array is sorted by ascending addresses of the method's names. So K::B() comes before K::A().

When K is copied into the dynamic archive, the symbol B will also be copied into the dynamic archive. The dynamic archive resides in a memory region above the static archive, so B will have an address like 0x800002000. Therefore, we want the copy of K in the dynamic archive to have a _methods array where K::A() comes before K::B().

This is done in DynamicArchiveBuilder::sort_methods()

My recommendation is to sort your table using fast_compare(), and re-sort this table in the same fashion as DynamicArchiveBuilder::sort_methods().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointers; in fact I had the problem with a (intermediate) variant that encoded the symbol addresses right into the stream. Therefore the issue was comparing value of symbol values before CDS persisting those and after dehydration. Since we're comparing only symbols within single class, I hope that the issue you describe above won't happen. I'll keep the problem with dynamic archives in mind, though.
With the current code I can try using fast_compare again and get back if I find any problems again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the fields are only defined within the same class, the names of the fields are shared among different classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To validate that you have sorted the table correctly, you should add a debug-only validation function. This function should be called after set_fieldinfo_search_table(), and also in InstanceKlass::restore_unshareable_info() and InstanceKlass::remove_unshareable_info().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks for raising those places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect there's a problem with dynamic archives. I will try to write a test case to demonstrate that.

@fparain
Copy link
Contributor

fparain commented Apr 30, 2025

The current patch triggered some failures in our internal tests. I'm investigating them.

@coleenp
Copy link
Contributor

coleenp commented May 2, 2025

For the record, I'm reviewing and testing this PR also.

@rvansa
Copy link
Author

rvansa commented May 5, 2025

Thanks for test investigation! I noticed that some code could be dependent on the order of fields, though to my best knowledge the specification does not guarantee that.

@rvansa
Copy link
Author

rvansa commented May 5, 2025

For the record, in the ideal case I would like to backport this into JDK 21 as well. Do you think that the change in iteration order would be problematic for that?

@SirYwell
Copy link
Member

SirYwell commented May 5, 2025

For the record, in the ideal case I would like to backport this into JDK 21 as well. Do you think that the change in iteration order would be problematic for that?

I think that also affects iteration order of Java methods like Class#getDeclaredFields() etc.? While the order is unspecified there, I'm pretty certain that it would break existing code. But even worse, JVM TI specifies that GetClassFields returns fields in the order they occur in the class file. I assume this is currently broken?

@rvansa
Copy link
Author

rvansa commented May 5, 2025

@SirYwell Oh, that's correct. I haven't noticed that requirement; I guess that this means that the PR needs to be updated.

How does the order of iteration cooperate with @Contended, though? In FieldLayoutBuilder::regular_field_sorting we separate static and contended fields; doesn't that break the requirement?

Copy link
Contributor

@jdksjolen jdksjolen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments.

// The table_length (in bytes) should match number of elements provided
// by the supplier (when Supplier::next() returns false the whole array should
// be filled).
void fill(u1* table, size_t table_length, Supplier &supplier) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let the ampersand hug the type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also seems like this can be a static method?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class constructor calculates the sizes, masks and bitshift for the arguments.

Comment on lines +49 to +56
class Supplier {
public:
// Returns elements with already ordered keys.
// This function should return true when the key and value was set,
// and false when there's no more elements.
// Packed table does NOT support duplicate keys.
virtual bool next(uint32_t* key, uint32_t* value) = 0;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to take the cost of an indirect call for each kv pair? You can't inline it, so the stack frame needs to be popped and pushed, and you're taking 2 registers (16 bytes) to give 8 bytes and 1 bit of information.

We can amortize the cost by implementing this signature instead:

virtual uint32_t next(Pair<uint32_t, uint32_t>* kvs, uint32_t kvs_size); 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done this way with a "Supplier" because this package would be useful for other Unsigned5 packed types.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then you'd need create an array of Pair in create_search_table and copy the data into that. You wouldn't need a Supplier at all, just pass the array and its length to the fill method. If you are worried about the virtual calls, I could do that.

Alternatively, we could replace virtual calls with templated fill method. In that case the Comparator should be a template method as well, since that one is even more perf-critical?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you misunderstand me, as what I'm proposing wouldn't requre creating an array in create_search_table.

I'm saying that you do this:

// Note: we require the supplier to provide the elements in the final order as we can't easily sort
// within this method - qsort() accepts only pure function as comparator.
void PackedTableBuilder::fill(u1* table, size_t table_length, Supplier &supplier) const {
  Pair<uint32_t, uint32_t> kvs[4];

  size_t offset = 0;
  size_t len_read = 0;
  while (len = supplier.next(kvs, 4)) {
    // len tells you how many of the full capacity of kvs was used.
  }
}

Now each call of Supplier::next gives you up to 4 elements, quartering the amount of virtual calls necessary.

Alternatively, we could replace virtual calls with templated fill method. In that case the Comparator should be a template method as well, since that one is even more perf-critical?

Sure, you can try that out.

To be clear, I am only asking: Are virtual calls expensive enough here that not having them, or amortizing their cost, something that makes the performance better?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The earlier version of this PR was not abstracting out PackedTable* so there were no virtual calls. In both the CCC.java and undisclosed customer reproducer I don't see a significant difference in performance - these involve whole JVM startup, so the efficiency of code with linear complexity is probably under the radar. If we want to optimize the hack out of it - yes, there would be space for that, maybe at the cost of maintainability and/or reusability.
TLDR: I don't have a (micro)benchmark that would prove one thing is better than the other.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, if you can't observe a difference then don't change a thing :-).

for (; offset + sizeof(uint64_t) <= length && supplier.next(&key, &value); offset += _element_bytes) {
assert((key & ~_key_mask) == 0, "key out of bounds");
assert((value & ~_value_mask) == 0, "value out of bounds: %x vs. %x (%x)", value, _value_mask, ~_value_mask);
*reinterpret_cast<uint64_t*>(data + offset) = static_cast<uint64_t>(key) | (static_cast<uint64_t>(value) << _value_shift);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong about memcpy, or rather the builtin version?

Doesn't regular memcpy compile into the builtin anyway? Aren't there LE/BE concerns when you do this type of computation?

@jdksjolen
Copy link
Contributor

Hi @rvansa ,

What about this type of API for dealing with the compressed table? We do the 8 byte accesses as unsigned chars (important so that they're 0-extended and not sign extended) and write the compressed KV down in little-endian. We use bitwise OR to not squash whatever was there before. Comparing GCC x64 and PPC (power), it looks good.

#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <cstring>


// For this hard-coded example we use 2 byte keys
// and 3 byte values -- for an element_bytes of 5
uint32_t key_mask = (1 << 16) - 1;
uint32_t value_mask = (1 << 24) - 1;
uint32_t value_shift = 16;

uint32_t element_bytes = 5;

uint64_t* kv_array;

uint32_t unpack_key(uint64_t kv) {
  return kv & key_mask;
}
uint32_t unpack_value(uint64_t kv) { return (kv >> value_shift) & value_mask; }

uint64_t pack_kv(uint64_t k, uint64_t v) {
  return k | (v << value_shift);
}


uint32_t align_down(uint32_t x, uint32_t align) { return x & -align; }
uint32_t align_up(uint32_t x, uint32_t align) {
  return (x + (align - 1)) & -align;
}

uint64_t u64s_required(int n) {
  uint32_t bytes_required = element_bytes * n;
  return align_up(bytes_required, 8) / 8;
}

uint64_t read_u8(uint8_t* p) {
  uint64_t result = 0;
  result |= ((uint64_t)*(p + 0)) << (0 * 8);
  result |= ((uint64_t)*(p + 1)) << (1 * 8);
  result |= ((uint64_t)*(p + 2)) << (2 * 8);
  result |= ((uint64_t)*(p + 3)) << (3 * 8);
  result |= ((uint64_t)*(p + 4)) << (4 * 8);
  result |= ((uint64_t)*(p + 5)) << (5 * 8);
  result |= ((uint64_t)*(p + 6)) << (6 * 8);
  result |= ((uint64_t)*(p + 7)) << (7 * 8);
  return result;
}

void write_u8(uint8_t* p, uint64_t u8) {
  p[0] |= u8             & 0xFF;
  p[1] |= (u8 >> 8*1) & 0xFF;
  p[2] |= (u8 >> 8*2) & 0xFF;
  p[3] |= (u8 >> 8*3) & 0xFF;
  p[4] |= (u8 >> 8*4) & 0xFF;
  p[5] |= (u8 >> 8*5) & 0xFF;
  p[6] |= (u8 >> 8*6) & 0xFF;
  p[7] |= (u8 >> 8*7) & 0xFF;
}

uint64_t read(int n) {
  uint32_t byte_index = element_bytes * n;
  return read_u8(&reinterpret_cast<uint8_t*>(kv_array)[byte_index]);
}

void fill(int n, uint64_t kv) {
  uint32_t byte_index = element_bytes * n;
  write_u8(&reinterpret_cast<uint8_t*>(kv_array)[byte_index], kv);
  return;
}

int main() {
  int num_elts = 65536;
  uint64_t sz = u64s_required(num_elts);
  kv_array = (uint64_t*)malloc(sz * 8);
  for (int i = 0; i < sz; i++) {
    kv_array[i] = 0;
  }

  for (int i = 0; i < num_elts; i++) {
    uint64_t kv = pack_kv(i, 2 * i);
    fill(i, kv);
  }
  for (int i = 0; i < num_elts; i++) {
    uint64_t kv = read(i);
    uint32_t k = unpack_key(kv);
    uint32_t v = unpack_value(kv);
    printf("K: %d, V: %d\n", k, v);
  }
}

@coleenp
Copy link
Contributor

coleenp commented Jun 10, 2025

Shouldn't the Copy package or memcpy do all of this? @jdksjolen I don't think this should be in the packedTable code.

@rvansa
Copy link
Author

rvansa commented Jun 10, 2025

@jdksjolen We'd need to solve the case without padding at the end of the table (or add the padding to ensure that we're not accessing past allocated area). However, I did not really get what problem are you trying to solve?

@jdksjolen
Copy link
Contributor

Shouldn't the Copy package or memcpy do all of this? @jdksjolen I don't think this should be in the packedTable code.

@jdksjolen We'd need to solve the case without padding at the end of the table (or add the padding to ensure that we're not accessing past allocated area). However, I did not really get what problem are you trying to solve?

My worry was that memcpy would have incorrect semantics when on a BE system. As the code no longer uses memcpy, we're fine.

@coleenp
Copy link
Contributor

coleenp commented Jun 10, 2025

These files: classFileParser.hpp, fieldStreams.hpp, fieldStreams.inline.hpp and unsigned5.hpp need a copyright update.

Copy link
Contributor

@coleenp coleenp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have any further comments. This change looks good and I believe this is needed to resolve the performance problem.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jun 11, 2025
@rvansa
Copy link
Author

rvansa commented Jun 11, 2025

Thank you! I'll mark this for integration and I would appreciate if I can get your sponsorship.

@rvansa
Copy link
Author

rvansa commented Jun 11, 2025

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Jun 11, 2025
@openjdk
Copy link

openjdk bot commented Jun 11, 2025

@rvansa
Your change (at version 36510e2) is now ready to be sponsored by a Committer.

@iklam
Copy link
Member

iklam commented Jun 11, 2025

@rvansa the base version of this PR is quite old. Please merge with the latest JDK repo before we integrate it into the mainline.

How much testing have you done on your side? On what platforms?

After you merge, someone at Oracle will run it on our testing pipeline and will sponsor it after all tests are clean.

@coleenp
Copy link
Contributor

coleenp commented Jun 11, 2025

I've been testing this against mainline source base -with just this change patched in and it seems fine. I don't know if it's worth merging. Also if you do merge it do 'git merge' not 'git rebase' I'm rerunning tier1-7 now.

@iklam
Copy link
Member

iklam commented Jun 11, 2025

I've been testing this against mainline source base -with just this change patched in and it seems fine. I don't know if it's worth merging. Also if you do merge it do 'git merge' not 'git rebase' I'm rerunning tier1-7 now.

@rvansa since Coleen is already testing against the mainline, there's no need for you to merge now. We will sponsor once the tests come out clean.

@coleenp
Copy link
Contributor

coleenp commented Jun 12, 2025

Tier1 on oracle platforms passsed, and tier1-7 have passed with yesterdays OpenJDK code and this change.

Copy link
Contributor

@jdksjolen jdksjolen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the good and patient work.

@coleenp
Copy link
Contributor

coleenp commented Jun 12, 2025

@rvansa thank you for your work on this change, fixing the performance regression in an understandable way and answering all comments and questions. I think it's ready to go.
/sponsor

@openjdk
Copy link

openjdk bot commented Jun 12, 2025

Going to push as commit e18277b.
Since your change was applied there have been 842 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jun 12, 2025
@openjdk openjdk bot closed this Jun 12, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Jun 12, 2025
@openjdk
Copy link

openjdk bot commented Jun 12, 2025

@coleenp @rvansa Pushed as commit e18277b.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@adinn
Copy link
Contributor

adinn commented Jun 12, 2025

Nice work, Radim! Thanks for persevering with this change.

@rvansa
Copy link
Author

rvansa commented Jun 13, 2025

Thanks to everyone involved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated serviceability serviceability-dev@openjdk.org

Development

Successfully merging this pull request may close these issues.

10 participants