-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8180450: secondary_super_cache does not scale well #18309
Conversation
So it turns out that even without a POPCNT intstruction, this algorithm is still faster than the current linear search in all reasonable cases. I've pushed a change that uses a hand-coded population count.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine now.
Performance testing results look fine. |
I wonder, could you do me a little favour? Please run the performance tests with |
/integrate |
Going to push as commit f11a496.
Your commit was automatically rebased without conflicts. |
@theRealAph Pushed as commit f11a496. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
Sure, I'll let you know once the testing is over. |
/backport jdk22u |
@theRealAph Could not automatically backport
Please fetch the appropriate branch/commit and manually resolve these conflicts by using the following commands in your personal fork of openjdk/jdk22u. Note: these commands are just some suggestions and you can use other equivalent commands you know.
Once you have resolved the conflicts as explained above continue with creating a pull request towards the openjdk/jdk22u with the title Below you can find a suggestion for the pull request body:
|
I've filed https://bugs.openjdk.org/browse/JDK-8331117 for PPC64. @bulasevich, @fyang, @amitkumar: You may want to check if it makes sense for your platforms. |
@TheRealMDoerr I guess you pinged wrong Amit 🙂 JBS Issue for s390x: https://bugs.openjdk.org/browse/JDK-8331126 |
I think it makes sense everywhere. It's even a win on machines without POPCOUNT, which surprised me. Once you have hashed lookups the secondary supers cache doesn't help at all. I want to delete the secondary supers cache soon, because it's an additional unnecessary step. @iwanowww did some measurements (DaCapo, Renaissance, SPECjbb2005, SPECjvm2008 on linux-x64/macos-aarch64), and he saw no significant regressions without secondary supers cache. |
Thanks for the information. So, we should probably wait for that. Is there a JBS issue already? |
No, don't wait! Every port will benefit from this change, now. |
/backport jdk21u |
@theRealAph Could not automatically backport
Please fetch the appropriate branch/commit and manually resolve these conflicts by using the following commands in your personal fork of openjdk/jdk21u. Note: these commands are just some suggestions and you can use other equivalent commands you know.
Once you have resolved the conflicts as explained above continue with creating a pull request towards the openjdk/jdk21u with the title Below you can find a suggestion for the pull request body:
|
This PR is a redesign of subtype checking.
The implementation of subtype checking in the HotSpot JVM is now twenty years old. There have been some performance-related bugs reported, and the only way to fix them is a redesign of the way it works.
So what's changed, so that the old design should be replaced?
Firstly, the computers of today aren't the computers of twenty years ago. It's not merely a matter of speed: the systems are much more parallel, both in the sense of having more cores and each core can run many instructions in parallel. Because of this, the speed ratio between memory accesses and the rate at which we can execute instructions has become wider and wider.
The most severe reported problem is to do with the "secondary supers cache". This is a 1-element per-class cache for interfaces (and arrays of interfaces). Unfortunately, if two threads repeatedly update this cache, the result is that a cache line ping-pongs between cores, causing a severe slowdown.
Also, the linear search for an interface that is absent means that the entire list of interfaces has to be scanned. This plays badly with newer language features such as JEP 406, pattern matching for switch.
However, the computers of today can help us. The very high instruction-per-cycle rate of a Great Big Out-Of-Order (GBOOO) processor allows us to execute many of the instructions of a hash table lookup in parallel, as long as we avoid dependencies between instructions.
The solution
We use a hashed lookup of secondary supers. This is a 64-way hash table, with linear probing for collisions. The table is compressed, in that null entries are removed, and the resulting hash table fits into the same secondary supers array as today's unsorted array of secondary supers. This means that existing code in HotSpot that simply does a linear scan of the secondary supers array does not need to be altered.
We add a bitmap field to each Klass object. This bitmap contains an occupancy bit corresponding to each element of the hash table, with a 1 indicating element presence. As well as allowing the hash table to be decompressed, this bimap is used as a simple kind of Bloom Filter. To determine whether a superclass is present, we simply have to check a single bit in the bitmap. If the bit is clear, we know that the superclass is not present. If the bit is set, we have to do a little arithmetic and then consult the hash table.
It works like this:
The popcount instruction returns the cardinality of a bitset. By shifting out the bits higher in number than the hash code of the element we're looking for, we leave only the lower bits.
popcount
, then, gives us the index of the element we're looking for.If we don't get a match at the first attempt, we test the next bit in the bitset and jump to a fallback stub:
Collisions are rare. Vladimir Ivanov did a survey of Java benchmark code, and it is very rare to see more than about 20 super-interfaces, and even that gives us a collision rate of only about 0.25.
The time taken for a positive lookup is somewhere between 3 - 6 cycles, or about 0.9 - 1.8 ns. This is a robust figure, confirmed across current AArch64 and x86 designs, and this rate can be sustained indefinitely. Negative lookups are slightly faster because there's usually no need to consult the secondary super cache, at about 3 - 5 cycles.
The current secondary super cache lookup is usually slightly faster than a hash table for positive lookups, at about 3 - 4 cycles, but it performs badly with negative lookups, and unless the class you're looking for is in the secondary super cache it performs badly as well. For example, a negative lookup in a class with 4 interfaces takes 10 - 47 cycles.
Limitations and disadvantages
This PR is a subset of what is possible. It is only implemented for C2, in the cases where the superclass is a constant, known at compile time. Given that this is almost all uses of subtype checking, it's not really a restriction.
There is less sharing of interface arrays between classes than there was before. In some cases, I have to make copies of the interfaces array and sort the copies, rather than just using the same array at runtime. This is fixable, and will be fixed in a soon-to-come patch.
I haven't removed the secondary super cache. It's still used by the interpreter and C1.
In the future I'd like to delete the secondary super cache, but it is used in many places across the VM. Today is not that day.
Performance testing
Hashing the secondary supers arrays takes a little time. I've added a perf counter for this, so you can see. It's only really a few milliseconds added to startup.
I have run Renaissance and SPECjvm, and whatever speed differences there may be are immeasurably small, down in the noise. The
SecondarySupersLookup
benchmark in this PR allows you to isolate single instanceof invocations with various sets of secondary superclasses.Finally
Vladimir Ivanov was very generous with his time and his advice. He explained the problem and went into detail about his experiments, and shared with me his experimental code. This saved me a great deal of time in this work.
Progress
Issue
Reviewers
Contributors
<vlivanov@openjdk.org>
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18309/head:pull/18309
$ git checkout pull/18309
Update a local copy of the PR:
$ git checkout pull/18309
$ git pull https://git.openjdk.org/jdk.git pull/18309/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 18309
View PR using the GUI difftool:
$ git pr show -t 18309
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18309.diff
Webrev
Link to Webrev Comment