-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8343933: Add a MemorySegment::fill benchmark with varying sizes #22010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Welcome back pminborg! A progress list of the required criteria for merging this PR into |
❗ This change is not yet ready to be integrated. |
Webrevs
|
Thanks @minborg for this :) Please remember to add the misprediction count if you can and avoid the bulk methods by having a Having separate call-sites which observe always the same type(s) "could" be too lucky (and gentle) for the runtime (and CHA) and would favour to have a single address entry (or few ones, if we include any optimization for the fill size) in the Branch Target Buffer of the cpu. |
I've added a "mixed" benchmark. I am not sure I understood all of your comments but given my changes, maybe you could elaborate a bit more? |
@Benchmark | ||
public void mixedSegmentFillUnsafe() { | ||
for (int i = 0; i < INSTANCES; i++) { | ||
MIXED_SEGMENTS[i].fill((byte) 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loop unrolling can defeat a bit the purpose of messing up with branch misprediction here so...one solution is to avoid loop unrolling OR
create a
@CompilerControl(DONTINLINE)
static void fillSegment(MemorySegment ms, byte value) {
ms.fill(value);
}
which makes sure (at the cost of a very predictable call to the fillSegment
method) that:
- the inlining depth of fill is controlled and not dependent by the depth of the JMH infra caller
- the call site for fill is always the same (in term of the address in the compiled blob)
Last point is slightly inaccurate because is what CHA will figure it out. Let's say instead that since the caller is not inlined, we make sure the method data of the fill invocation observe different receivers.
IIUC, this benchmark is designed to test what happens in the case where a bulk operation is executed on a segment whose type is not known e.g. through type profiling (e.g. polluted case). Note that in this case (MIXED) the fact that we can't rely on which segment it is (and hence on whether it has a certain scope or not, and what the base object is) would greatly dominate the cost of branch misprediction, which I saw mentioned here. In extreme cases of profile pollution it is very possible for a method to be optimized under the assumption that e.g. the method is always executed on off-heap segments, and then hit an on-heap segment, which will perform horribly. We have test cases such as these in |
@mcimadamore I've added a suggestion (at #22010 (comment)) on how to make sure that the type profile is polluted and shared by all the |
@minborg sent me some logs from his machine, and I'm analyzing them now. Basically, I'm trying to see why your Java code is a bit faster than the Loop code.
There seem to be 3 hot regions. main-loop (region has 44.77%):
post-loops: the "vectorized post-loop" and the "single iteration post-loop" (region has 24.43%):
scalar post loop:
Not sure why we have this below... probably the check that leads to the post-loop?
pre-loop (region has 21.80%):
with a strange extra add that has some strange looking percentage (profile inaccuracy?):
Summary:
The numbers don't quite add up - but they are still somewhat telling - and I think probably accurate enough to see what happens. Basically: we waste a lot of time in the pre and post-loop: getting alignment and then finishing off at the end. And to compare:
We have 2 hot regions. main (58%):
Rest:
... and then the rest of the code I speculate is your long-int-short-byte wind-down code. Conclusion:
Hmm. This really makes me want to ditch the alignment-code - it may hurt more than we gain from it 🤔 |
@eme64 not an expert with ARM, but profiling skidding due to modern big pipelined OOO CPUs is rather frequent
you should check some instr below it to get the real culprit More info on this topic are:
If you uses Intel/AMD and PEBS/IBS (if supported by your cpu) you can run perfasm using precise events via |
@franz1981 right. That is what I thought. I'm usually working on x64, and am not used to all the skidding of ARM. |
But I think the full count is done per block in profiling - and that should be reasonably accurate - though hard to attribute where in the block it came from exactly. |
@minborg This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
@minborg This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the |
This PR proposes to add a new `MemorySegment::fill" benchmark where the byte size of the segments varies.
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/22010/head:pull/22010
$ git checkout pull/22010
Update a local copy of the PR:
$ git checkout pull/22010
$ git pull https://git.openjdk.org/jdk.git pull/22010/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 22010
View PR using the GUI difftool:
$ git pr show -t 22010
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/22010.diff
Using Webrev
Link to Webrev Comment