-
Notifications
You must be signed in to change notification settings - Fork 63
8311594: Avoid GlobalSession liveness check #844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8311594: Avoid GlobalSession liveness check #844
Conversation
👋 Welcome back jvernee! A progress list of the required criteria for merging this PR into |
Webrevs
|
@Override | ||
@ForceInline | ||
public void checkValidStateRaw() { | ||
// do nothing, avoid liveness check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks ok, although (see comments on MemorySessionImpl::checkValidStateRaw) sometimes having too many versions of this method can cause profile pollution issues (which is why we consolidated the impl at some point). I suggest trying other benchmarks, such as LoopOverPollutedXYZ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Something I also realized over the weekend. I had thought bimorphic inlining would save us (and we override other methods such as acquire/release already). But, we have 3 concrete implementation classes of MemorySessionImpl: confined, shared, and global. This blows up inlining it seems, causing a big regression when all three session kinds are used (see latest push I did):
Before:
Benchmark Mode Cnt Score Error Units
LoopOverPollutedSegments.heap_segment_floats_VH avgt 30 0.311 ± 0.001 ms/op
LoopOverPollutedSegments.heap_segment_floats_instance avgt 30 0.311 ± 0.001 ms/op
LoopOverPollutedSegments.heap_segment_ints_VH avgt 30 0.311 ± 0.001 ms/op
LoopOverPollutedSegments.heap_segment_ints_instance avgt 30 0.311 ± 0.001 ms/op
LoopOverPollutedSegments.heap_unsafe avgt 30 0.217 ± 0.001 ms/op
LoopOverPollutedSegments.native_segment_VH avgt 30 0.237 ± 0.005 ms/op
LoopOverPollutedSegments.native_segment_instance avgt 30 0.245 ± 0.005 ms/op
LoopOverPollutedSegments.native_unsafe avgt 30 0.221 ± 0.007 ms/op
After:
Benchmark Mode Cnt Score Error Units
LoopOverPollutedSegments.heap_segment_floats_VH avgt 30 3.310 ± 0.005 ms/op
LoopOverPollutedSegments.heap_segment_floats_instance avgt 30 3.333 ± 0.004 ms/op
LoopOverPollutedSegments.heap_segment_ints_VH avgt 30 3.305 ± 0.007 ms/op
LoopOverPollutedSegments.heap_segment_ints_instance avgt 30 3.439 ± 0.106 ms/op
LoopOverPollutedSegments.heap_unsafe avgt 30 0.217 ± 0.001 ms/op
LoopOverPollutedSegments.native_segment_VH avgt 30 3.595 ± 0.043 ms/op
LoopOverPollutedSegments.native_segment_instance avgt 30 3.696 ± 0.006 ms/op
LoopOverPollutedSegments.native_unsafe avgt 30 0.204 ± 0.001 ms/op
If I remove the use of one of the other 2 session kinds again (e.g. the shared one), performance is back to normal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also tried another solution which puts an instanceof
check in the existing checkValidStateRaw
. This does improve the case with the special ALL
segment, but it also regresses the polluted case by about just as much it seems.
I think this is a dead end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, the only sane way to add support for this is to have a well-known constant - e.g. MemorySegment::allNative (which should be a restricted factory of course). Then special case this constant w.r.t. bound checks and temporal checks (since there's a well-known constant to compare against, I'm confident that pollution will not rear its ugly head). But that brings in a lot of API questions: for instance, if this segment is unbounded, what happens if you slice (of course one possible answer might be to throw). While we might add something like that in the future, I don't think we have enough evidence yet that something like this is really required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose that is not too much different from the instanceof check I tried. The issue I see is that, while when the segment/scope are constant, everything folds away, in every other case we now have to do an additional identity check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on a smaller test at the moment. The critical performance operation overall is a binary search within a b-tree node. Each comparison obtains a 2-byte pointer which references a binary key, which itself starts with a byte field to encode the length. The comparison does a byte-by-byte check of keys.
I've tried to optimize this in the past by doing larger comparisons (8 bytes at a time), but it didn't show much improvement. I assumed that HotSpot was already doing some optimizations that were good enough, and so I didn't bother with any more experiments.
Anyhow, with a test program which just stresses this critical bottleneck, the Panama version is two times slower than the Unsafe version. I'll work on creating something smaller and standalone (it won't suck in the whole project).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stuff like this can be very helpful, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Panama vs. Unsafe: https://gist.github.com/broneill/3a39051635e3cb758d0cca5a963c685e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the benchmark. We've been able to reproduce your result. We are investigating further (that will take some time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed this:
https://bugs.openjdk.org/browse/JDK-8311932
I'd still like to upstream the benchmark changes. I'll create a separate PR for that. |
For liveness check, a solution on the top of my head is to push the branch to the slow path. Instead of having one For ownership check, maybe we can tell the compiler to trust this particular field. Stable does not work here, though maybe we can add a dummy For range check, maybe we can do the same as liveness one, instead of failing on out of bounds, we branch on a constant field to see if it is a global one, and only throw if it is not. Thanks a lot. |
What you suggest are sensible optimization tactics. However, it is unclear how much this path needs to be optimized further. In most cases, when we see a significant performance difference between MS and Unsafe, the liveness check is never the culprit, the bound check is. And, when using tactics to e.g. generate wrapper segments on the fly to avoid bound checks [1], we can observe that liveness check is also gone from the corresponding assembly. So, this optimization seems to be in YAGNI territory, at least for now. [1] - https://mail.openjdk.org/pipermail/panama-dev/2023-July/019487.html |
Ping: does this PR needs to be open? |
No |
This patch overrides
checkValidStateRaw
in GlobalSession, to do nothing. The currentcheckValidStateRaw
in MemorySessionImpl checks the owner thread, and liveness, which are both not needed for the global session.Since the state field is mutable, I found that the JIT can not fold away the liveness check. This has a very marginal effect on performance, but since the fix is also very non-intrusive, I think it's okay to do.
I've added 2 new benchmarks that I've used to measure the difference. One measures the performance of MS::copy, and the other measures MS::get. Both use a special
ALL
segment which spans0
toLong.MAX_VALUE
, which means that the target address can be used as an offset when doing the accesses. This helps the bounds checks being eliminated, and makes the result of overridingcheckValidStateRaw
in GlobalSession easier to see.Here are some of the interesting numbers:
Before:
After:
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/panama-foreign.git pull/844/head:pull/844
$ git checkout pull/844
Update a local copy of the PR:
$ git checkout pull/844
$ git pull https://git.openjdk.org/panama-foreign.git pull/844/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 844
View PR using the GUI difftool:
$ git pr show -t 844
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/panama-foreign/pull/844.diff
Webrev
Link to Webrev Comment