Skip to content

8311594: Avoid GlobalSession liveness check #844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

JornVernee
Copy link
Member

@JornVernee JornVernee commented Jul 7, 2023

This patch overrides checkValidStateRaw in GlobalSession, to do nothing. The current checkValidStateRaw in MemorySessionImpl checks the owner thread, and liveness, which are both not needed for the global session.

Since the state field is mutable, I found that the JIT can not fold away the liveness check. This has a very marginal effect on performance, but since the fix is also very non-intrusive, I think it's okay to do.

I've added 2 new benchmarks that I've used to measure the difference. One measures the performance of MS::copy, and the other measures MS::get. Both use a special ALL segment which spans 0 to Long.MAX_VALUE, which means that the target address can be used as an offset when doing the accesses. This helps the bounds checks being eliminated, and makes the result of overriding checkValidStateRaw in GlobalSession easier to see.

Here are some of the interesting numbers:

Before:

Benchmark                                  Mode  Cnt  Score   Error  Units
MemorySegmentCopyALL.panama_ALL_to_ALL     avgt   30  7.788 ± 0.029  ns/op
MemorySegmentCopyALL.unsafe_array_to_addr  avgt   30  7.543 ± 0.018  ns/op

Benchmark                               Mode  Cnt  Score   Error  Units
MemorySegmentGet.segment_ALL_unaligned  avgt   30  0.486 ± 0.002  ns/op
MemorySegmentGet.unsafe                 avgt   30  0.398 ± 0.003  ns/op

After:

Benchmark                                  Mode  Cnt  Score   Error  Units
MemorySegmentCopyALL.panama_ALL_to_ALL     avgt   30  7.571 ± 0.017  ns/op
MemorySegmentCopyALL.unsafe_array_to_addr  avgt   30  7.550 ± 0.019  ns/op

Benchmark                               Mode  Cnt  Score   Error  Units
MemorySegmentGet.segment_ALL_unaligned  avgt   30  0.409 ± 0.002  ns/op
MemorySegmentGet.unsafe                 avgt   30  0.397 ± 0.003  ns/op

Progress

  • Change must not contain extraneous whitespace
  • Change must be properly reviewed (1 review required, with at least 1 Committer)

Issue

  • JDK-8311594: Avoid GlobalSession liveness check (Enhancement - P3)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/panama-foreign.git pull/844/head:pull/844
$ git checkout pull/844

Update a local copy of the PR:
$ git checkout pull/844
$ git pull https://git.openjdk.org/panama-foreign.git pull/844/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 844

View PR using the GUI difftool:
$ git pr show -t 844

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/panama-foreign/pull/844.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 7, 2023

👋 Welcome back jvernee! A progress list of the required criteria for merging this PR into foreign-memaccess+abi will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@JornVernee JornVernee marked this pull request as ready for review July 7, 2023 19:58
@openjdk openjdk bot added the rfr Ready for review label Jul 7, 2023
@mlbridge
Copy link

mlbridge bot commented Jul 7, 2023

Webrevs

@Override
@ForceInline
public void checkValidStateRaw() {
// do nothing, avoid liveness check
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok, although (see comments on MemorySessionImpl::checkValidStateRaw) sometimes having too many versions of this method can cause profile pollution issues (which is why we consolidated the impl at some point). I suggest trying other benchmarks, such as LoopOverPollutedXYZ

Copy link
Member Author

@JornVernee JornVernee Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Something I also realized over the weekend. I had thought bimorphic inlining would save us (and we override other methods such as acquire/release already). But, we have 3 concrete implementation classes of MemorySessionImpl: confined, shared, and global. This blows up inlining it seems, causing a big regression when all three session kinds are used (see latest push I did):

Before:

Benchmark                                              Mode  Cnt  Score    Error  Units
LoopOverPollutedSegments.heap_segment_floats_VH        avgt   30  0.311 ±  0.001  ms/op
LoopOverPollutedSegments.heap_segment_floats_instance  avgt   30  0.311 ±  0.001  ms/op
LoopOverPollutedSegments.heap_segment_ints_VH          avgt   30  0.311 ±  0.001  ms/op
LoopOverPollutedSegments.heap_segment_ints_instance    avgt   30  0.311 ±  0.001  ms/op
LoopOverPollutedSegments.heap_unsafe                   avgt   30  0.217 ±  0.001  ms/op
LoopOverPollutedSegments.native_segment_VH             avgt   30  0.237 ±  0.005  ms/op
LoopOverPollutedSegments.native_segment_instance       avgt   30  0.245 ±  0.005  ms/op
LoopOverPollutedSegments.native_unsafe                 avgt   30  0.221 ±  0.007  ms/op

After:

Benchmark                                              Mode  Cnt  Score    Error  Units
LoopOverPollutedSegments.heap_segment_floats_VH        avgt   30  3.310 ±  0.005  ms/op
LoopOverPollutedSegments.heap_segment_floats_instance  avgt   30  3.333 ±  0.004  ms/op
LoopOverPollutedSegments.heap_segment_ints_VH          avgt   30  3.305 ±  0.007  ms/op
LoopOverPollutedSegments.heap_segment_ints_instance    avgt   30  3.439 ±  0.106  ms/op
LoopOverPollutedSegments.heap_unsafe                   avgt   30  0.217 ±  0.001  ms/op
LoopOverPollutedSegments.native_segment_VH             avgt   30  3.595 ±  0.043  ms/op
LoopOverPollutedSegments.native_segment_instance       avgt   30  3.696 ±  0.006  ms/op
LoopOverPollutedSegments.native_unsafe                 avgt   30  0.204 ±  0.001  ms/op

If I remove the use of one of the other 2 session kinds again (e.g. the shared one), performance is back to normal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also tried another solution which puts an instanceof check in the existing checkValidStateRaw. This does improve the case with the special ALL segment, but it also regresses the polluted case by about just as much it seems.

I think this is a dead end.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, the only sane way to add support for this is to have a well-known constant - e.g. MemorySegment::allNative (which should be a restricted factory of course). Then special case this constant w.r.t. bound checks and temporal checks (since there's a well-known constant to compare against, I'm confident that pollution will not rear its ugly head). But that brings in a lot of API questions: for instance, if this segment is unbounded, what happens if you slice (of course one possible answer might be to throw). While we might add something like that in the future, I don't think we have enough evidence yet that something like this is really required.

Copy link
Member Author

@JornVernee JornVernee Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that is not too much different from the instanceof check I tried. The issue I see is that, while when the segment/scope are constant, everything folds away, in every other case we now have to do an additional identity check.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on a smaller test at the moment. The critical performance operation overall is a binary search within a b-tree node. Each comparison obtains a 2-byte pointer which references a binary key, which itself starts with a byte field to encode the length. The comparison does a byte-by-byte check of keys.

I've tried to optimize this in the past by doing larger comparisons (8 bytes at a time), but it didn't show much improvement. I assumed that HotSpot was already doing some optimizations that were good enough, and so I didn't bother with any more experiments.

Anyhow, with a test program which just stresses this critical bottleneck, the Panama version is two times slower than the Unsafe version. I'll work on creating something smaller and standalone (it won't suck in the whole project).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stuff like this can be very helpful, thanks!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the benchmark. We've been able to reproduce your result. We are investigating further (that will take some time).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JornVernee
Copy link
Member Author

I'd still like to upstream the benchmark changes. I'll create a separate PR for that.

@merykitty
Copy link
Member

merykitty commented Jul 31, 2023

For liveness check, a solution on the top of my head is to push the branch to the slow path. Instead of having one state field, we can have 2, which can take values according to the state of the session, (0, x): Open, (1, 0): Closing, (1, 1): Closed. These can be marked as stable, since they do not transit from value 1 to value 0. Then, we can special case the global session to have the openState field being 1, and we do type branching when encountering a 1 when reading this state. This has the advantage that it does not affect the execution of other sessions, as well as being able to fold the check for the global ones.

For ownership check, maybe we can tell the compiler to trust this particular field. Stable does not work here, though maybe we can add a dummy Thread object to use as the pivot value instead of null. But comparing to another pointer is a bit tad more expensive than comparing to null, so changing the compiler seems more preferable.

For range check, maybe we can do the same as liveness one, instead of failing on out of bounds, we branch on a constant field to see if it is a global one, and only throw if it is not.

Thanks a lot.

@mcimadamore
Copy link
Collaborator

For liveness check, a solution on the top of my head is to push the branch to the slow path. Instead of having one state field, we can have 2, which can take values according to the state of the session, (0, x): Open, (1, 0): Closing, (1, 1): Closed. These can be marked as stable, since they do not transit from value 1 to value 0. Then, we can special case the global session to have the openState field being 1, and we do type branching when encountering a 1 when reading this state. This have the advantage that it does not affect the execution of other sessions, as well as being able to fold the check for the global ones.

For ownership check, maybe we can tell the compiler to trust this particular field. Stable does not work here, though maybe we can add a dummy Thread object to use as the pivot value instead of null. But comparing to another pointer is a bit tad more expensive than comparing to null, so changing the compiler seems more preferable.

For range check, maybe we can do the same as liveness one, instead of failing on out of bounds, we branch on a constant field to see if it is a global one, and only throw if it is not.

Thanks a lot.

What you suggest are sensible optimization tactics. However, it is unclear how much this path needs to be optimized further. In most cases, when we see a significant performance difference between MS and Unsafe, the liveness check is never the culprit, the bound check is.

And, when using tactics to e.g. generate wrapper segments on the fly to avoid bound checks [1], we can observe that liveness check is also gone from the corresponding assembly. So, this optimization seems to be in YAGNI territory, at least for now.

[1] - https://mail.openjdk.org/pipermail/panama-dev/2023-July/019487.html

@mcimadamore
Copy link
Collaborator

Ping: does this PR needs to be open?

@JornVernee
Copy link
Member Author

No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfr Ready for review
Development

Successfully merging this pull request may close these issues.

4 participants