Skip to content

8302209: [Lilliput] Optimize fix-anon monitor owner path #74

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

rkennke
Copy link
Collaborator

@rkennke rkennke commented Feb 10, 2023

When trying to exit a monitor, we need to check if the current owner is anonymous, and if so, fix it before exiting the monitor. So far, we have been doing this by calling into the slow-path and handle it in the runtime. However, it is fairly easy to do in the fast-path and avoid calling into the runtime altogether. I also included those changes in upstream fast-locking PR.

Testing:

  • tier1 (x86_64, aarch64)
  • tier2 (x86_64, aarch64)
  • jcstress -t sync -m quick

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (1 review required, with at least 1 Committer)

Issue

  • JDK-8302209: [Lilliput] Optimize fix-anon monitor owner path

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/lilliput pull/74/head:pull/74
$ git checkout pull/74

Update a local copy of the PR:
$ git checkout pull/74
$ git pull https://git.openjdk.org/lilliput pull/74/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 74

View PR using the GUI difftool:
$ git pr show -t 74

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/lilliput/pull/74.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Feb 10, 2023

👋 Welcome back rkennke! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Feb 10, 2023
@mlbridge
Copy link

mlbridge bot commented Feb 10, 2023

Webrevs

@openjdk
Copy link

openjdk bot commented Feb 14, 2023

@rkennke this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8302209
git fetch https://git.openjdk.org/lilliput master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Feb 14, 2023
@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Feb 14, 2023
Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions/Comments:

What is the advantage of using a stub here? The code section between aarch64 and x64 are different and only used in one place each. So you don't really save much by reuse here. Seems you could have just written them out as a conditional path into the various xxfastlock() functions. Then you would have saved the jumps into and out of the stub in the case owner==ANONYMOUS case.

Anonymous means some other thread T2 inflated my lock while I (T1) was thin-lock-waiting on the lock, right?

@tstuefe
Copy link
Member

tstuefe commented Mar 7, 2023

BTW I found it confusing that we have pre-existing functions called ...fast_unlock and ...fast_lock, since it has nothing to do with your FastLocking, right?

We should rename UseFastLocking to UseRomanLocking :-)

@rkennke
Copy link
Collaborator Author

rkennke commented Mar 7, 2023

Hi Thomas,

What is the advantage of using a stub here? The code section between aarch64 and x64 are different and only used in one place each. So you don't really save much by reuse here. Seems you could have just written them out as a conditional path into the various xxfastlock() functions. Then you would have saved the jumps into and out of the stub in the case owner==ANONYMOUS case.

This is not about saving memory. The very common path here is that owner is not ANONYMOUS. Using a stub means that we forward-jump only in the very uncommon path, and not branch in the common path. If we were not using a stub, we would have to forward-jump in the common path and not branch in the uncommon path. In my experience, forward-jumping in the common path throws off static branch prediction and performs significantly worse. This is also mentioned in the Intel optimization guide as important consideration.

Anonymous means some other thread T2 inflated my lock while I (T1) was thin-lock-waiting on the lock, right?

Yes, exactly.

I'll address your other comments later today or tomorrow.

Thank you!
Roman

@tstuefe
Copy link
Member

tstuefe commented Mar 7, 2023

Hi Roman,

What is the advantage of using a stub here? The code section between aarch64 and x64 are different and only used in one place each. So you don't really save much by reuse here. Seems you could have just written them out as a conditional path into the various xxfastlock() functions. Then you would have saved the jumps into and out of the stub in the case owner==ANONYMOUS case.

This is not about saving memory. The very common path here is that owner is not ANONYMOUS. Using a stub means that we forward-jump only in the very uncommon path, and not branch in the common path.

wait... now I'm confused. Sorry, I'm no compiler guy. But:

__ tst(disp_hdr, (uint64_t)(intptr_t) ANONYMOUS_OWNER);
...
 __ br(Assembler::NE, stub->entry());

Why NE? we jump into the stub if owner != ANONYMOUS?

If we were not using a stub, we would have to forward-jump in the common path and not branch in the uncommon path. In my experience, forward-jumping in the common path throws off static branch prediction and performs significantly worse. This is also mentioned in the Intel optimization guide as important consideration.

I understand that. But you could have this done in-place too, right? Place a label at the end of the function and jump to it in the uncommon case, before the label do a ret for the common case. This is more to make sure I understand the coding, if you prefer to do it via stub that's fine.

Anonymous means some other thread T2 inflated my lock while I (T1) was thin-lock-waiting on the lock, right?

Yes, exactly.

Okay, non-anonymous is the case where a thread leaves an inflated monitor it inflated itself. And that is the hot path? Interesting. Under which circumstances does a thread inflate its own thread? Is this because your fastlocking does not work with recursion yet, right? So every recursion would inflate the lock right away?

I'll address your other comments later today or tomorrow.

Thank you! Roman

@rkennke
Copy link
Collaborator Author

rkennke commented Mar 7, 2023

What is the advantage of using a stub here? The code section between aarch64 and x64 are different and only used in one place each. So you don't really save much by reuse here. Seems you could have just written them out as a conditional path into the various xxfastlock() functions. Then you would have saved the jumps into and out of the stub in the case owner==ANONYMOUS case.

This is not about saving memory. The very common path here is that owner is not ANONYMOUS. Using a stub means that we forward-jump only in the very uncommon path, and not branch in the common path.

wait... now I'm confused. Sorry, I'm no compiler guy. But:

__ tst(disp_hdr, (uint64_t)(intptr_t) ANONYMOUS_OWNER);
...
 __ br(Assembler::NE, stub->entry());

Why NE? we jump into the stub if owner != ANONYMOUS?

tst is an and instruction in disguise: it masks the value and sets ZF according to the result: it is ZF=1 if the bit is 0, and ZF=0 if the bit is 1. NE is ZF=0, that is when the ANONYMOUS bit is set.

If we were not using a stub, we would have to forward-jump in the common path and not branch in the uncommon path. In my experience, forward-jumping in the common path throws off static branch prediction and performs significantly worse. This is also mentioned in the Intel optimization guide as important consideration.

I understand that. But you could have this done in-place too, right? Place a label at the end of the function and jump to it in the uncommon case, before the label do a ret for the common case. This is more to make sure I understand the coding, if you prefer to do it via stub that's fine.

There is no ret here. This whole code section is expanded when C2 emits FastLockNode and FastUnlockNode. (To make it even more confusing, FastLockNode and FastUnlockNode are subclasses of CmpNode, and the code following it will do different things depending on whether we come out with ZF=1 or =0. But this is not relevant for this discussion.) Lacking a ret, I don't see a reasonable way to achieve what I described without using a stub.

Anonymous means some other thread T2 inflated my lock while I (T1) was thin-lock-waiting on the lock, right?

Yes, exactly.

Okay, non-anonymous is the case where a thread leaves an inflated monitor it inflated itself.

It doesn't have to have inflated that monitor itself, it would only have locked the (possibly pre-existing) monitor itself.

And that is the hot path? Interesting. Under which circumstances does a thread inflate its own thread? Is this because your fastlocking does not work with recursion yet, right? So every recursion would inflate the lock right away?

Once 2 or more threads compete for a lock, that lock gets inflated. From there on, all locks and unlocks are done on the monitor. Yes, this is very common and hot (in workloads that use locking properly - as opposed to code that uses uncontended and/or unshared locks, which is what fast-locking is good at). Yes, recursion would inflate the lock immediately, but this is not the point here. (I do have an experimental patch to implement recursive fast-locking in rkennke/jdk#2, I am just not sure if it's worth to have this extra complexity for little gain in somewhat bad workloads.)

Roman

@tstuefe
Copy link
Member

tstuefe commented Mar 8, 2023

wait... now I'm confused. Sorry, I'm no compiler guy. But:

__ tst(disp_hdr, (uint64_t)(intptr_t) ANONYMOUS_OWNER);
...
 __ br(Assembler::NE, stub->entry());

Why NE? we jump into the stub if owner != ANONYMOUS?

tst is an and instruction in disguise: it masks the value and sets ZF according to the result: it is ZF=1 if the bit is 0, and ZF=0 if the bit is 1. NE is ZF=0, that is when the ANONYMOUS bit is set.

Ah, okay, thanks for clarifying.

If we were not using a stub, we would have to forward-jump in the common path and not branch in the uncommon path. In my experience, forward-jumping in the common path throws off static branch prediction and performs significantly worse. This is also mentioned in the Intel optimization guide as important consideration.

I understand that. But you could have this done in-place too, right? Place a label at the end of the function and jump to it in the uncommon case, before the label do a ret for the common case. This is more to make sure I understand the coding, if you prefer to do it via stub that's fine.

There is no ret here. This whole code section is expanded when C2 emits FastLockNode and FastUnlockNode. (To make it even more confusing, FastLockNode and FastUnlockNode are subclasses of CmpNode, and the code following it will do different things depending on whether we come out with ZF=1 or =0. But this is not relevant for this discussion.) Lacking a ret, I don't see a reasonable way to achieve what I described without using a stub.

Okay, I get it. There is no "end of function" to push your uncommon branch to, so the only way to do this without a stub would be to create a branch in-place, and that you don't want that for performance reasons.

Stupid question, is it guaranteed that the branch op always works with the location the stub is emitted to? Is there a mechanism to check that its is not too far away? Or is the code heap always small enough for every branch instruction used?

Anonymous means some other thread T2 inflated my lock while I (T1) was thin-lock-waiting on the lock, right?

Yes, exactly.

Okay, non-anonymous is the case where a thread leaves an inflated monitor it inflated itself.

It doesn't have to have inflated that monitor itself, it would only have locked the (possibly pre-existing) monitor itself.

And that is the hot path? Interesting. Under which circumstances does a thread inflate its own thread? Is this because your fastlocking does not work with recursion yet, right? So every recursion would inflate the lock right away?

Once 2 or more threads compete for a lock, that lock gets inflated. From there on, all locks and unlocks are done on the monitor. Yes, this is very common and hot (in workloads that use locking properly - as opposed to code that uses uncontended and/or unshared locks, which is what fast-locking is good at). Yes, recursion would inflate the lock immediately, but this is not the point here. (I do have an experimental patch to implement recursive fast-locking in rkennke/jdk#2, I am just not sure if it's worth to have this extra complexity for little gain in somewhat bad workloads.)

Again, thanks for explaining.

Reading through this and through the justification for biased-locking removal, it seems that which of those two scenarios are more common today is not so clear cut, right? So modern applications see more contented locking and less locking altogether than old apps?

Roman

@tstuefe
Copy link
Member

tstuefe commented Mar 8, 2023

Reading through fast_unlock() some more...

What I did not understand at first is why you pop the object twice in fast_unlock: once in the stub, then again in fast_unlock_impl. But "Stacked" means thin-locked, right? So the section at c2_MacroAssembler_x86.cpp 948ff is for handling stack-based thin locks?

If that is so, at some point we may want to rename things to make it simpler. E.g. "Stacked"->"Thin" or similar. "Thin" would be a good fit for both new-style locks and old stack-based locks. I understand that you don't want to do this downstream though to keep the delta small.

@tstuefe
Copy link
Member

tstuefe commented Mar 8, 2023

Could we factor out lock stack pushing and popping into some utility functions?

Also, would it make sense to emit an assert in debug when popping to not get into negative values, similar to what you do in MacroAssembler::fast_lock_impl?

Compile::current()->output()->add_stub(stub);
jcc(Assembler::notEqual, stub->entry());
bind(stub->continuation());
#else
jcc(Assembler::notEqual, NO_COUNT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment "// not yet implemented" ? Just to show that there is nothing basic stopping us from doing this for 32-bit, its just not yet done.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just implement it. shrugs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-fact, this is not so easy. See the comment that I added.

@rkennke
Copy link
Collaborator Author

rkennke commented Mar 8, 2023

Stupid question, is it guaranteed that the branch op always works with the location the stub is emitted to? Is there a mechanism to check that its is not too far away? Or is the code heap always small enough for every branch instruction used?

There are asserts in the branch instructions that check that the branch can be encoded. The stubs get emitted right after the method code. Not sure if there is any reasonable way that this could be too far away, but this is the way it is done since forever in C1 and recently also in C2. I would not be too worried about it.

Reading through this and through the justification for biased-locking removal, it seems that which of those two scenarios are more common today is not so clear cut, right? So modern applications see more contented locking and less locking altogether than old apps?

The scenario in which fast-locking (or stack-locking or biased-locking) really helps is actually pretty narrow. Code would have to:

  • do uncontended locking
    AND
  • using different lock objects

Because if it would be using the same lock-object over and over, the cost of allocating ('inflating') a true ObjectMonitor would be quickly amortised. The locking/unlocking primitives with OM should be as fast as with stack-/fast-locking (basically a single CAS plus some small stuff). Only in single-use (or relatively few used) uncontented locks is where fast-/stack-(/biased-)locking would gain anything compared to pure ObjectMonitor locking. Unfortunately, this still happens in a few scenarios: StringBuffer is one such place that is basically guaranteed to be single-threaded, and they're also very often throwaway objects. Collections and I/O streams I would expect to be reused often/long enough to amortise the OM inflation, but some code uses them in a throwaway fashion, too. And then there are cases where synchronized is used in outright stupid nonsense ways.

@rkennke
Copy link
Collaborator Author

rkennke commented Mar 8, 2023

Reading through fast_unlock() some more...

What I did not understand at first is why you pop the object twice in fast_unlock: once in the stub, then again in fast_unlock_impl. But "Stacked" means thin-locked, right? So the section at c2_MacroAssembler_x86.cpp 948ff is for handling stack-based thin locks?

Right. And the other path that handles anonymous owners is dealing with ObjectMonitor based locking.

If that is so, at some point we may want to rename things to make it simpler. E.g. "Stacked"->"Thin" or similar. "Thin" would be a good fit for both new-style locks and old stack-based locks. I understand that you don't want to do this downstream though to keep the delta small.

Yes. That whole code is quite a mess, and I already did some cleanups in upstream, but it still warrants some serious work. In particular, I would like to get rid of that implicit dealing with ZF there.

@rkennke
Copy link
Collaborator Author

rkennke commented Mar 8, 2023

Could we factor out lock stack pushing and popping into some utility functions?

Also, would it make sense to emit an assert in debug when popping to not get into negative values, similar to what you do in MacroAssembler::fast_lock_impl?

Yes, but it's creeping scope of this PR. This belongs in the bigger fast-locking PR IMO: openjdk/jdk#10907

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@openjdk
Copy link

openjdk bot commented Mar 8, 2023

@rkennke This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8302209: [Lilliput] Optimize fix-anon monitor owner path

Reviewed-by: stuefe

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 414 new commits pushed to the master branch:

  • 0fd026d: Merge jdk:jdk-21+12
  • 6e19387: 8303070: Memory leak in DCmdArgument<char*>::parse_value
  • 9fc518f: 8303401: Add a Vector API equalsIgnoreCase micro benchmark
  • 05faa73: 8303232: java.util.Date.parse(String) and java.util.Date(String) don't declare thrown IllegalArgumentException
  • 4c985e5: 8303183: Memory leak in Arguments::init_shared_archive_paths
  • 6af17c1: 8303362: Serial: Move CardTableRS to serial directory
  • c1e77e0: 8303252: G1: Return early from Full-GC if no regions are selected for compaction.
  • 8b86e1e: 8303344: After JDK-8302760, G1 heap verification does not exit VM after errors
  • 4c5d9cf: 8303013: Always do remembered set verification during G1 Full GC
  • 2451c5a: 8303357: [JVMCI] thread is _thread_in_vm when committing JFR compilation event
  • ... and 404 more: https://git.openjdk.org/lilliput/compare/7ed8bfa6c5a9bb957df0c3826c4c42061f062a25...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Mar 8, 2023
@rkennke
Copy link
Collaborator Author

rkennke commented Mar 8, 2023

/integrate

@openjdk
Copy link

openjdk bot commented Mar 8, 2023

Going to push as commit 27bcdb0.
Since your change was applied there have been 415 commits pushed to the master branch:

  • a4b8dbb: 8303823: [Lilliput] Inline initial LockStack stack
  • 0fd026d: Merge jdk:jdk-21+12
  • 6e19387: 8303070: Memory leak in DCmdArgument<char*>::parse_value
  • 9fc518f: 8303401: Add a Vector API equalsIgnoreCase micro benchmark
  • 05faa73: 8303232: java.util.Date.parse(String) and java.util.Date(String) don't declare thrown IllegalArgumentException
  • 4c985e5: 8303183: Memory leak in Arguments::init_shared_archive_paths
  • 6af17c1: 8303362: Serial: Move CardTableRS to serial directory
  • c1e77e0: 8303252: G1: Return early from Full-GC if no regions are selected for compaction.
  • 8b86e1e: 8303344: After JDK-8302760, G1 heap verification does not exit VM after errors
  • 4c5d9cf: 8303013: Always do remembered set verification during G1 Full GC
  • ... and 405 more: https://git.openjdk.org/lilliput/compare/7ed8bfa6c5a9bb957df0c3826c4c42061f062a25...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Mar 8, 2023
@openjdk openjdk bot closed this Mar 8, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Mar 8, 2023
@openjdk
Copy link

openjdk bot commented Mar 8, 2023

@rkennke Pushed as commit 27bcdb0.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

2 participants