-
Notifications
You must be signed in to change notification settings - Fork 18
8302209: [Lilliput] Optimize fix-anon monitor owner path #74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Welcome back rkennke! A progress list of the required criteria for merging this PR into |
Webrevs
|
@rkennke this pull request can not be integrated into git checkout JDK-8302209
git fetch https://git.openjdk.org/lilliput master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Questions/Comments:
What is the advantage of using a stub here? The code section between aarch64 and x64 are different and only used in one place each. So you don't really save much by reuse here. Seems you could have just written them out as a conditional path into the various xxfastlock() functions. Then you would have saved the jumps into and out of the stub in the case owner==ANONYMOUS case.
Anonymous means some other thread T2 inflated my lock while I (T1) was thin-lock-waiting on the lock, right?
BTW I found it confusing that we have pre-existing functions called ...fast_unlock and ...fast_lock, since it has nothing to do with your FastLocking, right? We should rename UseFastLocking to UseRomanLocking :-) |
Hi Thomas,
This is not about saving memory. The very common path here is that owner is not ANONYMOUS. Using a stub means that we forward-jump only in the very uncommon path, and not branch in the common path. If we were not using a stub, we would have to forward-jump in the common path and not branch in the uncommon path. In my experience, forward-jumping in the common path throws off static branch prediction and performs significantly worse. This is also mentioned in the Intel optimization guide as important consideration.
Yes, exactly. I'll address your other comments later today or tomorrow. Thank you! |
Hi Roman,
wait... now I'm confused. Sorry, I'm no compiler guy. But:
Why NE? we jump into the stub if owner != ANONYMOUS?
I understand that. But you could have this done in-place too, right? Place a label at the end of the function and jump to it in the uncommon case, before the label do a ret for the common case. This is more to make sure I understand the coding, if you prefer to do it via stub that's fine.
Okay, non-anonymous is the case where a thread leaves an inflated monitor it inflated itself. And that is the hot path? Interesting. Under which circumstances does a thread inflate its own thread? Is this because your fastlocking does not work with recursion yet, right? So every recursion would inflate the lock right away?
|
tst is an and instruction in disguise: it masks the value and sets ZF according to the result: it is ZF=1 if the bit is 0, and ZF=0 if the bit is 1. NE is ZF=0, that is when the ANONYMOUS bit is set.
There is no ret here. This whole code section is expanded when C2 emits FastLockNode and FastUnlockNode. (To make it even more confusing, FastLockNode and FastUnlockNode are subclasses of CmpNode, and the code following it will do different things depending on whether we come out with ZF=1 or =0. But this is not relevant for this discussion.) Lacking a ret, I don't see a reasonable way to achieve what I described without using a stub.
It doesn't have to have inflated that monitor itself, it would only have locked the (possibly pre-existing) monitor itself.
Once 2 or more threads compete for a lock, that lock gets inflated. From there on, all locks and unlocks are done on the monitor. Yes, this is very common and hot (in workloads that use locking properly - as opposed to code that uses uncontended and/or unshared locks, which is what fast-locking is good at). Yes, recursion would inflate the lock immediately, but this is not the point here. (I do have an experimental patch to implement recursive fast-locking in rkennke/jdk#2, I am just not sure if it's worth to have this extra complexity for little gain in somewhat bad workloads.) Roman |
Ah, okay, thanks for clarifying.
Okay, I get it. There is no "end of function" to push your uncommon branch to, so the only way to do this without a stub would be to create a branch in-place, and that you don't want that for performance reasons. Stupid question, is it guaranteed that the branch op always works with the location the stub is emitted to? Is there a mechanism to check that its is not too far away? Or is the code heap always small enough for every branch instruction used?
Again, thanks for explaining. Reading through this and through the justification for biased-locking removal, it seems that which of those two scenarios are more common today is not so clear cut, right? So modern applications see more contented locking and less locking altogether than old apps?
|
Reading through fast_unlock() some more... What I did not understand at first is why you pop the object twice in fast_unlock: once in the stub, then again in fast_unlock_impl. But "Stacked" means thin-locked, right? So the section at c2_MacroAssembler_x86.cpp 948ff is for handling stack-based thin locks? If that is so, at some point we may want to rename things to make it simpler. E.g. "Stacked"->"Thin" or similar. "Thin" would be a good fit for both new-style locks and old stack-based locks. I understand that you don't want to do this downstream though to keep the delta small. |
Could we factor out lock stack pushing and popping into some utility functions? Also, would it make sense to emit an assert in debug when popping to not get into negative values, similar to what you do in |
Compile::current()->output()->add_stub(stub); | ||
jcc(Assembler::notEqual, stub->entry()); | ||
bind(stub->continuation()); | ||
#else | ||
jcc(Assembler::notEqual, NO_COUNT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment "// not yet implemented" ? Just to show that there is nothing basic stopping us from doing this for 32-bit, its just not yet done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just implement it. shrugs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In-fact, this is not so easy. See the comment that I added.
There are asserts in the branch instructions that check that the branch can be encoded. The stubs get emitted right after the method code. Not sure if there is any reasonable way that this could be too far away, but this is the way it is done since forever in C1 and recently also in C2. I would not be too worried about it.
The scenario in which fast-locking (or stack-locking or biased-locking) really helps is actually pretty narrow. Code would have to:
Because if it would be using the same lock-object over and over, the cost of allocating ('inflating') a true ObjectMonitor would be quickly amortised. The locking/unlocking primitives with OM should be as fast as with stack-/fast-locking (basically a single CAS plus some small stuff). Only in single-use (or relatively few used) uncontented locks is where fast-/stack-(/biased-)locking would gain anything compared to pure ObjectMonitor locking. Unfortunately, this still happens in a few scenarios: StringBuffer is one such place that is basically guaranteed to be single-threaded, and they're also very often throwaway objects. Collections and I/O streams I would expect to be reused often/long enough to amortise the OM inflation, but some code uses them in a throwaway fashion, too. And then there are cases where synchronized is used in outright stupid nonsense ways. |
Right. And the other path that handles anonymous owners is dealing with ObjectMonitor based locking.
Yes. That whole code is quite a mess, and I already did some cleanups in upstream, but it still warrants some serious work. In particular, I would like to get rid of that implicit dealing with ZF there. |
Yes, but it's creeping scope of this PR. This belongs in the bigger fast-locking PR IMO: openjdk/jdk#10907 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@rkennke This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 414 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
/integrate |
Going to push as commit 27bcdb0.
Your commit was automatically rebased without conflicts. |
When trying to exit a monitor, we need to check if the current owner is anonymous, and if so, fix it before exiting the monitor. So far, we have been doing this by calling into the slow-path and handle it in the runtime. However, it is fairly easy to do in the fast-path and avoid calling into the runtime altogether. I also included those changes in upstream fast-locking PR.
Testing:
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/lilliput pull/74/head:pull/74
$ git checkout pull/74
Update a local copy of the PR:
$ git checkout pull/74
$ git pull https://git.openjdk.org/lilliput pull/74/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 74
View PR using the GUI difftool:
$ git pr show -t 74
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/lilliput/pull/74.diff