-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8319797: Recursive lightweight locking: Runtime implementation #16606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
TestingAs no port implements this. The
|
👋 Welcome back aboldtch! A progress list of the required criteria for merging this PR into |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work! Mostly looks good! I only have some minor comments and a question.
markWord mark = obj()->mark_acquire(); | ||
while (mark.is_unlocked()) { | ||
if (lock_stack.is_full()) { | ||
// The emitted code always goes into the runtime incase the lock stack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale behind this block? Is it beneficial to inflate the top-most lock to make room for the new one, because that might be hotter? If so, then it may be even more useful to inflate the bottom-most entry instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also unclear on the rationale, and again on checking for a full-stack upfront like this, when it should be a rare case. If recursion support means the lockStack is no longer big enough then we need to increase its size to accommodate that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale behind this block? Is it beneficial to inflate the top-most lock to make room for the new one, because that might be hotter? If so, then it may be even more useful to inflate the bottom-most entry instead?
The current implementation inflates the bottom (least recently added) entry.
The rational is that because the emitted code always goes into the runtime for monitorenter if the lock stack is full, we need to inflate at least one object on the lock stack to not get into a scenario where we are constantly going into the runtime because we are in some deeply nested critical sections entering and exiting in a loop with the lock stack full.
I've also have versions of this which goes through the lock stack, and first inflates the already inflated objects, and only inflate a not inflated object if the lock stack is still full.
As for inflating the bottom instead of the top. I am unsure what would be best. The idea behind the bottom is that it is furthest away from the current running code, and in case the top is in a loop with different objects every time it would cause a lot of inflation. But it could obviously also be that the stack is in a loop and the bottom most object is different every time while the top is the same.
I can't say that I have seen programs with this either of this behaviour. Both can have equally bad worst case programs (with respect to number of inflations) but my gut feeling is that the worst case is less likely when inflating the bottom.
If recursion support means the lockStack is no longer big enough then we need to increase its size to accommodate that.
I have not seen it being a problem, but it would be worth looking for programs where this could be an issue and evaluate increasing the lock stack size. Regardless of the capacity, if (and when) the lock stack gets full it needs to be handled in some way.
I'm also unclear on the rationale, and again on checking for a full-stack upfront like this, when it should be a rare case.
The check for a full lock stack is always performed in every codepath, emitted C2, emitted shared and the runtime.
This only adds an escape hatch for the degenerate behaviour we could arrive in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect that we will encounter a full lockstack, of the current size 4, much more often with recursion support, and so we probably should increase it. But the handling of a full stack should still come last IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current lock stack capacity is 8.
Worth noting is that when we add support for holding monitor in loom we will probably transfer the lock stack while mounting and un-mounting. If this is done via indirection instead of copying, adding a dynamically resizable lock stack would not be that much of an effort.
But the handling of a full stack should still come last IMO.
I am unsure what you mean with last here. The idea is to make room the current object on the lock stack.
We could make this conditional on the state of the objects locking. But the only cases where making room on the lock stack is arguable less useful are if this is a recursive enter on an inflated monitor or is a contended enter on a fast locked object.
Then there is also what I already described where you can remove already inflated objects first.
But having a full lock stack is not something I have encounter as a problem in the different benchmarks. So I did not want to add complex logic which tried to reason about when and how to make room on the lock stack and simply say, if it is full, make room.
So if I understand you correctly, you want to inflate the current objects monitor unconditionally if the lock stack is full.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current lock stack capacity is 8
Okay my recollection was they found 4 was sufficient, but I guess that changed.
So if I understand you correctly, you want to inflate the current objects monitor unconditionally if the lock stack is full.
No I just expected checking for a full lock stack to happen when it is actually needed, not up front ie try to recursive lock, have it fail because the stack is full, then make some room and try again. The fullness check would be happening inside the lock stack code, not out in the caller. I'm not asking for you to do this just clarifying for you what I meant by not checking for it first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the gist of this but I have a number of concerns about what is getting checked and when. I have a suspicion that recursion support is making the lockstack full more often and thus invalidating the sizing of the lockStack that was based on there being no recursion support. In that case we need to bump the size. A full lockstack should be rare not something we check for first.
// Make sure the layout of the object is compatable with the emitted codes assumptions. | ||
STATIC_ASSERT(sizeof(_bad_oop_sentinel) == oopSize); | ||
STATIC_ASSERT(sizeof(_base[0]) == oopSize); | ||
STATIC_ASSERT(std::is_standard_layout<LockStack>::value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this? Are we allowed to use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is probably more nuance here w.r.t. offsetof
than I know.
My belief was that reason we did not use offsetof
is because we use it on non standard layout types, for which is invalid. But the lock stack is a standard layout.
However, reading some of issues surrounding offsetof
(mainly poor compiler support and becoming conditionally supported in C++17) there might be more reasons to avoid it. If that is the case this property would have to be asserted at runtime instead.
Maybe @kimbarrett has some more insight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear I was querying the use of std::is_standard_layout
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://en.cppreference.com/w/cpp/language/classes#Standard-layout_class
TIL. Maybe this should be in our allowed list of features?
if (lock_stack.is_full()) { | ||
// Always go into runtime if the lock stack is full. | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't obvious that it is beneficial to check what should be a rare occurrence. Why do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All lightweight enters must check if the lock stack is full. Both push and try_recursive_enter have that as a pre condition. All code paths emitted C2, emitted shared code and the runtime does the is full check first.
The reason that quick_enter does this without checking the mark word (for monitor) is that we go into the runtime from the emitted code if the lock stack is full. So we want to enter the runtime to inflate and make room, to not get into scenario where we have to go into the runtime on every monitor enter because we are locking on new objects in a loop with a full lock stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, when I did the original LW-locking implementation, and when the lock-stack was not yet fixed-size, I experimented with an optimisation for this problem: instead of doing the check for every monitorenter, I let C2 analyse the maximum lock-stack depth that the method is going to require, and do the check (and growing of the lock-stack) at method-entry. However, I haven't seen a case where this was beneficial, and there have been several problems with the approach as well (maybe that's why it wasn't beneficial). But maybe it is worth revisiting at some point? OTOH, with recursive locking we need to load and check the top-offset anyway, which makes the extra cost to check for overflow even smaller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't checking that the lock stack is full a prerequisite for trying to recursively enter the lock? ie, we have to check first.
markWord mark = obj()->mark_acquire(); | ||
while (mark.is_unlocked()) { | ||
if (lock_stack.is_full()) { | ||
// The emitted code always goes into the runtime incase the lock stack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also unclear on the rationale, and again on checking for a full-stack upfront like this, when it should be a rare case. If recursion support means the lockStack is no longer big enough then we need to increase its size to accommodate that.
} else if (mark.is_fast_locked() && lock_stack.is_recursive(object)) { | ||
// This lock is recursive but unstructured exit. Just inflate the lock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again this seems in the wrong place - this should be a very rare case so we should not be checking it explicitly before the expected cases!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In exit we must always check for recursions first. Unsure what you are proposing here.
Maybe you want to call remove first, and have a branch on if the number removed is greater than 1. And in that case inflate an update the recessions field before falling through. Something like this:
// Fast-locking does not use the 'lock' argument.
LockStack& lock_stack = current->lock_stack();
if (mark.is_fast_locked()) {
if (lock_stack.try_recursive_exit(object)) {
// Recursively unlocked.
return;
}
size_t recursions = lock_stack.remove(object) - 1;
if (recursions == 0) {
while (mark.is_fast_locked()) {
const markWord new_mark = mark.set_unlocked();
const markWord old_mark = mark;
mark = object->cas_set_mark(new_mark, old_mark);
if (old_mark == mark) {
return;
}
}
}
lock_stack.push(object);
ObjectMonitor* mon = inflate(current, object, inflate_cause_vm_internal);
if (mon->is_owner_anonymous()) {
mon->set_owner_from_anonymous(current);
}
mon->set_recursions(recursions);
}
This make the code a little more like the emitted code. Except it is conditioned on the mark word lock bits.
Hard to believe this will have a measurable difference. But at least to me it is more noisy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't the recursions I was querying but the unstructured locking aspect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, my concern is that for a simple exit we not only have to first check for a recursive exit (fine) but also this unexpected rare unstructured locking recursive case. Thinking it through part of the problem is that a simple-exit does itself allow for unstructured locking. Is it worth adding an additional case to peek at the top of the lock-stack and then do an exit with a pop for the most common non-recursive case? That way we in effect handle things as follows:
- recursive exit
- direct exit
- recursive unstructured exit
- direct unstructured exit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of let us note that when reaching this code the unstructured exit is the common case. The normal exit and recursive exit is usually handled in the emitted code (this includes the interpreter). We reach this because either a CAS failed somewhere due to a concurrent hashCode instalment, or the exit was unstructured. Inflated monitors exit just jumps passed this code (everything is conditioned on mark.is_fast_locked()
).
Is this motivated by the structure/layout of the C++ code. Or an optimisation?
If it is motivated by the structure/layout. Then we can lay it out as you described. It would add some code duplication.
If it is motivated as an optimisation then after the recursive exit fail, we should just call remove and act based on the return value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not go so far as to say the unstructured locking case is common. Sure we are on the slow-path, which may be due to unstructured locking, or we may be here through deop (also a slow path) or through the native method wrapper, or ... but yes this is not really performance critical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps changing this comment will help. Please consider:
// This lock is recursive but is not at the top of the lock stack so we're
// doing an unstructured exit. We have to fall thru to inflation below and
// let ObjectMonitor::exit() do the unlock.
This check on L570 has to be done before the block that begins on L572
because that block assumes that object
is not a recursive lock and it
will remove all instances of object
from the lock stack on L578. That
would unexpectedly unlock object
too many times.
Just as a side note: The block from L572 -> L584 also handles both
a structured exit of a non-recursive lock AND an unstructured exit of a
non-recursive lock. This is true because we use remove()
on L578
instead of a pop()
which assumes top-of-stack/structured locking.
Co-authored-by: Tobias Hartmann <tobias.hartmann@oracle.com>
@xmas92 |
@xmas92 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff - looks good to me! Thank you!
/Roman
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I last reviewed v08 of this PR. This review is for v12 of the PR.
Don't forget to make a pass to update the appropriate copyright years.
Again, I only have nits or questions that need some clarifications.
This change is looking really good. This time I did my crawl thru using the webrev which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the updates. One minor typo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I last reviewed v15 of this PR so I reviewed the incremental v17.
We have done performance testing on the recursive locking implementation vs the existing lightweight locking implementation, and the results were much better on some worst cases, like J2DBench, reduces monitor inflation, and neutral on others, with a couple of small regressions on linux-aarch64 (Derby) that we will investigate as further work before making LM_LIGHTWEIGHT the default. We at Oracle feel that this PR is ready for integration. It's a nice implementation and improvement and great work. Thank you! |
Thanks for all the reviews. tier1-7 has been run with both LM_LEGACY and LM_LIGHTWEIGHT both linux x86_64 and aarch64. And tier1-3 windows and macosx. |
Going to push as commit 5dbf137.
Your commit was automatically rebased without conflicts. |
Implements the runtime part of JDK-8319796.
The different CPU implementations are/will be created as dependent pull requests.
This enhancement proposes introducing the ability for LM_LIGHTWEIGHT to handle consecutive recursive monitor enter. Limiting the implementation to only consecutive monitor enters allows for more efficient emitted code which only needs to look at the two top most entires on the lock stack to determine what to do in a monitor exit.
A high level overview:
Progress
Issue
Reviewers
Contributors
<stefank@openjdk.org>
<eosterlund@openjdk.org>
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16606/head:pull/16606
$ git checkout pull/16606
Update a local copy of the PR:
$ git checkout pull/16606
$ git pull https://git.openjdk.org/jdk.git pull/16606/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 16606
View PR using the GUI difftool:
$ git pr show -t 16606
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16606.diff
Webrev
Link to Webrev Comment