Skip to content

8319797: Recursive lightweight locking: Runtime implementation #16606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 55 commits into from

Conversation

xmas92
Copy link
Member

@xmas92 xmas92 commented Nov 10, 2023

Implements the runtime part of JDK-8319796.
The different CPU implementations are/will be created as dependent pull requests.

This enhancement proposes introducing the ability for LM_LIGHTWEIGHT to handle consecutive recursive monitor enter. Limiting the implementation to only consecutive monitor enters allows for more efficient emitted code which only needs to look at the two top most entires on the lock stack to determine what to do in a monitor exit.

A high level overview:

  • Locking is still performed on the mark word
    • Unlocked (0b01) <=> Locked (0b00)
  • Monitor enter on Obj with mark word Unlocked (0b01) is the same
    • Transition Obj's mark word Unlocked (0b01) => Locked (0b00)
    • Push Obj onto the lock stack
    • Success
  • Monitor enter on Obj with mark word Locked (0b00) will check the top entry on the lock stack
    • If top entry is Obj
      • Push Obj on the lock stack
      • Success
    • If top entry is not Obj
      • Inflate and call ObjectMonitor::enter
  • Monitor exit on Obj with mark word Locked (0b00) will check the two top entries on the lock stack
    • If just the top entry is Obj
      • Transition Obj's mark word Locked (0b00) => Unlocked (0b01)
      • Pop the entry
      • Success
    • If both entries are Obj
      • Pop the top entry
      • Success
    • Any other case only occurs for unstructured locking, then just inflate and call ObjectMonitor::exit
  • If the monitor has been inflated for object Obj which is owned by the current thread
    • All corresponding entries for Obj is removed from the lock stack
    • The monitor recursions is set to the number of removed entries - 1
    • The owner is changed from anonymous to the thread
    • The regular ObjectMonitor::action is called.

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8319797: Recursive lightweight locking: Runtime implementation (Sub-task - P4)

Reviewers

Contributors

  • Stefan Karlsson <stefank@openjdk.org>
  • Erik Österlund <eosterlund@openjdk.org>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16606/head:pull/16606
$ git checkout pull/16606

Update a local copy of the PR:
$ git checkout pull/16606
$ git pull https://git.openjdk.org/jdk.git pull/16606/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 16606

View PR using the GUI difftool:
$ git pr show -t 16606

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16606.diff

Webrev

Link to Webrev Comment

@xmas92
Copy link
Member Author

xmas92 commented Nov 10, 2023

Testing

As no port implements this. The LM_LIGHTWEIGHT is less interesting as it simply tests that the non recursive still works if some CPU has not yet implemented recursive lightweight. As such the full gtest is never run here. See the dependent CPU implementation PRs for more extensive testing.

  • linux-x64, linux-x64-debug
    • LM_LEGACY
      • tier 1-7
    • LM_LIGHTWEIGHT
      • tier 1-7
  • GitHub actions

@bridgekeeper
Copy link

bridgekeeper bot commented Nov 10, 2023

👋 Welcome back aboldtch! A progress list of the required criteria for merging this PR into pr/16603 will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 10, 2023
@openjdk
Copy link

openjdk bot commented Nov 10, 2023

@xmas92 The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Nov 10, 2023
@mlbridge
Copy link

mlbridge bot commented Nov 10, 2023

Copy link
Contributor

@rkennke rkennke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Mostly looks good! I only have some minor comments and a question.

markWord mark = obj()->mark_acquire();
while (mark.is_unlocked()) {
if (lock_stack.is_full()) {
// The emitted code always goes into the runtime incase the lock stack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale behind this block? Is it beneficial to inflate the top-most lock to make room for the new one, because that might be hotter? If so, then it may be even more useful to inflate the bottom-most entry instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also unclear on the rationale, and again on checking for a full-stack upfront like this, when it should be a rare case. If recursion support means the lockStack is no longer big enough then we need to increase its size to accommodate that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale behind this block? Is it beneficial to inflate the top-most lock to make room for the new one, because that might be hotter? If so, then it may be even more useful to inflate the bottom-most entry instead?

The current implementation inflates the bottom (least recently added) entry.

The rational is that because the emitted code always goes into the runtime for monitorenter if the lock stack is full, we need to inflate at least one object on the lock stack to not get into a scenario where we are constantly going into the runtime because we are in some deeply nested critical sections entering and exiting in a loop with the lock stack full.

I've also have versions of this which goes through the lock stack, and first inflates the already inflated objects, and only inflate a not inflated object if the lock stack is still full.

As for inflating the bottom instead of the top. I am unsure what would be best. The idea behind the bottom is that it is furthest away from the current running code, and in case the top is in a loop with different objects every time it would cause a lot of inflation. But it could obviously also be that the stack is in a loop and the bottom most object is different every time while the top is the same.
I can't say that I have seen programs with this either of this behaviour. Both can have equally bad worst case programs (with respect to number of inflations) but my gut feeling is that the worst case is less likely when inflating the bottom.

If recursion support means the lockStack is no longer big enough then we need to increase its size to accommodate that.

I have not seen it being a problem, but it would be worth looking for programs where this could be an issue and evaluate increasing the lock stack size. Regardless of the capacity, if (and when) the lock stack gets full it needs to be handled in some way.

I'm also unclear on the rationale, and again on checking for a full-stack upfront like this, when it should be a rare case.

The check for a full lock stack is always performed in every codepath, emitted C2, emitted shared and the runtime.

This only adds an escape hatch for the degenerate behaviour we could arrive in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that we will encounter a full lockstack, of the current size 4, much more often with recursion support, and so we probably should increase it. But the handling of a full stack should still come last IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current lock stack capacity is 8.

Worth noting is that when we add support for holding monitor in loom we will probably transfer the lock stack while mounting and un-mounting. If this is done via indirection instead of copying, adding a dynamically resizable lock stack would not be that much of an effort.

But the handling of a full stack should still come last IMO.
I am unsure what you mean with last here. The idea is to make room the current object on the lock stack.

We could make this conditional on the state of the objects locking. But the only cases where making room on the lock stack is arguable less useful are if this is a recursive enter on an inflated monitor or is a contended enter on a fast locked object.

Then there is also what I already described where you can remove already inflated objects first.

But having a full lock stack is not something I have encounter as a problem in the different benchmarks. So I did not want to add complex logic which tried to reason about when and how to make room on the lock stack and simply say, if it is full, make room.

So if I understand you correctly, you want to inflate the current objects monitor unconditionally if the lock stack is full.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current lock stack capacity is 8

Okay my recollection was they found 4 was sufficient, but I guess that changed.

So if I understand you correctly, you want to inflate the current objects monitor unconditionally if the lock stack is full.

No I just expected checking for a full lock stack to happen when it is actually needed, not up front ie try to recursive lock, have it fail because the stack is full, then make some room and try again. The fullness check would be happening inside the lock stack code, not out in the caller. I'm not asking for you to do this just clarifying for you what I meant by not checking for it first.

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the gist of this but I have a number of concerns about what is getting checked and when. I have a suspicion that recursion support is making the lockstack full more often and thus invalidating the sizing of the lockStack that was based on there being no recursion support. In that case we need to bump the size. A full lockstack should be rare not something we check for first.

// Make sure the layout of the object is compatable with the emitted codes assumptions.
STATIC_ASSERT(sizeof(_bad_oop_sentinel) == oopSize);
STATIC_ASSERT(sizeof(_base[0]) == oopSize);
STATIC_ASSERT(std::is_standard_layout<LockStack>::value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this? Are we allowed to use it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is probably more nuance here w.r.t. offsetof than I know.
My belief was that reason we did not use offsetof is because we use it on non standard layout types, for which is invalid. But the lock stack is a standard layout.

However, reading some of issues surrounding offsetof (mainly poor compiler support and becoming conditionally supported in C++17) there might be more reasons to avoid it. If that is the case this property would have to be asserted at runtime instead.

Maybe @kimbarrett has some more insight.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear I was querying the use of std::is_standard_layout here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://en.cppreference.com/w/cpp/language/classes#Standard-layout_class

TIL. Maybe this should be in our allowed list of features?

Comment on lines +392 to +395
if (lock_stack.is_full()) {
// Always go into runtime if the lock stack is full.
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't obvious that it is beneficial to check what should be a rare occurrence. Why do this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All lightweight enters must check if the lock stack is full. Both push and try_recursive_enter have that as a pre condition. All code paths emitted C2, emitted shared code and the runtime does the is full check first.

The reason that quick_enter does this without checking the mark word (for monitor) is that we go into the runtime from the emitted code if the lock stack is full. So we want to enter the runtime to inflate and make room, to not get into scenario where we have to go into the runtime on every monitor enter because we are locking on new objects in a loop with a full lock stack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, when I did the original LW-locking implementation, and when the lock-stack was not yet fixed-size, I experimented with an optimisation for this problem: instead of doing the check for every monitorenter, I let C2 analyse the maximum lock-stack depth that the method is going to require, and do the check (and growing of the lock-stack) at method-entry. However, I haven't seen a case where this was beneficial, and there have been several problems with the approach as well (maybe that's why it wasn't beneficial). But maybe it is worth revisiting at some point? OTOH, with recursive locking we need to load and check the top-offset anyway, which makes the extra cost to check for overflow even smaller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't checking that the lock stack is full a prerequisite for trying to recursively enter the lock? ie, we have to check first.

markWord mark = obj()->mark_acquire();
while (mark.is_unlocked()) {
if (lock_stack.is_full()) {
// The emitted code always goes into the runtime incase the lock stack
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also unclear on the rationale, and again on checking for a full-stack upfront like this, when it should be a rare case. If recursion support means the lockStack is no longer big enough then we need to increase its size to accommodate that.

Comment on lines 608 to 609
} else if (mark.is_fast_locked() && lock_stack.is_recursive(object)) {
// This lock is recursive but unstructured exit. Just inflate the lock.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again this seems in the wrong place - this should be a very rare case so we should not be checking it explicitly before the expected cases!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In exit we must always check for recursions first. Unsure what you are proposing here.

Maybe you want to call remove first, and have a branch on if the number removed is greater than 1. And in that case inflate an update the recessions field before falling through. Something like this:

      // Fast-locking does not use the 'lock' argument.
      LockStack& lock_stack = current->lock_stack();
      if (mark.is_fast_locked()) {
        if (lock_stack.try_recursive_exit(object)) {
          // Recursively unlocked.
          return;
        }

        size_t recursions = lock_stack.remove(object) - 1;

        if (recursions == 0) {
          while (mark.is_fast_locked()) {
              const markWord new_mark = mark.set_unlocked();
              const markWord old_mark = mark;
              mark = object->cas_set_mark(new_mark, old_mark);
              if (old_mark == mark) {
                return;
              }
            }
        }

        lock_stack.push(object);
        ObjectMonitor* mon = inflate(current, object, inflate_cause_vm_internal);
        if (mon->is_owner_anonymous()) {
          mon->set_owner_from_anonymous(current);
        }
        mon->set_recursions(recursions);
      }

This make the code a little more like the emitted code. Except it is conditioned on the mark word lock bits.
Hard to believe this will have a measurable difference. But at least to me it is more noisy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't the recursions I was querying but the unstructured locking aspect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, my concern is that for a simple exit we not only have to first check for a recursive exit (fine) but also this unexpected rare unstructured locking recursive case. Thinking it through part of the problem is that a simple-exit does itself allow for unstructured locking. Is it worth adding an additional case to peek at the top of the lock-stack and then do an exit with a pop for the most common non-recursive case? That way we in effect handle things as follows:

  • recursive exit
  • direct exit
  • recursive unstructured exit
  • direct unstructured exit

Copy link
Member Author

@xmas92 xmas92 Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of let us note that when reaching this code the unstructured exit is the common case. The normal exit and recursive exit is usually handled in the emitted code (this includes the interpreter). We reach this because either a CAS failed somewhere due to a concurrent hashCode instalment, or the exit was unstructured. Inflated monitors exit just jumps passed this code (everything is conditioned on mark.is_fast_locked()).

Is this motivated by the structure/layout of the C++ code. Or an optimisation?

If it is motivated by the structure/layout. Then we can lay it out as you described. It would add some code duplication.

If it is motivated as an optimisation then after the recursive exit fail, we should just call remove and act based on the return value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not go so far as to say the unstructured locking case is common. Sure we are on the slow-path, which may be due to unstructured locking, or we may be here through deop (also a slow path) or through the native method wrapper, or ... but yes this is not really performance critical.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps changing this comment will help. Please consider:

  // This lock is recursive but is not at the top of the lock stack so we're
  // doing an unstructured exit. We have to fall thru to inflation below and
  // let ObjectMonitor::exit() do the unlock.

This check on L570 has to be done before the block that begins on L572
because that block assumes that object is not a recursive lock and it
will remove all instances of object from the lock stack on L578. That
would unexpectedly unlock object too many times.

Just as a side note: The block from L572 -> L584 also handles both
a structured exit of a non-recursive lock AND an unstructured exit of a
non-recursive lock. This is true because we use remove() on L578
instead of a pop() which assumes top-of-stack/structured locking.

@xmas92
Copy link
Member Author

xmas92 commented Jan 23, 2024

/contributor add @stefank
/contributor add @fisk

@openjdk
Copy link

openjdk bot commented Jan 23, 2024

@xmas92
Contributor Stefan Karlsson <stefank@openjdk.org> successfully added.

@openjdk
Copy link

openjdk bot commented Jan 23, 2024

@xmas92
Contributor Erik Österlund <eosterlund@openjdk.org> successfully added.

Copy link
Contributor

@rkennke rkennke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff - looks good to me! Thank you!
/Roman

Copy link
Member

@dcubed-ojdk dcubed-ojdk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I last reviewed v08 of this PR. This review is for v12 of the PR.
Don't forget to make a pass to update the appropriate copyright years.

Again, I only have nits or questions that need some clarifications.

@dcubed-ojdk
Copy link
Member

This change is looking really good. This time I did my crawl thru using the webrev which
is likely why I found some things that I missed in previous reviews. Sorry about that.
I'm going to start doing crawl thru reviews of the X64 and aarch64 patches also.

Copy link
Member

@dcubed-ojdk dcubed-ojdk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the updates. One minor typo.

Copy link
Member

@dcubed-ojdk dcubed-ojdk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I last reviewed v15 of this PR so I reviewed the incremental v17.

@coleenp
Copy link
Contributor

coleenp commented Feb 11, 2024

We have done performance testing on the recursive locking implementation vs the existing lightweight locking implementation, and the results were much better on some worst cases, like J2DBench, reduces monitor inflation, and neutral on others, with a couple of small regressions on linux-aarch64 (Derby) that we will investigate as further work before making LM_LIGHTWEIGHT the default. We at Oracle feel that this PR is ready for integration. It's a nice implementation and improvement and great work. Thank you!
This applies to the x86 and aarch64 implementation PRs as well, and we're looking forward to seeing the riscv and ppc ports and results from these platforms also.

@xmas92
Copy link
Member Author

xmas92 commented Feb 13, 2024

Thanks for all the reviews. tier1-7 has been run with both LM_LEGACY and LM_LIGHTWEIGHT both linux x86_64 and aarch64. And tier1-3 windows and macosx.
/integrate

@openjdk
Copy link

openjdk bot commented Feb 13, 2024

Going to push as commit 5dbf137.
Since your change was applied there have been 54 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Feb 13, 2024
@openjdk openjdk bot closed this Feb 13, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Feb 13, 2024
@openjdk
Copy link

openjdk bot commented Feb 13, 2024

@xmas92 Pushed as commit 5dbf137.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

6 participants