Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8274282: Clarify special wait assert #5684

Closed
wants to merge 4 commits into from
Closed

8274282: Clarify special wait assert #5684

wants to merge 4 commits into from

Conversation

coleenp
Copy link
Contributor

@coleenp coleenp commented Sep 24, 2021

This change removes the assert and tests for Mutex::wait() only allowed with greater than nosafepoint ranked held locks.
Tested with tier1-3.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/5684/head:pull/5684
$ git checkout pull/5684

Update a local copy of the PR:
$ git checkout pull/5684
$ git pull https://git.openjdk.java.net/jdk pull/5684/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 5684

View PR using the GUI difftool:
$ git pr show -t 5684

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/5684.diff

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Sep 24, 2021

👋 Welcome back coleenp! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr label Sep 24, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Sep 24, 2021

@coleenp The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-runtime label Sep 24, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Sep 24, 2021

Webrevs

@coleenp
Copy link
Contributor Author

@coleenp coleenp commented Sep 24, 2021

I apologize I did a forced update. @pchilano helped me work out why this assert exists in the code @dholmes-ora also worked it out but I didn't get it yesterday. The reason for not waiting while holding a non-safepoint checking lock is that the wait could delay safepoints for a blocked thread, when the thread is a JavaThread. This is because holding the nosafepoint lock has a NoSafepointVerifier, so we won't set the state to blocked while waiting on the lower ranked lock.
I fixed the code to do this assert for JavaThread and assert for NonJavaThread if the lock held is below tty, since that could block a lot of NJT operations. Also updated one of the tests to call with NJT.
This should help the heap dumping code if it reverts to non-safepoint checking locking. And also helps solve this mystery of what this code was for.

@coleenp coleenp changed the title 8274282: Remove special wait assert 8274282: Clarify special wait assert Sep 24, 2021
Copy link
Member

@dholmes-ora dholmes-ora left a comment

Hi Coleen,

The rationale is not quite right to me here. See comments below.

Thanks,
David

@@ -384,14 +384,20 @@ void Mutex::check_rank(Thread* thread) {
if (owned_by_self()) {
// wait() case
Mutex* least = get_least_ranked_lock_besides_this(locks_owned);
// We enforce not holding locks of rank nosafepoint or lower while waiting.
// We enforce not holding locks of rank nosafepoint or lower while waiting for JavaThreads,
// because the held lock has a NoSafepointVerifier so waiting on a lock will block out safepoints.
Copy link
Member

@dholmes-ora dholmes-ora Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NSV is not the reason, it is a mechanism to detect violation of the rules. The reason we must not block in a JavaThread is because it will block without a safepoint check and without becoming safepoint-safe, thus blocking out safepoints while the wait is in progress.

Also note that this applies to any wait() on a nosafepoint lock, not just one done while holding a safepoint lock.

Copy link
Contributor

@pchilano pchilano Oct 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can probably shed some light on the origin of this comment since it came from discussions I had with Coleen. We were looking at cases like the Service_lock, where the ServiceThread waits on a lock with rank <= nosafepoint but within the context of a ThreadBlockInVM. If a JT would call wait() on a nosafepoint lock and we detect it holds some other nosafepoint lock, then that would imply there was no immediate TBIVM, like with the Service_lock case, because that would have asserted due to the NSV and so we know the call will block safepoints (assuming no manual changes to _thread_blocked that would bypass the check_possible_safepoint() call or some weird use of the TBIVM like: TBIVM, lock(a), lock(b), wait(b), where a and b are nosafepoint locks). But as you pointed out the issue of this JT blocking safepoints is a consequence of waiting on nosafepoint locks, regardless of which other locks the JT is holding. It just happens that if the JT is holding other nosafepoint locks then we already know this will block out safepoints (ruling out those behaviors described above).

I also just realized that waiting while holding a nosafepoint lock could also block out safepoints due to other JT's being blocked in a safepoint-unsafe state while trying to lock that monitor. So even if the waiting JT did a manual transition to _thread_blocked or used the TBIVM in some weird way, safepoints could still be blocked out due to those other JT's.

But in any case, if we want to verify if a wait() call might block safepoints, we should verify all cases where this JT will wait in an unsafe state, and that means checking whether the current lock is nosafepoint and the current state of the JT: "if thread->is_Java_thread() && this->rank <= Mutex::nosafepoint && (thread->thread_state() != _thread_blocked)". Today that would assert since we have cases where a JT waits on a nosafepoint lock without being _thread_blocked, mainly os::create_thread() and JfrThreadSampler::on_javathread_suspend() from some tests I run. VMThread::wait_for_vm_thread_exit() and ThreadsSMRSupport::wait_until_not_protected() also assert but in those cases the JT was already removed from the JT list so it wouldn't be blocking safepoints.

Copy link
Member

@dholmes-ora dholmes-ora Oct 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that a JT should ever switch to _thread_blocked when waiting on a non-safepoint monitor! That is totally confusing/conflating the very notions of having safepoint checking and non-safepoint checking mutexes/monitors. If you have to make yourself appear safepoint-safe when waiting on a non-safepoint monitor then that says to me that the monitor should not be a non-safepoint one! I need to look at the history of how we have defined and used the Service_lock to see why we do this.

The thread creation situation is somewhat complex because the newly running thread is not initially able to participate in safepoints and so although a JT it has to wait without a safepoint check. But the creating thread could theoretically wait with safepoint checks. But that would require the monitor to be safepoint-check-sometimes which we removed. Alternatively we may need to do something different with how _thread_new state is treated.

Anyway this is digressing somewhat. :)

I still dislike the wording of the comment - sorry.

Copy link
Contributor Author

@coleenp coleenp Oct 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation and further experiments @pchilano .

VMThread::wait_for_vm_thread_exit() and ThreadsSMRSupport::wait_until_not_protected() also assert but in those cases the JT was already removed from the JT list so it wouldn't be blocking safepoints.

I wonder if the check should be is_active_Java_thread instead?

The Service_lock, EscapeBarrier_lock, MonitorDeflation_lock, Notification_lock, and CodeSweeper_lock are held with a TBIVM (same mechanism). These locks are fairly low level locks and don't check for safepoints.

I think the problematic scenario that Patricio is referring to is:
Thread1: TBIVM (state == _thread_blocked), comes out of wait so owns Service_lock, does things while safepoint safe
Thread2: waiting on Service_lock, so blocking safepoint
But then Thread1 will come back to wait when done with things, Thread2 will proceed then get to it's safepoint check when done with eg. Service_lock.

Copy link
Contributor Author

@coleenp coleenp Oct 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also trying the experiment that Patricio refers to above:

+    } else if (thread->is_active_Java_thread() && rank() <= Mutex::nosafepoint) {
+      // Don't allow JavaThread to wait on non-safepoint checking lock if not in a safe state.
+      JavaThread* current = JavaThread::cast(thread);
+      assert(current->thread_state() == _thread_blocked || current->thread_state() == _thread_in_native,
+             "Cannot wait on active Java thread in unsafe state");

but the os::create_thread() case is an active Java thread.

Copy link
Contributor Author

@coleenp coleenp Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't easily add this new assert. It's going to take a lot more code so should be done in a separate RFE.

src/hotspot/share/runtime/mutex.cpp Outdated Show resolved Hide resolved
@mlbridge
Copy link

@mlbridge mlbridge bot commented Oct 4, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

On 4/10/2021 11:16 pm, Coleen Phillimore wrote:

On Fri, 1 Oct 2021 16:46:33 GMT, Patricio Chilano Mateo <pchilanomate at openjdk.org> wrote:

src/hotspot/share/runtime/mutex.cpp line 388:

386: Mutex* least = get_least_ranked_lock_besides_this(locks_owned);
387: // We enforce not holding locks of rank nosafepoint or lower while waiting for JavaThreads,
388: // because the held lock has a NoSafepointVerifier so waiting on a lock will block out safepoints.

The NSV is not the reason, it is a mechanism to detect violation of the rules. The reason we must not block in a JavaThread is because it will block without a safepoint check and without becoming safepoint-safe, thus blocking out safepoints while the wait is in progress.

Also note that this applies to any wait() on a nosafepoint lock, not just one done while holding a safepoint lock.

I can probably shed some light on the origin of this comment since it came from discussions I had with Coleen. We were looking at cases like the Service_lock, where the ServiceThread waits on a lock with rank <= nosafepoint but within the context of a ThreadBlockInVM. If a JT would call wait() on a nosafepoint lock and we detect it holds some other nosafepoint lock, then that would imply there was no immediate TBIVM, like with the Service_lock case, because that would have asserted due to the NSV and so we know the call will block safepoints (assuming no manual changes to _thread_blocked that would bypass the check_possible_safepoint() call or some weird use of the TBIVM like: TBIVM, lock(a), lock(b), wait(b), where a and b are nosafepoint locks). But as you pointed out the issue of this JT blocking safepoints is a consequence of waiting on nosafepoint locks, regardless of which other locks the JT is holding. It just happens that if the JT is holding other nosafepoint locks then we
already know this will block out safepoints (ruling out those behaviors described above).

I also just realized that waiting while holding a nosafepoint lock could also block out safepoints due to other JT's being blocked in a safepoint-unsafe state while trying to lock that monitor. So even if the waiting JT did a manual transition to _thread_blocked or used the TBIVM in some weird way, safepoints could still be blocked out due to those other JT's.

But in any case, if we want to verify if a wait() call might block safepoints, we should verify all cases where this JT will wait in an unsafe state, and that means checking whether the current lock is nosafepoint and the current state of the JT: "if thread->is_Java_thread() && this->rank <= Mutex::nosafepoint && (thread->thread_state() != _thread_blocked)". Today that would assert since we have cases where a JT waits on a nosafepoint lock without being _thread_blocked, mainly os::create_thread() and JfrThreadSampler::on_javathread_suspend() from some tests I run. VMThread::wait_for_vm_thread_exit() and ThreadsSMRSupport::wait_until_not_protected() also assert but in those cases the JT was already removed from the JT list so it wouldn't be blocking safepoints.

Thank you for the explanation and further experiments @pchilano .

VMThread::wait_for_vm_thread_exit() and ThreadsSMRSupport::wait_until_not_protected() also assert but in those cases the JT was already removed from the JT list so it wouldn't be blocking safepoints.

I wonder if the check should be is_active_Java_thread instead?

The Service_lock, EscapeBarrier_lock, MonitorDeflation_lock, Notification_lock, and CodeSweeper_lock are held with a TBIVM (same mechanism). These locks are fairly low level locks and don't check for safepoints.

I think the problematic scenario that Patricio is referring to is:
Thread1: TBIVM (state == _thread_blocked), comes out of wait so owns Service_lock, does things while safepoint safe
Thread2: waiting on Service_lock, so blocking safepoint
But then Thread1 will come back to wait when done with things, Thread2 will proceed then get to it's safepoint check when done with eg. Service_lock.

The ServiceThread releases the Service_lock before doing any significant
work and returns to _thread_in_vm.

David
-----

Copy link
Member

@dholmes-ora dholmes-ora left a comment

LGTM :)

Thanks,
David

@openjdk
Copy link

@openjdk openjdk bot commented Oct 5, 2021

@coleenp This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8274282: Clarify special wait assert

Reviewed-by: dholmes, pchilanomate

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 125 new commits pushed to the master branch:

  • bb0bab5: 8274286: Skip null for make_referent_alive in referenceProcessor
  • 7ad74d8: 8274299: Make Method/Constructor/Field accessors @stable
  • 1459180: 8274079: Cleanup unnecessary calls to Throwable.initCause() in java.base module
  • 8609ea5: 8273342: Null pointer dereference in classFileParser.cpp:2817
  • a5080ef: 8272564: Incorrect attribution of method invocations of Object methods on interfaces
  • a914ee7: 8274632: Possible pointer overflow in PretouchTask chunk claiming
  • 8f7a37c: 8274434: move os::get_default_process_handle and os::dll_lookup to os_posix for POSIX platforms
  • 3953e07: 8271459: C2: Missing NegativeArraySizeException when creating StringBuilder with negative capacity
  • 53d7e95: 8274635: Use String.equals instead of String.compareTo in jdk.accessibility
  • e43f540: 8274651: Possible race in FontDesignMetrics.KeyReference.dispose
  • ... and 115 more: https://git.openjdk.java.net/jdk/compare/56b8b35286634f2d2224ca445bc9f32ff284ae74...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Oct 5, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Oct 5, 2021

Mailing list message from coleen.phillimore at oracle.com on hotspot-runtime-dev:

On 10/4/21 7:06 PM, David Holmes wrote:

On 4/10/2021 11:16 pm, Coleen Phillimore wrote:

On Fri, 1 Oct 2021 16:46:33 GMT, Patricio Chilano Mateo
<pchilanomate at openjdk.org> wrote:

src/hotspot/share/runtime/mutex.cpp line 388:

386:???? Mutex* least =
get_least_ranked_lock_besides_this(locks_owned);
387:???? // We enforce not holding locks of rank nosafepoint or
lower while waiting for JavaThreads,
388:???? // because the held lock has a NoSafepointVerifier so
waiting on a lock will block out safepoints.

The NSV is not the reason, it is? a mechanism to detect violation
of the rules. The reason we must not block in a JavaThread is
because it will block without a safepoint check and without
becoming safepoint-safe, thus blocking out safepoints while the
wait is in progress.

Also note that this applies to any wait() on a nosafepoint lock,
not just one done while holding a safepoint lock.

I can probably shed some light on the origin of this comment since
it came from discussions I had with Coleen. We were looking at cases
like the Service_lock, where the ServiceThread waits on a lock with
rank <= nosafepoint but within the context of a ThreadBlockInVM. If
a JT would call wait() on a nosafepoint lock and we detect it holds
some other nosafepoint lock, then that would imply there was no
immediate TBIVM, like with the Service_lock case, because that would
have asserted due to the NSV and so we know the call will block
safepoints (assuming no manual changes to _thread_blocked that would
bypass the check_possible_safepoint() call or some weird use of the
TBIVM like: TBIVM, lock(a), lock(b), wait(b), where a and b are
nosafepoint locks). But as you pointed out the issue of this JT
blocking safepoints is a consequence of waiting on nosafepoint
locks, regardless of which other locks the JT is holding. It just
happens that if the JT is holding other nosafepoint locks then we
?? already know this will block out safepoints (ruling out those
behaviors described above).

I also just realized that waiting while holding a nosafepoint lock
could also block out safepoints due to other JT's being blocked in a
safepoint-unsafe state while trying to lock that monitor. So even if
the waiting JT did a manual transition to _thread_blocked or used
the TBIVM in some weird way, safepoints could still be blocked out
due to those other JT's.

But in any case, if we want to verify if a wait() call might block
safepoints, we should verify all cases where this JT will wait in an
unsafe state, and that means checking whether the current lock is
nosafepoint and the current state of the JT: "if
thread->is_Java_thread() && this->rank <= Mutex::nosafepoint &&
(thread->thread_state() != _thread_blocked)". Today that would
assert since we have cases where a JT waits on a nosafepoint lock
without being _thread_blocked, mainly os::create_thread() and
JfrThreadSampler::on_javathread_suspend() from some tests I run.
VMThread::wait_for_vm_thread_exit() and
ThreadsSMRSupport::wait_until_not_protected() also assert but in
those cases the JT was already removed from the JT list so it
wouldn't be blocking safepoints.

Thank you for the explanation and further experiments @pchilano .

VMThread::wait_for_vm_thread_exit() and
ThreadsSMRSupport::wait_until_not_protected() also assert but in
those cases the JT was already removed from the JT list so it
wouldn't be blocking safepoints.

I wonder if the check should be is_active_Java_thread instead?

The Service_lock, EscapeBarrier_lock, MonitorDeflation_lock,
Notification_lock, and CodeSweeper_lock are held with a TBIVM (same
mechanism).? These locks are fairly low level locks and don't check
for safepoints.

I think the problematic scenario that Patricio is referring to is:
Thread1: TBIVM (state == _thread_blocked), comes out of wait so owns
Service_lock, does things while safepoint safe
Thread2: waiting on Service_lock, so blocking safepoint
But then Thread1 will come back to wait when done with things,
Thread2 will proceed then get to it's safepoint check when done with
eg. Service_lock.

The ServiceThread releases the Service_lock before doing any
significant work and returns to _thread_in_vm.

Yes except in the oops_do functions, the Service_lock is held while
walking the static deferred JVMTI events.? But the Service_lock waiting
in TBIVM is safe.

Coleen

David
-----

Copy link
Contributor

@pchilano pchilano left a comment

LGTM! Only small comment below.

Thanks,
Patricio

src/hotspot/share/runtime/mutex.cpp Outdated Show resolved Hide resolved
@coleenp
Copy link
Contributor Author

@coleenp coleenp commented Oct 5, 2021

Thanks for the reviews @dholmes-ora and @pchilano !
/integrate

@openjdk
Copy link

@openjdk openjdk bot commented Oct 5, 2021

Going to push as commit 90a5ae8.
Since your change was applied there have been 125 commits pushed to the master branch:

  • bb0bab5: 8274286: Skip null for make_referent_alive in referenceProcessor
  • 7ad74d8: 8274299: Make Method/Constructor/Field accessors @stable
  • 1459180: 8274079: Cleanup unnecessary calls to Throwable.initCause() in java.base module
  • 8609ea5: 8273342: Null pointer dereference in classFileParser.cpp:2817
  • a5080ef: 8272564: Incorrect attribution of method invocations of Object methods on interfaces
  • a914ee7: 8274632: Possible pointer overflow in PretouchTask chunk claiming
  • 8f7a37c: 8274434: move os::get_default_process_handle and os::dll_lookup to os_posix for POSIX platforms
  • 3953e07: 8271459: C2: Missing NegativeArraySizeException when creating StringBuilder with negative capacity
  • 53d7e95: 8274635: Use String.equals instead of String.compareTo in jdk.accessibility
  • e43f540: 8274651: Possible race in FontDesignMetrics.KeyReference.dispose
  • ... and 115 more: https://git.openjdk.java.net/jdk/compare/56b8b35286634f2d2224ca445bc9f32ff284ae74...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot closed this Oct 5, 2021
@openjdk openjdk bot added integrated and removed ready rfr labels Oct 5, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Oct 5, 2021

@coleenp Pushed as commit 90a5ae8.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@coleenp coleenp deleted the test branch Oct 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime integrated
3 participants