Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8266557: assert(SafepointMechanism::local_poll_armed(_handshakee)) failed: Must be #3973

Closed
wants to merge 11 commits into from

Conversation

robehn
Copy link
Contributor

@robehn robehn commented May 11, 2021

Please consider this change-set which address the issue on hand.

I identified two problems:

  • is_locked() uses the _owner field which is unordered (no storestore|storeload) on store-side.
    Fixed by leaving the handshakee being processed in queue until completed.
    And remove looping, since if ever the queue is empty the handshakee may processed.
    If ever want to loop again, we must make sure queue is not empty before removing the processed handshake.
    But there is, at this moment, no performance benefit to that, so I chosse the simple, easy to reason about version. (some crazy stress test can see a difference)

Note that I'll do a follow-up and make is_locked() ifdef ASSERT only.

  • have_non_self_executable_operation() only provide correct acquire if first in queue matched, if second item matched it could be re-orderd with reading the poll state.
    Fixed by adding a loadload.

I could at first reproduce by checking _active_handshaker in update_poll (~1/50) and an increase in the test time by ten.
(real reprod ~1/400 with increased test time)
If we are updating the poll there should not be an active handshaker.
The above fixed the issue.
But after a rebase when I was trying to pin point the issue I could no longer reproduce even without any changes.

Added Atomic store/load to _active_handshaker since it may be concurrently loaded (it may only be used to ask if current thread is active handshaker).

Passes stressing relevant test on aarch64 and t1-7.

Thanks, Robbin


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8266557: assert(SafepointMechanism::local_poll_armed(_handshakee)) failed: Must be

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/3973/head:pull/3973
$ git checkout pull/3973

Update a local copy of the PR:
$ git checkout pull/3973
$ git pull https://git.openjdk.java.net/jdk pull/3973/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 3973

View PR using the GUI difftool:
$ git pr show -t 3973

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/3973.diff

@bridgekeeper
Copy link

bridgekeeper bot commented May 11, 2021

👋 Welcome back rehn! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented May 11, 2021

@robehn The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label May 11, 2021
@robehn
Copy link
Contributor Author

robehn commented May 11, 2021

/label remove hotspot
/label add hotspot-runtime

@openjdk openjdk bot removed the hotspot hotspot-dev@openjdk.org label May 11, 2021
@openjdk
Copy link

openjdk bot commented May 11, 2021

@robehn
The hotspot label was successfully removed.

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label May 11, 2021
@openjdk
Copy link

openjdk bot commented May 11, 2021

@robehn
The hotspot-runtime label was successfully added.

@robehn robehn marked this pull request as ready for review May 17, 2021 12:57
@openjdk openjdk bot added the rfr Pull request is ready for review label May 17, 2021
@mlbridge
Copy link

mlbridge bot commented May 17, 2021

@mlbridge
Copy link

mlbridge bot commented May 18, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

Hi Robbin,

On 17/05/2021 11:02 pm, Robbin Ehn wrote:

Please consider this change-set which address the issue on hand.

I identified two problems:

- is_locked() uses the _owner field which is unordered (no storestore|storeload) on store-side.
Fixed by leaving the handshakee being processed in queue until completed.

Sorry but I'm not seeing how leaving or removing the op from the queue
impacts things here. Which thread is reading the queue and seeing a
wrong value?

Thanks,
David
-----

@robehn
Copy link
Contributor Author

robehn commented May 18, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

Hi Robbin,

On 17/05/2021 11:02 pm, Robbin Ehn wrote:

Please consider this change-set which address the issue on hand.
I identified two problems:

  • is_locked() uses the _owner field which is unordered (no storestore|storeload) on store-side.
    Fixed by leaving the handshakee being processed in queue until completed.

Sorry but I'm not seeing how leaving or removing the op from the queue
impacts things here. Which thread is reading the queue and seeing a
wrong value?

When poll is armed the handshakee must go to slow path and lock the HandshakeState lock if:

  • 1: There is handshake operation to be processed
  • 2: Some-else is in the middle of processing a handshake operation for this handshakee

By having having a non-empty queue if 1 or 2 is true, the handshakee can just ask if there is a non-empty queue.
If so go to slow path and grab the HandshakeState lock.

Now 1 is checked with queue head and 2 is checked with is_locked().

But since the thread locking, storing _owner, can do this unordered, it may store the owner field after it have read our thread state, e.g. blocked.

What we do in three different scenarios are:
Handshaker adds an operation:
store queue
store poll

Handshaker processing do:
store _owner (lock)
load queue
load poll
load thread state

Handshakee polling path do:
store state
load poll
load owner
load queue

############################################
With the un-ordered characteristics of _owner field, it looks like it could play out like this:

Handshakee: state blocked
Handshaker A processor: add: store queue OP Z
Handshaker B processor: processes OP Z and store queue => NULL
Handshaker A processor: add: store poll
Handshaker A processor: LOCK HandshakeState lock (_owner is still NULL)
Handshaker C processor: add: store queue OP X
Handshaker A processor: load queue (sees X)
Handshaker A processor: load poll (armed)
Handshaker A processor: load thread state // OK blocked
Handshakee T target : store state // unsafe
Handshakee T target : load poll // armed
Handshakee T target : load owner // NULL
Handshaker A processor: store _owner (non-NULL) // re-ordered here !!
Handshaker A processor: store queue (NULL)
Handshakee T target : load queue // NULL
Handshakee T target : disarm // ERROR !!!!

Thanks,
David

@mlbridge
Copy link

mlbridge bot commented May 19, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

On 18/05/2021 6:25 pm, Robbin Ehn wrote:

On Tue, 11 May 2021 11:59:23 GMT, Robbin Ehn <rehn at openjdk.org> wrote:

_Mailing list message from [David Holmes](mailto:david.holmes at oracle.com) on [hotspot-runtime-dev](mailto:hotspot-runtime-dev at mail.openjdk.java.net):_

Hi Robbin,

On 17/05/2021 11:02 pm, Robbin Ehn wrote:

Please consider this change-set which address the issue on hand.
I identified two problems:
- is_locked() uses the _owner field which is unordered (no storestore|storeload) on store-side.
Fixed by leaving the handshakee being processed in queue until completed.

Sorry but I'm not seeing how leaving or removing the op from the queue
impacts things here. Which thread is reading the queue and seeing a
wrong value?

When poll is armed the handshakee must go to slow path and lock the HandshakeState lock if:
- 1: There is handshake operation to be processed
- 2: Some-else is in the middle of processing a handshake operation for this handshakee

By having having a non-empty queue if 1 or 2 is true, the handshakee can just ask if there is a non-empty queue.
If so go to slow path and grab the HandshakeState lock.

Now 1 is checked with queue head and 2 is checked with is_locked().

But since the thread locking, storing _owner, can do this unordered, it may store the owner field after it have read our thread state, e.g. blocked.

Okay ... I'm trying to decide if we have a bug in our Mutex
implementation, or whether the use of is_locked() is incorrect. The
problem is that checking _owner is not sufficient to know if a Mutex is
locked because that is a stand-alone action that happens after the
actual locking takes place. Even if we fixed the ordering issue by
performing a fence() after storing _owner, that would not fix the basic
problem. So any use of Mutex::is_locked() has to be treated with
suspicion as it can return false when the Mutex actually is locked.

So setting that aside I'm happy to see should_process() and the use of
is_locked() disappear, and I will re-examine the PR with a fresh
perspective. :)

Thanks,
David
-----

1 similar comment
@mlbridge
Copy link

mlbridge bot commented May 19, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

On 18/05/2021 6:25 pm, Robbin Ehn wrote:

On Tue, 11 May 2021 11:59:23 GMT, Robbin Ehn <rehn at openjdk.org> wrote:

_Mailing list message from [David Holmes](mailto:david.holmes at oracle.com) on [hotspot-runtime-dev](mailto:hotspot-runtime-dev at mail.openjdk.java.net):_

Hi Robbin,

On 17/05/2021 11:02 pm, Robbin Ehn wrote:

Please consider this change-set which address the issue on hand.
I identified two problems:
- is_locked() uses the _owner field which is unordered (no storestore|storeload) on store-side.
Fixed by leaving the handshakee being processed in queue until completed.

Sorry but I'm not seeing how leaving or removing the op from the queue
impacts things here. Which thread is reading the queue and seeing a
wrong value?

When poll is armed the handshakee must go to slow path and lock the HandshakeState lock if:
- 1: There is handshake operation to be processed
- 2: Some-else is in the middle of processing a handshake operation for this handshakee

By having having a non-empty queue if 1 or 2 is true, the handshakee can just ask if there is a non-empty queue.
If so go to slow path and grab the HandshakeState lock.

Now 1 is checked with queue head and 2 is checked with is_locked().

But since the thread locking, storing _owner, can do this unordered, it may store the owner field after it have read our thread state, e.g. blocked.

Okay ... I'm trying to decide if we have a bug in our Mutex
implementation, or whether the use of is_locked() is incorrect. The
problem is that checking _owner is not sufficient to know if a Mutex is
locked because that is a stand-alone action that happens after the
actual locking takes place. Even if we fixed the ordering issue by
performing a fence() after storing _owner, that would not fix the basic
problem. So any use of Mutex::is_locked() has to be treated with
suspicion as it can return false when the Mutex actually is locked.

So setting that aside I'm happy to see should_process() and the use of
is_locked() disappear, and I will re-examine the PR with a fresh
perspective. :)

Thanks,
David
-----

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Robbin,

I'm afraid these changes are very unclear to me. Other than not using is_locked() to determine if there is more work to do I really don't follow how the other changes relate to the failure mode.

Thanks,
David

src/hotspot/share/runtime/handshake.cpp Outdated Show resolved Hide resolved
MutexLocker ml(&_lock, Mutex::_no_safepoint_check_flag);
HandshakeOperation* op = pop_for_self();
HandshakeOperation* op = get_op_for_self();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand why we now do a peek here, and then do the pop below (in remove-op), instead of just doing the pop here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is, from above:

When poll is armed the handshakee must go to slow path and lock the HandshakeState lock if:

  • 1: There is handshake operation to be processed
  • 2: Some-else is in the middle of processing a handshake operation for this handshakee

This, in practice, means the first pointer of the queue most be non-NULL, and seen as non-NULL, until after we have processed the handshake operation.
By doing:
1: get op
2: do op

  • When handshakee first checks poll, it is armed.
  • Then it checks global poll (safepoint) and the queue state.
  • If there is a handshake operation to be processed goto slow path.
    3: remove op
  • First ptr can be set NULL, which means the handshakee may elide the slow path.
  • The handshakee sees poll armed, but no global poll and nothing in queue.
  • The handshakee may disarm the poll.

By simply removing the OP after we have executed we know the handshakee must go to slow path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry you mean when you are self processing?

It's because of symmetry, everyone do the same, which is far easier to think about.

@@ -565,39 +567,31 @@ HandshakeState::ProcessResult HandshakeState::try_process(HandshakeOperation* ma

Thread* current_thread = Thread::current();

HandshakeState::ProcessResult pr_ret = HandshakeState::_processed;
int executed = 0;
HandshakeOperation* op = get_op();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again why not pop here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

If there was only one OP in queue, the handshake may now disarm it's poll and continue.
If there is more in queue it will still go to slow path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So comment above applies here, we must not let queue become empty until OP is completed.


executed++;
}
} while (have_non_self_executable_operation());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all clear how/why this looping can now be removed.

Copy link
Contributor Author

@robehn robehn May 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be removed.
Since we have concurrent adds.
Thread P: get OP X
Thread P: do OP X
Thread P: remove OP X
Handshakee: is queue empty? Yes, can continue to e.g. Java
Thread X: add OP Z
Thread P: Is there more OPs to be execute ? Yes => Error, handshakee did not see that OP.

So Thread P needs to verify that the handshakee did see that operation. To do that we need to make sure poll is armed and that the handshakee is in handshake safe state. Which means we need to redo everything; by just returning from this method we do this.

p2i(current_thread), p2i(_handshakee),
op == match_op ? "including" : "excluding", p2i(match_op));

return op == match_op ? HandshakeState::_succeeded : HandshakeState::_processed;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing: what does _processed mean versus _succeeded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Processed means we did progress, so do not yield.
  • Succeeded means you processed the OP you add to the queue. (For your purposes you don't need to revisit this handshakee)

@@ -113,6 +113,7 @@ uintptr_t SafepointMechanism::compute_poll_word(bool armed, uintptr_t stack_wate
}

void SafepointMechanism::update_poll_values(JavaThread* thread) {
assert(thread == Thread::current(), "Must not be");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the message should say "Must be".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fixing.

@mlbridge
Copy link

mlbridge bot commented May 24, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

On 24/05/2021 8:25 pm, Robbin Ehn wrote:

On Mon, 24 May 2021 04:33:18 GMT, David Holmes <dholmes at openjdk.org> wrote:

Please consider this change-set which address the issue on hand.

I identified two problems:

- is_locked() uses the _owner field which is unordered (no storestore|storeload) on store-side.
Fixed by leaving the handshakee being processed in queue until completed.
And remove looping, since if ever the queue is empty the handshakee may processed.
If ever want to loop again, we must make sure queue is not empty before removing the processed handshake.
But there is, at this moment, no performance benefit to that, so I chosse the simple, easy to reason about version. (some crazy stress test can see a difference)

Note that I'll do a follow-up and make is_locked() ifdef ASSERT only.

- have_non_self_executable_operation() only provide correct acquire if first in queue matched, if second item matched it could be re-orderd with reading the poll state.
Fixed by adding a loadload.

I could at first reproduce by checking _active_handshaker in update_poll (~1/50) and an increase in the test time by ten.
(real reprod ~1/400 with increased test time)
If we are updating the poll there should not be an active handshaker.
The above fixed the issue.
But after a rebase when I was trying to pin point the issue I could no longer reproduce even without any changes.

Added Atomic store/load to _active_handshaker since it may be concurrently loaded (it may only be used to ask if current thread is active handshaker).

Passes stressing relevant test on aarch64 and t1-7.

Thanks, Robbin

src/hotspot/share/runtime/handshake.cpp line 473:

471: while (has_operation()) {
472: MutexLocker ml(&_lock, Mutex::_no_safepoint_check_flag);
473: HandshakeOperation* op = get_op_for_self();

I still don't understand why we now do a peek here, and then do the pop below (in remove-op), instead of just doing the pop here?

The reason is, from above:

When poll is armed the handshakee must go to slow path and lock the HandshakeState lock if:
- 1: There is handshake operation to be processed
- 2: Some-else is in the middle of processing a handshake operation for this handshakee

This, in practice, means the first pointer of the queue most be non-NULL, and seen as non-NULL, until after we have processed the handshake operation.
By doing:
1: get op
2: do op
- When handshakee first checks poll, it is armed.
- Then it checks global poll (safepoint) and the queue state.
- If there is a handshake operation to be processed goto slow path.
3: remove op
- First ptr can be set NULL, which means the handshakee may elide the slow path.
- The handshakee sees poll armed, but no global poll and nothing in queue.
- The handshakee may disarm the poll.

By simply removing the OP after we have executed we know the handshakee must go to slow path.

What are you calling the "slow path" here?

I think I need a refresher course on the basics of the handshake
mechanism as I am not understanding this at all sorry.

David
-----

1 similar comment
@mlbridge
Copy link

mlbridge bot commented May 24, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

On 24/05/2021 8:25 pm, Robbin Ehn wrote:

On Mon, 24 May 2021 04:33:18 GMT, David Holmes <dholmes at openjdk.org> wrote:

Please consider this change-set which address the issue on hand.

I identified two problems:

- is_locked() uses the _owner field which is unordered (no storestore|storeload) on store-side.
Fixed by leaving the handshakee being processed in queue until completed.
And remove looping, since if ever the queue is empty the handshakee may processed.
If ever want to loop again, we must make sure queue is not empty before removing the processed handshake.
But there is, at this moment, no performance benefit to that, so I chosse the simple, easy to reason about version. (some crazy stress test can see a difference)

Note that I'll do a follow-up and make is_locked() ifdef ASSERT only.

- have_non_self_executable_operation() only provide correct acquire if first in queue matched, if second item matched it could be re-orderd with reading the poll state.
Fixed by adding a loadload.

I could at first reproduce by checking _active_handshaker in update_poll (~1/50) and an increase in the test time by ten.
(real reprod ~1/400 with increased test time)
If we are updating the poll there should not be an active handshaker.
The above fixed the issue.
But after a rebase when I was trying to pin point the issue I could no longer reproduce even without any changes.

Added Atomic store/load to _active_handshaker since it may be concurrently loaded (it may only be used to ask if current thread is active handshaker).

Passes stressing relevant test on aarch64 and t1-7.

Thanks, Robbin

src/hotspot/share/runtime/handshake.cpp line 473:

471: while (has_operation()) {
472: MutexLocker ml(&_lock, Mutex::_no_safepoint_check_flag);
473: HandshakeOperation* op = get_op_for_self();

I still don't understand why we now do a peek here, and then do the pop below (in remove-op), instead of just doing the pop here?

The reason is, from above:

When poll is armed the handshakee must go to slow path and lock the HandshakeState lock if:
- 1: There is handshake operation to be processed
- 2: Some-else is in the middle of processing a handshake operation for this handshakee

This, in practice, means the first pointer of the queue most be non-NULL, and seen as non-NULL, until after we have processed the handshake operation.
By doing:
1: get op
2: do op
- When handshakee first checks poll, it is armed.
- Then it checks global poll (safepoint) and the queue state.
- If there is a handshake operation to be processed goto slow path.
3: remove op
- First ptr can be set NULL, which means the handshakee may elide the slow path.
- The handshakee sees poll armed, but no global poll and nothing in queue.
- The handshakee may disarm the poll.

By simply removing the OP after we have executed we know the handshakee must go to slow path.

What are you calling the "slow path" here?

I think I need a refresher course on the basics of the handshake
mechanism as I am not understanding this at all sorry.

David
-----

@robehn
Copy link
Contributor Author

robehn commented May 24, 2021

What are you calling the "slow path" here?

"need_rechecking = thread->handshake_state()->has_operation() && thread->handshake_state()->process_by_self();"

If has_operation() returns false we continue out of the transition into an unsafe state.
Thus has_operation() must return true until it is safe to do so.

It's same as for global_poll(), which must return true util the safepoint is over.

Thanks, Robbin

I think I need a refresher course on the basics of the handshake
mechanism as I am not understanding this at all sorry.

David

Copy link
Contributor

@pchilano pchilano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Robbin,

I think your patch looks good and fixes the issue with is_locked().
Based on the root cause it seems another simpler solution would have been to add a fence between claim_handshake() and can_process_handshake() to make sure the target sees the write to _owner before we read a handshake safe state. If the issue is that _owner should be only use for debug builds or that the fence should have been in Mutex class and we don't want to affect all other cases, then we could use _active_handshaker instead of _owner. That way we could also keep the while loop in try_process() to process multiple operations at a time, and don't worry about the peek vs pop issue.

Thanks,
Patricio

@@ -327,7 +327,7 @@ void HandshakeOperation::do_handshake(JavaThread* thread) {
// here to make sure memory operations executed in the handshake
// closure are visible to the VMThread/Handshaker after it reads
// that the operation has completed.
Atomic::dec(&_pending_threads, memory_order_release);
Atomic::dec(&_pending_threads);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a full fence instead of a release as before?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And does the comment on L326 now need to be updated?

Copy link
Contributor Author

@robehn robehn Jun 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have three options here:
1: "If the handshake operation is not completed it must be on at least one queue."
2: "If the handshake operation is completed it is not on any queue."
3: "It may be on queue while completed and may not be queue when not completed, it depends."

With the polling mechanism saying you must stop while the queue is not empty, the operation is removed after it is executed, option 1 seemed preferable to me.
Specially if we want a simpler easier model to avoid any further such bugs as this.

This means removal from queue happens after decrement of _pending_threads.
remove_op() -> pop() gives no ordering guarantee, those comes from the mutex.

The technical note on the current implementation of the queue is:
Only first can be read without lock and the implementation of first manipulation is done with a CAS.
So a side-effect of current implementation is that the only thing that can be seen will be ordered.

Since we running in slowpath there is no need to reduce ordering to a minimum and we can thus change the queue implementation since it have no guarantees to do any ordering without worrying about someone (ab)using side-effects.

Do you agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but I don't see why the change of just using the queue to check for pending handshakes matters here. This release was to ensure that if the handshakee executed the closure, the memory operations executed in the handshake would be visible to the vmthread/handshaker after seeing the operation was completed (matched with the acquired after is_completed()). What could be the issue if this is not a full fence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Change of just using the queue to check for pending handshakes": also means we must remove the handshake from the queue after it have been executed.
I consider the decrement as the marking of "it have been executed".

Therefore anything after the decrement, should stay after and not float above the dec, to follow the protocol.
If disregard ordering side-effects when removing the last in queue with CAS, if the removal happens before the decrement means if the handshakee ask "should process" it could say no before we decrement.

This release to the removal of the operation is match with the acquire on is_empty().
Normally the removal would had used the release from the mutex lock, but here we hold the lock over several operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we can keep the previous memory_order_release order, but if this stricter order makes it easier to read for others then I'm okay.

Copy link
Contributor Author

@robehn robehn Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment.

But that would be relying on implementation details of a generic queue.
This would be a perfectly fine change-set:

diff --git a/src/hotspot/share/utilities/filterQueue.inline.hpp b/src/hotspot/share/utilities/filterQueue.inline.hpp
index 2fa199f2b04..ecb13e57db7 100644
--- a/src/hotspot/share/utilities/filterQueue.inline.hpp
+++ b/src/hotspot/share/utilities/filterQueue.inline.hpp
@@ -94 +94 @@ E FilterQueue<E>::pop(MATCH_FUNC& match_func) {
-      if (Atomic::cmpxchg(&_first, match, match->_next) == match) {
+      if (Atomic::cmpxchg(&_first, match, match->_next, memory_order_relaxed) == match) {

CAS will happened after we load _first, since it's loaded with acquire, but both operation may float above decrement!

Now handshake will break because it would be using a undocumented side affect.
If we are removing any ordering, we should remove it where it is not needed. (the CAS)
Not the other way around!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we have a release() fence after the handshake closure the reordering of decrementing _pending_threads and removing the operation from the queue shouldn't matter since the target doesn't read _pending_threads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"release operation: no reads or writes in the current thread can be reordered after this store."

But stores happened after this store may be reordered before. Which means the removal might happen even before the handshake operation was executed.

So we need to stop the store ("the relaxed CAS") from floating above the dec.

Dec with rel+acq gives: "No memory reads or writes in the current thread can be reordered before or after this store."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By release() fence I was referring to the definition in orderAccess (LoadStore | StoreStore), sorry for the confusion, which would be enough to prevent any of those writes to float above the barrier. I now see that dec( ,memory_order_release) might not necessarily provide that though so we would need to use OrderAccess::release() + dec( ,memory_order_relaxed) instead. But I see that using a full fence is probably easier to read. Also by looking at the implementations of dec() for arm32/aarch64 it seems the order parameter is ignored (always full fence) (?), which would defeat the purpose of using release().

@@ -327,7 +327,7 @@ void HandshakeOperation::do_handshake(JavaThread* thread) {
// here to make sure memory operations executed in the handshake
// closure are visible to the VMThread/Handshaker after it reads
// that the operation has completed.
Atomic::dec(&_pending_threads, memory_order_release);
Atomic::dec(&_pending_threads);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And does the comment on L326 now need to be updated?

src/hotspot/share/runtime/handshake.cpp Show resolved Hide resolved
src/hotspot/share/runtime/handshake.cpp Show resolved Hide resolved
src/hotspot/share/runtime/handshake.cpp Outdated Show resolved Hide resolved
src/hotspot/share/runtime/handshake.cpp Show resolved Hide resolved
src/hotspot/share/runtime/handshake.cpp Show resolved Hide resolved
src/hotspot/share/runtime/handshake.cpp Show resolved Hide resolved
src/hotspot/share/runtime/handshake.hpp Show resolved Hide resolved
@dcubed-ojdk
Copy link
Member

I've stopped my review because I realized that I have gotten confused by
the peek() operation that actually appears to be searching for a match
rather than peeking at the head (or tail) of the queue. My brain is no longer
clear on what your operations are. Sorry.

@robehn
Copy link
Contributor Author

robehn commented Jun 2, 2021

Hi Robbin,

I think your patch looks good and fixes the issue with is_locked().
Based on the root cause it seems another simpler solution would have been to add a fence between claim_handshake() and can_process_handshake() to make sure the target sees the write to _owner before we read a handshake safe state. If the issue is that _owner should be only use for debug builds or that the fence should have been in Mutex class and we don't want to affect all other cases, then we could use _active_handshaker instead of _owner. That way we could also keep the while loop in try_process() to process multiple operations at a time, and don't worry about the peek vs pop issue.

Thanks,
Patricio

Thanks for having a look.

The main reason for not just adding the fence or use another state is that it leaves us with three different variables.
Since keeping track of three states, poll, queue and owner turn out to be a handful, reducing it back to two states I hope we can easier reason about the code.
Thus having queue non-empty (which was also suggested instead of using is_locked when it was introduced) while executing is the simplest model (but a bigger change).
This also mirrors safepoints: poll, global flag => wait barrier
With : poll, queue => HandshakeState lock

Since there is no performance benefit of looping, except in some test which do millions of them in a few seconds.
If we need looping when removing the executed OP, returning if the op was removed with CAS on _first with NULL we must abort loop, otherwise we can continue.

But removing the looping also simplifies the model and removes many scenarios which no longer needs to reason about.
Again this bug shows that we should simplify, so more people can reason about if this is correct or not.

Thanks Robbin

@robehn
Copy link
Contributor Author

robehn commented Jun 2, 2021

I've stopped my review because I realized that I have gotten confused by
the peek() operation that actually appears to be searching for a match
rather than peeking at the head (or tail) of the queue. My brain is no longer
clear on what your operations are. Sorry.

Thanks, sorry about that, hopefully something above can clear that out!

@pchilano
Copy link
Contributor

pchilano commented Jun 2, 2021

Hi Robbin,
I think your patch looks good and fixes the issue with is_locked().
Based on the root cause it seems another simpler solution would have been to add a fence between claim_handshake() and can_process_handshake() to make sure the target sees the write to _owner before we read a handshake safe state. If the issue is that _owner should be only use for debug builds or that the fence should have been in Mutex class and we don't want to affect all other cases, then we could use _active_handshaker instead of _owner. That way we could also keep the while loop in try_process() to process multiple operations at a time, and don't worry about the peek vs pop issue.
Thanks,
Patricio

Thanks for having a look.

The main reason for not just adding the fence or use another state is that it leaves us with three different variables.
Since keeping track of three states, poll, queue and owner turn out to be a handful, reducing it back to two states I hope we can easier reason about the code.
Thus having queue non-empty (which was also suggested instead of using is_locked when it was introduced) while executing is the simplest model (but a bigger change).
This also mirrors safepoints: poll, global flag => wait barrier
With : poll, queue => HandshakeState lock

Since there is no performance benefit of looping, except in some test which do millions of them in a few seconds.
If we need looping when removing the executed OP, returning if the op was removed with CAS on _first with NULL we must abort loop, otherwise we can continue.

But removing the looping also simplifies the model and removes many scenarios which no longer needs to reason about.
Again this bug shows that we should simplify, so more people can reason about if this is correct or not.

Thanks Robbin

Ok, I see. Yes, the code is now simpler at the expense of a little more overhead (no loop and having to traverse the list twice, one for the get and one for remove). But I guess if in the future we see it's needed we could always add the loop back with the previous approach.

@dcubed-ojdk
Copy link
Member

Has there been an update since my last review? I recorded that I
reviewed v02 and that appears to still be the current version.

@dcubed-ojdk
Copy link
Member

I've re-reviewed v02 and this time I used the full webrev. It was much
easier to wrap my head around this time. It could also be because I'm
re-reviewing much earlier in the day. :-)

@robehn
Copy link
Contributor Author

robehn commented Jun 4, 2021

I've re-reviewed v02 and this time I used the full webrev. It was much
easier to wrap my head around this time. It could also be because I'm
re-reviewing much earlier in the day. :-)

Good, thanks! Now there is a small update.

Copy link
Member

@dcubed-ojdk dcubed-ojdk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a crawl thru review of the v05 full webrev.
Thumbs up.

There are a few copyright years that need to be updated.

@openjdk
Copy link

openjdk bot commented Jun 7, 2021

@robehn This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8266557: assert(SafepointMechanism::local_poll_armed(_handshakee)) failed: Must be

Reviewed-by: pchilanomate, dcubed

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 1 new commit pushed to the master branch:

  • 61ab4b9: 8267564: JDK-8252971 causes SPECjbb2015 socket exceptions on Windows when MKS is installed

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jun 7, 2021
@mlbridge
Copy link

mlbridge bot commented Jun 8, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

On 7/06/2021 1:49 pm, Patricio Chilano Mateo wrote:

On Fri, 4 Jun 2021 22:46:08 GMT, Robbin Ehn <rehn at openjdk.org> wrote:

As long as we have a release() fence after the handshake closure the reordering of decrementing _pending_threads and removing the operation from the queue shouldn't matter since the target doesn't read _pending_threads.

"release operation: no reads or writes in the current thread can be reordered after this store."

But stores happened after this store may be reordered before. Which means the removal might happen even before the handshake operation was executed.

So we need to stop the store ("the relaxed CAS") from floating above the dec.

Dec with rel+acq gives: "No memory reads or writes in the current thread can be reordered before or after this store."

By release() fence I was referring to the definition in orderAccess (LoadStore | StoreStore), sorry for the confusion,

Those "definitions" are not definitions - the semantics of combining
loadstore|storestore are sufficient to implement release but not
necessary. If all you have is loadstore and storestore style barriers
then that combination is what you need to implement release semantics.
But release semantics (and acquire semantics) are weaker than that.

which would be enough to prevent any of those writes to float above the barrier. I now see that dec( ,memory_order_release) might not necessarily provide that though so we would need to use OrderAccess::release() + dec( ,memory_order_relaxed) instead. But I see that using a full fence is probably easier to read. Also by looking at the implementations of dec() for arm32/aarch64 it seems the order parameter is ignored (always full fence) (?), which would defeat the purpose of using release().

We should always try to use the correct logical memory barrier for a
given situation, without too much concern about whether or not a given
platform can implement "finer-grained" barriers. But on the flip-side if
proving the correctness of a weaker barrier is difficult to reason about
(or easily broken by simple code changes) then stick with a coarser
barrier. Performance issues may sometimes force you in a particular
direction in any case.

Cheers,
David
-----

1 similar comment
@mlbridge
Copy link

mlbridge bot commented Jun 8, 2021

Mailing list message from David Holmes on hotspot-runtime-dev:

On 7/06/2021 1:49 pm, Patricio Chilano Mateo wrote:

On Fri, 4 Jun 2021 22:46:08 GMT, Robbin Ehn <rehn at openjdk.org> wrote:

As long as we have a release() fence after the handshake closure the reordering of decrementing _pending_threads and removing the operation from the queue shouldn't matter since the target doesn't read _pending_threads.

"release operation: no reads or writes in the current thread can be reordered after this store."

But stores happened after this store may be reordered before. Which means the removal might happen even before the handshake operation was executed.

So we need to stop the store ("the relaxed CAS") from floating above the dec.

Dec with rel+acq gives: "No memory reads or writes in the current thread can be reordered before or after this store."

By release() fence I was referring to the definition in orderAccess (LoadStore | StoreStore), sorry for the confusion,

Those "definitions" are not definitions - the semantics of combining
loadstore|storestore are sufficient to implement release but not
necessary. If all you have is loadstore and storestore style barriers
then that combination is what you need to implement release semantics.
But release semantics (and acquire semantics) are weaker than that.

which would be enough to prevent any of those writes to float above the barrier. I now see that dec( ,memory_order_release) might not necessarily provide that though so we would need to use OrderAccess::release() + dec( ,memory_order_relaxed) instead. But I see that using a full fence is probably easier to read. Also by looking at the implementations of dec() for arm32/aarch64 it seems the order parameter is ignored (always full fence) (?), which would defeat the purpose of using release().

We should always try to use the correct logical memory barrier for a
given situation, without too much concern about whether or not a given
platform can implement "finer-grained" barriers. But on the flip-side if
proving the correctness of a weaker barrier is difficult to reason about
(or easily broken by simple code changes) then stick with a coarser
barrier. Performance issues may sometimes force you in a particular
direction in any case.

Cheers,
David
-----

@robehn
Copy link
Contributor Author

robehn commented Jun 8, 2021

I did a crawl thru review of the v05 full webrev.
Thumbs up.

There are a few copyright years that need to be updated.

Thank you!

@pchilano
Copy link
Contributor

pchilano commented Jun 8, 2021

Hi David,

On 7/06/2021 1:49 pm, Patricio Chilano Mateo wrote:

On Fri, 4 Jun 2021 22:46:08 GMT, Robbin Ehn wrote:

As long as we have a release() fence after the handshake closure the reordering of decrementing _pending_threads and removing the operation from the queue shouldn't matter since the target doesn't read _pending_threads.

"release operation: no reads or writes in the current thread can be reordered after this store."
But stores happened after this store may be reordered before. Which means the removal might happen even before the handshake operation was executed.
So we need to stop the store ("the relaxed CAS") from floating above the dec.
Dec with rel+acq gives: "No memory reads or writes in the current thread can be reordered before or after this store."

By release() fence I was referring to the definition in orderAccess (LoadStore | StoreStore), sorry for the confusion,

Those "definitions" are not definitions - the semantics of combining
loadstore|storestore are sufficient to implement release but not
necessary. If all you have is loadstore and storestore style barriers
then that combination is what you need to implement release semantics.
But release semantics (and acquire semantics) are weaker than that.

I'm talking about standalone fences not the definition of release semantics which is weaker as you pointed out. The equivalent of storeFence from Unsafe.java or std::atomic_thread_fence(std::memory_order_release) with C++ atomics.

Patricio

@robehn
Copy link
Contributor Author

robehn commented Jun 9, 2021

/integrate

@openjdk openjdk bot closed this Jun 9, 2021
@openjdk openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jun 9, 2021
@openjdk
Copy link

openjdk bot commented Jun 9, 2021

@robehn Since your change was applied there have been 24 commits pushed to the master branch:

  • 4d1cf51: 8240349: jlink should not leave partial image output directory on failure
  • 07108c9: 8268241: deprecate JVM TI Heap functions 1.0
  • c9dbc4f: 8266891: Provide a switch to force the class space to a specific location
  • 2cc1977: 8268424: JFR tests fail due to GC cause 'G1 Preventive Collection' not in the valid causes after JDK-8257774
  • 58a59e3: 8240997: Remove more "hack" word in security codes
  • 1c3932f: 8264766: ClassCastException during template compilation (Variable cannot be cast to Param)
  • f6f82c3: 8266421: Deadlock in Sound System
  • bcaa2cb: 8264144: Add handling of "--about-url" CLI parameter for RPM/DEB packages
  • ae16052: 8268088: Clarify Method::clear_jmethod_ids() related comments in ClassLoaderData::~ClassLoaderData()
  • 5ad4a91: 8268127: Shenandoah: Heap size may be too small for region to align to large page size
  • ... and 14 more: https://git.openjdk.java.net/jdk/compare/00c88f79b30d7867be4a66317b90b9ba7e947f4f...master

Your commit was automatically rebased without conflicts.

Pushed as commit 2bfd708.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@robehn robehn deleted the handshakee branch June 9, 2021 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated
4 participants