Skip to content

8267842: SIGSEGV in get_current_contended_monitor #4224

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

TheRealMDoerr
Copy link
Contributor

@TheRealMDoerr TheRealMDoerr commented May 27, 2021

We need a fix for crashes in get_current_contended_monitor due to concurrent modification of memory locations which are not declared volatile. See bug for details.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8267842: SIGSEGV in get_current_contended_monitor

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/4224/head:pull/4224
$ git checkout pull/4224

Update a local copy of the PR:
$ git checkout pull/4224
$ git pull https://git.openjdk.java.net/jdk pull/4224/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 4224

View PR using the GUI difftool:
$ git pr show -t 4224

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/4224.diff

@bridgekeeper
Copy link

bridgekeeper bot commented May 27, 2021

👋 Welcome back mdoerr! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label May 27, 2021
@openjdk
Copy link

openjdk bot commented May 27, 2021

@TheRealMDoerr The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label May 27, 2021
@mlbridge
Copy link

mlbridge bot commented May 27, 2021

Webrevs

Copy link
Member

@simonis simonis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Martin,

your fix looks good but I'm a little concerned because there are other call sites which us a similar pattern. E.g. in jvmtiEnvBase.cpp:

vmtiEnvBase::get_current_contended_monitor(JavaThread *calling_thread, JavaThread *java_thread, jobject *monitor_ptr) {
  Thread *current_thread = Thread::current();
  assert(java_thread->is_handshake_safe_for(current_thread),
         "call by myself or at handshake");
  oop obj = NULL;
  // The ObjectMonitor* can't be async deflated since we are either
  // at a safepoint or the calling thread is operating on itself so
  // it cannot leave the underlying wait()/enter() call.
  ObjectMonitor *mon = java_thread->current_waiting_monitor();
  if (mon == NULL) {
    // thread is not doing an Object.wait() call
    mon = java_thread->current_pending_monitor();
    if (mon != NULL) {
      // The thread is trying to enter() an ObjectMonitor.
      obj = mon->object();
      assert(obj != NULL, "ObjectMonitor should have a valid object!");
    }
    // implied else: no contended ObjectMonitor
  } else {
    // thread is doing an Object.wait() call
    obj = mon->object();
    assert(obj != NULL, "Object.wait() should have an object");
  }

So I wonder if we shouldn't make current_waiting_monitor()/current_pending_monitor() return volatile pointers to make it clear to the callers that these pointers can change at any time?

I'm also not that deep into ThreadService & al. to understand what happens after your fix. Now you don't reload the waiting monitor but you might use it although it has already been cleared out from the thread (in the case where you previously crashed). Is that still OK?

Copy link
Member

@stefank stefank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be changed to perform the read once by using Atomic::load? That's the guidance we've given the last few years. Some background for this: JDK-8234192.

@TheRealMDoerr
Copy link
Contributor Author

Hi Volker,
thanks for looking at my proposal.

I had seen the JVMTIEnvBase version of it. The comment says:
// The ObjectMonitor* can't be async deflated since we are either // at a safepoint or the calling thread is operating on itself so // it cannot leave the underlying wait()/enter() call.

The ThreadService version's comment says:
// This function can be called on a target JavaThread that is not // the caller and we are not at a safepoint. So it is possible for // the waiting or pending condition to be over/stale and for the // first stage of async deflation to clear the object field in // the ObjectMonitor. It is also possible for the object to be // inflated again and to be associated with a completely different // ObjectMonitor by the time this object reference is processed // by the caller.

So the affected code is a special usage. I don't know if a more generic fix would be desirable.
Accessing the ObjectMonitor after it was removed from the thread seems to be intended according to this comment. To verify that it's safe, one would have to check the protocol which is described here: https://wiki.openjdk.java.net/display/HotSpot/Async+Monitor+Deflation
(not a trivial task!)

@dcubed-ojdk
Copy link
Member

@TheRealMDoerr - You should also add the Serviceability group for this reivew.

@TheRealMDoerr
Copy link
Contributor Author

/label add serviceability

@openjdk openjdk bot added the serviceability serviceability-dev@openjdk.org label May 27, 2021
@openjdk
Copy link

openjdk bot commented May 27, 2021

@TheRealMDoerr
The serviceability label was successfully added.

@TheRealMDoerr
Copy link
Contributor Author

I wonder if this should be changed to perform the read once by using Atomic::load? That's the guidance we've given the last few years. Some background for this: JDK-8234192.

Hi Stefan, thanks for looking at it and thanks for the pointer. I do remember that Atomic::load should be preferred. But I would have to use it in thread.hpp and I don't know if I should change it in current_waiting_monitor() and current_pending_monitor(). Would it be acceptable to change the general code for this special usage? I'm not against doing it, I just want to double-check. It may improve reliability in general.

@dcubed-ojdk
Copy link
Member

The code that you're fixing is indeed a special case and I wrote the new test
(serviceability/monitoring/ThreadInfo/GetLockOwnerName/GetLockOwnerName.java)
specifically to stress the crashing code path. Unfortunately, I have never seen the
new test fail in our test in Mach5 or in my testing in my lab. We did get a single closed
test failure back on 2021.03.20 which is what started me down the path of creating
the new test. That single failure has never reproduced in the original closed test nor
in the targeted test that I wrote (GetLockOwnerName.java).

The similar usage in JVM/TI is safe for exactly the reasons explained in the comment so a
general Atomic::load() solution in current_waiting_monitor() or current_pending_monitor()
is not necessary.

I think your solution of adding volatile to wait_obj and enter_obj is a good solution.
I would like to see a comment added to explain the need for the volatile. I'll add an
embedded comment in the other PR view.

Using either wait_obj or enter_obj after it has been cleared from the JavaThread is
safe. The ObjectMonitor can only have gone through the first stage of async deflation.
The ObjectMonitor's memory cannot be freed until after a handshake with all threads
is completed and that cannot happen while this thread is executing the code that is
using wait_obj or enter_obj.

Copy link
Member

@dcubed-ojdk dcubed-ojdk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thumbs up.

ObjectMonitor *wait_obj = thread->current_waiting_monitor();
ObjectMonitor* volatile wait_obj = thread->current_waiting_monitor();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment above this declaration. Something like:

// Using 'volatile' to prevent the compiler from generating code that
// reloads 'wait_obj' from memory when used below.

ObjectMonitor *enter_obj = thread->current_pending_monitor();
ObjectMonitor* volatile enter_obj = thread->current_pending_monitor();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment above this declaration. Something like:

// Using 'volatile' to prevent the compiler from generating code that
// reloads 'enter_obj' from memory when used below.

@openjdk
Copy link

openjdk bot commented May 27, 2021

@TheRealMDoerr This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8267842: SIGSEGV in get_current_contended_monitor

Reviewed-by: stefank, dcubed, ysuenaga, dholmes

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 23 new commits pushed to the master branch:

  • 964bac9: 8267706: bin/idea.sh tries to use cygpath on WSL
  • 591b0c3: 8264624: change the guarantee() calls added by JDK-8264123 to assert() calls
  • 0c0ff7f: 8265309: com/sun/jndi/dns/ConfigTests/Timeout.java fails with "Address already in use" BindException
  • 24bf35f: 8265367: [macos-aarch64] 3 java/net/httpclient/websocket tests fail with "IOException: No buffer space available"
  • 1413f9e: 8241423: NUMA APIs fail to work in dockers due to dependent syscalls are disabled by default
  • 1d2c7ac: 8267555: Fix class file version during redefinition after 8238048
  • 97ec5ad: 8265753: Remove manual JavaThread transitions to blocked
  • 6eb9114: 8266877: Missing local debug information when debugging JEP-330
  • 0c9daa7: 8265029: Preserve SIZED characteristics on slice operations (skip, limit)
  • 95b1fa7: 8267529: StringJoiner can create a String that breaks String::equals
  • ... and 13 more: https://git.openjdk.java.net/jdk/compare/7278f56bb6345d7b023516d0f44de71cd74ff264...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label May 27, 2021
@TheRealMDoerr
Copy link
Contributor Author

Thanks for the review! Comment added. We have only seen the crash on s390. It depends on the platform if the compiler tends to reload values more or less.

@dholmes-ora
Copy link
Member

Hi Martin,

I'm with @stefank. This special case is a race condition and the solution for that is Atomic::load/store. Given Atomic::load/store don't actually need to do anything in practice (other than act as marker of a race) I don't have any qualms about using them all the time.

Seperately, I'm unclear why we allow this race to exist. I thought we took snapshots when threads were known to be safe and stable. But that is a separate issue.

Cheers,
David

@YaSuenag
Copy link
Member

I agree with @stefank and @dholmes-ora . It is nature to happen the change in thread.hpp to fix this problem.

Seperately, I'm unclear why we allow this race to exist. I thought we took snapshots when threads were known to be safe and stable. But that is a separate issue.

Now we have Thread-Local handshake. I think we should use it at here.

@TheRealMDoerr
Copy link
Contributor Author

Thanks, folks, for reviewing!
Changed to use Atomic::load/store. With this version, we use volatile accesses to these two members consistently. This may restrict compiler optimizations a bit, but I wouldn't expect a performance problem.

Copy link
Member

@YaSuenag YaSuenag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we add volatile to both _current_pending_monitor and _current_waiting_monitor?

@TheRealMDoerr
Copy link
Contributor Author

Shouldn't we add volatile to both _current_pending_monitor and _current_waiting_monitor?

Right. Done. Thanks!

Copy link
Member

@dcubed-ojdk dcubed-ojdk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thumbs up. Just a couple of suggestions about the comments.

Comment on lines 756 to 757
// Using atomic load to prevent compilers from reloading (ThreadService::get_current_contended_monitor).
// In case of concurrent modification, reloading pointer after NULL check must be prevented.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps rewrite the comment like this:

// Use Atomic::load() to prevent data race between concurrent modification and
// concurrent readers, e.g., ThreadService::get_current_contended_monitor().

@@ -765,10 +767,11 @@ class JavaThread: public Thread {
return _current_pending_monitor_is_from_java;
}
ObjectMonitor* current_waiting_monitor() {
return _current_waiting_monitor;
// Using atomic load as in current_pending_monitor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps:
// See the comment in current_pending_monitor() above.

@dcubed-ojdk
Copy link
Member

dcubed-ojdk commented May 28, 2021

I thought we took snapshots when threads were known to be safe and stable.

When we ask for snapshots with stack traces, we use a safepoint to get all of
target thread's stack traces at the same time. Obviously that code path is safe.

The "other" code path is when we don't ask for stack traces and that path has
always been carefully coded to return non stack trace information in a safe manner,
but it does not use a safepoint or a handshake.

This duality in code paths is why the new test I wrote:
serviceability/monitoring/ThreadInfo/GetLockOwnerName/GetLockOwnerName.java
makes alternating calls of asking for the stack trace and then not.

@TheRealMDoerr
Copy link
Contributor Author

Thanks for the review! I have improved the comments.

Copy link
Member

@dcubed-ojdk dcubed-ojdk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thumbs up.

Copy link
Member

@stefank stefank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks!

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Thanks,
David

@TheRealMDoerr
Copy link
Contributor Author

Thanks for the reviews!
/integrate

@openjdk openjdk bot closed this May 31, 2021
@openjdk openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels May 31, 2021
@openjdk
Copy link

openjdk bot commented May 31, 2021

@TheRealMDoerr Since your change was applied there have been 27 commits pushed to the master branch:

  • 236bd89: 8263583: Emoji rendering on macOS
  • 1ab2776: 8247608: Javadoc: CSS margin is not applied consistently
  • 9031477: 8267945: ZGC: Revert NUMA changes (JDK-8266217 and JDK-8241354) after JDK-8241423
  • 6627432: 8267953: restore 'volatile' to ObjectMonitor::_owner field
  • 964bac9: 8267706: bin/idea.sh tries to use cygpath on WSL
  • 591b0c3: 8264624: change the guarantee() calls added by JDK-8264123 to assert() calls
  • 0c0ff7f: 8265309: com/sun/jndi/dns/ConfigTests/Timeout.java fails with "Address already in use" BindException
  • 24bf35f: 8265367: [macos-aarch64] 3 java/net/httpclient/websocket tests fail with "IOException: No buffer space available"
  • 1413f9e: 8241423: NUMA APIs fail to work in dockers due to dependent syscalls are disabled by default
  • 1d2c7ac: 8267555: Fix class file version during redefinition after 8238048
  • ... and 17 more: https://git.openjdk.java.net/jdk/compare/7278f56bb6345d7b023516d0f44de71cd74ff264...master

Your commit was automatically rebased without conflicts.

Pushed as commit 1e29005.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@TheRealMDoerr TheRealMDoerr deleted the 8267842_get_current_contended_monitor branch May 31, 2021 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated serviceability serviceability-dev@openjdk.org
Development

Successfully merging this pull request may close these issues.

6 participants