8267842: SIGSEGV in get_current_contended_monitor by TheRealMDoerr · Pull Request #4224 · openjdk/jdk

TheRealMDoerr · 2021-05-27T09:56:22Z

We need a fix for crashes in get_current_contended_monitor due to concurrent modification of memory locations which are not declared volatile. See bug for details.

Progress

Change must not contain extraneous whitespace
Commit message must refer to an issue
Change must be properly reviewed

Issue

JDK-8267842: SIGSEGV in get_current_contended_monitor

Reviewers

Stefan Karlsson (@stefank - Reviewer)
Daniel D. Daugherty (@dcubed-ojdk - Reviewer)
Yasumasa Suenaga (@YaSuenag - Reviewer)
David Holmes (@dholmes-ora - Reviewer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/4224/head:pull/4224
$ git checkout pull/4224

Update a local copy of the PR:
$ git checkout pull/4224
$ git pull https://git.openjdk.java.net/jdk pull/4224/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 4224

View PR using the GUI difftool:
$ git pr show -t 4224

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/4224.diff

bridgekeeper · 2021-05-27T10:04:01Z

👋 Welcome back mdoerr! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2021-05-27T10:07:05Z

@TheRealMDoerr The following label will be automatically applied to this pull request:

hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2021-05-27T10:10:14Z

Webrevs

simonis

Hi Martin,

your fix looks good but I'm a little concerned because there are other call sites which us a similar pattern. E.g. in jvmtiEnvBase.cpp:

vmtiEnvBase::get_current_contended_monitor(JavaThread *calling_thread, JavaThread *java_thread, jobject *monitor_ptr) {
  Thread *current_thread = Thread::current();
  assert(java_thread->is_handshake_safe_for(current_thread),
         "call by myself or at handshake");
  oop obj = NULL;
  // The ObjectMonitor* can't be async deflated since we are either
  // at a safepoint or the calling thread is operating on itself so
  // it cannot leave the underlying wait()/enter() call.
  ObjectMonitor *mon = java_thread->current_waiting_monitor();
  if (mon == NULL) {
    // thread is not doing an Object.wait() call
    mon = java_thread->current_pending_monitor();
    if (mon != NULL) {
      // The thread is trying to enter() an ObjectMonitor.
      obj = mon->object();
      assert(obj != NULL, "ObjectMonitor should have a valid object!");
    }
    // implied else: no contended ObjectMonitor
  } else {
    // thread is doing an Object.wait() call
    obj = mon->object();
    assert(obj != NULL, "Object.wait() should have an object");
  }

So I wonder if we shouldn't make current_waiting_monitor()/current_pending_monitor() return volatile pointers to make it clear to the callers that these pointers can change at any time?

I'm also not that deep into ThreadService & al. to understand what happens after your fix. Now you don't reload the waiting monitor but you might use it although it has already been cleared out from the thread (in the case where you previously crashed). Is that still OK?

stefank

I wonder if this should be changed to perform the read once by using Atomic::load? That's the guidance we've given the last few years. Some background for this: JDK-8234192.

TheRealMDoerr · 2021-05-27T15:29:13Z

Hi Volker,
thanks for looking at my proposal.

I had seen the JVMTIEnvBase version of it. The comment says:
// The ObjectMonitor* can't be async deflated since we are either // at a safepoint or the calling thread is operating on itself so // it cannot leave the underlying wait()/enter() call.

The ThreadService version's comment says:
// This function can be called on a target JavaThread that is not // the caller and we are not at a safepoint. So it is possible for // the waiting or pending condition to be over/stale and for the // first stage of async deflation to clear the object field in // the ObjectMonitor. It is also possible for the object to be // inflated again and to be associated with a completely different // ObjectMonitor by the time this object reference is processed // by the caller.

So the affected code is a special usage. I don't know if a more generic fix would be desirable.
Accessing the ObjectMonitor after it was removed from the thread seems to be intended according to this comment. To verify that it's safe, one would have to check the protocol which is described here: https://wiki.openjdk.java.net/display/HotSpot/Async+Monitor+Deflation
(not a trivial task!)

dcubed-ojdk · 2021-05-27T15:31:19Z

@TheRealMDoerr - You should also add the Serviceability group for this reivew.

TheRealMDoerr · 2021-05-27T15:35:31Z

/label add serviceability

openjdk · 2021-05-27T15:36:39Z

@TheRealMDoerr
The serviceability label was successfully added.

TheRealMDoerr · 2021-05-27T15:50:01Z

I wonder if this should be changed to perform the read once by using Atomic::load? That's the guidance we've given the last few years. Some background for this: JDK-8234192.

Hi Stefan, thanks for looking at it and thanks for the pointer. I do remember that Atomic::load should be preferred. But I would have to use it in thread.hpp and I don't know if I should change it in current_waiting_monitor() and current_pending_monitor(). Would it be acceptable to change the general code for this special usage? I'm not against doing it, I just want to double-check. It may improve reliability in general.

dcubed-ojdk · 2021-05-27T16:43:43Z

The code that you're fixing is indeed a special case and I wrote the new test
(serviceability/monitoring/ThreadInfo/GetLockOwnerName/GetLockOwnerName.java)
specifically to stress the crashing code path. Unfortunately, I have never seen the
new test fail in our test in Mach5 or in my testing in my lab. We did get a single closed
test failure back on 2021.03.20 which is what started me down the path of creating
the new test. That single failure has never reproduced in the original closed test nor
in the targeted test that I wrote (GetLockOwnerName.java).

The similar usage in JVM/TI is safe for exactly the reasons explained in the comment so a
general Atomic::load() solution in current_waiting_monitor() or current_pending_monitor()
is not necessary.

I think your solution of adding volatile to wait_obj and enter_obj is a good solution.
I would like to see a comment added to explain the need for the volatile. I'll add an
embedded comment in the other PR view.

Using either wait_obj or enter_obj after it has been cleared from the JavaThread is
safe. The ObjectMonitor can only have gone through the first stage of async deflation.
The ObjectMonitor's memory cannot be freed until after a handshake with all threads
is completed and that cannot happen while this thread is executing the code that is
using wait_obj or enter_obj.

dcubed-ojdk

Thumbs up.

dcubed-ojdk · 2021-05-27T16:43:13Z

src/hotspot/share/services/threadService.cpp

  // ObjectMonitor by the time this object reference is processed
  // by the caller.
-  ObjectMonitor *wait_obj = thread->current_waiting_monitor();
+  ObjectMonitor* volatile wait_obj = thread->current_waiting_monitor();


Please add a comment above this declaration. Something like:

// Using 'volatile' to prevent the compiler from generating code that // reloads 'wait_obj' from memory when used below.

dcubed-ojdk · 2021-05-27T16:43:17Z

src/hotspot/share/services/threadService.cpp

    obj = wait_obj->object();
  } else {
-    ObjectMonitor *enter_obj = thread->current_pending_monitor();
+    ObjectMonitor* volatile enter_obj = thread->current_pending_monitor();


Please add a comment above this declaration. Something like:

// Using 'volatile' to prevent the compiler from generating code that // reloads 'enter_obj' from memory when used below.

openjdk · 2021-05-27T16:45:07Z

@TheRealMDoerr This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8267842: SIGSEGV in get_current_contended_monitor

Reviewed-by: stefank, dcubed, ysuenaga, dholmes

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 23 new commits pushed to the master branch:

964bac9: 8267706: bin/idea.sh tries to use cygpath on WSL
591b0c3: 8264624: change the guarantee() calls added by JDK-8264123 to assert() calls
0c0ff7f: 8265309: com/sun/jndi/dns/ConfigTests/Timeout.java fails with "Address already in use" BindException
24bf35f: 8265367: [macos-aarch64] 3 java/net/httpclient/websocket tests fail with "IOException: No buffer space available"
1413f9e: 8241423: NUMA APIs fail to work in dockers due to dependent syscalls are disabled by default
1d2c7ac: 8267555: Fix class file version during redefinition after 8238048
97ec5ad: 8265753: Remove manual JavaThread transitions to blocked
6eb9114: 8266877: Missing local debug information when debugging JEP-330
0c9daa7: 8265029: Preserve SIZED characteristics on slice operations (skip, limit)
95b1fa7: 8267529: StringJoiner can create a String that breaks String::equals
... and 13 more: https://git.openjdk.java.net/jdk/compare/7278f56bb6345d7b023516d0f44de71cd74ff264...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

TheRealMDoerr · 2021-05-27T16:57:49Z

Thanks for the review! Comment added. We have only seen the crash on s390. It depends on the platform if the compiler tends to reload values more or less.

dholmes-ora · 2021-05-28T06:45:49Z

Hi Martin,

I'm with @stefank. This special case is a race condition and the solution for that is Atomic::load/store. Given Atomic::load/store don't actually need to do anything in practice (other than act as marker of a race) I don't have any qualms about using them all the time.

Seperately, I'm unclear why we allow this race to exist. I thought we took snapshots when threads were known to be safe and stable. But that is a separate issue.

Cheers,
David

YaSuenag · 2021-05-28T07:06:50Z

I agree with @stefank and @dholmes-ora . It is nature to happen the change in thread.hpp to fix this problem.

Seperately, I'm unclear why we allow this race to exist. I thought we took snapshots when threads were known to be safe and stable. But that is a separate issue.

Now we have Thread-Local handshake. I think we should use it at here.

…or accessors.

TheRealMDoerr · 2021-05-28T10:16:34Z

Thanks, folks, for reviewing!
Changed to use Atomic::load/store. With this version, we use volatile accesses to these two members consistently. This may restrict compiler optimizations a bit, but I wouldn't expect a performance problem.

YaSuenag

Shouldn't we add volatile to both _current_pending_monitor and _current_waiting_monitor?

TheRealMDoerr · 2021-05-28T13:22:06Z

Shouldn't we add volatile to both _current_pending_monitor and _current_waiting_monitor?

Right. Done. Thanks!

dcubed-ojdk

Thumbs up. Just a couple of suggestions about the comments.

dcubed-ojdk · 2021-05-28T13:25:18Z

src/hotspot/share/runtime/thread.hpp

+    // Using atomic load to prevent compilers from reloading (ThreadService::get_current_contended_monitor).
+    // In case of concurrent modification, reloading pointer after NULL check must be prevented.


Perhaps rewrite the comment like this:

// Use Atomic::load() to prevent data race between concurrent modification and // concurrent readers, e.g., ThreadService::get_current_contended_monitor().

dcubed-ojdk · 2021-05-28T13:26:33Z

src/hotspot/share/runtime/thread.hpp

  }
  ObjectMonitor* current_waiting_monitor() {
-    return _current_waiting_monitor;
+    // Using atomic load as in current_pending_monitor.


Perhaps:
// See the comment in current_pending_monitor() above.

dcubed-ojdk · 2021-05-28T13:33:45Z

I thought we took snapshots when threads were known to be safe and stable.

When we ask for snapshots with stack traces, we use a safepoint to get all of
target thread's stack traces at the same time. Obviously that code path is safe.

The "other" code path is when we don't ask for stack traces and that path has
always been carefully coded to return non stack trace information in a safe manner,
but it does not use a safepoint or a handshake.

This duality in code paths is why the new test I wrote:
serviceability/monitoring/ThreadInfo/GetLockOwnerName/GetLockOwnerName.java
makes alternating calls of asking for the stack trace and then not.

TheRealMDoerr · 2021-05-28T13:39:28Z

Thanks for the review! I have improved the comments.

dcubed-ojdk

Thumbs up.

stefank

Looks good. Thanks!

dholmes-ora

Looks good.

Thanks,
David

TheRealMDoerr · 2021-05-31T08:26:21Z

Thanks for the reviews!
/integrate

openjdk · 2021-05-31T08:28:18Z

@TheRealMDoerr Since your change was applied there have been 27 commits pushed to the master branch:

236bd89: 8263583: Emoji rendering on macOS
1ab2776: 8247608: Javadoc: CSS margin is not applied consistently
9031477: 8267945: ZGC: Revert NUMA changes (JDK-8266217 and JDK-8241354) after JDK-8241423
6627432: 8267953: restore 'volatile' to ObjectMonitor::_owner field
964bac9: 8267706: bin/idea.sh tries to use cygpath on WSL
591b0c3: 8264624: change the guarantee() calls added by JDK-8264123 to assert() calls
0c0ff7f: 8265309: com/sun/jndi/dns/ConfigTests/Timeout.java fails with "Address already in use" BindException
24bf35f: 8265367: [macos-aarch64] 3 java/net/httpclient/websocket tests fail with "IOException: No buffer space available"
1413f9e: 8241423: NUMA APIs fail to work in dockers due to dependent syscalls are disabled by default
1d2c7ac: 8267555: Fix class file version during redefinition after 8238048
... and 17 more: https://git.openjdk.java.net/jdk/compare/7278f56bb6345d7b023516d0f44de71cd74ff264...master

Your commit was automatically rebased without conflicts.

Pushed as commit 1e29005.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

8267842: SIGSEGV in get_current_contended_monitor

6e444fa

openjdk bot added the rfr Pull request is ready for review label May 27, 2021

openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label May 27, 2021

simonis reviewed May 27, 2021

View reviewed changes

stefank suggested changes May 27, 2021

View reviewed changes

openjdk bot added the serviceability serviceability-dev@openjdk.org label May 27, 2021

dcubed-ojdk approved these changes May 27, 2021

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label May 27, 2021

Add comments as suggested by Dan.

97a7df6

New solution: Use Atomic::load/store in current_pending/waiting_monit…

f4005ec

…or accessors.

YaSuenag reviewed May 28, 2021

View reviewed changes

TheRealMDoerr added 2 commits May 28, 2021 15:09

Make the 2 member fields volatile.

932a7df

whitespace fix

c3e5888

dcubed-ojdk approved these changes May 28, 2021

View reviewed changes

Improve comments.

49354d0

dcubed-ojdk approved these changes May 28, 2021

View reviewed changes

stefank approved these changes May 28, 2021

View reviewed changes

YaSuenag approved these changes May 28, 2021

View reviewed changes

dholmes-ora approved these changes May 28, 2021

View reviewed changes

openjdk bot closed this May 31, 2021

openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels May 31, 2021

TheRealMDoerr deleted the 8267842_get_current_contended_monitor branch May 31, 2021 08:29

mlbridge bot mentioned this pull request Jun 7, 2021

8267579: Thread::cooked_allocated_bytes() hits assert(left >= right) failed: avoid underflow #4363

Closed

3 tasks

		// Using atomic load to prevent compilers from reloading (ThreadService::get_current_contended_monitor).
		// In case of concurrent modification, reloading pointer after NULL check must be prevented.

Conversation

TheRealMDoerr commented May 27, 2021 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Reviewing

Uh oh!

bridgekeeper bot commented May 27, 2021

Uh oh!

openjdk bot commented May 27, 2021

Uh oh!

mlbridge bot commented May 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

simonis left a comment

Choose a reason for hiding this comment

Uh oh!

stefank left a comment

Choose a reason for hiding this comment

Uh oh!

TheRealMDoerr commented May 27, 2021

Uh oh!

dcubed-ojdk commented May 27, 2021

Uh oh!

TheRealMDoerr commented May 27, 2021

Uh oh!

openjdk bot commented May 27, 2021

Uh oh!

TheRealMDoerr commented May 27, 2021

Uh oh!

dcubed-ojdk commented May 27, 2021

Uh oh!

dcubed-ojdk left a comment

Choose a reason for hiding this comment

Uh oh!

dcubed-ojdk May 27, 2021

Choose a reason for hiding this comment

Uh oh!

dcubed-ojdk May 27, 2021

Choose a reason for hiding this comment

Uh oh!

openjdk bot commented May 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheRealMDoerr commented May 27, 2021

Uh oh!

dholmes-ora commented May 28, 2021

Uh oh!

YaSuenag commented May 28, 2021

Uh oh!

TheRealMDoerr commented May 28, 2021

Uh oh!

YaSuenag left a comment

Choose a reason for hiding this comment

Uh oh!

TheRealMDoerr commented May 28, 2021

Uh oh!

dcubed-ojdk left a comment

Choose a reason for hiding this comment

Uh oh!

dcubed-ojdk May 28, 2021

Choose a reason for hiding this comment

Uh oh!

dcubed-ojdk May 28, 2021

Choose a reason for hiding this comment

Uh oh!

dcubed-ojdk commented May 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheRealMDoerr commented May 28, 2021

Uh oh!

dcubed-ojdk left a comment

Choose a reason for hiding this comment

Uh oh!

stefank left a comment

Choose a reason for hiding this comment

Uh oh!

dholmes-ora left a comment

TheRealMDoerr commented May 27, 2021 •

edited by openjdk bot

Loading

mlbridge bot commented May 27, 2021 •

edited

Loading

openjdk bot commented May 27, 2021 •

edited

Loading

dcubed-ojdk commented May 28, 2021 •

edited

Loading