Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8274196: Crashes in VM_HeapDumper::work after JDK-8252842 #5681

Closed
wants to merge 6 commits into from

Conversation

linzang
Copy link
Contributor

@linzang linzang commented Sep 24, 2021

The root cause for crash in ZGC is that the JNIHandles are processed before object iteration. And ZGC would update the JNIHandles at object iteration with read barrier. So the crash is cause by accessing the invalid address which can be dummy info after zgc, and hence crash.

The lock rank issue can be fixed because the related mutexes are acquired in safepoint. so the safepoint_check_required could be safepoint_check_always.

The Epsilon issue is caused by wrong _num_dumper_thread calculated when the gang==NULL.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issues

  • JDK-8274196: Crashes in VM_HeapDumper::work after JDK-8252842
  • JDK-8274245: sun/tools/jmap/BasicJMapTest.java Mutex rank failures

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/5681/head:pull/5681
$ git checkout pull/5681

Update a local copy of the PR:
$ git checkout pull/5681
$ git pull https://git.openjdk.java.net/jdk pull/5681/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 5681

View PR using the GUI difftool:
$ git pr show -t 5681

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/5681.diff

@linzang
Copy link
Contributor Author

@linzang linzang commented Sep 24, 2021

/issue 8274245

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Sep 24, 2021

👋 Welcome back lzang! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr label Sep 24, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Sep 24, 2021

@linzang
Adding additional issue to issue list: 8274245: sun/tools/jmap/BasicJMapTest.java Mutex rank failures.

@openjdk
Copy link

@openjdk openjdk bot commented Sep 24, 2021

@linzang The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-runtime label Sep 24, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Sep 24, 2021

@@ -748,7 +748,7 @@ class ParDumpWriter : public AbstractDumpWriter {

static void before_work() {
assert(_lock == NULL, "ParDumpWriter lock must be initialized only once");
_lock = new (std::nothrow) PaddedMonitor(Mutex::leaf, "Parallel HProf writer lock", Mutex::_safepoint_check_never);
_lock = new (std::nothrow) PaddedMonitor(Mutex::leaf, "ParallelHProfWriter_lock", Mutex::_safepoint_check_always);
Copy link
Contributor

@coleenp coleenp Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change these locks to _safepoint_check_always, you have to acquire them without the Mutex::_no_safepoint_check flags so I don't know why you don't get that assert.

Copy link
Contributor Author

@linzang linzang Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be because this is actually not a JavaThread. So the assert in Mutex::check_no_safepoint_state would pass.
Moreover, I have tried to use PaddedMonitor(Mutex::nosafepoint, "ParallelHProfWriter_lock", Mutex::_safepoint_check_never); here, but the slowdebug would report errors as you mentioned in JDK-8274245.

Copy link
Contributor Author

@linzang linzang Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the flag here and at the place of lock acquiring seems problematic. I will try to see whether I can use Mutex::_safepoint_check_never here and get rid of the assert.

Copy link
Contributor

@coleenp coleenp Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes
void Mutex::check_no_safepoint_state(Thread* thread) {
check_block_state(thread);
assert(!thread->is_active_Java_thread() || _safepoint_check_required != _safepoint_check_always,
"This lock should always have a safepoint check for Java threads: %s",
name());
}

yes, we exclude the check for a non-java thread, which I thought was an odd exclusion last time I looked. I pass the tests in sun/tools/jmap/BasicJMapTest.java so maybe leave it for now?

Copy link
Contributor Author

@linzang linzang Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, maybe a new issue could be created for tracking this.

Copy link
Contributor

@coleenp coleenp Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@linzang
Copy link
Contributor Author

@linzang linzang commented Sep 24, 2021

/label add serviceability

@openjdk openjdk bot added the serviceability label Sep 24, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Sep 24, 2021

@linzang
The serviceability label was successfully added.

@dcubed-ojdk
Copy link
Member

@dcubed-ojdk dcubed-ojdk commented Sep 24, 2021

@linzang and @coleenp - I've ProblemListed the test via:

JDK-8274294 ProblemList sun/tools/jmap/BasicJMapTest.java

to give you folks time to sort thru the details.

Copy link
Contributor

@coleenp coleenp left a comment

The lock stuff looks ok, but please have at least one of the original reviewers review the change.

@@ -44,6 +44,3 @@ serviceability/sa/TestJmapCoreMetaspace.java 8268722,8268636
serviceability/sa/TestJhsdbJstackMixed.java 8248912 generic-all
serviceability/sa/ClhsdbPstack.java#process 8248912 generic-all
serviceability/sa/ClhsdbPstack.java#core 8248912 generic-all

serviceability/dcmd/gc/HeapDumpAllTest.java 8274196 linux-all,windows-all
serviceability/dcmd/gc/HeapDumpTest.java 8274196 linux-all,windows-all
Copy link
Contributor

@coleenp coleenp Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before you push, you'll need to do a merge from mainline and should un-ProblemList sun/tools/jmap/BasicJMapTest.java.

@openjdk openjdk bot removed the rfr label Sep 24, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Sep 26, 2021

@linzang This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8274196: Crashes in VM_HeapDumper::work after JDK-8252842
8274245: sun/tools/jmap/BasicJMapTest.java Mutex rank failures

Reviewed-by: coleenp, pliden, cjplummer

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 49 new commits pushed to the master branch:

  • 355356c: 8273435: Remove redundant zero-length check in ClassDesc.of
  • 97385d4: 8274405: Suppress warnings on non-serializable non-transient instance fields in javac and javadoc
  • 79cebe2: 8274050: Unnecessary Vector usage in javax.crypto
  • 97b2874: 8274509: Remove stray * and stylistic . from doc comments
  • b1b6696: 8274453: (sctp) com/sun/nio/sctp/SctpChannel/CloseDescriptors.java test should be resilient to lsof warnings
  • edd9d1c: 8274330: Incorrect encoding of the DistributionPointName object in IssuingDistributionPointExtension
  • 980c50d: 8272562: C2: assert(false) failed: Bad graph detected in build_loop_late
  • 1dc8fa9: 8274340: [BACKOUT] JDK-8271880: Tighten condition for excluding regions from collecting cards with cross-references
  • aaa36cc: 8274242: Implement fast-path for ASCII-compatible CharsetEncoders on x86
  • c4d1157: 8271855: [TESTBUG] Wrong weakCompareAndSet assumption in UnsafeIntrinsicsTest
  • ... and 39 more: https://git.openjdk.java.net/jdk/compare/5ec1cdcaf39229a7d2457313600b0dc2bf8c6453...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added ready rfr labels Sep 26, 2021
Copy link
Contributor

@pliden pliden left a comment

The root cause for crash in ZGC is that the JNIHandles are processed before object iteration. And ZGC would update the JNIHandles at object iteration with read barrier. So the crash is cause by accessing the invalid address which can be dummy info after zgc, and hence crash.

The fix here should not be to change the order of stuff, so that heap iteration happens first, that will just hide the real bug. The real bug is that the JNIGlobalsDumper::do_oop() is missing a load barrier. In other words, keep the order and just make sure to add a load barrier, like this:

void JNIGlobalsDumper::do_oop(oop* obj_p) {
  oop o = NativeAccess<AS_NO_KEEPALIVE>::oop_load(obj_p);
  ...

@linzang
Copy link
Contributor Author

@linzang linzang commented Sep 27, 2021

The root cause for crash in ZGC is that the JNIHandles are processed before object iteration. And ZGC would update the JNIHandles at object iteration with read barrier. So the crash is cause by accessing the invalid address which can be dummy info after zgc, and hence crash.

The fix here should not be to change the order of stuff, so that heap iteration happens first, that will just hide the real bug. The real bug is that the JNIGlobalsDumper::do_oop() is missing a load barrier. In other words, keep the order and just make sure to add a load barrier, like this:

void JNIGlobalsDumper::do_oop(oop* obj_p) {
  oop o = NativeAccess<AS_NO_KEEPALIVE>::oop_load(obj_p);
  ...

Hi Per @pliden ,
Thanks a lot!
Correct!I am just puzzling why the sequency of root type dump is a must as there is no such request in spec, and your suggestion definitely help to answer that, I took the wrong fix and neglect that there is a read barrier missed.
I will make the change.

BRs,
Lin

@openjdk openjdk bot added ready rfr and removed ready rfr labels Sep 27, 2021
pliden
pliden approved these changes Sep 28, 2021
Copy link
Contributor

@pliden pliden left a comment

The load barrier part of this now looks good to me.

@linzang
Copy link
Contributor Author

@linzang linzang commented Sep 28, 2021

Thanks @pliden for help review and approve.

Dear Chris(@plummercj) and Ralf(@schmelter-sap), may I ask your help to review this fix of JDK-8252842? Thanks!

@plummercj
Copy link
Contributor

@plummercj plummercj commented Sep 29, 2021

Dear Chris(@plummercj) and Ralf(@schmelter-sap), may I ask your help to review this fix of JDK-8252842? Thanks!

Yes, I will have a look at it.

Copy link
Contributor

@plummercj plummercj left a comment

The dumper threads related fix looks fine. I don't know enough to verify the GC fix, but I think Per has that covered sufficiently. Likewise for the lock rank issue which Coleen has reviewed. Also, I tested your changes with our tier2 and tier3 CI runs, which is where the failures initially turned up, and they are passing now.

@linzang
Copy link
Contributor Author

@linzang linzang commented Sep 30, 2021

Thanks all for your help reviewing this patch. I will integrate it.

@linzang
Copy link
Contributor Author

@linzang linzang commented Sep 30, 2021

/integrate

@openjdk
Copy link

@openjdk openjdk bot commented Sep 30, 2021

Going to push as commit bfd6163.
Since your change was applied there have been 56 commits pushed to the master branch:

  • bb95dda: 8248001: javadoc generates invalid HTML pages whose ftp:// links are broken
  • 2f955d6: 8273142: Remove dependancy of TestHttpServer, HttpTransaction, HttpCallback from open/test/jdk/sun/net/www/protocol/http/ tests
  • 94e31e5: 8274506: TestPids.java and TestPidsLimit.java fail with podman run as root
  • a8210c5: 8274401: C2: GraphKit::load_array_element bypasses Access API
  • dfc557c: 8274406: RunThese30M.java failed "assert(!LCA_orig->dominates(pred_block) || early->dominates(pred_block)) failed: early is high enough"
  • c0533ef: 8274522: java/lang/management/ManagementFactory/MXBeanException.java test fails with Shenandoah
  • f8415a9: 8274523: java/lang/management/MemoryMXBean/MemoryTest.java test should handle Shenandoah
  • 355356c: 8273435: Remove redundant zero-length check in ClassDesc.of
  • 97385d4: 8274405: Suppress warnings on non-serializable non-transient instance fields in javac and javadoc
  • 79cebe2: 8274050: Unnecessary Vector usage in javax.crypto
  • ... and 46 more: https://git.openjdk.java.net/jdk/compare/5ec1cdcaf39229a7d2457313600b0dc2bf8c6453...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot closed this Sep 30, 2021
@openjdk openjdk bot added integrated and removed ready rfr labels Sep 30, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Sep 30, 2021

@linzang Pushed as commit bfd6163.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@linzang linzang deleted the pd-fix branch Oct 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime integrated serviceability
5 participants