Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8264634: CollectCLDClosure collects duplicated CLDs when dumping dynamic archive #3320

Closed
wants to merge 2 commits into from

Conversation

@kelthuzadx
Copy link
Member

@kelthuzadx kelthuzadx commented Apr 2, 2021

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/home/qingfeng.yy/openjdk16_so_warning/jdk/src/hotspot/share/classfile/classLoaderData.cpp:316), pid=68929, tid=68930
#  assert(_keep_alive > 0) failed: Invalid keep alive decrement count
#
# JRE version: OpenJDK Runtime Environment (17.0) (slowdebug build 17-internal+0-adhoc.qingfengyy.jdk)
# Java VM: OpenJDK 64-Bit Server VM (slowdebug 17-internal+0-adhoc.qingfengyy.jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x781087]  ClassLoaderData::dec_keep_alive()+0x31

Stack: [0x00007f1593072000,0x00007f1593173000],  sp=0x00007f1593171c00,  free space=1023k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x781087]  ClassLoaderData::dec_keep_alive()+0x31
V  [libjvm.so+0xef19e7]  MetaspaceShared::link_and_cleanup_shared_classes(Thread*)+0x181
V  [libjvm.so+0x1260834]  JavaThread::invoke_shutdown_hooks()+0x46
V  [libjvm.so+0x12609e5]  Threads::destroy_vm()+0xe7
V  [libjvm.so+0xbb40ec]  jni_DestroyJavaVM_inner+0x91
V  [libjvm.so+0xbb4147]  jni_DestroyJavaVM+0x1f
C  [libjli.so+0x4b4f]  JavaMain+0xc61
C  [libjli.so+0xad93]  ThreadJavaMain+0x27

We observed VM crashed when dumping dynamic archive in a simple springboot application(See detailed content on JBS attachment). I did some investigations. In rare case, both of the following paths may be stepped on when dumping dynamic archive:

1. SIGINT
at java.lang.Shutdown.beforeHalt(java.base@17-internal/Native Method)
at java.lang.Shutdown.exit(java.base@17-internal/Shutdown.java:172)
- locked <0x00000007fef02040> (a java.lang.Class for java.lang.Shutdown)
at java.lang.Terminator$1.handle(java.base@17-internal/Terminator.java:51)
at jdk.internal.misc.Signal$1.run(java.base@17-internal/Signal.java:219)
at java.lang.Thread.run(java.base@17-internal/Thread.java:831)

2. Normal Exit
JavaThread::invoke_shutdown_hooks()+0x46
Threads::destroy_vm()+0xe7
jni_DestroyJavaVM_inner+0x91
jni_DestroyJavaVM+0x1f
JavaMain+0xc61
ThreadJavaMain+0x27

They would call MetaspaceShared::link_and_cleanup_shared_classes, and CollectCLDClosure collects duplicated CLDs into _loaded_cld, _keep_alive is decrementing twice, causing a negative _keep_alive.

Testing(linux_x64):
[+] test/hotspot/jtreg/runtime/cds
[+] test/hotspot/jtreg/gc


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8264634: CollectCLDClosure collects duplicated CLDs when dumping dynamic archive

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/3320/head:pull/3320
$ git checkout pull/3320

Update a local copy of the PR:
$ git checkout pull/3320
$ git pull https://git.openjdk.java.net/jdk pull/3320/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 3320

View PR using the GUI difftool:
$ git pr show -t 3320

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/3320.diff

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Apr 2, 2021

👋 Welcome back yyang! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr label Apr 2, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Apr 2, 2021

@kelthuzadx The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@mlbridge
Copy link

@mlbridge mlbridge bot commented Apr 2, 2021

Webrevs

@kelthuzadx kelthuzadx force-pushed the kelthuzadx:fix_crash branch from bdc9c72 to 56a47fc Apr 2, 2021
Copy link
Contributor

@yminqi yminqi left a comment

Hi, Yi
The _loaded_cld is a global list, in this case it looks contain duplicated CLD in it.
The duplication could from the thread run shutdown hook.
Could you try
if (!cld->is_unloading()) {
cld->inc_keep_alive();
+ if (!_loaded_cld->contains(cld)) {
_loaded_cld->append(cld);
+ }
}
Please let us know if you can avoid the crash.

Copy link
Member

@iklam iklam left a comment

The fix looks reasonable. If MetaspaceShared::link_and_cleanup_shared_classes may be called twice, it's better to isolate the loaded_cld for each invocation. Allocating it locally will also avoid any potential threading issues.

I have some requests for cleaning up the code.

cld->dec_keep_alive();
}
loaded_cld.trunc_to(0);

This comment has been minimized.

@iklam

iklam Apr 3, 2021
Member

There's no need for the trucate -- loaded_cld is locally allocated and will be freed after this function returns.

Also, to improve modularity, I think we should move the dec_keep_alive loop into the destructor of CollectCLDClosure.

Also, loaded_cld can be moved as a field into CollectCLDClosure.

CollectCLDClosure collect_cld;
ResourceMark rm;
GrowableArray<ClassLoaderData*> loaded_cld;
CollectCLDClosure collect_cld(&loaded_cld);

This comment has been minimized.

@iklam

iklam Apr 3, 2021
Member

I think we should add a comment to say why it's necessary to first collect the ClassLoaderDatas first:

// ClassLoaderDataGraph::loaded_cld_do requires ClassLoaderDataGraph_lock.
// We cannot link the classes while holding this lock (or else we may run into deadlock).
// Therefore, we need to first collect all the CLDs, and then link their classes after
// releasing the lock.
@kelthuzadx
Copy link
Member Author

@kelthuzadx kelthuzadx commented Apr 5, 2021

Hi, Yi
The _loaded_cld is a global list, in this case it looks contain duplicated CLD in it.
The duplication could from the thread run shutdown hook.
Could you try
if (!cld->is_unloading()) {
cld->inc_keep_alive();
+ if (!_loaded_cld->contains(cld)) {
_loaded_cld->append(cld);
+ }
}
Please let us know if you can avoid the crash.

Hi Yumin, this fix still crashes because the CLDs collected at the first invocation of MetaspaceShared::link_and_cleanup_shared_classes are not cleaned, they will decrement their _keep_alives as before at the second invocation of MetaspaceShared::link_and_cleanup_shared_classes.

@kelthuzadx
Copy link
Member Author

@kelthuzadx kelthuzadx commented Apr 5, 2021

Hi Ioi,

Also, to improve modularity, I think we should move the dec_keep_alive loop into the destructor of CollectCLDClosure.
Also, loaded_cld can be moved as a field into CollectCLDClosure.

Suggestions make sense, changed. Tests under runtime/cds/ are all passed with slowdebug mode.

@iklam
iklam approved these changes Apr 5, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Apr 5, 2021

@kelthuzadx This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8264634: CollectCLDClosure collects duplicated CLDs when dumping dynamic archive

Reviewed-by: minqi, iklam

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 47 new commits pushed to the master branch:

  • dc608fd: 8264411: serviceability/jvmti/HeapMonitor tests intermittently fail due to large TLAB size
  • b1a225e: 8263565: NPE was thrown when sun.jvm.hotspot.rmi.serverNamePrefix was set
  • c41cd15: 8264686: ClhsdbTestConnectArgument.java should use SATestUtils::validateSADebugDPrivileges
  • b7baca7: 8264288: Performance issue with MethodHandle.asCollector
  • 9201899: 8264729: Random check-in failing header checks.
  • d920f85: 8264540: WhiteBox.metaspaceReserveAlignment should return shared region alignment
  • 104e925: 8264512: jdk/test/jdk/java/util/prefs/ExportNode.java relies on default platform encoding
  • a0ec2cb: 8248862: Implement Enhanced Pseudo-Random Number Generators
  • 39719da: 8253266: JList and JTable constructors should clear OPAQUE_SET before calling updateUI
  • a8005ef: 8166727: javac crashed: [jimage.dll+0x1942] ImageStrings::find+0x28
  • ... and 37 more: https://git.openjdk.java.net/jdk/compare/d2df9a7df89f095a9f706d849177eb201ac8d1cf...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@yminqi, @iklam) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready label Apr 5, 2021
@yminqi
yminqi approved these changes Apr 5, 2021
Copy link
Contributor

@yminqi yminqi left a comment

Make the CLD list local is a reasonable solution. LGTM.

@kelthuzadx
Copy link
Member Author

@kelthuzadx kelthuzadx commented Apr 6, 2021

Thanks @yminqi @iklam for the reviews!

/integreate

@kelthuzadx
Copy link
Member Author

@kelthuzadx kelthuzadx commented Apr 6, 2021

/integrate

@openjdk
Copy link

@openjdk openjdk bot commented Apr 6, 2021

@kelthuzadx Unknown command integreate - for a list of valid commands use /help.

@openjdk openjdk bot added the sponsor label Apr 6, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Apr 6, 2021

@kelthuzadx
Your change (at version fea3c4b) is now ready to be sponsored by a Committer.

@yminqi
Copy link
Contributor

@yminqi yminqi commented Apr 6, 2021

/sponsor

@openjdk openjdk bot closed this Apr 6, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Apr 6, 2021

@yminqi @kelthuzadx Since your change was applied there have been 48 commits pushed to the master branch:

  • 43d4a6f: 8264564: AArch64: use MOVI instead of FMOV to zero FP register
  • dc608fd: 8264411: serviceability/jvmti/HeapMonitor tests intermittently fail due to large TLAB size
  • b1a225e: 8263565: NPE was thrown when sun.jvm.hotspot.rmi.serverNamePrefix was set
  • c41cd15: 8264686: ClhsdbTestConnectArgument.java should use SATestUtils::validateSADebugDPrivileges
  • b7baca7: 8264288: Performance issue with MethodHandle.asCollector
  • 9201899: 8264729: Random check-in failing header checks.
  • d920f85: 8264540: WhiteBox.metaspaceReserveAlignment should return shared region alignment
  • 104e925: 8264512: jdk/test/jdk/java/util/prefs/ExportNode.java relies on default platform encoding
  • a0ec2cb: 8248862: Implement Enhanced Pseudo-Random Number Generators
  • 39719da: 8253266: JList and JTable constructors should clear OPAQUE_SET before calling updateUI
  • ... and 38 more: https://git.openjdk.java.net/jdk/compare/d2df9a7df89f095a9f706d849177eb201ac8d1cf...master

Your commit was automatically rebased without conflicts.

Pushed as commit 54b4070.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@kelthuzadx kelthuzadx deleted the kelthuzadx:fix_crash branch Apr 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants