Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8283849: AsyncGetCallTrace may crash JVM on guarantee #8061

Closed
wants to merge 11 commits into from

Conversation

jbachorik
Copy link

@jbachorik jbachorik commented Mar 31, 2022

A gist of the fix is to allow relaxed instantiation of a frame when done from a signal handler - eg. for profiling purposes.

Currently, a frame instantiation will fail on guarantee when we happen to hit a zombie method which is still on stack. While this would indicate a serious error for the normal execution flow, in case of profiling where the executing thread can be expected at any possible method this is something which may happen and we really should not take the profiled JVM down due to it.

The behaviour defaults to checking the code blob status in the guarantee so nothing will change for the rest of the callers - just ASGCT will be affected.


Unfortunately, I am not able to create a simple reproducer for the crash other that testing in our production where the crash is happening sporadically. However, thanks to @parttimenerd and his [ASGCT stress test](https://github.com/parttimenerd/asgct2-tester.git) the problem can be reproduced quite reliably.

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (1 reviews required, with at least 1 reviewer)

Issue

  • JDK-8283849: AsyncGetCallTrace may crash JVM on guarantee

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/8061/head:pull/8061
$ git checkout pull/8061

Update a local copy of the PR:
$ git checkout pull/8061
$ git pull https://git.openjdk.java.net/jdk pull/8061/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 8061

View PR using the GUI difftool:
$ git pr show -t 8061

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/8061.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Mar 31, 2022

👋 Welcome back jbachorik! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Mar 31, 2022

@jbachorik The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Mar 31, 2022
@jbachorik jbachorik force-pushed the jb/agct_crash_fix branch 7 times, most recently from 25f56b2 to 70df3b3 Compare April 6, 2022 13:51
@jbachorik jbachorik marked this pull request as ready for review April 14, 2022 11:45
@openjdk openjdk bot added the rfr Pull request is ready for review label Apr 14, 2022
@mlbridge
Copy link

mlbridge bot commented Apr 14, 2022

Webrevs

Copy link
Contributor

@RealLucy RealLucy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a complete review yet, just some general remarks.

While I absolutely appreciate the tedious work and time you invested, I dislike the widespread nature of your changes. This is not just your fault. Your changes only make deficiencies in the code layout really obvious.

Did you spend a thought on consolidating e.g. the changes in os_cpu into a common file, potentially in os/linux or, even better, os/posix? The cpu tree is begging for consolidation as well. Maybe most of the init stuff can be done in runtime/frame.*?

While browsing through the changes, I found that in pd_get_top_frame(), the code is inconsistent. Some instances provide special handling for #if COMPILER2_OR_JVMCI, others only for #if COMPILER2, and sometimes there is no special handling at all. We should either have it everywhere or nowhere. Furthermore, I would prefer to see only one syntactical form, either #if, or #ifdef or #if defined(). Please note the subtle differences!

I added a few inline comments where the code needs some TLC.

@@ -46,7 +46,7 @@ inline frame::frame() {

static int spin;

inline void frame::init(intptr_t* sp, intptr_t* fp, address pc) {
inline void frame::init(intptr_t* sp, intptr_t* fp, address pc, bool forSignalHandler) {
assert(pauth_ptr_is_raw(pc), "cannot be signed");
intptr_t a = intptr_t(sp);
intptr_t b = intptr_t(fp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these declarations good for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean bool forSignalHandler?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I am referring to lines 51 and 52. These variables are unused as far as I can see.

Copy link
Author

@jbachorik jbachorik May 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, no idea. They seem to have been added by @theRealAph some 7 years ago ...

src/hotspot/cpu/aarch64/frame_aarch64.inline.hpp Outdated Show resolved Hide resolved
src/hotspot/cpu/aarch64/frame_aarch64.inline.hpp Outdated Show resolved Hide resolved
@jbachorik
Copy link
Author

@RealLucy
I deliberately left out any attempts on consolidation as it might make the change affecting even more files across the whole repo. From my experience it would lower the chance of getting the fix reviewed and approved significantly so I opted for the minimum number of required changes that are preventing this particular crash.

There is a JEP (https://bugs.openjdk.java.net/browse/JDK-8284289) under way which is supposed to clean up and unify the implementations for obtaining profiling stacktraces. I would prefer having the sweeping changes being rather a part of the JEP effort then attaching them to this PR. I can create a follow up JIRA ticket for the tasks you have mentioned and link it with the JEP so your input does not get lost.

@RealLucy
Copy link
Contributor

RealLucy commented May 2, 2022

@RealLucy I deliberately left out any attempts on consolidation...

There is a JEP (https://bugs.openjdk.java.net/browse/JDK-8284289) under way which is supposed to clean up...

I understand your approach - don't mess with things that are not directly affected. On the other hand: if you have to touch the code anyway, why not improve the maintainability? Maybe the JEP author has an opinion whether the cleanup could be part of the JEP. And you might get support by other reviewers as well.

@@ -46,7 +46,7 @@ inline frame::frame() {

static int spin;

inline void frame::init(intptr_t* sp, intptr_t* fp, address pc) {
inline void frame::init(intptr_t* sp, intptr_t* fp, address pc, bool forSignalHandler) {
assert(pauth_ptr_is_raw(pc), "cannot be signed");
intptr_t a = intptr_t(sp);
intptr_t b = intptr_t(fp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these declarations good for?

@@ -46,7 +46,7 @@ inline frame::frame() {

static int spin;

inline void frame::init(intptr_t* sp, intptr_t* fp, address pc) {
inline void frame::init(intptr_t* sp, intptr_t* fp, address pc, bool forSignalHandler) {
assert(pauth_ptr_is_raw(pc), "cannot be signed");
intptr_t a = intptr_t(sp);
intptr_t b = intptr_t(fp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I am referring to lines 51 and 52. These variables are unused as far as I can see.

@jbachorik
Copy link
Author

Maybe the JEP author has an opinion whether the cleanup could be part of the JEP. And you might get support by other reviewers as well.

Let me pull @parttimenerd in so he can voice his opinion about where the cleanup should happen.

@parttimenerd
Copy link
Contributor

The consolidation is part of my JEP and one of the main purposes. It might be easier to incorporate large clean ups into this JEP to minimize duplicated work (and code).

@tstuefe
Copy link
Member

tstuefe commented May 4, 2022

Hi,

sorry for the late chiming in.

Could we find a better way to do this that to funnel "is_signal_handler" thru every layer? How about a thread-local state instead? E.g. "bool Thread::_unsafe_code_cache_lookup"? Or, for that matter, "bool Thread::in_agct"? Could be set and restored at the beginning resp. end of AGCT.

I dislike the name "in_signal_handler" since the fact that we are inside signal handling is incidental. IIUC this is probably much more often a thread-safety issue than a reentrance issue, right? You would have the same problem if you analyzed the code cache without getting the code cache lock, while it is concurrently modified?

So if you want to keep the argument, how about renaming it to "unsafe", or "unsafe_access", or "unsafe_codecache_lookup"? We use "unsafe" in other places as well.

Cheers, Thomas

@jbachorik
Copy link
Author

Hi @tstuefe ,

Thanks for the review.

Could we find a better way to do this that to funnel "is_signal_handler" thru every layer? How about a thread-local state instead? E.g. "bool Thread::_unsafe_code_cache_lookup"? Or, for that matter, "bool Thread::in_agct"? Could be set and restored at the beginning resp. end of AGCT.

I wanted to avoid as much 'magic' as possible because usually I find such PRs as having higher chance of getting reviewed. But if using a TLS to avoid funnelling extra arg is something what would usually be done I will revisit the patch and use that instead.

I dislike the name "in_signal_handler" since the fact that we are inside signal handling is incidental. IIUC this is probably much more often a thread-safety issue than a reentrance issue, right? You would have the same problem if you analyzed the code cache without getting the code cache lock, while it is concurrently modified?

So if you want to keep the argument, how about renaming it to "unsafe", or "unsafe_access", or "unsafe_codecache_lookup"? We use "unsafe" in other places as well.

Interestingly enough, my first sketch of the patch was using the unsafe_access named argument :) Then I renamed it to look less scary. If the TLS approach won't work out and I need to pass in the extra argument I can definitely rename it unsafe_access if that is not going to freak out the reviewers :)

@dholmes-ora
Copy link
Member

But if using a TLS to avoid funnelling extra arg is something what would usually be done I will revisit the patch and use that instead.

You can't set TLS from a signal handler either. Reading library-based TLS is considered safe enough.

@tstuefe
Copy link
Member

tstuefe commented May 4, 2022

But if using a TLS to avoid funnelling extra arg is something what would usually be done I will revisit the patch and use that instead.

You can't set TLS from a signal handler either. Reading library-based TLS is considered safe enough.

No need, since you only need to adjust a member inside Thread. e.g.
Thread::current()->set_unsafe_access_mode(true) inside AGCT.

@parttimenerd
Copy link
Contributor

Using the approach proposed by Thomas has also the advantage, that we can use to handle other special casing for ASGCT too.

@@ -715,7 +715,7 @@ intptr_t* frame::real_fp() const {
#ifndef PRODUCT
// This is a generic constructor which is only used by pns() in debug.cpp.
frame::frame(void* sp, void* fp, void* pc) {
init((intptr_t*)sp, (intptr_t*)fp, (address)pc);
init((intptr_t*)sp, (intptr_t*)fp, (address)pc, CodeCache::find_blob((address)pc));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no overload for init that accepts a CodeBlob* as its fourth argument. If it compiles then it converted implicitly into CodeCache::find_blob((address)pc) != NULL. Is this intentional?

@jbachorik
Copy link
Author

Thank you all for the reviews!

I am going to close this PR and open a new one based on the thread-local flag approach as recommended by @tstuefe

The new PR is #8549

@jbachorik jbachorik closed this May 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

5 participants