-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8283849: AsyncGetCallTrace may crash JVM on guarantee #8061
Conversation
👋 Welcome back jbachorik! A progress list of the required criteria for merging this PR into |
@jbachorik The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
25f56b2
to
70df3b3
Compare
70df3b3
to
443066e
Compare
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a complete review yet, just some general remarks.
While I absolutely appreciate the tedious work and time you invested, I dislike the widespread nature of your changes. This is not just your fault. Your changes only make deficiencies in the code layout really obvious.
Did you spend a thought on consolidating e.g. the changes in os_cpu
into a common file, potentially in os/linux
or, even better, os/posix
? The cpu
tree is begging for consolidation as well. Maybe most of the init stuff can be done in runtime/frame.*
?
While browsing through the changes, I found that in pd_get_top_frame()
, the code is inconsistent. Some instances provide special handling for #if COMPILER2_OR_JVMCI
, others only for #if COMPILER2
, and sometimes there is no special handling at all. We should either have it everywhere or nowhere. Furthermore, I would prefer to see only one syntactical form, either #if
, or #ifdef
or #if defined()
. Please note the subtle differences!
I added a few inline comments where the code needs some TLC.
@@ -46,7 +46,7 @@ inline frame::frame() { | |||
|
|||
static int spin; | |||
|
|||
inline void frame::init(intptr_t* sp, intptr_t* fp, address pc) { | |||
inline void frame::init(intptr_t* sp, intptr_t* fp, address pc, bool forSignalHandler) { | |||
assert(pauth_ptr_is_raw(pc), "cannot be signed"); | |||
intptr_t a = intptr_t(sp); | |||
intptr_t b = intptr_t(fp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these declarations good for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean bool forSignalHandler
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I am referring to lines 51 and 52. These variables are unused as far as I can see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, no idea. They seem to have been added by @theRealAph some 7 years ago ...
@RealLucy There is a JEP (https://bugs.openjdk.java.net/browse/JDK-8284289) under way which is supposed to clean up and unify the implementations for obtaining profiling stacktraces. I would prefer having the sweeping changes being rather a part of the JEP effort then attaching them to this PR. I can create a follow up JIRA ticket for the tasks you have mentioned and link it with the JEP so your input does not get lost. |
I understand your approach - don't mess with things that are not directly affected. On the other hand: if you have to touch the code anyway, why not improve the maintainability? Maybe the JEP author has an opinion whether the cleanup could be part of the JEP. And you might get support by other reviewers as well. |
@@ -46,7 +46,7 @@ inline frame::frame() { | |||
|
|||
static int spin; | |||
|
|||
inline void frame::init(intptr_t* sp, intptr_t* fp, address pc) { | |||
inline void frame::init(intptr_t* sp, intptr_t* fp, address pc, bool forSignalHandler) { | |||
assert(pauth_ptr_is_raw(pc), "cannot be signed"); | |||
intptr_t a = intptr_t(sp); | |||
intptr_t b = intptr_t(fp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these declarations good for?
@@ -46,7 +46,7 @@ inline frame::frame() { | |||
|
|||
static int spin; | |||
|
|||
inline void frame::init(intptr_t* sp, intptr_t* fp, address pc) { | |||
inline void frame::init(intptr_t* sp, intptr_t* fp, address pc, bool forSignalHandler) { | |||
assert(pauth_ptr_is_raw(pc), "cannot be signed"); | |||
intptr_t a = intptr_t(sp); | |||
intptr_t b = intptr_t(fp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I am referring to lines 51 and 52. These variables are unused as far as I can see.
Let me pull @parttimenerd in so he can voice his opinion about where the cleanup should happen. |
The consolidation is part of my JEP and one of the main purposes. It might be easier to incorporate large clean ups into this JEP to minimize duplicated work (and code). |
Hi, sorry for the late chiming in. Could we find a better way to do this that to funnel "is_signal_handler" thru every layer? How about a thread-local state instead? E.g. "bool Thread::_unsafe_code_cache_lookup"? Or, for that matter, "bool Thread::in_agct"? Could be set and restored at the beginning resp. end of AGCT. I dislike the name "in_signal_handler" since the fact that we are inside signal handling is incidental. IIUC this is probably much more often a thread-safety issue than a reentrance issue, right? You would have the same problem if you analyzed the code cache without getting the code cache lock, while it is concurrently modified? So if you want to keep the argument, how about renaming it to "unsafe", or "unsafe_access", or "unsafe_codecache_lookup"? We use "unsafe" in other places as well. Cheers, Thomas |
Hi @tstuefe , Thanks for the review.
I wanted to avoid as much 'magic' as possible because usually I find such PRs as having higher chance of getting reviewed. But if using a TLS to avoid funnelling extra arg is something what would usually be done I will revisit the patch and use that instead.
Interestingly enough, my first sketch of the patch was using the |
You can't set TLS from a signal handler either. Reading library-based TLS is considered safe enough. |
No need, since you only need to adjust a member inside Thread. e.g. |
Using the approach proposed by Thomas has also the advantage, that we can use to handle other special casing for ASGCT too. |
@@ -715,7 +715,7 @@ intptr_t* frame::real_fp() const { | |||
#ifndef PRODUCT | |||
// This is a generic constructor which is only used by pns() in debug.cpp. | |||
frame::frame(void* sp, void* fp, void* pc) { | |||
init((intptr_t*)sp, (intptr_t*)fp, (address)pc); | |||
init((intptr_t*)sp, (intptr_t*)fp, (address)pc, CodeCache::find_blob((address)pc)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no overload for init
that accepts a CodeBlob*
as its fourth argument. If it compiles then it converted implicitly into CodeCache::find_blob((address)pc) != NULL
. Is this intentional?
A gist of the fix is to allow relaxed instantiation of a frame when done from a signal handler - eg. for profiling purposes.
Currently, a frame instantiation will fail on guarantee when we happen to hit a zombie method which is still on stack. While this would indicate a serious error for the normal execution flow, in case of profiling where the executing thread can be expected at any possible method this is something which may happen and we really should not take the profiled JVM down due to it.
The behaviour defaults to checking the code blob status in the guarantee so nothing will change for the rest of the callers - just ASGCT will be affected.
Unfortunately, I am not able to create a simple reproducer for the crash other that testing in our production where the crash is happening sporadically. However, thanks to @parttimenerd and his [ASGCT stress test](https://github.com/parttimenerd/asgct2-tester.git) the problem can be reproduced quite reliably.
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/8061/head:pull/8061
$ git checkout pull/8061
Update a local copy of the PR:
$ git checkout pull/8061
$ git pull https://git.openjdk.java.net/jdk pull/8061/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 8061
View PR using the GUI difftool:
$ git pr show -t 8061
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/8061.diff