-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8350111: [PPC] AsyncGetCallTrace crashes when called while handling SIGTRAP #23641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8350111: [PPC] AsyncGetCallTrace crashes when called while handling SIGTRAP #23641
Conversation
|
👋 Welcome back rrich! A progress list of the required criteria for merging this PR into |
|
@reinrich This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 126 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
e4c61ae to
c4b81e2
Compare
Webrevs
|
|
Can this also happen on other platforms when in signal handling (e.g. segfault based nullchecks?) |
TheRealMDoerr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I guess such problems can happen on all platforms which use some kind of link register (aarch64, s390, ?). I also don't like that we lose so many samples with this current solution. I only approved it because I think it is better than crashing. |
Probably not, as the problems will still exists with external profilers like async-profiler. |
The actual issue here is that an attempt to walk native stack frames fails and we don't recognize that the stack is not walkable for our stackwalking code. The concrete problem is (likely) that caller pc was not yet stored to the stack. This specific problem cannot occur on x86 (caller pc passed on stack) but also there pushing a new frame isn't atomic and there are states where our stackwalking code can crash I'm sure.
This would avoid this specific type of crash.
With that enhancement we would capture the top java frame (sp, pc) in the signal handler too and then do the stack walk at the safepoint. Finding the top java frame is the purpose of find_initial_Java_frame but it crashes and would also crash with the walk of java frames delayed to the next safepoint. |
That worries me too (see pr descr.). |
|
@reinrich @TheRealMDoerr Thank you for the explanations.
I think the SIGTRAP handler should block SIGPROF or SIGVTALARM (whatever 26 is on linux ppc). This should be possible since SIGPROF is asynchronous. And if we enter the SIGTRAP jvm handler via the normal path (JVM gets SIGTRAP), this is already done. All signals that are not synchronous error signals are blocked, which should include SIGPROF. However, if we enter the signal handling via chaining (in this case, via async_profiler trap_handler), nothing is blocked. At least I don't see any setup for it in the async_profiler sources. The simple solution could be to just block SIGPROF for the current thread when entering the JVM signal handler. A better fix would be for async profiler to block SIGPROF in its trap handler (when setting up the sigaction). |
This reverts commit c4b81e2.
|
I've pushed an alternative fix to consider a frame not |
TheRealMDoerr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This solution looks much better!
src/hotspot/cpu/ppc/frame_ppc.cpp
Outdated
|
|
||
| if (sender_pc() == nullptr) { | ||
| // Likely the return pc was not yet stored to stack. We rather discard this | ||
| // sample also because we would hit an assertion in frame::setup(). We can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double-whitespaces seem to be uncommon.
tstuefe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
A couple of comments for the record. Detecting another signal handler on the stack or blocking SIGPROF inside a handler is not a solution: a signal number that profiler uses is configurable; there may be multiple profilers working at the same time or one profiler working in dual mode (cpu + wall clock). In any case, the problem is not specific to signal handlers: it may happen with any frame that does not store frame pointer at a known location. A typical example is I'm OK with the proposed fix as long as it reduces possibility of crashes, but it's likely not a bullet-proof solution. Any native frame that does not belong to |
Thanks for your comments, Andrei. I agree. Even frames from |
|
Thanks for the reviews! |
|
Going to push as commit e4d3c97.
Your commit was automatically rebased without conflicts. |
|
@apangin: Thanks for looking at this PR!
I think frame pointers are problematic on some platforms, but not on PPC64. The PPC64 ABI requires a valid back chain at all time. *SP always points to the previous frame and frames are pushed atomically. |
With this change
JavaThread::pd_get_top_frame_for_profiling()fails if the current thread is found to be_thread_in_Javabut the CodeCache does not contain its pc.This will prevent crashes as described by the JBS item.
The fix might be too conservative for situations where a thread doen't change its thread state when calling native code, e.g. using the Foreign Function & Memory API. The difficulty finding a less defensive fix is that one must detect if a valid pc can be found in the caller's ABI before constructing that frame.
Testing:
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23641/head:pull/23641$ git checkout pull/23641Update a local copy of the PR:
$ git checkout pull/23641$ git pull https://git.openjdk.org/jdk.git pull/23641/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 23641View PR using the GUI difftool:
$ git pr show -t 23641Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23641.diff
Using Webrev
Link to Webrev Comment