8350111: [PPC] AsyncGetCallTrace crashes when called while handling SIGTRAP #23641

reinrich · 2025-02-14T16:11:55Z

With this change JavaThread::pd_get_top_frame_for_profiling() fails if the current thread is found to be _thread_in_Java but the CodeCache does not contain its pc.

This will prevent crashes as described by the JBS item.

The fix might be too conservative for situations where a thread doen't change its thread state when calling native code, e.g. using the Foreign Function & Memory API. The difficulty finding a less defensive fix is that one must detect if a valid pc can be found in the caller's ABI before constructing that frame.

Testing:

DaCapo Tomcat with async-profiler on a fastdebug build.
Tier 1-4 of hotspot and jdk on the main platforms and also on Linux/PPC64le and AIX.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8350111: [PPC] AsyncGetCallTrace crashes when called while handling SIGTRAP (Bug - P4)

Reviewers

Martin Doerr (@TheRealMDoerr - Reviewer)
Thomas Stuefe (@tstuefe - Reviewer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23641/head:pull/23641
$ git checkout pull/23641

Update a local copy of the PR:
$ git checkout pull/23641
$ git pull https://git.openjdk.org/jdk.git pull/23641/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 23641

View PR using the GUI difftool:
$ git pr show -t 23641

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23641.diff

Using Webrev

Link to Webrev Comment

bridgekeeper · 2025-02-14T16:12:49Z

👋 Welcome back rrich! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-02-14T16:13:34Z

@reinrich This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8350111: [PPC] AsyncGetCallTrace crashes when called while handling SIGTRAP

Reviewed-by: mdoerr, stuefe

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 126 new commits pushed to the master branch:

bd112c4: 8350443: GHA: Split static-libs-bundles into a separate job
2731712: 8287749: Re-enable javadoc -serialwarn option
0f82268: 8345598: Upgrade NSS binaries for interop tests
ea2c923: 8323807: Async UL: Add a stalling mode to async UL
e7d4b36: 8350667: Remove startThread_lock() and _startThread_lock on AIX
1e18fff: 8328473: StringTable and SymbolTable statistics delay time to safepoint
a0dd565: 8350643: G1: Make loop iteration variable type correspond to limit in G1SurvRateGroup
aac9cb4: 8349906: G1: Improve initial survivor rate for newly used young regions
a431046: 8350518: org.openjdk.bench.java.util.TreeMapUpdate.compute fails with "java.lang.IllegalArgumentException: key out of range"
a70eba8: 8330174: Protection zone for easier detection of accidental zero-nKlass use
... and 116 more: https://git.openjdk.org/jdk/compare/0414dcec118fce24037ca1a6b00561c0ce4c6953...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk · 2025-02-14T16:14:09Z

@reinrich The following label will be automatically applied to this pull request:

hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2025-02-25T09:36:27Z

Webrevs

tstuefe · 2025-02-25T09:38:58Z

Can this also happen on other platforms when in signal handling (e.g. segfault based nullchecks?)

TheRealMDoerr

LGTM.

TheRealMDoerr · 2025-02-25T15:20:50Z

Can this also happen on other platforms when in signal handling (e.g. segfault based nullchecks?)

I guess such problems can happen on all platforms which use some kind of link register (aarch64, s390, ?). I also don't like that we lose so many samples with this current solution. I only approved it because I think it is better than crashing.
Recognizing that a signal handler is on stack may be a better solution. Do we already have functionality for that?
There are efforts to read the stack at a safepoint. @parttimenerd: Would it make sense to wait for that?

parttimenerd · 2025-02-25T21:20:25Z

There are efforts to read the stack at a safepoint. @parttimenerd: Would it make sense to wait for that?

Probably not, as the problems will still exists with external profilers like async-profiler.

reinrich · 2025-02-26T08:02:11Z

Can this also happen on other platforms when in signal handling (e.g. segfault based nullchecks?)

I guess such problems can happen on all platforms which use some kind of link register (aarch64, s390, ?).

The actual issue here is that an attempt to walk native stack frames fails and we don't recognize that the stack is not walkable for our stackwalking code. The concrete problem is (likely) that caller pc was not yet stored to the stack. This specific problem cannot occur on x86 (caller pc passed on stack) but also there pushing a new frame isn't atomic and there are states where our stackwalking code can crash I'm sure.

I also don't like that we lose so many samples with this current solution. I only approved it because I think it is better than crashing.
Recognizing that a signal handler is on stack may be a better solution.

This would avoid this specific type of crash.
Attempts to walk native frames until the top java frame is found can fail, though, in similar ways.
That's what I meant referring to ffi calls in the pr description.

Do we already have functionality for that? There are efforts to read the stack at a safepoint. @parttimenerd: Would it make sense to wait for that?

With that enhancement we would capture the top java frame (sp, pc) in the signal handler too and then do the stack walk at the safepoint. Finding the top java frame is the purpose of find_initial_Java_frame but it crashes and would also crash with the walk of java frames delayed to the next safepoint.
It would only help if we would use the java frame (sp, pc) we find on top at the safepoint but doing so you loose precision, e.g. if you where in an critical ffi call when the thread was interrupted then you would loose this information.

reinrich · 2025-02-26T09:20:49Z

I also don't like that we lose so many samples with this current solution.

That worries me too (see pr descr.).
It might be possible to handle this situation better in frame::safe_for_sender if we only make sure that the sender pc is not null.
I was worried about the case where sender pc is random but within the code cache. This seems to be handled though in find_initial_Java_frame.

tstuefe · 2025-02-26T09:29:39Z

@reinrich @TheRealMDoerr Thank you for the explanations.

Recognizing that a signal handler is on stack may be a better solution.

I think the SIGTRAP handler should block SIGPROF or SIGVTALARM (whatever 26 is on linux ppc). This should be possible since SIGPROF is asynchronous.

And if we enter the SIGTRAP jvm handler via the normal path (JVM gets SIGTRAP), this is already done. All signals that are not synchronous error signals are blocked, which should include SIGPROF. However, if we enter the signal handling via chaining (in this case, via async_profiler trap_handler), nothing is blocked. At least I don't see any setup for it in the async_profiler sources.

The simple solution could be to just block SIGPROF for the current thread when entering the JVM signal handler. A better fix would be for async profiler to block SIGPROF in its trap handler (when setting up the sigaction).

This reverts commit c4b81e2.

reinrich · 2025-02-26T11:02:37Z

I've pushed an alternative fix to consider a frame not safe_for_sender if its sender pc is null.
This of course avoids the assertion given in the bug.
As explained in the comment we might find other random values when looking for the sender pc if it hasn't been stored to stack yet. This has to be handled by subsequent consistency checks.
Let's see how the nightly testing goes...

TheRealMDoerr

Thanks! This solution looks much better!

TheRealMDoerr · 2025-02-26T13:19:45Z

src/hotspot/cpu/ppc/frame_ppc.cpp


+  if (sender_pc() == nullptr) {
+    // Likely the return pc was not yet stored to stack.  We rather discard this
+    // sample also because we would hit an assertion in frame::setup().  We can


Double-whitespaces seem to be uncommon.

tstuefe

+1

apangin · 2025-02-27T01:51:58Z

A couple of comments for the record.

Detecting another signal handler on the stack or blocking SIGPROF inside a handler is not a solution: a signal number that profiler uses is configurable; there may be multiple profilers working at the same time or one profiler working in dual mode (cpu + wall clock).

In any case, the problem is not specific to signal handlers: it may happen with any frame that does not store frame pointer at a known location. A typical example is clock_gettime function called from System.currentTimeMillis and System.nanoTime. If libc is compiled without frame pointers, JVM fails to unwind clock_gettime. Note that currentTimeMillis and nanoTime are JVM intrinsics: they do not do regular state transition; a thread remains in_Java while executing clock_gettime. A signal trampoline is just another example of code with uncommon frame layout (not only on PPC).

I'm OK with the proposed fix as long as it reduces possibility of crashes, but it's likely not a bullet-proof solution. Any native frame that does not belong to libjvm.so is potentially dangerous to walk.

reinrich · 2025-02-27T09:04:31Z

I'm OK with the proposed fix as long as it reduces possibility of crashes, but it's likely not a bullet-proof solution. Any native frame that does not belong to libjvm.so is potentially dangerous to walk.

Thanks for your comments, Andrei. I agree. Even frames from libjvm.so are dangerous since we might read an incorrect random sender_pc(). The outcome that is just undefined.
Tests are good though :)

reinrich · 2025-02-27T09:22:58Z

Thanks for the reviews!
/integrate

openjdk · 2025-02-27T09:24:18Z

Going to push as commit e4d3c97.
Since your change was applied there have been 135 commits pushed to the master branch:

885338b: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory
bb48b73: 8350723: RISC-V: debug.cpp help() is missing riscv line for pns
b29f8b0: 8350665: SIZE_FORMAT_HEX macro undefined in gtest
78c18cf: 8349399: GHA: Add static-jdk build on linux-x64
e43960a: 8350616: Skip ValidateHazardPtrsClosure in non-debug builds
9ec4696: 8350313: Include timings for leaving safepoint in safepoint logging
ec6624b: 8350649: Class unloading accesses/resurrects dead Java mirror after JDK-8346567
9477c70: 8024695: new File("").exists() returns false whereas it is the current working directory
3e46480: 8350770: [BACKOUT] Protection zone for easier detection of accidental zero-nKlass use
bd112c4: 8350443: GHA: Split static-libs-bundles into a separate job
... and 125 more: https://git.openjdk.org/jdk/compare/0414dcec118fce24037ca1a6b00561c0ce4c6953...master

Your commit was automatically rebased without conflicts.

openjdk · 2025-02-27T09:24:28Z

@reinrich Pushed as commit e4d3c97.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

TheRealMDoerr · 2025-02-27T10:35:59Z

@apangin: Thanks for looking at this PR!

In any case, the problem is not specific to signal handlers: it may happen with any frame that does not store frame pointer at a known location. A typical example is clock_gettime function called from System.currentTimeMillis and System.nanoTime. If libc is compiled without frame pointers, JVM fails to unwind clock_gettime. Note that currentTimeMillis and nanoTime are JVM intrinsics: they do not do regular state transition; a thread remains in_Java while executing clock_gettime. A signal trampoline is just another example of code with uncommon frame layout (not only on PPC).

I think frame pointers are problematic on some platforms, but not on PPC64. The PPC64 ABI requires a valid back chain at all time. *SP always points to the previous frame and frames are pushed atomically.
On PPC64, retrieving the return PC from the top function (PC from ucontext) is unreliable because we don't know if it lives in the LR register or it was already written on stack. x86 doesn't have this problem because the call instruction writes the return PC on stack. I guess most other platforms have the problem like PPC64 that it's hard to figure out if we are before or after the frame complete offset.

openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label Feb 14, 2025

Handle attempt to sample stack while handling SIGTRAP

c4b81e2

reinrich force-pushed the 8350111_fail_if_handling_SIGTRAP branch from e4c61ae to c4b81e2 Compare February 24, 2025 16:35

reinrich marked this pull request as ready for review February 25, 2025 09:31

openjdk bot added the rfr Pull request is ready for review label Feb 25, 2025

TheRealMDoerr approved these changes Feb 25, 2025

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Feb 25, 2025

reinrich added 2 commits February 26, 2025 11:51

Revert first fix

17d8032

This reverts commit c4b81e2.

A frame isn't safe_for_sender if sender_pc() returns null

9a2eedc

openjdk bot removed the ready Pull request is ready to be integrated label Feb 26, 2025

TheRealMDoerr approved these changes Feb 26, 2025

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Feb 26, 2025

Improve whitespace

92b0ef9

openjdk bot removed the ready Pull request is ready to be integrated label Feb 26, 2025

TheRealMDoerr approved these changes Feb 26, 2025

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Feb 26, 2025

tstuefe approved these changes Feb 26, 2025

View reviewed changes

openjdk bot added the integrated Pull request has been integrated label Feb 27, 2025

openjdk bot closed this Feb 27, 2025

openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Feb 27, 2025

reinrich deleted the 8350111_fail_if_handling_SIGTRAP branch October 24, 2025 07:13

8350111: [PPC] AsyncGetCallTrace crashes when called while handling SIGTRAP #23641

8350111: [PPC] AsyncGetCallTrace crashes when called while handling SIGTRAP #23641

Uh oh!

Conversation

reinrich commented Feb 14, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Reviewing

Uh oh!

bridgekeeper bot commented Feb 14, 2025

Uh oh!

openjdk bot commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk bot commented Feb 14, 2025

Uh oh!

mlbridge bot commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

tstuefe commented Feb 25, 2025

Uh oh!

TheRealMDoerr left a comment

Choose a reason for hiding this comment

Uh oh!

TheRealMDoerr commented Feb 25, 2025

Uh oh!

parttimenerd commented Feb 25, 2025

Uh oh!

reinrich commented Feb 26, 2025

Uh oh!

reinrich commented Feb 26, 2025

Uh oh!

tstuefe commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

reinrich commented Feb 26, 2025

Uh oh!

TheRealMDoerr left a comment

Choose a reason for hiding this comment

Uh oh!

TheRealMDoerr Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

tstuefe left a comment

Choose a reason for hiding this comment

Uh oh!

apangin commented Feb 27, 2025

Uh oh!

reinrich commented Feb 27, 2025

Uh oh!

reinrich commented Feb 27, 2025

Uh oh!

openjdk bot commented Feb 27, 2025

Uh oh!

openjdk bot commented Feb 27, 2025

Uh oh!

TheRealMDoerr commented Feb 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

reinrich commented Feb 14, 2025 •

edited by openjdk bot

Loading

openjdk bot commented Feb 14, 2025 •

edited

Loading

mlbridge bot commented Feb 25, 2025 •

edited

Loading

tstuefe commented Feb 26, 2025 •

edited

Loading