Skip to content

Conversation

@reinrich
Copy link
Member

@reinrich reinrich commented Feb 14, 2025

With this change JavaThread::pd_get_top_frame_for_profiling() fails if the current thread is found to be _thread_in_Java but the CodeCache does not contain its pc.

This will prevent crashes as described by the JBS item.

The fix might be too conservative for situations where a thread doen't change its thread state when calling native code, e.g. using the Foreign Function & Memory API. The difficulty finding a less defensive fix is that one must detect if a valid pc can be found in the caller's ABI before constructing that frame.

Testing:

  • DaCapo Tomcat with async-profiler on a fastdebug build.
  • Tier 1-4 of hotspot and jdk on the main platforms and also on Linux/PPC64le and AIX.

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8350111: [PPC] AsyncGetCallTrace crashes when called while handling SIGTRAP (Bug - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23641/head:pull/23641
$ git checkout pull/23641

Update a local copy of the PR:
$ git checkout pull/23641
$ git pull https://git.openjdk.org/jdk.git pull/23641/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 23641

View PR using the GUI difftool:
$ git pr show -t 23641

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23641.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Feb 14, 2025

👋 Welcome back rrich! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Feb 14, 2025

@reinrich This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8350111: [PPC] AsyncGetCallTrace crashes when called while handling SIGTRAP

Reviewed-by: mdoerr, stuefe

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 126 new commits pushed to the master branch:

  • bd112c4: 8350443: GHA: Split static-libs-bundles into a separate job
  • 2731712: 8287749: Re-enable javadoc -serialwarn option
  • 0f82268: 8345598: Upgrade NSS binaries for interop tests
  • ea2c923: 8323807: Async UL: Add a stalling mode to async UL
  • e7d4b36: 8350667: Remove startThread_lock() and _startThread_lock on AIX
  • 1e18fff: 8328473: StringTable and SymbolTable statistics delay time to safepoint
  • a0dd565: 8350643: G1: Make loop iteration variable type correspond to limit in G1SurvRateGroup
  • aac9cb4: 8349906: G1: Improve initial survivor rate for newly used young regions
  • a431046: 8350518: org.openjdk.bench.java.util.TreeMapUpdate.compute fails with "java.lang.IllegalArgumentException: key out of range"
  • a70eba8: 8330174: Protection zone for easier detection of accidental zero-nKlass use
  • ... and 116 more: https://git.openjdk.org/jdk/compare/0414dcec118fce24037ca1a6b00561c0ce4c6953...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk
Copy link

openjdk bot commented Feb 14, 2025

@reinrich The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label Feb 14, 2025
@reinrich reinrich force-pushed the 8350111_fail_if_handling_SIGTRAP branch from e4c61ae to c4b81e2 Compare February 24, 2025 16:35
@reinrich reinrich marked this pull request as ready for review February 25, 2025 09:31
@openjdk openjdk bot added the rfr Pull request is ready for review label Feb 25, 2025
@mlbridge
Copy link

mlbridge bot commented Feb 25, 2025

Webrevs

@tstuefe
Copy link
Member

tstuefe commented Feb 25, 2025

Can this also happen on other platforms when in signal handling (e.g. segfault based nullchecks?)

Copy link
Contributor

@TheRealMDoerr TheRealMDoerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Feb 25, 2025
@TheRealMDoerr
Copy link
Contributor

Can this also happen on other platforms when in signal handling (e.g. segfault based nullchecks?)

I guess such problems can happen on all platforms which use some kind of link register (aarch64, s390, ?). I also don't like that we lose so many samples with this current solution. I only approved it because I think it is better than crashing.
Recognizing that a signal handler is on stack may be a better solution. Do we already have functionality for that?
There are efforts to read the stack at a safepoint. @parttimenerd: Would it make sense to wait for that?

@parttimenerd
Copy link
Contributor

There are efforts to read the stack at a safepoint. @parttimenerd: Would it make sense to wait for that?

Probably not, as the problems will still exists with external profilers like async-profiler.

@reinrich
Copy link
Member Author

Can this also happen on other platforms when in signal handling (e.g. segfault based nullchecks?)

I guess such problems can happen on all platforms which use some kind of link register (aarch64, s390, ?).

The actual issue here is that an attempt to walk native stack frames fails and we don't recognize that the stack is not walkable for our stackwalking code. The concrete problem is (likely) that caller pc was not yet stored to the stack. This specific problem cannot occur on x86 (caller pc passed on stack) but also there pushing a new frame isn't atomic and there are states where our stackwalking code can crash I'm sure.

I also don't like that we lose so many samples with this current solution. I only approved it because I think it is better than crashing.
Recognizing that a signal handler is on stack may be a better solution.

This would avoid this specific type of crash.
Attempts to walk native frames until the top java frame is found can fail, though, in similar ways.
That's what I meant referring to ffi calls in the pr description.

Do we already have functionality for that? There are efforts to read the stack at a safepoint. @parttimenerd: Would it make sense to wait for that?

With that enhancement we would capture the top java frame (sp, pc) in the signal handler too and then do the stack walk at the safepoint. Finding the top java frame is the purpose of find_initial_Java_frame but it crashes and would also crash with the walk of java frames delayed to the next safepoint.
It would only help if we would use the java frame (sp, pc) we find on top at the safepoint but doing so you loose precision, e.g. if you where in an critical ffi call when the thread was interrupted then you would loose this information.

@reinrich
Copy link
Member Author

I also don't like that we lose so many samples with this current solution.

That worries me too (see pr descr.).
It might be possible to handle this situation better in frame::safe_for_sender if we only make sure that the sender pc is not null.
I was worried about the case where sender pc is random but within the code cache. This seems to be handled though in find_initial_Java_frame.

@tstuefe
Copy link
Member

tstuefe commented Feb 26, 2025

@reinrich @TheRealMDoerr Thank you for the explanations.

Recognizing that a signal handler is on stack may be a better solution.

I think the SIGTRAP handler should block SIGPROF or SIGVTALARM (whatever 26 is on linux ppc). This should be possible since SIGPROF is asynchronous.

And if we enter the SIGTRAP jvm handler via the normal path (JVM gets SIGTRAP), this is already done. All signals that are not synchronous error signals are blocked, which should include SIGPROF. However, if we enter the signal handling via chaining (in this case, via async_profiler trap_handler), nothing is blocked. At least I don't see any setup for it in the async_profiler sources.

The simple solution could be to just block SIGPROF for the current thread when entering the JVM signal handler. A better fix would be for async profiler to block SIGPROF in its trap handler (when setting up the sigaction).

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Feb 26, 2025
@reinrich
Copy link
Member Author

I've pushed an alternative fix to consider a frame not safe_for_sender if its sender pc is null.
This of course avoids the assertion given in the bug.
As explained in the comment we might find other random values when looking for the sender pc if it hasn't been stored to stack yet. This has to be handled by subsequent consistency checks.
Let's see how the nightly testing goes...

Copy link
Contributor

@TheRealMDoerr TheRealMDoerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This solution looks much better!


if (sender_pc() == nullptr) {
// Likely the return pc was not yet stored to stack. We rather discard this
// sample also because we would hit an assertion in frame::setup(). We can
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double-whitespaces seem to be uncommon.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Feb 26, 2025
@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Feb 26, 2025
@openjdk openjdk bot added the ready Pull request is ready to be integrated label Feb 26, 2025
Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@apangin
Copy link

apangin commented Feb 27, 2025

A couple of comments for the record.

Detecting another signal handler on the stack or blocking SIGPROF inside a handler is not a solution: a signal number that profiler uses is configurable; there may be multiple profilers working at the same time or one profiler working in dual mode (cpu + wall clock).

In any case, the problem is not specific to signal handlers: it may happen with any frame that does not store frame pointer at a known location. A typical example is clock_gettime function called from System.currentTimeMillis and System.nanoTime. If libc is compiled without frame pointers, JVM fails to unwind clock_gettime. Note that currentTimeMillis and nanoTime are JVM intrinsics: they do not do regular state transition; a thread remains in_Java while executing clock_gettime. A signal trampoline is just another example of code with uncommon frame layout (not only on PPC).

I'm OK with the proposed fix as long as it reduces possibility of crashes, but it's likely not a bullet-proof solution. Any native frame that does not belong to libjvm.so is potentially dangerous to walk.

@reinrich
Copy link
Member Author

I'm OK with the proposed fix as long as it reduces possibility of crashes, but it's likely not a bullet-proof solution. Any native frame that does not belong to libjvm.so is potentially dangerous to walk.

Thanks for your comments, Andrei. I agree. Even frames from libjvm.so are dangerous since we might read an incorrect random sender_pc(). The outcome that is just undefined.
Tests are good though :)

@reinrich
Copy link
Member Author

Thanks for the reviews!
/integrate

@openjdk
Copy link

openjdk bot commented Feb 27, 2025

Going to push as commit e4d3c97.
Since your change was applied there have been 135 commits pushed to the master branch:

  • 885338b: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory
  • bb48b73: 8350723: RISC-V: debug.cpp help() is missing riscv line for pns
  • b29f8b0: 8350665: SIZE_FORMAT_HEX macro undefined in gtest
  • 78c18cf: 8349399: GHA: Add static-jdk build on linux-x64
  • e43960a: 8350616: Skip ValidateHazardPtrsClosure in non-debug builds
  • 9ec4696: 8350313: Include timings for leaving safepoint in safepoint logging
  • ec6624b: 8350649: Class unloading accesses/resurrects dead Java mirror after JDK-8346567
  • 9477c70: 8024695: new File("").exists() returns false whereas it is the current working directory
  • 3e46480: 8350770: [BACKOUT] Protection zone for easier detection of accidental zero-nKlass use
  • bd112c4: 8350443: GHA: Split static-libs-bundles into a separate job
  • ... and 125 more: https://git.openjdk.org/jdk/compare/0414dcec118fce24037ca1a6b00561c0ce4c6953...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Feb 27, 2025
@openjdk openjdk bot closed this Feb 27, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Feb 27, 2025
@openjdk
Copy link

openjdk bot commented Feb 27, 2025

@reinrich Pushed as commit e4d3c97.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@TheRealMDoerr
Copy link
Contributor

@apangin: Thanks for looking at this PR!

In any case, the problem is not specific to signal handlers: it may happen with any frame that does not store frame pointer at a known location. A typical example is clock_gettime function called from System.currentTimeMillis and System.nanoTime. If libc is compiled without frame pointers, JVM fails to unwind clock_gettime. Note that currentTimeMillis and nanoTime are JVM intrinsics: they do not do regular state transition; a thread remains in_Java while executing clock_gettime. A signal trampoline is just another example of code with uncommon frame layout (not only on PPC).

I think frame pointers are problematic on some platforms, but not on PPC64. The PPC64 ABI requires a valid back chain at all time. *SP always points to the previous frame and frames are pushed atomically.
On PPC64, retrieving the return PC from the top function (PC from ucontext) is unreliable because we don't know if it lives in the LR register or it was already written on stack. x86 doesn't have this problem because the call instruction writes the return PC on stack. I guess most other platforms have the problem like PPC64 that it's hard to figure out if we are before or after the frame complete offset.

@reinrich reinrich deleted the 8350111_fail_if_handling_SIGTRAP branch October 24, 2025 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

5 participants