Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDK-8288556: VM crashes if it gets sent SIGUSR2 from outside #9181

Closed
wants to merge 3 commits into from

Conversation

tstuefe
Copy link
Member

@tstuefe tstuefe commented Jun 16, 2022

The VM uses SIGUSR2 (can be overridden via _JAVA_SR_SIGNUM) to implement suspend/resume on java threads. It sends, via pthread_kill, SIGUSR2 to targeted threads to interrupt them. It knows the target thread, and the target thread is always a VM-attached thread.

However, if SIGUSR2 gets sent from outside, any thread may receive the signal, and if the target thread is not attached to the VM (e.g. primordial), it is unable to handle it. The result is an assert (debug VM) or a crash (release VM). On my box, this can be reliably reproduced by sending SIGUSR2 to any VM.

This has been discussed here: https://mail.openjdk.org/pipermail/core-libs-dev/2022-June/091450.html

The proposed solutions range from "works as designed" (on the ground that sending arbitrary signals to the JVM is an error in itself, and we should rather crash hard and fast) to "lets catch and ignore the signal".


In this patch I opt for:

  • Debug: keep asserting, but make the message more helpful by including signal info for the stray SR signal. Includes sender pid and signal number (in case SR signal had been overridden).
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/shared/projects/openjdk/jdk-jdk/source/src/hotspot/os/posix/signals_posix.cpp:1611), pid=139712, tid=139712
#  assert(thread != __null) failed: Non-attached thread received stray SR signal (siginfo: si_signo: 12 (SIGUSR2), si_code: 0 (SI_USER), si_pid: 6681, si_uid: 1027)..
  • Release: write a message to tty about this signal, including sender pid and signal name. Otherwise, ignore the signal, dont crash. Repeated signals will generate repeated output:
thomas@starfish:/shared/projects/openjdk/jdk-jdk/output-release$ ./images/jdk/bin/java -cp $REPROS_JAR de.stuefe.repros.Simple
<press key>
Non-attached thread received stray SR signal (siginfo: si_signo: 12 (SIGUSR2), si_code: 0 (SI_USER), si_pid: 239773, si_uid: 1027).
Non-attached thread received stray SR signal (siginfo: si_signo: 12 (SIGUSR2), si_code: 0 (SI_USER), si_pid: 239774, si_uid: 1027).
Non-attached thread received stray SR signal (siginfo: si_signo: 12 (SIGUSR2), si_code: 0 (SI_USER), si_pid: 239775, si_uid: 1027).

Notes:

  • In release builds, we also could quit the VM instead of continuing. I prefer gracefully ignoring the signal, because in our experience quitting - regardless of how good the diagnostic message is - often just leads to frustrated users complaining about VMs mysteriously vanishing. Same goes for crashes, it just pools into the general "java is unstable" notion. I'm open for discussing this.
  • I use tty for the diagnostic message, which goes to stdout. I really dislike that, error output should go to stderr. But since the rest of the VM handles diagnostic output the same way, I use tty here too.

Thanks, Thomas


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8288556: VM crashes if it gets sent SIGUSR2 from outside

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9181/head:pull/9181
$ git checkout pull/9181

Update a local copy of the PR:
$ git checkout pull/9181
$ git pull https://git.openjdk.org/jdk pull/9181/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9181

View PR using the GUI difftool:
$ git pr show -t 9181

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9181.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 16, 2022

👋 Welcome back stuefe! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@tstuefe tstuefe marked this pull request as ready for review June 16, 2022 07:50
@openjdk
Copy link

openjdk bot commented Jun 16, 2022

@tstuefe The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot-runtime hotspot-runtime-dev@openjdk.org rfr Pull request is ready for review labels Jun 16, 2022
@mlbridge
Copy link

mlbridge bot commented Jun 16, 2022

Webrevs

os::print_siginfo(&ss, siginfo);
ss.print_raw(").");
assert(thread != NULL, "%s.", ss.base());
tty->print_cr("%s", ss.base());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely this should be a regular VM warning not a raw write to tty - neither of which are signal-safe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I changed this to log_warning(os).

@dholmes-ora
Copy link
Member

As per the discussion this is a much broader problem as it could apply to a range of signals and can be wrong even if the thread is not attached. Even if you want to restrict this to SR_SIGNUM the current proposal only handles one case.

@tstuefe
Copy link
Member Author

tstuefe commented Jun 20, 2022

Hi David,

As per the discussion this is a much broader problem as it could apply to a range of signals and can be wrong even if the thread is not attached. Even if you want to restrict this to SR_SIGNUM the current proposal only handles one case.

I disagree about this being a large issue. Quoting https://mail.openjdk.org/pipermail/core-libs-dev/2022-June/091577.html

I'd say we limit this to
1) signal for which we registered handlers: the handlers should at least
not crash or vanish the VM without a trace

2) signals which may conceivably be sent to the VM in the normal course of
events: if the default action is to terminate the VM, we should handle them

The (1) set is rather small and contains:

  • SEGV, ILL, FPE, BUS: These should crash, so crashing out the VM if the signal is sent manually is reasonable (in fact, I use this sometimes).
  • TRAP (power only): Used by the compiler; relies not on Thread::current, just on the pc from the context. Sending it from outside should be benign.
  • PIPE, XFSZ: As I wrote in the mail thread, we already gracefully ignore these signals when receiving them, for both debug and release builds.
  • QUIT (BREAK_SIGNAL): We assert !ReduceSignalUsage in debug. In Release, we gracefully ignore this signal.
  • HUP, SIGTERM (SHUTDOWN1_SIGNAL, SHUTDOWN3_SIGNAL): End the VM immediately as expected.
  • QUIT (SHUTDOWN2_SIGNAL): Prints thread dump. Does not shutdown the VM.

All of these signals are already handled correctly. Note that with several of them (PIPE, XFSZ) we already established the pattern of ignoring signals instead of vanishing the VM.

The set (2) is atm unknown to me. Are there any more? SIGCHILD is ignored by default. SIGUSR1 exists and exits the VM; this may be another case, but atm we don't handle it and I would not add a handler to it since user apps may use this signal. Any others?


The way I see it, my patch would introduce the same handling for SIGUSR2 we already have established for SIGPIPE, SIGXFSZ, and arguably for SIGINT.

Cheers, Thomas

@dholmes-ora
Copy link
Member

Hi Thomas,

Okay I see SIGUSR2 (or more generally SR_SIGNUM) is special in that regard: it is an internal-use-only non-terminating signal. But any external sending of SIGUSR2 is invalid regardless of whether an attached or not-attached thread handles it.

@tstuefe
Copy link
Member Author

tstuefe commented Jun 20, 2022

Hi Thomas,

Okay I see SIGUSR2 (or more generally SR_SIGNUM) is special in that regard: it is an internal-use-only non-terminating signal. But any external sending of SIGUSR2 is invalid regardless of whether an attached or not-attached thread handles it.

So, should I fail if si_pid!=getpid? As I wrote, I was a bit worried that some OSes may not deliver the correct pid in si_pid - either deliver the kernel thread id or leave it empty. I have dim recollections of such errors on AIX or HPUX. So far, we use si_pid only for displaying purposes and don't really rely on it being correct.

Another issue, I tried the patch with redefining SR_SIGNUM and found that I could not use SIGUSR1 on Linux because its numerical value (10) is below SIGSEGV(11) on my box and we have this code:

if ((s = ::getenv("_JAVA_SR_SIGNUM")) != 0) {
int sig = ::strtol(s, 0, 10);
if (sig > MAX2(SIGSEGV, SIGBUS) && // See 4355769.
sig < NSIG) { // Must be legal signal and fit into sigflags[].
PosixSignals::SR_signum = sig;
} else {

Do you think this is fix-worthy? SIGUSR1 seems an obvious choice for an alternate SR signal, but OTOH nobody complained in 20+ years.

@tstuefe
Copy link
Member Author

tstuefe commented Jun 20, 2022

Another issue, I tried the patch with redefining SR_SIGNUM and found that I could not use SIGUSR1 on Linux because its numerical value (10) is below SIGSEGV(11) on my box and we have this code:

if ((s = ::getenv("_JAVA_SR_SIGNUM")) != 0) {
int sig = ::strtol(s, 0, 10);
if (sig > MAX2(SIGSEGV, SIGBUS) && // See 4355769.
sig < NSIG) { // Must be legal signal and fit into sigflags[].
PosixSignals::SR_signum = sig;
} else {

Do you think this is fix-worthy? SIGUSR1 seems an obvious choice for an alternate SR signal, but OTOH nobody complained in 20+ years.

I read up on this in the very good analysis at: https://bugs.openjdk.org/browse/JDK-4355769?focusedCommentId=12425929&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-12425929 . This was a wild read :) I continue to be impressed by the quality of bug reports in JBS.

Seems more complicated than I thought. I am not sure how much of this is till valid today. This was 21 years ago. I'm surprised by the described behaviour though, that SIGUSR2 can interrupt delivery of error signals; I would have naively thought that signal delivery is on a strict first come first deliver base.

@dholmes-ora
Copy link
Member

As the saying goes "it's complicated". Whether Linux signal delivery has the same properties today I have no idea. I'm somewhat bemused that SIGUSR1 and SIGUSR2 are not adjacent signals - weird design to say the least. I also don't understand how you can possibly get two signals pending like that at the same time when one is synchronous and the other asynchronous. 4355769 is an interesting read but I'm not sure I can really agree with the analysis (and note that one comment seems to contradict an earlier one, so exactly what happened is unclear). So yeah setting an alternative SR_SIGNUM is problematic.

On the si_pid part ... I have no prior knowledge of this (didn't even know it existed), so have no idea whether it is reliable or not.

Seems to me we have far greater risk of breaking something unexpectedly with changing this code than we potentially benefit from making the change.

So I'd vote for doing nothing.

@tstuefe
Copy link
Member Author

tstuefe commented Jun 20, 2022

On the si_pid part ... I have no prior knowledge of this (didn't even know it existed), so have no idea whether it is reliable or not.

Seems to me we have far greater risk of breaking something unexpectedly with changing this code than we potentially benefit from making the change.

So I'd vote for doing nothing.

"Nothing" as in I should withdraw this patch? Surely not?

The behavioral difference my patch brings would be:
debug: assert with useless information -> assert with useful information
release: crash with useless report -> print useful information, continue

As I wrote, I can compromise the second part to:
release: crash with useless report -> print useful information, exit VM

@dholmes-ora
Copy link
Member

Given you have done the work I can review the patch - we just need to resolve the tty vs. VM warning issue. But I don't see a need to make any actual changes here.

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Thomas!

@openjdk
Copy link

openjdk bot commented Jun 21, 2022

@tstuefe This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8288556: VM crashes if it gets sent SIGUSR2 from outside

Reviewed-by: dholmes, lucy

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 53 new commits pushed to the master branch:

  • 5e680d0: 8288724: Prevent NullPointerException in serviceability/tmtools/jstack/DaemonThreadTest.java if jstack process fails
  • ad89146: 8288601: Consolidate static/dynamic archive tables
  • 7e211d7: 8287672: jtreg test com/sun/jndi/ldap/LdapPoolTimeoutTest.java fails intermittently in nightly run
  • 7039c66: Merge
  • 453e8be: 8288527: broken link in java.base/java/util/zip/package-summary.html
  • 33d0363: 8288741: JFR: Change package name of snippet files
  • 0408f9c: 8288663: JFR: Disabling the JfrThreadSampler commits only a partially disabled state
  • 1cf83a4: 8287800: JFR: Incorrect error message when starting recording with missing .jfc file
  • 09da87c: 8288485: jni/nullCaller/NullCallerTest.java failing (ppc64)
  • ed714af: 8288564: C2: LShiftLNode::Ideal produces wrong result after JDK-8278114
  • ... and 43 more: https://git.openjdk.org/jdk/compare/39526e28bc4b82d22623a839362fd443e9fb11f0...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jun 21, 2022
Copy link
Contributor

@RealLucy RealLucy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me.
PRODUCT code should not abort but continue running if safely possible. In the case considered here, nothing is wrong in the VM. It just receives a signal it doesn't know what to do about. Therefore, "ignore and continue" is the right strategy.

@tstuefe
Copy link
Member Author

tstuefe commented Jun 21, 2022

Thanks a lot, @dholmes-ora and @RealLucy !

/integrate

@openjdk
Copy link

openjdk bot commented Jun 21, 2022

Going to push as commit 701ea3b.
Since your change was applied there have been 53 commits pushed to the master branch:

  • 5e680d0: 8288724: Prevent NullPointerException in serviceability/tmtools/jstack/DaemonThreadTest.java if jstack process fails
  • ad89146: 8288601: Consolidate static/dynamic archive tables
  • 7e211d7: 8287672: jtreg test com/sun/jndi/ldap/LdapPoolTimeoutTest.java fails intermittently in nightly run
  • 7039c66: Merge
  • 453e8be: 8288527: broken link in java.base/java/util/zip/package-summary.html
  • 33d0363: 8288741: JFR: Change package name of snippet files
  • 0408f9c: 8288663: JFR: Disabling the JfrThreadSampler commits only a partially disabled state
  • 1cf83a4: 8287800: JFR: Incorrect error message when starting recording with missing .jfc file
  • 09da87c: 8288485: jni/nullCaller/NullCallerTest.java failing (ppc64)
  • ed714af: 8288564: C2: LShiftLNode::Ideal produces wrong result after JDK-8278114
  • ... and 43 more: https://git.openjdk.org/jdk/compare/39526e28bc4b82d22623a839362fd443e9fb11f0...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jun 21, 2022
@openjdk openjdk bot closed this Jun 21, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jun 21, 2022
@openjdk
Copy link

openjdk bot commented Jun 21, 2022

@tstuefe Pushed as commit 701ea3b.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@tstuefe tstuefe deleted the JDK-8288556-SIGUSR2-crash branch August 24, 2023 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

3 participants