Skip to content

Conversation

@navyxliu
Copy link
Member

@navyxliu navyxliu commented Jan 10, 2022

In early stage of initialization, HotSpot doesn't handle SIGQUIT. The default signal preposition on Linux is to quit the process and generate coredump.

There are 2 applications for this signal.

  1. There's a handshake protocol between sun.tools.attach and HotSpot. VirtualMachineImpl sends SIGQUIT(3) if the AttachListener has not been initialized. It expects "Signal Dispatcher" to handle SIGQUIT and create AttachListener.
  2. POSIX systems use SIGQUIT as SIGBREAK. After AttachListener is up, HotSpot will reinterpret the signal for thread dump.

It is possible that HotSpot is still initializing in Threads::create_vm() when SIGQUIT arrives. We should change JVM_HANDLE_XXX_SIGNAL to catch SIGQUIT and ignore it. It is installed os::init_2() and should cover the early stage of initialization. Later on, os::initialize_jdk_signal_support() still overwrites the signal handler of SIGQUIT if ReduceSignalUsage is false(default).

Testing

Before, this patch, once initialization takes long time, jcmd may quit the java process.

$java -Xms64g -XX:+AlwaysPreTouch -Xlog:gc+heap=debug:stderr -XX:ParallelGCThreads=1 &
[1] 9589
[0.028s][debug][gc,heap] Minimum heap 68719476736  Initial heap 68719476736  Maximum heap 68719476736
[0.034s][debug][gc,heap] Running G1 PreTouch with 1 workers for 16384 work units pre-touching 68719476736B.
$jcmd 9589 VM.flags
9589:
[1]  + 9589 quit       java -Xms64g -XX:+AlwaysPreTouch -Xlog:gc+heap=debug:stderr
java.io.IOException: No such process
        at jdk.attach/sun.tools.attach.VirtualMachineImpl.sendQuitTo(Native Method)
        at jdk.attach/sun.tools.attach.VirtualMachineImpl.<init>(VirtualMachineImpl.java:100)
        at jdk.attach/sun.tools.attach.AttachProviderImpl.attachVirtualMachine(AttachProviderImpl.java:58)
        at jdk.attach/com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:207)
        at jdk.jcmd/sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:113)
        at jdk.jcmd/sun.tools.jcmd.JCmd.main(JCmd.java:97)

With this patch, neither jcmd nor kill -3 will disrupt java process 45850.

$java -Xms64g -XX:+AlwaysPreTouch -XX:ParallelGCThreads=1 -version &
[1] 45850
$ kill -3 45850
$jcmd 45850 VM.flags
45850:
com.sun.tools.attach.AttachNotSupportedException: Unable to open socket file /proc/45850/root/tmp/.java_pid45850: target process 45850 doesn't respond within 10500ms or HotSpot VM not loaded
        at jdk.attach/sun.tools.attach.VirtualMachineImpl.<init>(VirtualMachineImpl.java:105)
        at jdk.attach/sun.tools.attach.AttachProviderImpl.attachVirtualMachine(AttachProviderImpl.java:58)
        at jdk.attach/com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:207)
        at jdk.jcmd/sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:113)
        at jdk.jcmd/sun.tools.jcmd.JCmd.main(JCmd.java:97)
$openjdk version "19-internal" 2022-09-20
OpenJDK Runtime Environment (fastdebug build 19-internal+0-adhoc.xxinliu.jdk)
OpenJDK 64-Bit Server VM (fastdebug build 19-internal+0-adhoc.xxinliu.jdk, mixed mode, sharing)
[1]  + 45850 done       java -Xms64g -XX:+AlwaysPreTouch -XX:ParallelGCThreads=1  -version

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8279124: VM does not handle SIGQUIT during initialization

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/7003/head:pull/7003
$ git checkout pull/7003

Update a local copy of the PR:
$ git checkout pull/7003
$ git pull https://git.openjdk.java.net/jdk pull/7003/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 7003

View PR using the GUI difftool:
$ git pr show -t 7003

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/7003.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Jan 10, 2022

👋 Welcome back xliu! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jan 10, 2022

@navyxliu The following label will be automatically applied to this pull request:

  • serviceability

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the serviceability serviceability-dev@openjdk.org label Jan 10, 2022
@navyxliu navyxliu marked this pull request as ready for review January 11, 2022 06:05
@openjdk openjdk bot added the rfr Pull request is ready for review label Jan 11, 2022
@mlbridge
Copy link

mlbridge bot commented Jan 11, 2022

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Xin,

A couple of comments below. I'm still thinking about this one ... seems okay but I'm not certain ...

Thanks,
David

* Return 1 if the SIGQUIT is set in SigCgt; 0 if it is not.
* Return -1 when it runs into any error.
*/
static int check_sigquit_caught(jint pid) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't this just be a bool function? Even if int we don't need a ternary return value as only zero is of interest.

@tstuefe
Copy link
Member

tstuefe commented Jan 17, 2022

Hi @navyxliu,

nice catch. I can see how this can be annoying.

I propose a simpler and more robust way to fix it though. We (1) set up general hotspot signal handling very early, then (2) proceed to initialize the heap - which you have shown can take some time - then (3) set up SIGQUIT handling. We core if we get a quit signal before (3).

I would add SIGQUIT handling to the general signal handler too, just to cover the time frame between (1) and (3). At (3), it would be overwritten, but that would be fine. Before (3), we could just ignore SIGQUIT, or print out some generic information (I assume thread dumps are not yet possible at this stage).

Since the documented behavior for the JVM is to threaddump on SIGQUIT unless we run with -Xrs, I think this is acceptable behavior. Not printing thread dump or only printing partial information is preferable to quitting with core.

Then, this would work for any client that sends sigquit to a JVM, not just those using the attach framework. And it would work on all Posix platforms, not just Linux. And we'd would not have to rely on parsing the proc fs.

Als note that a solution implemented in the client as you did has the disadvantage that I need a modern jcmd for it to work. However, I often just use whatever jcmd is in the path. Better to handle this in the receiving hotspot.

I sketched out a simple patch to test if what I propose can work:

diff --git a/src/hotspot/os/posix/signals_posix.cpp b/src/hotspot/os/posix/signals_posix.cpp
index 2c020a79408..3f0dd91e03b 100644
--- a/src/hotspot/os/posix/signals_posix.cpp
+++ b/src/hotspot/os/posix/signals_posix.cpp
@@ -615,6 +615,11 @@ int JVM_HANDLE_XXX_SIGNAL(int sig, siginfo_t* info,
 #endif // ZERO
   }
 
+if (sig == BREAK_SIGNAL) {
+  printf("too early for thread dumps...\n");
+  signal_was_handled = true;
+}
+
   // Ignore SIGPIPE and SIGXFSZ (4229104, 6499219).
   if (!signal_was_handled &&
       (sig == SIGPIPE || sig == SIGXFSZ)) {
@@ -1279,6 +1284,7 @@ void install_signal_handlers() {
   set_signal_handler(SIGFPE);
   PPC64_ONLY(set_signal_handler(SIGTRAP);)
   set_signal_handler(SIGXFSZ);
+  set_signal_handler(BREAK_SIGNAL); // just for initialization phase

It still misses a number of things (I did not check signal mask setup and ReduceSignalUsage must be handled too), but it shows the general direction and worked as expected.

Cheers, Thomas

@kevinjwalls
Copy link
Contributor

I propose a simpler and more robust way to fix it though

Great, this is the kind of thing I was heading towards with the conversation in the bug text. Although not sure why I could not reproduce the problem, with various different JDK versions.

@tstuefe
Copy link
Member

tstuefe commented Jan 17, 2022

I propose a simpler and more robust way to fix it though

Great, this is the kind of thing I was heading towards with the conversation in the bug text. Although not sure why I could not reproduce the problem, with various different JDK versions.

Ah, I missed your conversation.

I reproduced this by adding a delay during initialization and sending sigquit manually. The bug is not restricted to jcmd, sigquit handling is broken during initialization. Folks tend to send sigquit to unresponsive VMs to get thread dumps, so coring is unfortunate (another reason not to fix it in jcmd itself).

Cheers, Thomas

@dholmes-ora
Copy link
Member

Apologies @kevinjwalls as I also missed the discussion in the JBS issue.

Ideally we would know if the target VM is ready before we send the SIGQUIT to attach - e.g. writing a well-known file that the attach mechanism looks for before trying to attach. That seems feasible but perhaps costly and not always possible(?) ...

What has been presented here is a side-channel way of knowing. In that respect I like this. It is a pity it is Linux only.

The alternative suggestions of just making the window during which SIGQUIT terminates the VM process small enough to be un-hittable, also has merit. I don't think we have to try and accommodate extreme code that just looks up a process in the process table and throws a signal at it. Ignoring SIGQUIT during the early VM startup seems a reasonable solutions (we can't produce a thread dump at that time anyway). It seems to me that we can simply install UserHandler for SIGQUIT very early in the VM initialization process and it will be a no-op until the JDK signal handling is properly initialized (just need to fix one assert).

Cheers,
David

@navyxliu
Copy link
Member Author

hi, @tstuefe @dholmes-ora
I think your opinion is same. I agree Thomas's approach is more general. It works in practice because "making the window during which SIGQUIT terminates the VM process small enough to be unhittable". The gain is obvious. It can work on all POSIX systems and the attach clients don't need to be patched.

One behavior change in the 2f25753 is that we can't core the JVM process with -Xrs (ReduceSignalUsage) anymore SIGQUIT will be intercepted by JVM_HANDLE_XXX_SIGNAL. Thomas said this is fine.
--lx

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Xin,

thanks for taking my suggestion. Remarks inline.

Cheers, Thomas

@tstuefe
Copy link
Member

tstuefe commented Jan 19, 2022

p.s. may make sense to reformulate the JBS issue. "VM does not handle SIGQUIT during initialization", and extending the error description to make clear this affects anyone sending SIGQUIT, not only jcmd.

@navyxliu
Copy link
Member Author

navyxliu commented Jan 19, 2022

I propose a simpler and more robust way to fix it though

Great, this is the kind of thing I was heading towards with the conversation in the bug text. Although not sure why I could not reproduce the problem, with various different JDK versions.

Ah, I missed your conversation.

I reproduced this by adding a delay during initialization and sending sigquit manually. The bug is not restricted to jcmd, sigquit handling is broken during initialization. Folks tend to send sigquit to unresponsive VMs to get thread dumps, so coring is unfortunate (another reason not to fix it in jcmd itself).

Cheers, Thomas

Yes, I learn this trick recently. If the JVM looks suspiciously frozen, just input Ctrl-\ to get the thread dump. Essentially, this is same as "kill -3 $pid" on Linux. jcmd happens to do the same if it can't find the socket file /tmp/.java_pid$pid.

false(default value).

This patch also adds a log message with 'os+init=info'.
@navyxliu navyxliu changed the title 8279124: VirtualMachineImpl should check signal handler has installed before sending SIGQUIT 8279124: VM does not handle SIGQUIT during initialization Jan 19, 2022
@kevinjwalls
Copy link
Contributor

Hi -- thanks for updating the bug title and text. Yes it's much better to start with a concise problem description. I'm in favour of the signal hander change. I'm not personally concerned about printing, silently handling SIGQUIT seems fine for a VM at this stage, perhaps printing just adds risk.

Still curious that I don't reproduce the problem by making heap initialization slow with options like -Xms100g -XX:+AlwaysPreTouch as you could. Startup can be so slow I can attach gdb and see it's in:

Threads::create_vm / init_globals / universe_init / G1CollectedHeap::initialize / ...etc...

...but jcmd or kill -QUIT don't hurt my JVM. 8-)

That process' /proc/PID/status contains:

SigIgn: 0000000000000006
SigCgt: 0000000180000000

...so that I think has signals 2 and 3 ignored? (Ubuntu)

Elsewhere I used Oracle Linux under Windows Services for Linux, and SigXXX fields in /proc/PID/status are all zeroes, not sure if they are meaningful there. Possibly another reason to handle this with the signal handler change.

On a real OracleLinux install I do see :

SigIgn: 0000000000000006
SigCgt: 0000000000000000

at startup become:

SigIgn: 0000000000000006
SigCgt: 0000000181001cc8

..after some seconds. But I still can't trigger the issue, there are some signals ignored there also.

So I like the change but would like to be clearer where the problem exists, where (what platforms?) can we see no signals ignored or caught at startup, and trigger the problem of crashing the VM with SIGQUIT.

@tstuefe
Copy link
Member

tstuefe commented Jan 19, 2022

So I like the change but would like to be clearer where the problem exists, where (what platforms?) can we see no signals ignored or caught at startup, and trigger the problem of crashing the VM with SIGQUIT.

I reproduced it with my artificial delay on Ubuntu 20.04. Cannot reproduce it with AlwaysPreTouch since my machine is too fast.

@navyxliu
Copy link
Member Author

hi, @kevinjwalls,
Even though AlwaysPreTouch leverages multicore now, we can use "-XX:ParallelGCThreads=1" to restrict one parallel work to pretouch work. I think the reason you can't reproduce this because of "SigIgn: 0000000000000006".

According to signal(7), the default disposition of SIGQUIT(3) is "Core" and SIGINT(2) is "Term". "SigIgn: 0000000000000006" indicates that you "Ign" them in the first place. Besides launcher and hotspot, is it possible systemd/bash or kernel can change it?

The doc also describes as follows. Is that possible that you just fork but not exec so it just inherits the signal dispositions from your bash?

A child created via fork(2) inherits a copy of its parent's
signal dispositions. During an execve(2), the dispositions of
handled signals are reset to the default; the dispositions of
ignored signals are left unchanged.

--lx

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now. Thanks!

Cheers, Thomas

@openjdk
Copy link

openjdk bot commented Jan 20, 2022

@navyxliu This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8279124: VM does not handle SIGQUIT during initialization

Reviewed-by: dholmes, stuefe

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 67 new commits pushed to the master branch:

  • 54c9de2: 8275918: Remove unused local variables in java.base security code
  • b9ae779: 8279675: CDS cannot handle non-existent JAR file in bootclassapth
  • c1e4f3d: 8279397: Update --release 18 symbol information for JDK 18 build 32
  • 2920ce5: 8278036: Saving rscratch1 is optional in MacroAssembler::verify_heapbase
  • 6287ae3: 8277531: Print actual default stacksize on Windows thread logging
  • ab2c8d3: 8280393: Promote use of HtmlTree factory methods
  • 47b1c51: 8277120: Use Optional.isEmpty instead of !Optional.isPresent in java.net.http
  • 19f8779: 8278784: C2: Refactor PhaseIdealLoop::remix_address_expressions() so it operates on longs
  • 6352c02: 8280401: [sspi] gss_accept_sec_context leaves output_token uninitialized
  • 35ee0f3: 8258814: Compilation logging crashes for thread suspension / debugging tests
  • ... and 57 more: https://git.openjdk.java.net/jdk/compare/98d96a770756ffe3e7f5e4b82120e9fb484cad9a...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jan 20, 2022
Comment on lines 597 to 600
if (!signal_was_handled && sig == BREAK_SIGNAL) {
assert(!ReduceSignalUsage, "Should not happen with -Xrs/-XX:+ReduceSignalUsage");
signal_was_handled = true;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just do this as the very first thing at line 564:

// If we get BREAK_SIGNAL it means we are very early in VM initialization and 
// only temporarily "handling" it to prevent the VM process getting terminated.
if (sig == BREAK_SIGNAL) {
  assert(!ReduceSignalUsage, "Should not happen with -Xrs/-XX:+ReduceSignalUsage");
  return true;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, David and Thomas,

I am nervous about this change. It's not complex but I don't want to break the existing java applications. Bear with me.

From my reading, HotSpot can chain a user-custom signal handler because of libjsig.so. It's not like a linked-list chain. if the user installs a handler for a signal, libjsig just saves it. JVM_HANDLE_XXX_SIGNAL is invoked first and then the user-custom handler is called here if it isn't handled.

The reason we can hoist this logic to line 564 because we assume that no user application would define a handler for BREAK_SIGNAL. It doesn't work right now because os::initialize_jdk_signal_support() will overwrite the signal handler of BREAK_SIGNAL later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Xin,
Signal chaining doesn't work for BREAK_SIGNAL - from the signal chaining docs:

Note:

The SIGQUIT, SIGTERM, SIGINT, and SIGHUP signals cannot be chained. If the application must 
handle these signals, then consider using the —Xrs option. 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. If embedding application likes to handle SIGQUIT, it needs to set -Xrs. In that case, it does not matter if it installed the signal handler before or after VM was initialized. We just won't touch SIGQUIT at all in that case.

Wrt the position of handling the break signal: most of the coding between lines 564 and 588 does not hurt; PosixSignals::unblock_error_signals(); may actually be beneficial, though it is highly unlikely that we get problems with secondary crashes when handling SIGQUIT. So, line 588 would be an okay position for SIGQUIT handling too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just didn't see the point in doing that extra stuff or the signal_is_handled when we just need to bail out immediately.

Xin Liu added 2 commits January 20, 2022 20:52
It's because SIGQUIT will be overwritten later. This patch fixed the regresssion
runtime/jni/checked/TestCheckedJniExceptionCheck.java.
The test uses -Xcheck:jni and the warning message from JniPeriodicCheckerTask
may mess up the expected outputs.
@tstuefe
Copy link
Member

tstuefe commented Jan 22, 2022

Good catch about Xcheck:jni.

This looks good to me. Ship it.

}

void set_signal_handler(int sig) {
void set_signal_handler(int sig, bool do_check = true) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good catch on this part!

@navyxliu
Copy link
Member Author

/integrate

@openjdk
Copy link

openjdk bot commented Jan 24, 2022

Going to push as commit 9bf6ffa.
Since your change was applied there have been 68 commits pushed to the master branch:

  • 30cd47d: 8280499: runtime/cds/appcds/TestDumpClassListSource.java fails on platforms without AppCDS custom class loaders support
  • 54c9de2: 8275918: Remove unused local variables in java.base security code
  • b9ae779: 8279675: CDS cannot handle non-existent JAR file in bootclassapth
  • c1e4f3d: 8279397: Update --release 18 symbol information for JDK 18 build 32
  • 2920ce5: 8278036: Saving rscratch1 is optional in MacroAssembler::verify_heapbase
  • 6287ae3: 8277531: Print actual default stacksize on Windows thread logging
  • ab2c8d3: 8280393: Promote use of HtmlTree factory methods
  • 47b1c51: 8277120: Use Optional.isEmpty instead of !Optional.isPresent in java.net.http
  • 19f8779: 8278784: C2: Refactor PhaseIdealLoop::remix_address_expressions() so it operates on longs
  • 6352c02: 8280401: [sspi] gss_accept_sec_context leaves output_token uninitialized
  • ... and 58 more: https://git.openjdk.java.net/jdk/compare/98d96a770756ffe3e7f5e4b82120e9fb484cad9a...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jan 24, 2022
@openjdk openjdk bot closed this Jan 24, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jan 24, 2022
@openjdk
Copy link

openjdk bot commented Jan 24, 2022

@navyxliu Pushed as commit 9bf6ffa.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integrated Pull request has been integrated serviceability serviceability-dev@openjdk.org

Development

Successfully merging this pull request may close these issues.

4 participants