Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8294003: Don't handle si_addr == 0 && si_code == SI_KERNEL SIGSEGVs #10340

Closed
wants to merge 2 commits into from

Conversation

stefank
Copy link
Member

@stefank stefank commented Sep 19, 2022

We have this code code in our signal handler:

#ifndef AMD64
    // Halt if SI_KERNEL before more crashes get misdiagnosed as Java bugs
    // This can happen in any running code (currently more frequently in
    // interpreter code but has been seen in compiled code)
    if (sig == SIGSEGV && info->si_addr == 0 && info->si_code == SI_KERNEL) {
      fatal("An irrecoverable SI_KERNEL SIGSEGV has occurred due "
            "to unstable signal handling in this distribution.");
    }
#endif // AMD64

This bug added that change:
https://bugs.openjdk.java.net/browse/JDK-8004124

In the Generational ZGC we hit the exact same condition whenever we try to (incorrectly) dereference one of our colored pointers. From the bug above:

"A segmentation violation that occurs as a result of userspace process accessing virtual memory above the TASK_SIZE limit will cause a segmentation violation with an si_code of SI_KERNEL"

That is, if we have set high-order bits (past the TASK_SIZE limit), we get these kind of SIGSEGVs.

As the signal handle code is written today, we don't "stop" this signal, and instead try to handle it as an implicit null check. This causes hard-to-debug error messages and crashes in code that incorrectly try to deoptimize the faulty code.

I propose that we short-cut the signal handling code, and let this problematic SIGSEGV get passed to VMError::report_and_die.

We've been running with this patch in the Generational ZGC repository for over a year, without any problems.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8294003: Don't handle si_addr == 0 && si_code == SI_KERNEL SIGSEGVs

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10340/head:pull/10340
$ git checkout pull/10340

Update a local copy of the PR:
$ git checkout pull/10340
$ git pull https://git.openjdk.org/jdk pull/10340/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 10340

View PR using the GUI difftool:
$ git pr show -t 10340

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10340.diff

@stefank
Copy link
Member Author

stefank commented Sep 19, 2022

/label add hotspot

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 19, 2022

👋 Welcome back stefank! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added rfr Pull request is ready for review hotspot hotspot-dev@openjdk.org labels Sep 19, 2022
@openjdk
Copy link

openjdk bot commented Sep 19, 2022

@stefank
The hotspot label was successfully added.

@mlbridge
Copy link

mlbridge bot commented Sep 19, 2022

Webrevs

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems quite reasonable. We've had a few SI_KERNEL crash reports since the original 32-bit Linux issue was reported. Short-circuiting the processing makes complete sense.

Thanks.

@openjdk
Copy link

openjdk bot commented Sep 19, 2022

@stefank This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8294003: Don't handle si_addr == 0 && si_code == SI_KERNEL SIGSEGVs

Reviewed-by: dholmes, shade, dlong

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 50 new commits pushed to the master branch:

  • 83abfa5: 8255670: Improve C2's detection of modified nodes
  • 5652030: 8292376: A few Swing methods use inheritDoc on exceptions which are not inherited
  • 03f287d: 8293995: Problem list sun/tools/jstatd/TestJstatdRmiPort.java on all platforms because of 8293577
  • d5bee4a: 8294086: RISC-V: Cleanup InstructionMark usages in the backend
  • 47f233a: 8292202: modules_do is called without Module_lock
  • 742bc04: 8294100: RISC-V: Move rt_call and xxx_move from SharedRuntime to MacroAssembler
  • 2283c32: 8294149: JMH 1.34 and later requires jopt-simple 5.0.4
  • 9f90eb0: 8294062: Improve parsing performance of j.l.c.MethodTypeDesc
  • c6be2cd: 8293156: Dcmd VM.classloaders fails to print the full hierarchy
  • 711e252: 8294039: Remove "Classpath" exception from java/awt tests
  • ... and 40 more: https://git.openjdk.org/jdk/compare/8082c24a0df3f4861ea391266bdfe6cdd1a77bab...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Sep 19, 2022
@dean-long
Copy link
Member

Is there a way to detect this on Windows too? https://bugs.openjdk.org/browse/JDK-8293832 looks like it could be because of high bits with ZGC.

@stefank
Copy link
Member Author

stefank commented Sep 20, 2022

Is there a way to detect this on Windows too? https://bugs.openjdk.org/browse/JDK-8293832 looks like it could be because of high bits with ZGC.

I don't know.

Single-generational ZGC places most bits in "dereferenceable" memory. We only use one bit outside of that, and that bit is specifically for dealing with finalizers.

I took a look at the linked Bug. So, on Windows we get 0xffffffff reported as the failing address. Note, that even with my patch, we still get the incorrect address reported on Linux. I don't know if that's fixable. My patch only makes sure that we don't try to continue running, and then hit one of those very misleading secondary failures.

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only question is, does distinguishing between 32-bit and 64-bit still make sense then? Both crash, only with different error texts. In the range of possible explanations (malformed address, XEN bug, ...) there is nothing that would explain a different behavior for 32-bit.

@stefank
Copy link
Member Author

stefank commented Sep 20, 2022

Oracle doesn't build and test 32-bit versions anymore, so I can't effectively try to verify if this is needed or not. I wouldn't mind if someone else takes ownership of investigating if this is still needed for 32-bits.

@dholmes-ora
Copy link
Member

The original issue we targeted on 32-bit was a kernel problem. Is it even possible to have an address past TASK_SIZE on 32-bit?

@tstuefe
Copy link
Member

tstuefe commented Sep 20, 2022

The original issue we targeted on 32-bit was a kernel problem. Is it even possible to have an address past TASK_SIZE on 32-bit?

I think yes. TASK_SIZE seems to ultimately be PAGE_OFFSET, which on 32-bit is a high address like 0xB0000000 or so.

@shipilev
Copy link
Member

shipilev commented Sep 21, 2022

I think x86_32 can/should do the same, because faulting on bona fide incorrect address currently produces a misleading error, see below. From the reading of JDK-8015837, JDK-8004124 and related issues, it looks like this code was added for x86_32 to better handle a kernel bug with exec-shield emulation on hardware without NX bit. But even then "better handle" seems to be only about crashing with more precise message.

I think only the ancient hardware runs without NX, and most kernels where this bug appears otherwise are long dead. So, I think we should favor faulting with proper error instead of telling (potentially misleading) things about "unstable signal handling".

$ lscpu
Model name:                      Intel(R) Atom(TM) CPU Z530   @ 1.60GHz

$ cat /etc/debian_version 
11.5

$ jdk/bin/java -version
openjdk version "20-testing" 2023-03-21
OpenJDK Runtime Environment (build 20-testing-builds.shipilev.net-openjdk-jdk-b210-20220919)
OpenJDK Server VM (build 20-testing-builds.shipilev.net-openjdk-jdk-b210-20220919, mixed mode, sharing)

$ cat Crash.java 
import java.lang.reflect.*;
import sun.misc.Unsafe;

public class Crash {
  public static void main(String... args) throws Exception {
    Field f = Unsafe.class.getDeclaredField("theUnsafe");
    f.setAccessible(true);
    Unsafe u = (Unsafe) f.get(null);
    u.getInt(-1L); // 0xF....F
  }
}

$ jdk/bin/java Crash.java
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (os_linux_x86.cpp:227), pid=1033, tid=1034
#  fatal error: An irrecoverable SI_KERNEL SIGSEGV has occurred due to unstable signal handling in this distribution.
#

@shipilev
Copy link
Member

shipilev commented Sep 21, 2022

I think x86_32 can/should do the same, because faulting on bona fide incorrect address currently produces a misleading error, see below.

So I think we can just drop the entirety of #ifndef block:

diff --git a/src/hotspot/os_cpu/linux_x86/os_linux_x86.cpp b/src/hotspot/os_cpu/linux_x86/os_linux_x86.cpp
index 31afbe696a2..9cd0b9a8b58 100644
--- a/src/hotspot/os_cpu/linux_x86/os_linux_x86.cpp
+++ b/src/hotspot/os_cpu/linux_x86/os_linux_x86.cpp
@@ -220,17 +220,9 @@ bool PosixSignals::pd_hotspot_signal_handler(int sig, siginfo_t* info,
     pc = (address) os::Posix::ucontext_get_pc(uc);
 
     if (sig == SIGSEGV && info->si_addr == 0 && info->si_code == SI_KERNEL) {
-#ifndef AMD64
-    // Halt if SI_KERNEL before more crashes get misdiagnosed as Java bugs
-    // This can happen in any running code (currently more frequently in
-    // interpreter code but has been seen in compiled code)
-      fatal("An irrecoverable SI_KERNEL SIGSEGV has occurred due "
-            "to unstable signal handling in this distribution.");
-#else
       // An irrecoverable SI_KERNEL SIGSEGV has occurred.
       // It's likely caused by dereferencing an address larger than TASK_SIZE.
       return false;
-#endif
     }
 
     // Handle ALL stack overflow variations here

On the test above, x86_32 failure before:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (os_linux_x86.cpp:227), pid=1007, tid=1008
#  fatal error: An irrecoverable SI_KERNEL SIGSEGV has occurred due to unstable signal handling in this distribution.
#
# JRE version: OpenJDK Runtime Environment (20.0) (build 20-testing-builds.shipilev.net-openjdk-jdk-b210-20220919)
# Java VM: OpenJDK Server VM (20-testing-builds.shipilev.net-openjdk-jdk-b210-20220919, mixed mode, sharing, tiered, serial gc, linux-x86)
# Problematic frame:
# V  [libjvm.so+0xa095be]  PosixSignals::pd_hotspot_signal_handler(int, siginfo_t*, ucontext_t*, JavaThread*)+0x40e
...
---------------  T H R E A D  ---------------

Current thread (0xb6a162d0):  JavaThread "main" [_thread_in_vm, id=1008, stack(0xb6bda000,0xb6c2b000)]

Stack: [0xb6bda000,0xb6c2b000],  sp=0xb6c29810,  free space=318k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xa095be]  PosixSignals::pd_hotspot_signal_handler(int, siginfo_t*, ucontext_t*, JavaThread*)+0x40e  (os_linux_x86.cpp:227)
V  [libjvm.so+0xb477fa]  JVM_handle_linux_signal+0x15a  (signals_posix.cpp:655)
V  [libjvm.so+0xb47a23]  javaSignalHandler(int, siginfo_t*, void*)+0x23  (signals_posix.cpp:683)
C  [linux-gate.so.1+0x570]  __kernel_rt_sigreturn+0x0
J 860  jdk.internal.misc.Unsafe.getInt(Ljava/lang/Object;J)I java.base@20-testing (0 bytes) @ 0xaf3706e3 [0xaf370620+0x000000c3]
j  jdk.internal.misc.Unsafe.getInt(J)I+3 java.base@20-testing
j  sun.misc.Unsafe.getInt(J)I+4 jdk.unsupported@20-testing
j  Crash.main([Ljava/lang/String;)V+26

x86_32 failure after:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0xb78f5f53, pid=710, tid=711
#
# JRE version: OpenJDK Runtime Environment (20.0) (build 20-internal-adhoc.buildbot.openjdk-jdk)
# Java VM: OpenJDK Server VM (20-internal-adhoc.buildbot.openjdk-jdk, mixed mode, sharing, tiered, serial gc, linux-x86)
# Problematic frame:
# V  [libjvm.so+0xc35f53]  Unsafe_GetInt+0xa3

Current thread (0xb6a162c0):  JavaThread "main" [_thread_in_vm, id=711, stack(0xb6b6b000,0xb6bbc000)]

Stack: [0xb6b6b000,0xb6bbc000],  sp=0xb6bbacf0,  free space=319k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xc35f53]  Unsafe_GetInt+0xa3  (unsafe.cpp:223)
J 884  jdk.internal.misc.Unsafe.getInt(Ljava/lang/Object;J)I java.base@20-internal (0 bytes) @ 0xaf372063 [0xaf371fa0+0x000000c3]
j  jdk.internal.misc.Unsafe.getInt(J)I+3 java.base@20-internal
j  sun.misc.Unsafe.getInt(J)I+4 jdk.unsupported@20-internal
j  Crash.main([Ljava/lang/String;)V+26

Current hs_err does not have siginfo printout, while the hs_err with the patch does the proper:

siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 0x00000000

@coleenp
Copy link
Contributor

coleenp commented Sep 21, 2022

Thanks @shipilev for the comment about the origin of this change. We used to see this error a LOT randomly and it was always painful to diagnose, but I agree that this hardware/config is likely long gone and we can remove this special message.

Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Putting a formal review comment)
We would be better off doing the same for x86_32, as per #10340 (comment)

@stefank
Copy link
Member Author

stefank commented Sep 22, 2022

Thanks for the investigations, comments, and reviews! I'll remove the 32-bit ifdefs.

Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me!

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@stefank
Copy link
Member Author

stefank commented Sep 22, 2022

/integrate

@openjdk
Copy link

openjdk bot commented Sep 22, 2022

Going to push as commit d781ab0.
Since your change was applied there have been 53 commits pushed to the master branch:

  • a216960: 8294087: RISC-V: RVC: Fix a potential alignment issue and add more alignment assertions for the patchable calls/nops
  • 3fa6778: 8292296: Use multiple threads to process ParallelGC deferred updates
  • 800e68d: 8292044: HttpClient doesn't handle 102 or 103 properly
  • 83abfa5: 8255670: Improve C2's detection of modified nodes
  • 5652030: 8292376: A few Swing methods use inheritDoc on exceptions which are not inherited
  • 03f287d: 8293995: Problem list sun/tools/jstatd/TestJstatdRmiPort.java on all platforms because of 8293577
  • d5bee4a: 8294086: RISC-V: Cleanup InstructionMark usages in the backend
  • 47f233a: 8292202: modules_do is called without Module_lock
  • 742bc04: 8294100: RISC-V: Move rt_call and xxx_move from SharedRuntime to MacroAssembler
  • 2283c32: 8294149: JMH 1.34 and later requires jopt-simple 5.0.4
  • ... and 43 more: https://git.openjdk.org/jdk/compare/8082c24a0df3f4861ea391266bdfe6cdd1a77bab...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Sep 22, 2022
@openjdk openjdk bot closed this Sep 22, 2022
@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Sep 22, 2022
@openjdk openjdk bot removed the rfr Pull request is ready for review label Sep 22, 2022
@openjdk
Copy link

openjdk bot commented Sep 22, 2022

@stefank Pushed as commit d781ab0.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@tstuefe
Copy link
Member

tstuefe commented Sep 22, 2022

A small nit remains: why do we even need this section at all?

We get a SIGSEGV with si_addr=0. VM assumes this to be an implicit null check if signal==SIGSEGV and the PC makes sense (interpreter or code blob or vtable stub).

Would it not be cleaner to add a check at that point for info->si_code != SI_KERNEL? E.g. before calling SharedRuntime::continuation_for_implicit_exception?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
6 participants