Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8286182: C2: crash with SIGFPE when executing compiled code #8726

Closed

Conversation

TheRealMDoerr
Copy link
Contributor

@TheRealMDoerr TheRealMDoerr commented May 16, 2022

The bug is not assigned to me, but I have seen that the C2 code which checks for div by 0 is not aware of the new nodes from JDK-8284742.
This fixes the VM to pass the reproducer. I'm not sure if more opcode checks are required to get added.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (1 review required, with at least 1 reviewer)

Issue

  • JDK-8286182: C2: crash with SIGFPE when executing compiled code

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/8726/head:pull/8726
$ git checkout pull/8726

Update a local copy of the PR:
$ git checkout pull/8726
$ git pull https://git.openjdk.java.net/jdk pull/8726/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 8726

View PR using the GUI difftool:
$ git pr show -t 8726

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/8726.diff

@bridgekeeper
Copy link

bridgekeeper bot commented May 16, 2022

👋 Welcome back mdoerr! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label May 16, 2022
@openjdk
Copy link

openjdk bot commented May 16, 2022

@TheRealMDoerr The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label May 16, 2022
@mlbridge
Copy link

mlbridge bot commented May 16, 2022

Webrevs

@TobiHartmann
Copy link
Member

So the fix for JDK-8257822 misses the new nodes added by JDK-8284742.

I wonder if this code needs to be adjusted as well:

// We'd like +VerifyLoopOptimizations to not believe that Mod's/Loads
// _must_ be pinned (they have to observe their control edge of course).
// Unlike Stores (which modify an unallocable resource, the memory
// state), Mods/Loads can float around. So free them up.
switch( n->Opcode() ) {
case Op_DivI:

@chhagedorn should have a look as well.

@TheRealMDoerr
Copy link
Contributor Author

So the fix for JDK-8257822 misses the new nodes added by JDK-8284742.

I wonder if this code needs to be adjusted as well:

// We'd like +VerifyLoopOptimizations to not believe that Mod's/Loads
// _must_ be pinned (they have to observe their control edge of course).
// Unlike Stores (which modify an unallocable resource, the memory
// state), Mods/Loads can float around. So free them up.
switch( n->Opcode() ) {
case Op_DivI:

@chhagedorn should have a look as well.

I was wondering about that, too. Also why we have checks for Div/ModI nodes, but not for Div/ModL at some places. We can have loops with long trip count since JDK-8223051.

@merykitty
Copy link
Member

Thanks a lot for taking a look at this. I am considering this option, too. The problem is that NoOvfDivI does not only depend on the zero-divisor check but a possible overflow check as well. So with this fix it is still possible for a SIGFPE to occur.

IIUC this trouble comes from the fact that on x86 a Div node must be pinned to its zero-divisor check but may float with regards to other control nodes. Maybe we can remove all this special handling and simply catch SIGFPE instead? The result is guaranteed to not be used in those cases so we may not worry about the correctness of the compiled code.

@TheRealMDoerr
Copy link
Contributor Author

Right, I had forgotten that x86 also raises SIGFPE in case of overflow.

Catching SIGFPE is possible. I had experimented with

--- a/src/hotspot/os_cpu/linux_x86/os_linux_x86.cpp
+++ b/src/hotspot/os_cpu/linux_x86/os_linux_x86.cpp
@@ -278,6 +278,15 @@ bool PosixSignals::pd_hotspot_signal_handler(int sig, siginfo_t* info,
                                               pc,
                                               SharedRuntime::
                                               IMPLICIT_DIVIDE_BY_ZERO);
+        if (stub == nullptr) {
+          int op = pc[0];
+          // TODO: make sure to handle all variants used by C2!
+          if (op == 0xF7) {
+            // ignore SIGPFE by speculative div
+            uc->uc_mcontext.gregs[REG_PC] += 4;
+            return true;
+          }
+        }
 #else
       if (sig == SIGFPE /* && info->si_code == FPE_INTDIV */) {
         // HACK: si_code does not work on linux 2.2.12-20!!!
diff --git a/src/hotspot/share/code/compiledMethod.cpp b/src/hotspot/share/code/compiledMethod.cpp
index 83c33408ea3..1c5e197511d 100644
--- a/src/hotspot/share/code/compiledMethod.cpp
+++ b/src/hotspot/share/code/compiledMethod.cpp
@@ -746,7 +746,7 @@ address CompiledMethod::continuation_for_implicit_exception(address pc, bool for
   int exception_offset = pc - code_begin();
   int cont_offset = ImplicitExceptionTable(this).continuation_offset( exception_offset );
 #ifdef ASSERT
-  if (cont_offset == 0) {
+  if (cont_offset == 0 && !for_div0_check) {
     Thread* thread = Thread::current();
     ResourceMark rm(thread);
     CodeBlob* cb = CodeCache::find_blob(pc);

This works and behaves like PPC64. However, we need to make sure it doesn't show up on any hot path. Otherwise, performance will be terrible.

Feel free to create a new PR if you want to propose a solution. I only opened this one to show my findings and can close it again if we decide for something else.

@TheRealMDoerr
Copy link
Contributor Author

Btw. it is possible to make the div node dependent on only one check by using the following scheme for a/b with unsigned comparison:

if (b+1 >u 1) return a/b; // <-1 or >0
else if (b==0) arithmetic_exception();
else return -a; // div by -1

It could be transformed back later to enable implicit div by 0 checks. Sounds complicated and I don't know if this makes sense.

@chhagedorn
Copy link
Member

chhagedorn commented May 17, 2022

The fix looks reasonable to address the same problems in JDK-8257822 and JDK-8248552 for the new nodes Div and Mod nodes.

The problem is that NoOvfDivI does not only depend on the zero-divisor check but a possible overflow check as well. So with this fix it is still possible for a SIGFPE to occur.

Do you have a failing test for this case? We've been seeing a lot of SIGFPE failures lately with Java Fuzzer. I have to walk through them to see if I can find a case that is still failing with this fix. Will get back with the result of this analysis.

IIUC this trouble comes from the fact that on x86 a Div node must be pinned to its zero-divisor check but may float with regards to other control nodes. Maybe we can remove all this special handling and simply catch SIGFPE instead? The result is guaranteed to not be used in those cases so we may not worry about the correctness of the compiled code.

I'm not sure if we should rely on signal catching to fix the cases where a division is wrongly floating above its zero check. I think we should not intentionally leave a graph in a broken state with the intention to fix it later at runtime.

@merykitty
Copy link
Member

merykitty commented May 18, 2022

Do you have a failing test for this case? We've been seeing a lot of SIGFPE failures lately with Java Fuzzer. I have to walk through them to see if I can find a case that is still failing with this fix. Will get back with the result of this analysis.

Overflow only happens with dividend = MIN_VALUE and divisor = -1 so it would be hard to have a failing test for this case. I will try to come up with one.

I'm not sure if we should rely on signal catching to fix the cases where a division is wrongly floating above its zero check. I think we should not intentionally leave a graph in a broken state with the intention to fix it later at runtime.

IIUC the graph is broken only because the division raises SIGFPE for invalid inputs. If we choose to ignore the signal instead then we can treat the Div nodes similar to how we treat other nodes such as Add or Sub and let them freely float around.

Thanks.

@TobiHartmann
Copy link
Member

I completely agree with Christian, we should not let the division float above its zero check for various reasons:

  • The graph is in a broken/inconsistent state that might lead to all kinds of subsequent issues. For example, if the division floats upwards, the divisor can become statically known to be zero. The division could then be replaced by TOP which would propagate downwards while the zero check is not necessarily removed as well, leading to an unschedulable graph. The same can happen with overflows.
  • Catching the SIGFPE has an impact on performance and can potentially happen in a hot path.
  • There is a risk of ignoring a "real" SIGFPE caused by a C2 bug. For example, if C2 erroneously removes the zero check, we would just continue execution on a SIGFPE, making things worse by essentially converting a crash to a wrong execution issue that might lead to all kinds of weird failures that are hard to debug.

Given that this issue leads to massive failures in our fuzzer testing, I would suggest to back out JDK-8284742 for now and properly re-implement it.

@TheRealMDoerr
Copy link
Contributor Author

I agree, too. Catching SIGFPE was an interesting experiment, but I didn't want to propose it as fix.
If we decide for a complete backout of JDK-8284742, we need to backout JDK-8285390 as well. (Alternatively, we could only revert the x86 parts. Current implementation works on PPC64.)

@chhagedorn
Copy link
Member

I've checked all our fuzzer crashes and the current fix works for all of them. But given that there is potentially another issue with the overflow check mentioned by @merykitty, we could indeed think about backing the changes out for JDK 19 as we are getting closer to the fork. I'd suggest to wait and see for now if @merykitty was able to find a case where we still crash.

@TheRealMDoerr
Copy link
Contributor Author

While evaluation is ongoing I have thought a bit about making the div node dependent on a single check which avoids all SIGFPE cases:

diff --git a/src/hotspot/share/opto/parse3.cpp b/src/hotspot/share/opto/parse3.cpp
index 7e0f854436b..9d4226571ae 100644
--- a/src/hotspot/share/opto/parse3.cpp
+++ b/src/hotspot/share/opto/parse3.cpp
@@ -484,26 +484,30 @@ void Parse::do_divmod_fixup() {
     return;
   }
 
-  // The generated graph is equivalent to (in2 == -1) ? -in1 : (in1 / in2)
-  // we need to have a separate branch for in2 == -1 due to the special
-  // case of min_jint / -1
-  Node* cmp = _gvn.transform(CmpNode::make(in2, _gvn.integercon(-1, bt), bt));
-  Node* bol = Bool(cmp, BoolTest::eq);
-  IfNode* iff = create_and_map_if(control(), bol, PROB_UNLIKELY_MAG(3), COUNT_UNKNOWN);
+  // Check if in2 < -1 or > 1 by one CmpUNode (in2 + 1 >u 1). Reason:
+  // Div node will have to get pinned, but we can only pin to one region.
+  Node* inc = _gvn.transform(AddNode::make(in2, _gvn.integercon(1, bt), bt));
+  Node* cmp = _gvn.transform(CmpUNode::make(inc, _gvn.integercon(1, bt), bt));
+
+  Node* bol = Bool(cmp, BoolTest::gt);
+  IfNode* iff = create_and_map_if(control(), bol, PROB_MAX, COUNT_UNKNOWN);
   Node* iff_true = IfTrue(iff);
   Node* iff_false = IfFalse(iff);
+
+  Node* res_slow = generate_division(_gvn, iff_true, in1, in2, bc);
+
   Node* res_fast = (bc == Bytecodes::_idiv || bc == Bytecodes::_ldiv)
                    ? _gvn.transform(SubNode::make(_gvn.zerocon(bt), in1, bt))
                    : _gvn.zerocon(bt);
-  Node* res_slow = generate_division(_gvn, iff_false, in1, in2, bc);
+
   Node* merge = new RegionNode(3);
   merge->init_req(1, iff_true);
   merge->init_req(2, iff_false);
   record_for_igvn(merge);
   set_control(_gvn.transform(merge));
   Node* res = new PhiNode(merge, Type::get_const_basic_type(bt));
-  res->init_req(1, res_fast);
-  res->init_req(2, res_slow);
+  res->init_req(1, res_slow);
+  res->init_req(2, res_fast);
   res = _gvn.transform(res);
   push_result(*this, res, bt);
 }

This effectively disables implicit div by 0 checks, but I believe they could get repaired. Are there any opinions about this idea?

@merykitty
Copy link
Member

I have made a backout for the changes made by JDK-8285390 and JDK-8284742 in #8774 , a more appropriate fix would be submitted for jdk20 as suggested by Tobias and Christian. Thanks.

@TheRealMDoerr
Copy link
Contributor Author

Closing in favor of #8774.

@TheRealMDoerr TheRealMDoerr deleted the 8286182_C2_SIGFPE branch May 18, 2022 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

4 participants