Transient fatal error when building spring-graalvm-native with GraalVM 20.2.0 #2748

sdeleuze · 2020-08-12T12:13:31Z

We see sometimes on our CI such kind of transient error on various samples with with GraalVM 20.2.0-dev:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f992032fafb, pid=1073, tid=0x00007f98521fc700
#
# JRE version: OpenJDK Runtime Environment (8.0_262-b10) (build 1.8.0_262-b10)
# Java VM: OpenJDK 64-Bit Server VM GraalVM 20.2.0-dev (25.262-b10-jvmci-20.2-b03 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x6d0afb]  JfrEventClassTransformer::on_klass_creation(InstanceKlass*&, ClassFileParser&, Thread*)+0x92b
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /tmp/build/80754af9/spring-graalvm-native/spring-graalvm-native-samples/webflux-netty/target/native-image/hs_err_pid1073.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/oracle/graal/issues
#
Error: Image build request failed with exit status 134
com.oracle.svm.driver.NativeImage$NativeImageError: Image build request failed with exit status 134
	at com.oracle.svm.driver.NativeImage.showError(NativeImage.java:1558)
	at com.oracle.svm.driver.NativeImage.build(NativeImage.java:1308)
	at com.oracle.svm.driver.NativeImage.performBuild(NativeImage.java:1269)
	at com.oracle.svm.driver.NativeImage.main(NativeImage.java:1228)
real	0m2.883s
user	0m7.934s
sys	0m0.527s
FAILURE: an error occurred when compiling the native-image

It happens with Java 8 (https://ci.spring.io/teams/spring-graalvm-native/pipelines/spring-graalvm-native/jobs/java8-key-samples-build/builds/476 or https://ci.spring.io/teams/spring-graalvm-native/pipelines/spring-graalvm-native/jobs/java8-key-samples-build/builds/478) and also on Java 11 (https://ci.spring.io/teams/spring-graalvm-native/pipelines/spring-graalvm-native/jobs/java11-key-samples-build/builds/420).

We use up to date builds of release/graal-vm/20.2 with openjdk-8u262+10-jvmci-20.2-b03 and labsjdk-ce-11.0.8+10-jvmci-20.2-b03.

Despite the message I don't see any hs_err_pid1073.log error file on our CI machine.
I have enabled core dumps on the CI in order to try to provide more information.

Please tag this issue with the spring label.

The text was updated successfully, but these errors were encountered:

eginez · 2020-08-12T12:19:22Z

Thanks for the report @sdeleuze.
The builds above are using:
Graal commit: release/graal-vm/20.2
with openjdk-8u262+10-jvmci-20.2-b03
and labsjdk-ce-11.0.8+10-jvmci-20.2-b03

eginez · 2020-08-12T12:23:53Z

@dougxc do you think this is possibly to the problem observed in #2732 ?

dougxc · 2020-08-12T12:47:28Z

Unlikely - the deadlock in #2732 should never result in a crash.
This looks related to https://bugs.openjdk.java.net/browse/JDK-8245283 and/or https://bugs.openjdk.java.net/browse/JDK-8232997.
Either way, it's hard to diagnose further without the hs-err file.

dougxc · 2020-08-12T12:53:26Z

Despite the description above, I think the crashes are only happening in Java 8. The crash in https://ci.spring.io/teams/spring-graalvm-native/pipelines/spring-graalvm-native/jobs/java11-key-samples-build/builds/420 is:

# JRE version: OpenJDK Runtime Environment (8.0_262-b10) (build 1.8.0_262-b10)
# Java VM: OpenJDK 64-Bit Server VM GraalVM 20.2.0-dev (25.262-b10-jvmci-20.2-b03 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x6cda34]  position_stream_after_cp(ClassFileStream*)+0x44

This is the first OpenJDK 8 update release to include a backport of JFR. It's possible that the JFR bug fixes mentioned above were not included in this backport (@adinn @zakkak ).

adinn · 2020-08-12T13:46:55Z

@dougxc JDK-8245283 has not been back-ported to jdk11u so that might cause a failure on jdk11u. However, it's not going to be the cause of the problem on to jdk8u. The patch for that bug skips over a JVM_Constant_Dynamic bytecode but that was only added to the byecode spec from release 11.

This may perhaps be the same error as was seen in JDK-8232997. The error was resolved by changes made for a related issue, JDK-8230400 (for which DK-8232997 was marked as a duplicate). The latter had a two part fix.

The first part avoids a potential error when comparing old and new methods in the case where the method arrays may not have the same size. The change was backported to jdk11u but the method comparison still present in jdk8u. This may well be what is causing the SEGV. However, in order for this to be the cause of the crash I think the user has to have installed their own agent in the runtime. Is that the case here?

The second part of the fix was pushed as a fix to JDK-8233111. It does not appear to have been backported to either jdk11u or jdk8u. I'm not sure whether it could be the cause of the SEGV but I suppose it is possible. It certainly seems to relate to running a Java JIT in the JVM.

dougxc · 2020-08-12T14:10:30Z

There is an SVM agent in play although I don't know what it does (@peter-hofer ?):
https://ci.spring.io/teams/spring-graalvm-native/pipelines/spring-graalvm-native/jobs/java8-key-samples-build/builds/476#L5ec26892:688

adinn · 2020-08-12T14:45:34Z

Ah yes. That agent is responsible for adding code to manage class and instance init, tracking init at build time and maybe deferring it to runtime. It can add a new <clinit> method. That might be enough to cause the problem.

peter-hofer · 2020-08-13T12:13:06Z

There is an SVM agent in play although I don't know what it does (@peter-hofer ?):
https://ci.spring.io/teams/spring-graalvm-native/pipelines/spring-graalvm-native/jobs/java8-key-samples-build/builds/476#L5ec26892:688

The agent that I think you're referring to (-javaagent:/opt/java/jre/lib/svm/builder/svm.jar) is NativeImageBytecodeInstrumentationAgent, perhaps @vjovanov can comment on this.

vjovanov · 2020-08-14T13:56:53Z

As @adinn said the agent is there to add the <clinit> method and instrumentation. First, I would not use that in CI as this is only for debugging by the users. Second, we are narrowing down -H:+TraceClassInitialization to use a regular expression. When we do that, only a few classes will be instrumented and the impact should be minimal.

I would advise removing -H:+TraceClassInitialization from the CI (it will anyhow not be possible soon) and see if the issue prevails.

sdeleuze · 2020-08-30T13:41:56Z

@vjovanov I reproduced the issue locally (without -H:+TraceClassInitialization BTW) and sent to you the hs_err_pid1073.log file.

vjovanov · 2020-08-30T18:25:28Z

hs_err_pid26455.log

sdeleuze · 2020-08-31T15:16:15Z

As pointed out by @vjovanov, it seems to be JFR related, but we enable nothing about JFR on our side, we just ship JFR support with https://github.com/spring-projects/spring-framework/blob/master/spring-core/src/main/java/org/springframework/core/metrics/jfr/FlightRecorderApplicationStartup.java referenced nowhere.

But in our builds I noticed /opt/java/jre/lib/jvmci/graal-truffle-jfr-impl.jar is on the classpath and not added by us. Could it be a side effect of JDK default activation (see https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8246384)?

Is there a workaround? --disable-jfr is a JDK build flag right?

adinn · 2020-09-01T09:49:04Z

That's a very bizarre SEGV.

The hs_err file shows the crash point is early in method position_stream_after_cp(). This is JFR code which modifies loaded class bytecode for class Event or one of its subclasses (I suspect that in this case it is actually the latter given that the call to position_stream_after_cp happens 0x491 bytes into method JfrEventClassTransformer::on_klass_creation()). In an case the problem is not in the transformation code itself because it is happening befoe any byrtes get modified. The problem lies with the bytecode stream for the class to be transformed that is passed to position_stream_after_cp().

The offending routine is passed a ClassFileStream i.e. a stream over a byte array in classfile format that is handed over by the class loader. The code works its way through the mapped byte array skipping over the header and constant pool data. It's purpose is to leave the stream pointing at the next block of data which identifies class fields (methods come after that).

static u2 position_stream_after_cp(ClassFileStream* stream) {
  assert(stream != NULL, "invariant");
  assert(stream->current_offset() == 0, "invariant");
  stream->skip_u4_fast(2);  // 8 bytes skipped
  const u2 cp_len = stream->get_u2_fast();
  assert(cp_len > 0, "invariant");
  // now spin the stream position to just after the constant pool
  for (u2 index = 1; index < cp_len; ++index) {
    const u1 tag = stream->get_u1_fast(); // cp tag
    switch (tag) {
      case JVM_CONSTANT_Class:
      case JVM_CONSTANT_String: {
        stream->skip_u2_fast(1); // skip 2 bytes
        continue;
      }
   ...

The machine code listed in the hs_err file immediately around the crash pc (0x7f0fb20eaa34) disassembles as follows:

...
0x7f0fb20eaa22:	nopw   0x0(%rax,%rax,1)
0x7f0fb20eaa28:	lea    0x1(%rdx),%r9
0x7f0fb20eaa2c:	lea    0x1(%rsi),%r11d
0x7f0fb20eaa30:	mov    %r9,0x10(%rdi)
0x7f0fb20eaa34:	movzbl (%rdx),%r8d
0x7f0fb20eaa38:	cmp    $0x12,%r8b
0x7f0fb20eaa3c:	ja     0x7f0fb20eaab0
...

This is 0x44 bytes into the method. So, it appears to be at the start of the loop reading a tag byte from the constant pool and then skipping the relevant number of data bytes depending on the tag. The faulting instruction appears to be at the start of the loop where the tag byte is being range checked it to see if it is above 0x12, the highest possible tag.

Looking at related info in the hs_err file the address in rdx from which the byte is being read is 0x7f0ec02f736a

RDX=0x00007f0ec02f736a is an unknown value

and the SEGV is an access error

siginfo: si_signo: 11 (SIGSEGV), si_code: 2 (SEGV_ACCERR), si_addr: 0x00007f0ec02f736a

So, although 0x00007f0ec02f736a does not look like a wildly wrong address for the classfile byte array it clearly must be since the it is not in a segment mapped for READ access.

Now this ClassFileStream pointer has come out of core JVM code (the entry to the JFR on_klass_creation method comes under java.lang.ClassLoader.defineClass1 via SystemDictionary::resolve_from_stream). It is hard to see how that might relate to any class file transformer employed by GraalVM. As Hamlet said, something is rotten in the state of Danmark. Diagnosing this will require a reliable reproducer on a debuginfo build of OpenJDK so the JVM can be debugged.

sdeleuze · 2020-09-01T10:27:07Z

Could it be possible to share JVMCI builds without JFR enabled in order to check that fixes the issue?

adinn · 2020-09-01T10:41:51Z

You may not need to do that. Your suggestion of disabling JFR is worth trying first. See what happens when you run the native image build with --disable-jfr (you will probably need to pass -J--disable-jfr). That should stop the JFR class transformation code being invoked.

sdeleuze · 2020-09-01T10:45:43Z

I am not sure, when I specify -J--disable-jfr I got a message Unrecognized option: --disable-jfr. Maybe --disable-jfr is a flag to be used when building the JVM not running it ...

adinn · 2020-09-01T11:17:37Z

Oops, apologies. Of course the correct JVM option is -XX:-FlightRecorder.

sdeleuze · 2020-09-01T12:20:33Z

Thanks for the tip, I am starting a few full builds on our CI with this option to see if that solves the issue. I will keep you informed.

sdeleuze · 2020-09-01T14:10:38Z

Same error with -J-XX:-FlightRecorder so this flag is not sufficient. I guess next step would be testing a JDK without JFR support since this has started to fail when this support was introduced in OpenJDK (approximately).

sdeleuze · 2020-09-01T14:15:36Z

Do we have a way to remove graal-truffle-jfr-impl.jar from the classpath? We could at least do it when building GraalVM on our CI. That would not help our users who will use stock 20.2.0 but that would be useful to see if that fixes the issue.

sdeleuze · 2020-09-01T20:04:04Z

I saw a similar error on #2164 and related JDK-8232997 cc @gilles-duboscq. Related?

sdeleuze · 2020-09-01T20:07:22Z

Trying a new build with -Djfr.unsupported.vm=true.

sdeleuze · 2020-09-01T21:57:20Z

Using -J-Djfr.unsupported.vm=true introduces other issue so I rollback this change.

dougxc · 2020-09-02T12:57:42Z

Diagnosing this will require a reliable reproducer on a debuginfo build of OpenJDK so the JVM can be debugged

@sdeleuze you could take the libjvm.so file from https://github.com/graalvm/graal-jvmci-8/releases/download/jvmci-20.2-b03/openjdk-8u262+10-jvmci-20.2-b03-fastdebug-linux-amd64.tar.gz and replace the same file in the GraalVM 20.2.0 binary you are using.

To help tracking oracle/graal#2748.

sdeleuze · 2020-09-07T07:30:21Z

@dougxc We have the choice of the LabsJDK version we use so I have switched to the debug variant. The bug is transient so by nature we can't reproduce it for every native-image invocation but we can provide a way to run the build until this bug is reached. Let me first check what is the result of this bug with those new JDKs.

sdeleuze · 2020-09-08T13:04:31Z

@dougxc Could you please share some guidance on what we could do to provide you more insights now that we use the debug builds?

dougxc · 2020-09-08T13:18:57Z

Do you have a new hs-err file from the debug build? Maybe it provides more detail that @adinn might be able to decipher. If not, then the next step is to add assertions to the C++ to try work out what's going wrong. My suspicion is still that there's something wrong in the JFR backport but I do not know that code very well.

sdeleuze · 2020-09-08T13:35:41Z

Let me reproduce the bug locally and provide you the hs-err file.

adinn · 2020-09-08T14:29:49Z

Hi @sdeleuze @dougxc I think the culprit is probably this issue just reported in the JDK8 JVM

https://bugs.openjdk.java.net/browse/JDK-8252904

It certainly looks likely as the cause of this issue.

Andrey Petrushkov has proposed a patch for it in the jdk8u mail list which I am reviewing now.

https://mail.openjdk.java.net/pipermail/jdk8u-dev/2020-September/012642.html

It looks like his patch will fix the problem but I'm not sure it is acceptable because it uses a ResourceMark in a very obscure way.

To help tracking oracle/graal#2748.

sdeleuze · 2020-10-27T15:11:21Z

The related JDK issue is expected to be fixed in openjdk8u282, is it something we could benefit in GraalVM 20.3?

adinn · 2020-10-27T15:46:47Z

The related JDK issue is expected to be fixed in openjdk8u282, is it something we could benefit in GraalVM 20.3?

Sorry, for not confirming earlier but yes you are right that a fix has already been pushed and will appear in openjdk8u282 (I reviewed the fix and the jdk8u maintainers approved it).

I am sorry I cannot help you much regarding pushing the patch to the GraalVM Oracle 'labs' jdk8 releases. I assume the fix will eventually go into a release automatically -- Oracle jdk8u mostly keeps parity with openjdk8u when it comes to bug fixes and 'labs' jdk8 is derived form Oracle jdk8. What GraalVM release it goes into is not something I am qualified to say. Perhaps an Oracle GraalVM dev could answer this.

dougxc · 2020-10-27T21:08:31Z

I will apply https://bugs.openjdk.java.net/browse/JDK-8252904 to graal-jvmci-8. If all goes well, it should make it into GraalVM CE 20.3 but if not, definitely in 21.0.

sdeleuze · 2020-10-29T11:01:13Z

Thanks @dougxc please keep us informed.

dougxc · 2020-10-30T10:17:54Z

TheJDK-8252904 patch has been applied (graalvm/graal-jvmci-8@0284e18) so this will make it into 20.3.

dougxc · 2020-10-30T10:18:20Z

Please re-open if the issue still shows up in the 20.3 release.

sdeleuze · 2020-11-04T15:40:11Z

Not sure if that's due to jvmci-20.3-b04 or not but our builds are mostly broken, I have raised #2965 for the detailed error log.

sdeleuze · 2020-11-05T09:11:31Z

I confirm the bug seems to be fixed, thanks a lot! In practice Java 11 seems not stable for us, and Java 8 will require a fix for #2965 .

sdeleuze · 2020-11-18T06:52:29Z

Could you please add this one in the 20.3 milestone?

sdeleuze added bug native-image labels Aug 12, 2020

eginez added the spring spring related issue label Aug 12, 2020

sdeleuze added a commit to spring-attic/spring-native that referenced this issue Aug 30, 2020

Document oracle/graal#2748

188c102

sdeleuze added a commit to spring-attic/spring-native that referenced this issue Sep 1, 2020

Add -XX:-FlightRecorder option as a workaround for oracle/graal#2748

6331646

sdeleuze changed the title ~~Transient fatal error when building spring-graalvm-native with GraalVM 20.2.0-dev~~ Transient fatal error when building spring-graalvm-native with GraalVM 20.2.0 Sep 1, 2020

sdeleuze added a commit to spring-attic/spring-native that referenced this issue Sep 7, 2020

Use debug variants of LabsJDK

c939918

To help tracking oracle/graal#2748.

sdeleuze added a commit to spring-attic/spring-native that referenced this issue Sep 10, 2020

Remove build badge until oracle/graal#2748 is fixed

f4cc683

dsyer pushed a commit to scratches/spring-graalvm-native that referenced this issue Sep 23, 2020

Document oracle/graal#2748

29f41b2

dsyer pushed a commit to scratches/spring-graalvm-native that referenced this issue Sep 23, 2020

Add -XX:-FlightRecorder option as a workaround for oracle/graal#2748

80097ed

dsyer pushed a commit to scratches/spring-graalvm-native that referenced this issue Sep 23, 2020

Use debug variants of LabsJDK

482c802

To help tracking oracle/graal#2748.

dsyer pushed a commit to scratches/spring-graalvm-native that referenced this issue Sep 23, 2020

Remove build badge until oracle/graal#2748 is fixed

87d1230

dsyer pushed a commit to scratches/spring-graalvm-native that referenced this issue Sep 23, 2020

Document oracle/graal#2748

eb4cbf8

dsyer pushed a commit to scratches/spring-graalvm-native that referenced this issue Sep 23, 2020

Add -XX:-FlightRecorder option as a workaround for oracle/graal#2748

b2c8f14

dsyer pushed a commit to scratches/spring-graalvm-native that referenced this issue Sep 23, 2020

Use debug variants of LabsJDK

65bf4b0

To help tracking oracle/graal#2748.

dsyer pushed a commit to scratches/spring-graalvm-native that referenced this issue Sep 23, 2020

Remove build badge until oracle/graal#2748 is fixed

75a5010

dougxc self-assigned this Oct 27, 2020

dougxc closed this as completed Oct 30, 2020

gilles-duboscq mentioned this issue Nov 5, 2020

Must release all locks when leaving VM error #2965

Closed

gilles-duboscq added this to the 20.3 milestone Nov 18, 2020

Transient fatal error when building spring-graalvm-native with GraalVM 20.2.0 #2748

Transient fatal error when building spring-graalvm-native with GraalVM 20.2.0 #2748

Comments

sdeleuze commented Aug 12, 2020

eginez commented Aug 12, 2020 • edited Loading

eginez commented Aug 12, 2020

dougxc commented Aug 12, 2020

dougxc commented Aug 12, 2020

adinn commented Aug 12, 2020

dougxc commented Aug 12, 2020

adinn commented Aug 12, 2020 • edited Loading

peter-hofer commented Aug 13, 2020

vjovanov commented Aug 14, 2020

sdeleuze commented Aug 30, 2020

vjovanov commented Aug 30, 2020

sdeleuze commented Aug 31, 2020

adinn commented Sep 1, 2020

sdeleuze commented Sep 1, 2020 • edited Loading

adinn commented Sep 1, 2020

sdeleuze commented Sep 1, 2020 • edited Loading

adinn commented Sep 1, 2020

sdeleuze commented Sep 1, 2020

sdeleuze commented Sep 1, 2020

sdeleuze commented Sep 1, 2020 • edited Loading

sdeleuze commented Sep 1, 2020 • edited Loading

sdeleuze commented Sep 1, 2020

sdeleuze commented Sep 1, 2020

dougxc commented Sep 2, 2020

sdeleuze commented Sep 7, 2020

sdeleuze commented Sep 8, 2020

dougxc commented Sep 8, 2020

sdeleuze commented Sep 8, 2020

adinn commented Sep 8, 2020

sdeleuze commented Oct 27, 2020

adinn commented Oct 27, 2020

dougxc commented Oct 27, 2020

sdeleuze commented Oct 29, 2020

dougxc commented Oct 30, 2020

dougxc commented Oct 30, 2020

sdeleuze commented Nov 4, 2020

sdeleuze commented Nov 5, 2020

sdeleuze commented Nov 18, 2020

eginez commented Aug 12, 2020 •

edited

Loading

adinn commented Aug 12, 2020 •

edited

Loading

sdeleuze commented Sep 1, 2020 •

edited

Loading

sdeleuze commented Sep 1, 2020 •

edited

Loading

sdeleuze commented Sep 1, 2020 •

edited

Loading

sdeleuze commented Sep 1, 2020 •

edited

Loading