Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transient fatal error when building spring-graalvm-native with GraalVM 20.2.0 #2748

Closed
sdeleuze opened this issue Aug 12, 2020 · 38 comments
Closed
Assignees
Labels
bug native-image spring spring related issue
Milestone

Comments

@sdeleuze
Copy link
Collaborator

We see sometimes on our CI such kind of transient error on various samples with with GraalVM 20.2.0-dev:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f992032fafb, pid=1073, tid=0x00007f98521fc700
#
# JRE version: OpenJDK Runtime Environment (8.0_262-b10) (build 1.8.0_262-b10)
# Java VM: OpenJDK 64-Bit Server VM GraalVM 20.2.0-dev (25.262-b10-jvmci-20.2-b03 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x6d0afb]  JfrEventClassTransformer::on_klass_creation(InstanceKlass*&, ClassFileParser&, Thread*)+0x92b
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /tmp/build/80754af9/spring-graalvm-native/spring-graalvm-native-samples/webflux-netty/target/native-image/hs_err_pid1073.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/oracle/graal/issues
#
Error: Image build request failed with exit status 134
com.oracle.svm.driver.NativeImage$NativeImageError: Image build request failed with exit status 134
	at com.oracle.svm.driver.NativeImage.showError(NativeImage.java:1558)
	at com.oracle.svm.driver.NativeImage.build(NativeImage.java:1308)
	at com.oracle.svm.driver.NativeImage.performBuild(NativeImage.java:1269)
	at com.oracle.svm.driver.NativeImage.main(NativeImage.java:1228)
real	0m2.883s
user	0m7.934s
sys	0m0.527s
FAILURE: an error occurred when compiling the native-image

It happens with Java 8 (https://ci.spring.io/teams/spring-graalvm-native/pipelines/spring-graalvm-native/jobs/java8-key-samples-build/builds/476 or https://ci.spring.io/teams/spring-graalvm-native/pipelines/spring-graalvm-native/jobs/java8-key-samples-build/builds/478) and also on Java 11 (https://ci.spring.io/teams/spring-graalvm-native/pipelines/spring-graalvm-native/jobs/java11-key-samples-build/builds/420).

We use up to date builds of release/graal-vm/20.2 with openjdk-8u262+10-jvmci-20.2-b03 and labsjdk-ce-11.0.8+10-jvmci-20.2-b03.

Despite the message I don't see any hs_err_pid1073.log error file on our CI machine.
I have enabled core dumps on the CI in order to try to provide more information.

Please tag this issue with the spring label.

@eginez eginez added the spring spring related issue label Aug 12, 2020
@eginez
Copy link
Contributor

eginez commented Aug 12, 2020

Thanks for the report @sdeleuze.
The builds above are using:
Graal commit: release/graal-vm/20.2
with openjdk-8u262+10-jvmci-20.2-b03
and labsjdk-ce-11.0.8+10-jvmci-20.2-b03

@eginez
Copy link
Contributor

eginez commented Aug 12, 2020

@dougxc do you think this is possibly to the problem observed in #2732 ?

@dougxc
Copy link
Member

dougxc commented Aug 12, 2020

Unlikely - the deadlock in #2732 should never result in a crash.
This looks related to https://bugs.openjdk.java.net/browse/JDK-8245283 and/or https://bugs.openjdk.java.net/browse/JDK-8232997.
Either way, it's hard to diagnose further without the hs-err file.

@dougxc
Copy link
Member

dougxc commented Aug 12, 2020

Despite the description above, I think the crashes are only happening in Java 8. The crash in https://ci.spring.io/teams/spring-graalvm-native/pipelines/spring-graalvm-native/jobs/java11-key-samples-build/builds/420 is:

# JRE version: OpenJDK Runtime Environment (8.0_262-b10) (build 1.8.0_262-b10)
# Java VM: OpenJDK 64-Bit Server VM GraalVM 20.2.0-dev (25.262-b10-jvmci-20.2-b03 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x6cda34]  position_stream_after_cp(ClassFileStream*)+0x44

This is the first OpenJDK 8 update release to include a backport of JFR. It's possible that the JFR bug fixes mentioned above were not included in this backport (@adinn @zakkak ).

@adinn
Copy link
Collaborator

adinn commented Aug 12, 2020

@dougxc JDK-8245283 has not been back-ported to jdk11u so that might cause a failure on jdk11u. However, it's not going to be the cause of the problem on to jdk8u. The patch for that bug skips over a JVM_Constant_Dynamic bytecode but that was only added to the byecode spec from release 11.

This may perhaps be the same error as was seen in JDK-8232997. The error was resolved by changes made for a related issue, JDK-8230400 (for which DK-8232997 was marked as a duplicate). The latter had a two part fix.

The first part avoids a potential error when comparing old and new methods in the case where the method arrays may not have the same size. The change was backported to jdk11u but the method comparison still present in jdk8u. This may well be what is causing the SEGV. However, in order for this to be the cause of the crash I think the user has to have installed their own agent in the runtime. Is that the case here?

The second part of the fix was pushed as a fix to JDK-8233111. It does not appear to have been backported to either jdk11u or jdk8u. I'm not sure whether it could be the cause of the SEGV but I suppose it is possible. It certainly seems to relate to running a Java JIT in the JVM.

@dougxc
Copy link
Member

dougxc commented Aug 12, 2020

@adinn
Copy link
Collaborator

adinn commented Aug 12, 2020

Ah yes. That agent is responsible for adding code to manage class and instance init, tracking init at build time and maybe deferring it to runtime. It can add a new <clinit> method. That might be enough to cause the problem.

@peter-hofer
Copy link
Member

There is an SVM agent in play although I don't know what it does (@peter-hofer ?):
https://ci.spring.io/teams/spring-graalvm-native/pipelines/spring-graalvm-native/jobs/java8-key-samples-build/builds/476#L5ec26892:688

The agent that I think you're referring to (-javaagent:/opt/java/jre/lib/svm/builder/svm.jar) is NativeImageBytecodeInstrumentationAgent, perhaps @vjovanov can comment on this.

@vjovanov
Copy link
Member

As @adinn said the agent is there to add the <clinit> method and instrumentation. First, I would not use that in CI as this is only for debugging by the users. Second, we are narrowing down -H:+TraceClassInitialization to use a regular expression. When we do that, only a few classes will be instrumented and the impact should be minimal.

I would advise removing -H:+TraceClassInitialization from the CI (it will anyhow not be possible soon) and see if the issue prevails.

@sdeleuze
Copy link
Collaborator Author

@vjovanov I reproduced the issue locally (without -H:+TraceClassInitialization BTW) and sent to you the hs_err_pid1073.log file.

sdeleuze added a commit to spring-attic/spring-native that referenced this issue Aug 30, 2020
@vjovanov
Copy link
Member

hs_err_pid26455.log

@sdeleuze
Copy link
Collaborator Author

As pointed out by @vjovanov, it seems to be JFR related, but we enable nothing about JFR on our side, we just ship JFR support with https://github.com/spring-projects/spring-framework/blob/master/spring-core/src/main/java/org/springframework/core/metrics/jfr/FlightRecorderApplicationStartup.java referenced nowhere.

But in our builds I noticed /opt/java/jre/lib/jvmci/graal-truffle-jfr-impl.jar is on the classpath and not added by us. Could it be a side effect of JDK default activation (see https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8246384)?

Is there a workaround? --disable-jfr is a JDK build flag right?

@adinn
Copy link
Collaborator

adinn commented Sep 1, 2020

That's a very bizarre SEGV.

The hs_err file shows the crash point is early in method position_stream_after_cp(). This is JFR code which modifies loaded class bytecode for class Event or one of its subclasses (I suspect that in this case it is actually the latter given that the call to position_stream_after_cp happens 0x491 bytes into method JfrEventClassTransformer::on_klass_creation()). In an case the problem is not in the transformation code itself because it is happening befoe any byrtes get modified. The problem lies with the bytecode stream for the class to be transformed that is passed to position_stream_after_cp().

The offending routine is passed a ClassFileStream i.e. a stream over a byte array in classfile format that is handed over by the class loader. The code works its way through the mapped byte array skipping over the header and constant pool data. It's purpose is to leave the stream pointing at the next block of data which identifies class fields (methods come after that).

static u2 position_stream_after_cp(ClassFileStream* stream) {
  assert(stream != NULL, "invariant");
  assert(stream->current_offset() == 0, "invariant");
  stream->skip_u4_fast(2);  // 8 bytes skipped
  const u2 cp_len = stream->get_u2_fast();
  assert(cp_len > 0, "invariant");
  // now spin the stream position to just after the constant pool
  for (u2 index = 1; index < cp_len; ++index) {
    const u1 tag = stream->get_u1_fast(); // cp tag
    switch (tag) {
      case JVM_CONSTANT_Class:
      case JVM_CONSTANT_String: {
        stream->skip_u2_fast(1); // skip 2 bytes
        continue;
      }
   ...

The machine code listed in the hs_err file immediately around the crash pc (0x7f0fb20eaa34) disassembles as follows:

...
0x7f0fb20eaa22:	nopw   0x0(%rax,%rax,1)
0x7f0fb20eaa28:	lea    0x1(%rdx),%r9
0x7f0fb20eaa2c:	lea    0x1(%rsi),%r11d
0x7f0fb20eaa30:	mov    %r9,0x10(%rdi)
0x7f0fb20eaa34:	movzbl (%rdx),%r8d
0x7f0fb20eaa38:	cmp    $0x12,%r8b
0x7f0fb20eaa3c:	ja     0x7f0fb20eaab0
...

This is 0x44 bytes into the method. So, it appears to be at the start of the loop reading a tag byte from the constant pool and then skipping the relevant number of data bytes depending on the tag. The faulting instruction appears to be at the start of the loop where the tag byte is being range checked it to see if it is above 0x12, the highest possible tag.

Looking at related info in the hs_err file the address in rdx from which the byte is being read is 0x7f0ec02f736a

RDX=0x00007f0ec02f736a is an unknown value

and the SEGV is an access error

siginfo: si_signo: 11 (SIGSEGV), si_code: 2 (SEGV_ACCERR), si_addr: 0x00007f0ec02f736a

So, although 0x00007f0ec02f736a does not look like a wildly wrong address for the classfile byte array it clearly must be since the it is not in a segment mapped for READ access.

Now this ClassFileStream pointer has come out of core JVM code (the entry to the JFR on_klass_creation method comes under java.lang.ClassLoader.defineClass1 via SystemDictionary::resolve_from_stream). It is hard to see how that might relate to any class file transformer employed by GraalVM. As Hamlet said, something is rotten in the state of Danmark. Diagnosing this will require a reliable reproducer on a debuginfo build of OpenJDK so the JVM can be debugged.

@sdeleuze
Copy link
Collaborator Author

sdeleuze commented Sep 1, 2020

Could it be possible to share JVMCI builds without JFR enabled in order to check that fixes the issue?

@adinn
Copy link
Collaborator

adinn commented Sep 1, 2020

You may not need to do that. Your suggestion of disabling JFR is worth trying first. See what happens when you run the native image build with --disable-jfr (you will probably need to pass -J--disable-jfr). That should stop the JFR class transformation code being invoked.

@sdeleuze
Copy link
Collaborator Author

sdeleuze commented Sep 1, 2020

I am not sure, when I specify -J--disable-jfr I got a message Unrecognized option: --disable-jfr. Maybe --disable-jfr is a flag to be used when building the JVM not running it ...

@adinn
Copy link
Collaborator

adinn commented Sep 1, 2020

Oops, apologies. Of course the correct JVM option is -XX:-FlightRecorder.

@sdeleuze
Copy link
Collaborator Author

sdeleuze commented Sep 1, 2020

Thanks for the tip, I am starting a few full builds on our CI with this option to see if that solves the issue. I will keep you informed.

sdeleuze added a commit to spring-attic/spring-native that referenced this issue Sep 1, 2020
@sdeleuze sdeleuze changed the title Transient fatal error when building spring-graalvm-native with GraalVM 20.2.0-dev Transient fatal error when building spring-graalvm-native with GraalVM 20.2.0 Sep 1, 2020
@sdeleuze
Copy link
Collaborator Author

sdeleuze commented Sep 1, 2020

Same error with -J-XX:-FlightRecorder so this flag is not sufficient. I guess next step would be testing a JDK without JFR support since this has started to fail when this support was introduced in OpenJDK (approximately).

@sdeleuze
Copy link
Collaborator Author

sdeleuze commented Sep 1, 2020

Do we have a way to remove graal-truffle-jfr-impl.jar from the classpath? We could at least do it when building GraalVM on our CI. That would not help our users who will use stock 20.2.0 but that would be useful to see if that fixes the issue.

@sdeleuze
Copy link
Collaborator Author

sdeleuze commented Sep 1, 2020

I saw a similar error on #2164 and related JDK-8232997 cc @gilles-duboscq. Related?

@sdeleuze
Copy link
Collaborator Author

sdeleuze commented Sep 1, 2020

Trying a new build with -Djfr.unsupported.vm=true.

@sdeleuze
Copy link
Collaborator Author

sdeleuze commented Sep 1, 2020

Using -J-Djfr.unsupported.vm=true introduces other issue so I rollback this change.

@dougxc
Copy link
Member

dougxc commented Sep 2, 2020

Diagnosing this will require a reliable reproducer on a debuginfo build of OpenJDK so the JVM can be debugged

@sdeleuze you could take the libjvm.so file from https://github.com/graalvm/graal-jvmci-8/releases/download/jvmci-20.2-b03/openjdk-8u262+10-jvmci-20.2-b03-fastdebug-linux-amd64.tar.gz and replace the same file in the GraalVM 20.2.0 binary you are using.

sdeleuze added a commit to spring-attic/spring-native that referenced this issue Sep 7, 2020
@sdeleuze
Copy link
Collaborator Author

sdeleuze commented Sep 7, 2020

@dougxc We have the choice of the LabsJDK version we use so I have switched to the debug variant. The bug is transient so by nature we can't reproduce it for every native-image invocation but we can provide a way to run the build until this bug is reached. Let me first check what is the result of this bug with those new JDKs.

@sdeleuze
Copy link
Collaborator Author

sdeleuze commented Sep 8, 2020

@dougxc Could you please share some guidance on what we could do to provide you more insights now that we use the debug builds?

@dougxc
Copy link
Member

dougxc commented Sep 8, 2020

Do you have a new hs-err file from the debug build? Maybe it provides more detail that @adinn might be able to decipher. If not, then the next step is to add assertions to the C++ to try work out what's going wrong. My suspicion is still that there's something wrong in the JFR backport but I do not know that code very well.

@sdeleuze
Copy link
Collaborator Author

sdeleuze commented Sep 8, 2020

Let me reproduce the bug locally and provide you the hs-err file.

@adinn
Copy link
Collaborator

adinn commented Sep 8, 2020

Hi @sdeleuze @dougxc I think the culprit is probably this issue just reported in the JDK8 JVM

https://bugs.openjdk.java.net/browse/JDK-8252904

It certainly looks likely as the cause of this issue.

Andrey Petrushkov has proposed a patch for it in the jdk8u mail list which I am reviewing now.

https://mail.openjdk.java.net/pipermail/jdk8u-dev/2020-September/012642.html

It looks like his patch will fix the problem but I'm not sure it is acceptable because it uses a ResourceMark in a very obscure way.

sdeleuze added a commit to spring-attic/spring-native that referenced this issue Sep 10, 2020
dsyer pushed a commit to scratches/spring-graalvm-native that referenced this issue Sep 23, 2020
dsyer pushed a commit to scratches/spring-graalvm-native that referenced this issue Sep 23, 2020
dsyer pushed a commit to scratches/spring-graalvm-native that referenced this issue Sep 23, 2020
dsyer pushed a commit to scratches/spring-graalvm-native that referenced this issue Sep 23, 2020
dsyer pushed a commit to scratches/spring-graalvm-native that referenced this issue Sep 23, 2020
dsyer pushed a commit to scratches/spring-graalvm-native that referenced this issue Sep 23, 2020
dsyer pushed a commit to scratches/spring-graalvm-native that referenced this issue Sep 23, 2020
dsyer pushed a commit to scratches/spring-graalvm-native that referenced this issue Sep 23, 2020
@sdeleuze
Copy link
Collaborator Author

The related JDK issue is expected to be fixed in openjdk8u282, is it something we could benefit in GraalVM 20.3?

@adinn
Copy link
Collaborator

adinn commented Oct 27, 2020

The related JDK issue is expected to be fixed in openjdk8u282, is it something we could benefit in GraalVM 20.3?

Sorry, for not confirming earlier but yes you are right that a fix has already been pushed and will appear in openjdk8u282 (I reviewed the fix and the jdk8u maintainers approved it).

I am sorry I cannot help you much regarding pushing the patch to the GraalVM Oracle 'labs' jdk8 releases. I assume the fix will eventually go into a release automatically -- Oracle jdk8u mostly keeps parity with openjdk8u when it comes to bug fixes and 'labs' jdk8 is derived form Oracle jdk8. What GraalVM release it goes into is not something I am qualified to say. Perhaps an Oracle GraalVM dev could answer this.

@dougxc dougxc self-assigned this Oct 27, 2020
@dougxc
Copy link
Member

dougxc commented Oct 27, 2020

I will apply https://bugs.openjdk.java.net/browse/JDK-8252904 to graal-jvmci-8. If all goes well, it should make it into GraalVM CE 20.3 but if not, definitely in 21.0.

@sdeleuze
Copy link
Collaborator Author

Thanks @dougxc please keep us informed.

@dougxc
Copy link
Member

dougxc commented Oct 30, 2020

TheJDK-8252904 patch has been applied (graalvm/graal-jvmci-8@0284e18) so this will make it into 20.3.

@dougxc
Copy link
Member

dougxc commented Oct 30, 2020

Please re-open if the issue still shows up in the 20.3 release.

@dougxc dougxc closed this as completed Oct 30, 2020
@sdeleuze
Copy link
Collaborator Author

sdeleuze commented Nov 4, 2020

Not sure if that's due to jvmci-20.3-b04 or not but our builds are mostly broken, I have raised #2965 for the detailed error log.

@sdeleuze
Copy link
Collaborator Author

sdeleuze commented Nov 5, 2020

I confirm the bug seems to be fixed, thanks a lot! In practice Java 11 seems not stable for us, and Java 8 will require a fix for #2965 .

@sdeleuze
Copy link
Collaborator Author

Could you please add this one in the 20.3 milestone?

@gilles-duboscq gilles-duboscq added this to the 20.3 milestone Nov 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug native-image spring spring related issue
Projects
None yet
Development

No branches or pull requests

7 participants