Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on GraalVM at 1.9.0-RC #4146

Open
sgammon opened this issue Jun 1, 2024 · 11 comments
Open

Crash on GraalVM at 1.9.0-RC #4146

sgammon opened this issue Jun 1, 2024 · 11 comments
Labels

Comments

@sgammon
Copy link

sgammon commented Jun 1, 2024

Describe the bug

When building a GraalVM native image against coroutines 1.9.0-RC, things mostly work but the following crash occurs under some conditions, for us with use of Mosaic and the new Kotlin-built-in Compose compiler:

java.lang.ClassCastException
	at java.base@23/java.util.concurrent.atomic.AtomicReferenceFieldUpdater$AtomicReferenceFieldUpdaterImpl.throwAccessCheckException(AtomicReferenceFieldUpdater.java:418)
	at java.base@23/java.util.concurrent.atomic.AtomicReferenceFieldUpdater$AtomicReferenceFieldUpdaterImpl.accessCheck(AtomicReferenceFieldUpdater.java:409)
	at java.base@23/java.util.concurrent.atomic.AtomicReferenceFieldUpdater$AtomicReferenceFieldUpdaterImpl.get(AtomicReferenceFieldUpdater.java:466)
	at kotlinx.coroutines.CancellableContinuationImpl.getParentHandle(CancellableContinuationImpl.kt:103)
	at kotlinx.coroutines.CancellableContinuationImpl.detachChild$kotlinx_coroutines_core(CancellableContinuationImpl.kt:569)
	at kotlinx.coroutines.CancellableContinuationImpl.detachChildIfNonResuable(CancellableContinuationImpl.kt:562)
	at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core(CancellableContinuationImpl.kt:503)
	at kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core$default(CancellableContinuationImpl.kt:493)
	at kotlinx.coroutines.CancellableContinuationImpl.resumeUndispatched(CancellableContinuationImpl.kt:596)
	at kotlinx.coroutines.EventLoopImplBase$DelayedResumeTask.run(EventLoop.common.kt:497)
	at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:263)
	at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:95)
	at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:69)
	at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
	at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:47)
	at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
	at com.jakewharton.mosaic.BlockingKt.runMosaicBlocking(blocking.kt:6)
	at elide.tool.cli.AbstractToolCommand.call(AbstractToolCommand.kt:221)
	at elide.tool.cli.AbstractToolCommand.call(AbstractToolCommand.kt:29)
	at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
	at picocli.CommandLine.access$1500(CommandLine.java:148)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
	at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
	at picocli.CommandLine.execute(CommandLine.java:2174)
	at elide.tool.cli.ElideTool$Companion.exec$runtime(ElideTool.kt:232)
	at elide.tool.cli.ElideTool$Companion.main(ElideTool.kt:195)
	at elide.tool.cli.ElideTool.main(ElideTool.kt)

Provide a Reproducer

See here for a partner issue with detailed tracing. A reproducer is available.

@sgammon sgammon added the bug label Jun 1, 2024
@fzhinkin
Copy link

fzhinkin commented Jun 3, 2024

@sgammon could you please provide an instruction on how to run a reproducer?

@sgammon
Copy link
Author

sgammon commented Jun 5, 2024

@fzhinkin Yes, I have a reproducer. Here is one that works with a very simple native image build. It includes Coroutines 1.9.0-RC and Mosaic, where I have filed a related issue.

It seems like most of the stacktraces I can produce point to coroutines and/or atomicfu. I have much more detailed tracing in the following issues:

Reproducer:
coroutines-crash-reproducer-972.zip

I may need to file a bug with Micronaut as well if it relates to their code, but so far I don't see any hint that it does. I'm using the Micronaut project generator because it gets really close to our dependencies anyway (ours is a Micronaut/Picocli command line application).

There seem to be multiple exceptions or crashes, or maybe multiple bugs interplaying, so if I come up with other reproducers I will post them here and on the tracking ticket.

As far as I can tell, though, most hints points to either coroutines or atomicfu; specifically, the exception depicted surfaces if optimizations are in -O2. With -Ob, the exception goes away. So, maybe this is related to some optimization GraalVM is doing, but the class cast exception could still be valid, I don't know.

@fzhinkin
Copy link

fzhinkin commented Jun 5, 2024

@sgammon, thank you for sharing the reproducer!

I managed to reproduce the crash locally on macos-aarch64 with Graal EE 21, 22, and 23.
However, on Linux (both aarch64 and x86_64), the issue did not show up.

For me, the issue also gone after switching optimization level to -O1.

As a side note, commands to run the reproducer are (taken from the Graal's GH issue):

./gradlew nativeCompile
./build/native/nativeCompile/demo

@fzhinkin
Copy link

fzhinkin commented Jun 5, 2024

For the record, the stack trace I ended up with is:

Stacktrace for the failing thread 0x0000000126e05140 (A=AOT compiled, J=JIT compiled, D=deoptimized, i=inlined):
  A  SP 0x000000016b0525f0 IP 0x0000000105abe680 size=96    java.lang.Class.getTypeName(DynamicHub.java)
  A  SP 0x000000016b052650 IP 0x00000001051f4148 size=768   com.oracle.svm.core.snippets.ImplicitExceptions.throwNewClassCastExceptionWithArgs(ImplicitExceptions.java:311)
  A  SP 0x000000016b052950 IP 0x00000001064f4164 size=96    kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:28)
  A  SP 0x000000016b0529b0 IP 0x00000001069aaaf4 size=96    kotlinx.coroutines.DispatchedTaskKt.resume(DispatchedTask.kt:229)
  i  SP 0x000000016b052a10 IP 0x00000001069a1480 size=80    kotlinx.coroutines.DispatchedTaskKt.dispatch(DispatchedTask.kt:162)
  A  SP 0x000000016b052a10 IP 0x00000001069a1480 size=80    kotlinx.coroutines.CancellableContinuationImpl.dispatchResume(CancellableContinuationImpl.kt:470)
  i  SP 0x000000016b052a60 IP 0x00000001069abee4 size=80    kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core(CancellableContinuationImpl.kt:504)
  i  SP 0x000000016b052a60 IP 0x00000001069abee4 size=80    kotlinx.coroutines.CancellableContinuationImpl.resumeImpl$kotlinx_coroutines_core$default(CancellableContinuationImpl.kt:493)
  i  SP 0x000000016b052a60 IP 0x00000001069abee4 size=80    kotlinx.coroutines.CancellableContinuationImpl.resumeUndispatched(CancellableContinuationImpl.kt:596)
  A  SP 0x000000016b052a60 IP 0x00000001069abee4 size=80    kotlinx.coroutines.EventLoopImplBase$DelayedResumeTask.run(EventLoop.common.kt:497)
  A  SP 0x000000016b052ab0 IP 0x00000001069ae57c size=48    kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:263)
  A  SP 0x000000016b052ae0 IP 0x000000010699d8f0 size=80    kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:95)
  i  SP 0x000000016b052b30 IP 0x0000000105062528 size=112   kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:69)
  i  SP 0x000000016b052b30 IP 0x0000000105062528 size=112   kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
  i  SP 0x000000016b052b30 IP 0x0000000105062528 size=112   kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:47)
  i  SP 0x000000016b052b30 IP 0x0000000105062528 size=112   kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
  A  SP 0x000000016b052b30 IP 0x0000000105062528 size=112   com.jakewharton.mosaic.BlockingKt.runMosaicBlocking(blocking.kt:6)
  A  SP 0x000000016b052ba0 IP 0x0000000104f9f068 size=48    com.example.DemoCommand.run(DemoCommand.kt:24)
...

@fzhinkin
Copy link

fzhinkin commented Jun 5, 2024

I tried to debug the issue a bit more, and some outcomes so far:

@fzhinkin
Copy link

fzhinkin commented Jun 5, 2024

The crash is reproducible with Kotlin 1.9.24, Mosaic 0.12.0 and some older compose version: 972-w-1924.zip

@fzhinkin
Copy link

fzhinkin commented Jun 5, 2024

It is also reproducible with GraalEE 17. In all crash occurrences, it seems like we're hitting NPE:

(lldb) disassemble 
demo`CombinedContext_get_210292045824580ad9d3693d539e87b522b69d40:
   0x10191c210 <+0>:   sub    x8, sp, #0x40
    0x10191c214 <+4>:   ldr    x9, [x28, #0x8]
    0x10191c218 <+8>:   cmp    x8, x9
    0x10191c21c <+12>:  b.ls   0x10191c364               ; <+340>
    0x10191c220 <+16>:  stp    x29, x30, [sp, #-0x10]
    0x10191c224 <+20>:  sub    x29, sp, #0x10
    0x10191c228 <+24>:  mov    sp, x8
    0x10191c22c <+28>:  str    x1, [sp, #0x28]
    0x10191c230 <+32>:  cmp    x1, x27
    0x10191c234 <+36>:  b.eq   0x10191c2e8               ; <+216>
    0x10191c238 <+40>:  b      0x10191c2bc               ; <+172>
    0x10191c23c <+44>:  nop    
    0x10191c240 <+48>:  nop    
    0x10191c244 <+52>:  nop    
    0x10191c248 <+56>:  nop    
    0x10191c24c <+60>:  nop
    0x10191c250 <+64>:  str    x0, [sp, #0x20]
    0x10191c254 <+68>:  mov    w30, w30
    0x10191c258 <+72>:  add    x30, x27, x30, lsl #3
    0x10191c25c <+76>:  ldr    w2, [x30]
    0x10191c260 <+80>:  lsr    w2, w2, #5
    0x10191c264 <+84>:  mov    w2, w2
    0x10191c268 <+88>:  add    x2, x27, x2, lsl #3
->  0x10191c26c <+92>:  ldr    x2, [x2, #0xf0]
    0x10191c270 <+96>:  mov    x0, x30
    0x10191c274 <+100>: mov    x4, x1
    0x10191c278 <+104>: mov    x30, x2
    0x10191c27c <+108>: blr    x30
    0x10191c280 <+112>: nop    
    0x10191c284 <+116>: cmp    x0, x27
    0x10191c288 <+120>: b.ne   0x10191c2cc               ; <+188>
    0x10191c28c <+124>: ldr    x0, [sp, #0x20]
    0x10191c290 <+128>: ldr    w1, [x0, #0x4]
    0x10191c294 <+132>: mov    w30, w1
    0x10191c298 <+136>: add    x30, x27, x30, lsl #3
    0x10191c29c <+140>: cbz    w1, 0x10191c30c           ; <+252>
    0x10191c2a0 <+144>: ldr    w0, [x30]
    0x10191c2a4 <+148>: lsr    w0, w0, #5
    0x10191c2a8 <+152>: mov    w2, #0x965f
    0x10191c2ac <+156>: movk   w2, #0x18, lsl #16
    0x10191c2b0 <+160>: cmp    w0, w2
    0x10191c2b4 <+164>: b.ne   0x10191c30c               ; <+252>
    0x10191c2b8 <+168>: mov    x0, x30
    0x10191c2bc <+172>: ldr    x1, [sp, #0x28]
    0x10191c2c0 <+176>: ldr    w30, [x0, #0x8]
    0x10191c2c4 <+180>: cbnz   w30, 0x10191c250          ; <+64>
    0x10191c2c8 <+184>: b      0x10191c354               ; <+324>
    0x10191c2cc <+188>: ldp    x29, x30, [sp, #0x30]
    0x10191c2d0 <+192>: add    sp, sp, #0x40
    0x10191c2d4 <+196>: ldr    w8, [x28, #0x10]
    0x10191c2d8 <+200>: subs   w8, w8, #0x1
    0x10191c2dc <+204>: str    w8, [x28, #0x10]
    0x10191c2e0 <+208>: b.le   0x10191c368               ; <+344>
    0x10191c2e4 <+212>: ret    
    0x10191c2e8 <+216>: str    x0, [sp, #0x20]
    0x10191c2ec <+220>: mov    w2, #0x4f90
    0x10191c2f0 <+224>: movk   w2, #0x11, lsl #16
    0x10191c2f4 <+228>: add    x2, x27, x2, lsl #3
    0x10191c2f8 <+232>: mov    x0, x2
    0x10191c2fc <+236>: bl     0x101925da0               ; Intrinsics_throwParameterIsNullNPE_d91c2ab47c1f26bdc56e3112fa34c2d996ec1c47
    0x10191c300 <+240>: nop    
    0x10191c304 <+244>: ldr    x0, [sp, #0x20]
    0x10191c308 <+248>: b      0x10191c2bc               ; <+172>
    0x10191c30c <+252>: cbz    w1, 0x10191c35c           ; <+332>
    0x10191c310 <+256>: ldr    w0, [x30]
    0x10191c314 <+260>: lsr    w0, w0, #5
    0x10191c318 <+264>: mov    w0, w0
    0x10191c31c <+268>: add    x0, x27, x0, lsl #3
    0x10191c320 <+272>: ldr    x2, [x0, #0xf0]
    0x10191c324 <+276>: mov    x0, x30
    0x10191c328 <+280>: ldr    x1, [sp, #0x28]
    0x10191c32c <+284>: mov    x30, x2
    0x10191c330 <+288>: blr    x30
    0x10191c334 <+292>: nop    
    0x10191c338 <+296>: ldp    x29, x30, [sp, #0x30]
    0x10191c33c <+300>: add    sp, sp, #0x40
    0x10191c340 <+304>: ldr    w8, [x28, #0x10]
    0x10191c344 <+308>: subs   w8, w8, #0x1
    0x10191c348 <+312>: str    w8, [x28, #0x10]
    0x10191c34c <+316>: b.le   0x10191c368               ; <+344>
    0x10191c350 <+320>: ret    
    0x10191c354 <+324>: bl     0x1006ad180               ; ImplicitExceptions_throwNewNullPointerException_83e8a13d7d211b9efc4f752fd9d24059e9defb61
    0x10191c358 <+328>: nop    
    0x10191c35c <+332>: bl     0x1006ad180               ; ImplicitExceptions_throwNewNullPointerException_83e8a13d7d211b9efc4f752fd9d24059e9defb61
    0x10191c360 <+336>: nop    
    0x10191c364 <+340>: b      0x1005b3b70               ; StackOverflowCheckImpl_throwNewStackOverflowError_31341960d080a71e3dff8d322e20c16c7dc860eb
    0x10191c368 <+344>: b      0x1006c5780               ; aq_enterSlowPathSafepointCheckObject_4ef7cb08b9d10a255d4fb63cbcd79b0bd0e7a19c
    0x10191c36c <+348>: .long  0xcccccccc                ; unknown opcode


(lldb) register read    
    General Purpose Registers:
        x0 = 0x0000000282f315d8
        x1 = 0x00000002815b61e0
        x2 = 0x0000000280000000
        x3 = 0x00000000ffffffd9
        x4 = 0x0000000282f31610
        x5 = 0x0000000282f5e3e0
        x6 = 0x00000000005e6291
        x7 = 0x00000000a0000802
        x8 = 0x000000016fdfe620
        x9 = 0x000000016f60e000
       x10 = 0x0000000000000000
       x11 = 0x0000000282f5e3e8
       x12 = 0x0000000000000001
       x13 = 0x0000000000000000
       x14 = 0x0000000282f5e278
       x15 = 0x0000000280f31b10
       x16 = 0x0000000282f5e3d0
       x17 = 0x00000002814ac170
       x18 = 0x0000000000000000
       x19 = 0x00000000005e7ebb
       x20 = 0x0000000282f3f5d8
       x21 = 0x00000000005e7e46
       x22 = 0x0000000283000000
       x23 = 0x0000000282f3f248
       x24 = 0x0000000000000000
       x25 = 0x0000000000000000
       x26 = 0x0000000282f31270
       x27 = 0x0000000280000000
       x28 = 0x0000000156704b80
        fp = 0x000000016fdfe650
        lr = 0x0000000282f2f458
        sp = 0x000000016fdfe620
        pc = 0x000000010191c26c  demo`CombinedContext_get_210292045824580ad9d3693d539e87b522b69d40 + 92
      cpsr = 0x20001000

At the crash time, memory referred by x30 contains 0.

@qwwdfsad qwwdfsad added invalid and removed bug labels Jul 10, 2024
@qwwdfsad
Copy link
Collaborator

@sgammon have you succeeded filing an issue against GraalVM?

From where we are at, it seems like it's rather a Graal issue than not

@sgammon
Copy link
Author

sgammon commented Jul 10, 2024

@qwwdfsad Yes, there is an issue filed with GraalVM. I'll try to cross tag it.

I would have thought the same thing but I just recently witnessed this bug surface with Proguard too.

It seems this issue can surface through bytecode optimization as well as AOT

@fzhinkin
Copy link

@sgammon, oracle/graal#9046 was closed as the assigned engineer is missing a reproducer published on github. Could you please take a look @ oracle/graal#9046 (comment)?

@sgammon
Copy link
Author

sgammon commented Jul 10, 2024

@fzhinkin I didn't get the tag! Yes, I will respond there now, and ping the GraalVM team on their Slack. That is the best place to collaborate with them. Maybe you guys could join there as well, since Kotlin uses Slack.

You can join here. I'll ping them now. Thanks for letting me know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants