Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kotlin coroutines with loops are 18 times slower under the Graal CI. #1330

Open
qwwdfsad opened this issue May 24, 2019 · 9 comments
Assignees

Comments

@qwwdfsad
Copy link

@qwwdfsad qwwdfsad commented May 24, 2019

Reproducing project:

https://github.com/qwwdfsad/coroutines-graal

Overview:

FlowBenchmark is constructed to expose a non-standard pattern which Graal fails to compile.
Flow is a very simplified version of kotlinx.coroutines.Flow, a suspension-based primitive for operating with reactive streams.

Benchmark results

How to run: ./gradlew --no-daemon cleanJmhJar jmhJar && java -jar benchmarks.jar from the root folder.

Results:

// Java 1.8.0_162-b12
FlowBenchmark.flowBaseline  avgt  7  3.542 ± 0.026  us/op

// Graalvm-ce-19.0.0
FlowBenchmark.flowBaseline  avgt  7  54.129 ± 0.387  us/op

dtraceasm profiler shows that all the time spent in the interpreter, mostly in fast_aputfield (probably it is a coroutine state machine spilling).

Native call-stacks obtained via async-profiler are polluted with InterpreterRuntime::frequency_counter_overflow from the uppermost Java frame (flow.FlowBenchmark$numbers$$inlined$flow$1::collect), that is, by the way, compiled with C1.

Compilation log

Compilation log contains pretty suspicious statements about target method:

294  498  3  flow.FlowBenchmark$numbers$$inlined$flow$1::collect (255 bytes) COMPILE SKIPPED: live_in set of first block not empty (retry at different tier)
337  535  4  flow.FlowBenchmark$numbers$$inlined$flow$1::collect (255 bytes) COMPILE SKIPPED: Non-reducible loop (not retryable)
@qwwdfsad

This comment has been minimized.

Copy link
Author

@qwwdfsad qwwdfsad commented May 24, 2019

@shelajev pointed out that it all comes from the fact that only structurred loops are supported in IR and this will not likely to change in the nearest future.

To understand where this irreducible loop comes from (and why we can't simply fix Kotlin compiler here), one should understand what coroutines are.
Coroutine aka suspend function is a function which execution can be "suspended" (as opposed to blocking) and later resumed using Continuation handle.
suspend functions can be invoked only from other suspend functions or started with a Kotlin compiler intrinsic startCoroutine(block or method reference).

Functions like this:

suspend fun foo(): Int {
  println("a")
  val result: Int = suspendSleep(1) // Can be suspended
  println("b")
  return result
}

we roughly (I'll ignore all exception handling here) translate it to the following form:

fun foo(continuation: Continuation<Int>): Object /* INT | SUSPENDED union */ {
  val me = continuation as? FooContinuation ?: FooContinuation(caller = continuation)
  switch (me.label) { // current state of computation
    0 -> {
      println("a")
      me.label = 1
      val result = suspendSleep(1, me)
      if (result === SUSPENDED) return SUSPENDED // Unroll stack
      else goto 1
    }

    1 -> {
       println("b")
       return me.resultOrException.getOrThrow()
    }
  }
}

additionally, we generate a separate class FooContinuation with a single resume method that is pretty similar to the foo method:

inner class FooContinuation : Continuation<Unit> {
  
  var resultOrException: Result<Unit>

  // May be called only from within `suspendSleep` if it decided to suspend
  override fun resume(result: Result<Unit> /* exception or Unit */ ) {
    assert label == 1;
    resultOrException = result
    foo(this)
  }

}

Don't hesitate to ask me for further explanations of this example because I am slightly biased about coroutines internals simplicity :)

Now back to irreducible loops.
What if suspend fun contains a loop?

suspend fun bar() {
  var sum = 0
  for (i in 1..100) {
    sum += suspendSleep(i)
  }
  
  return sum
}

Applying the same transformation procedure, we end up in a situation when code path can jump directly into loop's body, i is a continuation state that is restored when we jump right in the middle of the loop by the corresponding label value. Unfortunately, changing translation strategy is both hard and implies a serious performance overhead (e.g. because now we have to jump to the loop prologue and guard all statements here with an if).

And this is exactly an irreducible loop that Graal cannot compile :(

@qwwdfsad qwwdfsad changed the title Kotlin coroutines with loops are 10 times slower under the Graal CI. Kotlin coroutines with loops are 18 times slower under the Graal CI. May 24, 2019
@gilles-duboscq

This comment has been minimized.

Copy link
Member

@gilles-duboscq gilles-duboscq commented May 24, 2019

This is the same root cause as #366

As you diagnosed, this is caused by bytecode containing irreducible loops which is not supported in the Graal compiler.
From our side there there is one strategy to support this: just duplicate the portion of the loop between the "extra" entry and the back-edges or loop exits so that we are back to a normal loop entry. We should do that in the bytecode parser.

Hopefully Kotlin only produce cases where there are "few" extra entries (or those extra entries jump towards the end of the loop, close to the back-edges) otherwise it will cause an explosion of the code size for large loops.

@qwwdfsad

This comment has been minimized.

Copy link
Author

@qwwdfsad qwwdfsad commented May 24, 2019

It is very unfortunate to hear as we had plans to build native images of coroutines-based applications as well.

Hopefully Kotlin only produce cases where there are "few" extra entries

It grows linearly with the count of suspension points (calls to another suspend function) in the loop. But where it really matters (in hot loops), there are usually one or two of them. Not sure about native image though.

Feel free to ping me if you need help with evaluation/testing of the potential (?) change on the Graal side, we have a bunch of applications that exploit coroutines.

@gilles-duboscq

This comment has been minimized.

Copy link
Member

@gilles-duboscq gilles-duboscq commented May 27, 2019

Hi @qwwdfsad, i have implemented a simple duplication strategy (not merged yet) so to evaluate it i'd be interested in other workloads that run into this issue.

FYI on 1.8.0_212

# C2
FlowBenchmark.flowBaseline  avgt    7   3.805 ± 0.081  us/op
# Graal
FlowBenchmark.flowBaseline  avgt    7  47.374 ± 4.753  us/op
# Graal with duplication strategy
FlowBenchmark.flowBaseline  avgt    7   0.052 ± 0.001  us/op
@qwwdfsad

This comment has been minimized.

Copy link
Author

@qwwdfsad qwwdfsad commented May 27, 2019

Nice! It is hard for me to extract such workloads into separate self-containing projects, but I can point a couple of our specific benchmarks in different projects (with steps how to configure and run them). Is it okay?
Alternatively (or additionally), I can also try a Graal build with fixes in our projects and see how it's going.

For example:
kotlinx.coroutines, develop branch, benchmarks from flow package. Run with ./gradlew --no-daemon cleanJmhJar jmhJar && java -jar benchmarks.jar "benchmarks.flow.*"

@gilles-duboscq

This comment has been minimized.

Copy link
Member

@gilles-duboscq gilles-duboscq commented May 27, 2019

I can point a couple of our specific benchmarks in different projects (with steps how to configure and run them). Is it okay?

That would be great

Alternatively (or additionally), I can also try a Graal build with fixes in our projects and see how it's going.

I'll probably put that code behind a flag at first so you can do that but i wanted to do some basic testing on my side first to avoid too many round-trips.

kotlinx.coroutines, develop branch, benchmarks from flow package. Run with ./gradlew --no-daemon cleanJmhJar jmhJar && java -jar benchmarks.jar "benchmarks.flow.*"

I will start with that.

@chintana-zz

This comment has been minimized.

Copy link

@chintana-zz chintana-zz commented Jun 1, 2019

@gilles-duboscq I bumped into the same issue when trying to get GraalVM to do native image generation for Ballerina. I would be more than happy to test your fix. In the mean time you can reproduce this as follows,

$ git clone https://github.com/chintana/ballerina && cd ballerina
$ ./gradlew build -x test -x check -x :composer-library:npmBuild
$ cd distribution/zip/jballerina-tools/build/distributions
$ unzip jballerina-tools-0.992.0-m2-SNAPSHOT.zip
$ cat >file.bal
import ballerina/io;
public function main() {
	int i = 0;
	int j = 0;
	while (i < 20) {
		i+=1;
		j+=1;
	}
	io:println(j);
}
$ ./bin/jballerina build file.bal
Compiling source
    file.bal

Generating executable
    file.jar
$ # set GRAALVM_HOME
$ ./bin/jballerina native-img file.jar
[file:98676]    classlist:  14,329.22 ms
[file:98676]        (cap):   2,224.56 ms
[file:98676]        setup:   4,481.97 ms
[file:98676]     analysis:   8,488.17 ms
Warning: Abort stand-alone image build. Non-reducible loop
Detailed message:
Call path from entry point to file.main(Strand):
	at file.main(file.bal)
	at ___init.$lambda$main$(.)
	at ___init$$Lambda$364/1803931637.accept(Unknown Source)
	at org.ballerinalang.jvm.SchedulerItem.execute(Scheduler.java:401)
	at org.ballerinalang.jvm.Scheduler.run(Scheduler.java:194)
	at org.ballerinalang.jvm.Scheduler.runSafely(Scheduler.java:166)
	at org.ballerinalang.jvm.Scheduler.start(Scheduler.java:158)
	at ___init.main(.)
	at com.oracle.svm.core.JavaMainWrapper.run(JavaMainWrapper.java:153)
	at com.oracle.svm.core.code.IsolateEnterStub.JavaMainWrapper_run_5087f5482cc9a6abc971913ece43acb471d2631b(generated:0)

Warning: Use -H:+ReportExceptionStackTraces to print stacktrace of underlying exception
Build on Server(pid: 98720, port: 53135)*
[file:98720]    classlist:   2,068.29 ms
[file:98720]        (cap):   1,633.39 ms
[file:98720]        setup:   3,670.54 ms
[file:98720]   (typeflow):   2,593.28 ms
[file:98720]    (objects):   1,193.84 ms
[file:98720]   (features):     380.28 ms
[file:98720]     analysis:   4,270.38 ms
[file:98720]     universe:     256.61 ms
[file:98720]      (parse):     455.73 ms
[file:98720]     (inline):   1,759.22 ms
[file:98720]    (compile):   8,062.22 ms
[file:98720]      compile:  10,790.02 ms
[file:98720]        image:     819.42 ms
[file:98720]        write:     388.00 ms
[file:98720]      [total]:  22,451.48 ms
Warning: Image 'file' is a fallback image that requires a JDK for execution (use --no-fallback to suppress fallback image generation).

@qwwdfsad

This comment has been minimized.

Copy link
Author

@qwwdfsad qwwdfsad commented Aug 7, 2019

Hi, could you please elaborate on the status of this fix? Do you need any additional help from me, e.g. new benchmarks or test suites?

@gilles-duboscq

This comment has been minimized.

Copy link
Member

@gilles-duboscq gilles-duboscq commented Aug 7, 2019

I had a first version but i noticed some issues while adding more tests. I have ideas about how to fix it and still plan to do it but i have no ETA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants
You can’t perform that action at this time.