Skip to content

i10416/bench

Repository files navigation

About

slinc vs jni benchmark

benchmark for simple operations

Qsort benchmark

Environment

  • Scala 3.2.2
  • JVM: OpenJDK Runtime Environment Zulu19.30+11-CA (build 19.0.1+10)
  • slinc: 0.1.1-110-7863cb
  • Apple clang version 13.1.6 (clang-1316.0.21.2.5)
  • slinc 0.1.1-110-7863cb

In general, it is clear that upcall JVM method from native is quite slow. The following change makes a program extremely slower(check SimpleNativeCallBenchmarks. jniNativeQSort and SimpleNativeCallBenchmarks.jniQSort).

(JNIEnv *jenv, jobject jobj, jintArray jarr){
  jint *arr = (*jenv)->GetIntArrayElements(jenv,jarr, 0);
  jsize len = (*jenv)->GetArrayLength(jenv,jarr);
-  qsort(arr,len,sizeof(int),compare_int);
+  qsort(arr,len,sizeof(int),upcall_compare_int);
  (*jenv)->ReleaseIntArrayElements(jenv,jarr,arr, 0);
  return;
};

There's only a small difference in performance between array copy back and forth and array copy without copy back. See SimpleNativeCallBenchmarks.slincQSortWithCopyBack and SimpleNativeCallBenchmarks.slincQSortWithoutCopyBack.1

As is mentioned in the comment, I confirmed that SimpleNativeCallBenchmarks.slincQsortAllocCallbackForEachIteration is much slower than slincQSortWithCopyBack and slincQSortWithoutCopyBack. Allocating upcall seems costly operation.

Having cloned your bench and having the callback allocated once (rather than per benchmark iteration), I see a improvement in performance of Slinc's upcall code to just 2x slower than JNI, rather than 5x slower

scala-interop/slinc#81 (comment)

SlinC one is around 2 times slower than JNI one when allocating callback in advance and 5x~ times slower when allocating callback for each iteration.

SlinC one is around 5 times slower than JNI one probably because SlincQsort transfers array between JVM and native, but I suspect there are other reasons why SlinC ones are slow because there is not large difference between slincQSortWithCopyBack and slincQSortWithoutCopyBack, which implies data transfer is not the bottleneck

Benchmark NOTE Mode Cnt Score Error Units
SimpleNativeCallBenchmarks.slincQSortJVM avgt 5 1774.509 ± 4.972 ns/op
SimpleNativeCallBenchmarks.jniNativeQSort Using native comparator. No upcall avgt 5 4272.838 ± 50.298 ns/op
SimpleNativeCallBenchmarks.jniQSort Using upcall. destructively mutate original array avgt 5 299570.811 ± 4542.836 ns/op
SimpleNativeCallBenchmarks.slincQSortWithCopyBack Using global shared upcall. Copy array back and forth. avgt 5 618014.439 ± 8280.107 ns/op
SimpleNativeCallBenchmarks.slincQSortWithoutCopyBack Using upcall. Copy and transfer array but not copy back. avgt 5 625336.580 ± 10471.754 ns/op
SimpleNativeCallBenchmarks.slincQsortAllocCallbackForEachIteration Allocating upcall for each iteration. avgt 5 1700443.210 ± 650331.220 ns/op

Feedback from SlinC author(@markehammons)

Having cloned your bench and having the callback allocated once (rather than per benchmark iteration), I see a improvement in performance of Slinc's upcall code to just 2x slower than JNI, rather than 5x slower. I think there may be more performance improvements to be found, but first I should make us able to generate an upcall from a method rather than a lambda and see what the performance from that looks like.

benchmark for more complex routine

For the following routine,

  1. copy string (jvm to native)
  2. invoke native call (jvm to native)
  3. dereferencing pointer (native to jvm)
  4. copy object (jvm to native)
  5. copy object (native to jvm)

with the following environment,

  • Scala 3.2.2
  • JVM: JDK 17.0.3, OpenJDK 64-Bit Server VM, 17.0.3+7-LTS
  • slinc: 0.1.1-110-7863cb
  • Apple clang version 13.1.6 (clang-1316.0.21.2.5)
  • slinc 0.10-110-7863cb

slinc one takes around 3x~ longer than jni.

Result

Benchmark Mode Cnt Score Error Units
NativeBenchmarks.jni avgt 5 5064.292 ± 593.829 ns/op
NativeBenchmarks.slinc avgt 5 16882.792 ± 1172.054 ns/op

However, just updating JDK to Zulu19.30+11-CA, slinc one gets nearly as fast as jni one.

JVM: OpenJDK Runtime Environment Zulu19.30+11-CA (build 19.0.1+10)

Benchmark Mode Cnt Score Error Units
NativeBenchmarks.jni avgt 5 4872.056 ± 57.582 ns/op
NativeBenchmarks.slinc avgt 5 5607.126 ± 115.210 ns/op

Caveat

JNI or FFI is not always the fastest solution as modern JVM is quote performant and communication between native and jvm is not free.

For example, see benchmark for quicksort and you can find SimpleNativeCallBenchmarks.slincQSortJVM is the fastest.

You should take overhead into consideration. Find a significant bottleneck and carefully measure the performance benefits before resorting to FFI.

Footnotes

  1. This is not the case with Slinc 0.3.0. It takes around 1.3x longer for with copyback than without copyback. Note that the author says it is the expected result as 0.3.0 focuses on refactoring rather than performance optimization.

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published