Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance drop off of ~18% in newer JDK vs. JDK 8 #5789

Open
PragTob opened this issue Jul 14, 2019 · 6 comments

Comments

@PragTob
Copy link

commented Jul 14, 2019

Mentioned this in passing to @headius thinking it was known but he urged me to share so here I am :)

I was doing my "semi-regular" benchmark of rubykon (an MCTS Go-bot implemented in Ruby, not very good mind you ;) ). The benchmark I run here does a full Monte Carlo Tree Search with 1000 playouts, results are reported in iterations per minute (the higher the better).

benchmarking script with plenty of warmup
script I use to run multiple rubies

In my reports JRuby was slower than in the last editions and the only thing that really changed was the JVM (well and Spectre and Melt).

as I'm currently looking for jobs, the blog post has been put on the backburner but here's a preliminary share of the results of JVM x JRuby:

vm Jruby 9.1.17.0 Jruby 9.1.17.0 ID Jruby 9.2.7.0 Jruby 9.2.7.0 ID
adoptopenjdk-8.212 6.46 14.49 6.41 14.56
openjdk-12.0.1 5.11 10.13 5.23 12.19
oracle-8.141 6.66 13.72 6.17 14.69
corretto-8.212 6.63 14.25 6.24 14.74
corretto-11.0.3 5.18 10.17 5.39 12.07

(remember iterations per minute, higher is better)

ID denotes running JRuby with --server -Xcompile.invokedynamic=true -J-Xmx1500m

The notable thing here focussing on JRuby 9.2 ID is a that newer JDKs run at ~12 i/min while the old ones ran at ~14.7 which is about a ~18% drop in performance. It's much worse for JRuby 9.1.

You can see the full run here (it includes standard deviation which was usually small, jruby -v output. and all individual run times to see JIT stuff)

Warnings

Running on the newer JDKs I get this warnings once at the start:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.jruby.util.SecurityHelper to field java.lang.reflect.Field.modifiers
WARNING: Please consider reporting this to the maintainers of org.jruby.util.SecurityHelper
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Running your benchmark...

(I think they're the same but didn't double check)

System

Operating System Linux (5.0.0-16, Linux Mint 19.1/Ubuntu 18.04)
CPU Information Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Number of Available Cores 8
Available Memory 15.61 GB

Thanks

As always, thank you all for your fantastic work 💚

If there's any additional information you'd need please ping me, although I hope I covered everything.

@headius

This comment has been minimized.

Copy link
Member

commented Jul 15, 2019

Thanks for the report @PragTob!

Reproduced on a build of JDK-12 from a few months ago. With Java 8, I got a final result of 10.56ips iterations, and with JDK-12 I got 9.24ips. Similar drop-off to your results.

Interestingly I got the same results with JDK-11, even though I've not seen any other benchmarks degrade going from 8 to 11. There may be something specific about rubykon that's optimizing poorly.

@headius

This comment has been minimized.

Copy link
Member

commented Jul 15, 2019

FWIW, GraalVM EE 19.1 has similar results to Java 8...so similar I suspect this has less to do with the JIT than with other libraries and features of JDK 11+. GraalVM + JRuby achieved an average of 10.39ips.

@headius

This comment has been minimized.

Copy link
Member

commented Jul 15, 2019

Good news! After analyzing a run using async-profiler I noticed a large chunk of the resulting flamegraph was spent inside the G1 garbage collector, which became the default starting with Java 9. G1 is better with large heaps and somewhat more compact in memory, but it does not have the throughput of the parallel collector, so on that hunch I added a flag to switch GCs: -J-XX:+UseParallelGC.

Boom! The new number for Java 12 is 12.25ips, which appears to be the best JRuby result out of all configurations so far, a solid 13% faster than the Java 8 result I posted above.

I would still like to know why G1 has such a visible degradation. We are obviously very allocation-heavy.

@PragTob

This comment has been minimized.

Copy link
Author

commented Jul 15, 2019

Oh wow, I was hoping for such an insight. Thanks Charlie! When things cool down over here I'll rerun the benchmark on my machine with those parameters :) Or maybe another separate run with them... although at some point I might need to run the benchmarks over night :)

image

@shipilev

This comment has been minimized.

Copy link

commented Jul 16, 2019

Tested with recent JDK 13 EA and multiple collectors. Judging from GC logs, it is heavily-allocating, but fairly young-gc workload. Both Parallel and G1 run very short Young GCs during the run, taking about 1% of total time, which means allocation pressure itself is not the issue here.

Per collector experiments:

$ jruby --server -Xcompile.invokedynamic=true -J-Xms1500m -J-Xmx1500m 
-J-XX:+UnlockExperimentalVMOptions -J-XX:+UseParallelGC 
benchmark/mcts_avg.rb 
19x19 1_000 iterations        14.24 i/min  4.21 s (avg)  (± 0.59%)

$ jruby --server -Xcompile.invokedynamic=true -J-Xms1500m -J-Xmx1500m 
-J-XX:+UnlockExperimentalVMOptions -J-XX:+UseG1GC 
benchmark/mcts_avg.rb 
19x19 1_000 iterations        10.38 i/min  5.78 s (avg)  (± 0.47%)

$ jruby --server -Xcompile.invokedynamic=true -J-Xms1500m -J-Xmx1500m 
-J-XX:+UnlockExperimentalVMOptions -J-XX:+UseShenandoahGC
benchmark/mcts_avg.rb 
19x19 1_000 iterations        11.34 i/min  5.29 s (avg)  (± 0.49%)

$ jruby --server -Xcompile.invokedynamic=true -J-Xms1500m -J-Xmx1500m
-J-XX:+UnlockExperimentalVMOptions -J-XX:+UnlockDiagnosticVMOptions 
-J-XX:+UseShenandoahGC -J-XX:ShenandoahGCHeuristics=passive
benchmark/mcts_avg.rb 
19x19 1_000 iterations        14.12 i/min  4.25 s (avg)  (± 0.55%)

$ jruby --server -Xcompile.invokedynamic=true -J-Xms1500m -J-Xmx1500m 
-J-XX:+UnlockExperimentalVMOptions -J-XX:+UnlockDiagnosticVMOptions 
-J-XX:+UseShenandoahGC -J-XX:ShenandoahGCHeuristics=passive -J-XX:+ShenandoahSATBBarrier
benchmark/mcts_avg.rb 
19x19 1_000 iterations        13.15 i/min  4.56 s (avg)  (± 0.50%)

$ jruby --server -Xcompile.invokedynamic=true -J-Xms1500m -J-Xmx1500m 
-J-XX:+UnlockExperimentalVMOptions -J-XX:+UnlockDiagnosticVMOptions 
-J-XX:+UseShenandoahGC -J-XX:ShenandoahGCHeuristics=passive -J-XX:+ShenandoahSATBBarrier -J-XX:+ShenandoahLoadRefBarrier
benchmark/mcts_avg.rb 
19x19 1_000 iterations        11.06 i/min  5.42 s (avg)  (± 0.52%)

$ jruby --server -Xcompile.invokedynamic=true -J-Xms1500m -J-Xmx1500m 
-J-XX:+UnlockExperimentalVMOptions -J-XX:+UnlockDiagnosticVMOptions 
-J-XX:+UseZGC
benchmark/mcts_avg.rb 
19x19 1_000 iterations        11.15 i/min  5.38 s (avg)  (± 0.68%)

I do believe this is caused by GC barriers: any GC that does something non-trivial barrier-wise runs slower. Shenandoah in "passive" mode (has all barriers disabled) runs the same as Parallel, and gets progressively worse as we enable two major barriers: SATB (like in G1), and LRB. Alas, this is the price we pay for better GC latency.

@PragTob

This comment has been minimized.

Copy link
Author

commented Jul 16, 2019

Hey fascinating stuff! A bit OT, but GCs are super interesting is this a good source to learn about barriers and why/how they affect performance: https://shipilev.net/jvm/anatomy-quarks/13-intergenerational-barriers/ ?

Other reading suggestions welcome, thanks!
Tobias

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.