JRuby 9k: simple script ~80% as fast as jruby 1.7.19 #2761

Closed
PragTob opened this Issue Mar 25, 2015 · 9 comments

Projects

None yet

2 participants

@PragTob
PragTob commented Mar 25, 2015

This might be related to #2544 - this is based on a benchmark @jasonrclark wrote for shoes4 and I abstracted away everything so what we are left with is fairly simple, a value assignment and returning it:

require 'benchmark/ips'

class FakeDimension
  def initialize
    @margin_start = 10
  end

  def full_direct
    value = @margin_start
    value
  end
end

Benchmark.ips do |benchmark|
  dim = FakeDimension.new
  benchmark.report("full_direct")           { dim.full_direct }
  benchmark.compare!
end

And that script is down to 80% of the 1.7.19 performance on 9k:

tobi@tobi-desktop ~/Documents $ ruby -v
jruby 9.0.0.0-SNAPSHOT (2.2.1) 2015-03-25 d507e49 OpenJDK 64-Bit Server VM 24.75-b04 on 1.7.0_75-b13 +jit [linux-amd64]
tobi@tobi-desktop ~/Documents $ ruby simple_bench.rb 
Calculating -------------------------------------
         full_direct   107.974k i/100ms
-------------------------------------------------
         full_direct      9.128M (±10.9%) i/s -     44.593M
tobi@tobi-desktop ~/Documents $ rvm use jruby
Using /home/tobi/.rvm/gems/jruby-1.7.19
tobi@tobi-desktop ~/Documents $ ruby simple_bench.rb 
Calculating -------------------------------------
         full_direct   173.166k i/100ms
-------------------------------------------------
         full_direct     11.525M (±11.5%) i/s -     56.452M
tobi@tobi-desktop ~/Documents $ ruby -v
jruby 1.7.19 (1.9.3p551) 2015-01-29 20786bd on OpenJDK 64-Bit Server VM 1.7.0_75-b13 +jit [linux-amd64]
tobi@tobi-desktop ~/Documents $ java -version
java version "1.7.0_75"
OpenJDK Runtime Environment (IcedTea 2.5.4) (7u75-2.5.4-1~trusty1)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)

Rather simple scripts are important to us as we do a ton of small calculations around the dimensions of elements to position them. :)

As always thanks for all your work on JRuby ❤️
Tobi

@headius
Member
headius commented Mar 29, 2015

Looks like it's limited to the JVM6 JIT. The indy JIT is actually much faster than 1.7:

[] ~/projects/jruby $ rvm jruby-1.7.19 do jruby bench.rb 
Calculating -------------------------------------
         full_direct   150.712k i/100ms
-------------------------------------------------
         full_direct     11.843M (±13.5%) i/s -     57.873M

[] ~/projects/jruby $ jruby -Xcompile.invokedynamic=true bench.rb 
Calculating -------------------------------------
         full_direct   130.706k i/100ms
-------------------------------------------------
         full_direct     19.772M (±14.1%) i/s -     95.415M

[] ~/projects/jruby $ jruby bench.rb 
Calculating -------------------------------------
         full_direct    97.529k i/100ms
-------------------------------------------------
         full_direct      9.787M (±13.0%) i/s -     47.692M
@headius
Member
headius commented Mar 30, 2015

I've spent some time playing with this and found a few improvements, but nothing that fixes the benchmark.

First improvement was from discovering that literal fixnum operations (like a + 1) were creating a fixnum object for the literal every time due to using an optimized call path with an unoptimized call site (1d8dc06). This helped a version of the benchmark that used a while loop, but did not help enough to fix the performance difference completely.

[] ~/projects/jruby $ rvm jruby-1.7.19 do jruby bench.rb 
Calculating -------------------------------------
                       183.799k i/100ms
-------------------------------------------------
                         12.722M (±11.5%) i/s -     62.492M

[] ~/projects/jruby $ jruby bench.rb 
Calculating -------------------------------------
                       114.370k i/100ms
-------------------------------------------------
                         10.339M (±13.1%) i/s -     50.323M

When I modify the benchmark to have a 1M while loop inside the ips report section, we see a similar ratio:

[] ~/projects/jruby $ rvm jruby-1.7.19 do jruby bench.rb 
Calculating -------------------------------------
                         3.000  i/100ms
-------------------------------------------------
                         37.930  (± 5.3%) i/s -    189.000 

[] ~/projects/jruby $ jruby bench.rb 
Calculating -------------------------------------
                         2.000  i/100ms
-------------------------------------------------
                         27.144  (± 3.7%) i/s -    136.000

And if I put the 1M loop inside the method being benchmarked, the performance difference goes away:

[] ~/projects/jruby $ rvm jruby-1.7.19 do jruby bench.rb 
Calculating -------------------------------------
                         6.000  i/100ms
-------------------------------------------------
                         72.699  (± 4.1%) i/s -    366.000 

[] ~/projects/jruby $ jruby bench.rb 
Calculating -------------------------------------
                         7.000  i/100ms
-------------------------------------------------
                         77.497  (± 5.2%) i/s -    392.000 

My current theory is that this is a performance problem calling through a MethodHandle for every jitted method. In order to avoid generating invoker stubs for compiled methods, CompilerIRMethod always uses MethodHandle to do dispatch. This simplifies codegen but appears to introduce enough overhead to tilt the benchmark. A final version doing 1M calls to an empty method (returning self, which is basically free) shows how much of a hit:

[] ~/projects/jruby $ rvm jruby-1.7.19 do jruby bench.rb 
Calculating -------------------------------------
             control    41.000  i/100ms
                         7.000  i/100ms
-------------------------------------------------
             control    443.751  (± 3.8%) i/s -      2.214k
                         75.826  (± 5.3%) i/s -    378.000 

[] ~/projects/jruby $ jruby bench.rb 
Calculating -------------------------------------
             control    41.000  i/100ms
                         3.000  i/100ms
-------------------------------------------------
             control    471.151  (± 4.5%) i/s -      2.378k
                         39.963  (± 5.0%) i/s -    201.000

Minus the cost of the control loop (which is obviously quite fast), dispatching to a jitted method via a MethodHandle seems to be around 2x slower than the generated invokers we used before.

So we may need to explore generating a full DynamicMethod wrapper around jitted bodies for the JVM6 mode. This is unfortunate, but having doubled dispatch cost for jitted methods compared to 1.7 is not really acceptable.

Note that none of this affects invokedynamic mode (generally speaking) because indy call sites will bind the handle directly, and the JVM should optimize it properly.

@headius
Member
headius commented Mar 30, 2015

Current ideas for eliminating or improving MethodHandle dispatch in non-indy JIT:

  • Pull the method handle all the way back to generated call site code.

This would allow the handle to optimize in-place, but it would require generating quite a bit more bytecode. CachingCallSite would still be used to cache, but instead of dispatching through it or through DynamicMethod.call, we would dispatch directly to the handle.

  • Generate method bodies into a subclass of CompiledIRMethod so they can dispatch directly.

This would be roughly equivalent to what 1.7 does but perhaps a bit less bytecode because only a single class would be generated. This does not address block bodies, which are normally generated into the same class as the method body, and that class can only be a CompiledIRMethod or a CompiledIRBlockBody.

  • Get the JVM guys to fix it.

This won't solve it in the short term, but I'm going to let the JVM folks responsible for MethodHandle performance have a crack at improving it.

@PragTob
PragTob commented Mar 30, 2015

Fascinating to follow your investigation Charlie - thanks :)

Was invokedynamic disabled for Java 7? Just wondering why my JRuby doesn't seem to use indy by default :)

Cheers,
Tobi

@headius
Member
headius commented Mar 30, 2015

No version of JRuby uses indy by default right now due to longer startup times.

@headius
Member
headius commented Mar 30, 2015

John Rose requested I add assembly to this bug report, so here it is: https://gist.github.com/headius/836ed0d73647fce6da15

I believe I have captured the code for the generated and compiled MethodHandle, the target method, and the actual call to MethodHandle.invokeExact. The invoke appears to inline up to invokeBasic, at which point it does a virtual invocation. The jitted handle also has several call instructions in it and does not inline the target method body. The calls also do not appear to go directly to the target method body, so there's at least one more layer here I'm not seeing.

@PragTob
PragTob commented Jan 8, 2017

For what it's worth, this is still slower on 9.1.6.0, also with JDK 8.

tobi@speedy ~/github/shoes4 $ rvm use jruby-1.7.26
Using /home/tobi/.rvm/gems/jruby-1.7.26
tobi@speedy ~/github/shoes4 $ gem install benchmark-ips
Fetching: benchmark-ips-2.7.2.gem (100%)
Successfully installed benchmark-ips-2.7.2
1 gem installed
tobi@speedy ~/github/shoes4 $ ruby testing/bench.rb 
Warming up --------------------------------------
         full_direct   359.070k i/100ms
Calculating -------------------------------------
         full_direct     22.523M (± 9.5%) i/s -    110.953M
tobi@speedy ~/github/shoes4 $ rvm use 9.1.6.0
Using /home/tobi/.rvm/gems/jruby-9.1.6.0
tobi@speedy ~/github/shoes4 $ gem install benchmark-ips
Successfully installed benchmark-ips-2.7.2
1 gem installed
tobi@speedy ~/github/shoes4 $ ruby testing/bench.rb 
Warming up --------------------------------------
         full_direct   317.512k i/100ms
Calculating -------------------------------------
         full_direct     19.623M (± 8.1%) i/s -     97.159M in   4.998989s
@headius
Member
headius commented Jan 9, 2017

I just tested this with JRuby master (9.1.7.0) on JDK 8 (8u111 or so) and the warmed-up results barely differ from 1.7.25:

~/projects/jruby $ rvm jruby-1.7.25 do ruby bench.rb 
Warming up --------------------------------------
         full_direct   290.127k i/100ms
Calculating -------------------------------------
         full_direct     20.821M (± 6.3%) i/s -    103.575M

~/projects/jruby $ jruby -v bench.rb
jruby 9.1.7.0-SNAPSHOT (2.3.1) 2017-01-06 db207c1 OpenJDK 64-Bit Server VM 25.111-b16 on 1.8.0_111-b16 +jit [linux-x86_64]
Warming up --------------------------------------
         full_direct   239.342k i/100ms
Calculating -------------------------------------
         full_direct     20.394M (± 8.9%) i/s -    100.763M in   4.993509s

~/projects/jruby $ rvm jruby-1.7.25 do ruby bench.rb 
Warming up --------------------------------------
         full_direct   310.178k i/100ms
Calculating -------------------------------------
         full_direct     20.771M (± 6.6%) i/s -    103.289M

~/projects/jruby $ jruby -v bench.rb
jruby 9.1.7.0-SNAPSHOT (2.3.1) 2017-01-06 db207c1 OpenJDK 64-Bit Server VM 25.111-b16 on 1.8.0_111-b16 +jit [linux-x86_64]
Warming up --------------------------------------
         full_direct   237.280k i/100ms
Calculating -------------------------------------
         full_direct     20.291M (± 8.1%) i/s -    100.607M in   5.003640s

JRuby 1.7 appears to warm up a little faster, or perhaps it's just faster in the non-jitted form (which would make sense...IR has remained a bit slower than AST interpreter), but the warmed-up results are very close...perhaps 3-5% faster in 1.7.

Interestingly, the numbers with invokedynamic enabled are significantly faster in JRuby 9.1.7.0 than in 1.7.25:

~/projects/jruby $ jruby -Xcompile.invokedynamic -v bench.rb
jruby 9.1.7.0-SNAPSHOT (2.3.1) 2017-01-06 db207c1 OpenJDK 64-Bit Server VM 25.111-b16 on 1.8.0_111-b16 +indy +jit [linux-x86_64]
Warming up --------------------------------------
         full_direct   271.889k i/100ms
Calculating -------------------------------------
         full_direct     36.233M (± 8.7%) i/s -    179.175M in   4.999300s

~/projects/jruby $ rvm jruby-1.7.25 do jruby -Xcompile.invokedynamic -v bench.rb
jruby 1.7.25 (1.9.3p551) 2016-04-13 867cb81 on OpenJDK 64-Bit Server VM 1.8.0_111-b16 +jit [linux-amd64]
Warming up --------------------------------------
         full_direct   303.354k i/100ms
Calculating -------------------------------------
         full_direct     20.253M (± 9.3%) i/s -    100.107M

I think we can close this. Seems like we've made enough small improvements in the past few months to largely eliminate the gap.

@headius headius closed this Jan 9, 2017
@headius headius added this to the JRuby 9.1.7.0 milestone Jan 9, 2017
@PragTob
PragTob commented Jan 10, 2017

Cool, thanks Charlie! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment