New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oat gem slower at steady-state with invokedynamic than without it #4596

Closed
ivoanjo opened this Issue May 10, 2017 · 7 comments

Comments

Projects
None yet
3 participants
@ivoanjo
Contributor

ivoanjo commented May 10, 2017

Environment

Rubies:

  • jruby 9.1.8.0 (2.3.1) 2017-03-06 90fc7ab Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +jit [linux-x86_64]

Linux: Linux maruchan 4.10.0-20-generic #22-Ubuntu SMP Thu Apr 20 09:22:42 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Expected Behavior

Performance on the benchmark below consistently regresses when running JRuby with -Xcompile.invokedynamic=true.

I was under the impression that after warmup, running JRuby with invokedynamic should never be slower than without it, but that does not seem to be the case on this benchmark.

require 'benchmark/ips'

require 'oat'
require 'oat/adapters/hal'

class Foo
  attr_reader :foo, :bar, :baz

  def initialize(**values)
    values.each do |(key, value)|
      respond_to?(key) ? instance_variable_set(:"@#{key}", value) : raise
    end
  end
end

class FooSerializer < Oat::Serializer
  adapter Oat::Adapters::HAL

  schema do
    property :foo, item.foo
    property :bar, item.bar
    property :baz, item.baz
  end
end

puts "Running with #{RUBY_DESCRIPTION} (JRUBY_OPTS: #{ENV['JRUBY_OPTS']})"

FOO = Foo.new(foo: 1, bar: 2, baz: 'baz')

Benchmark.ips do |benchmark|
  benchmark.time = 120
  benchmark.warmup = 120

  benchmark.report('FooSerializer') { FooSerializer.new(FOO).to_hash }

  benchmark.compare!
end

Actual Behavior

  • Without invokedynamic: FooSerializer 251.627k (±23.9%) i/s - 28.264M in 120.011879s
  • With invokedynamic: FooSerializer 33.785k (± 6.9%) i/s - 4.032M in 120.030513s

I've also tested with latest master and while the numbers go up slightly in either case, the general relation between the two results keeps the same.

@enebo enebo added this to the JRuby 9.1.9.0 milestone May 10, 2017

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius May 10, 2017

Member

Yeah something's definitely wrong here.

Member

headius commented May 10, 2017

Yeah something's definitely wrong here.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius May 10, 2017

Member

It looks like something's getting endlessly invalidated, which is very costly. Most places where we use indy should already be failing over to a less-optimized path rather than constantly invalidating, so I'll poke around a bit.

Member

headius commented May 10, 2017

It looks like something's getting endlessly invalidated, which is very costly. Most places where we use indy should already be failing over to a less-optimized path rather than constantly invalidating, so I'll poke around a bit.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius May 10, 2017

Member

Ahh I see. The indy logic in 9k does not take into consideration code that calls method_missing forever, and continues to try to cache it.

Member

headius commented May 10, 2017

Ahh I see. The indy logic in 9k does not take into consideration code that calls method_missing forever, and continues to try to cache it.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius May 10, 2017

Member

Ok, elaborating on this a bit more.

In any call site (using indy) we wrap the cached method in a SwitchPoint from the original class. This acts as an "off" switch for the cache, flipped when the original class is modified. We lazily create that SwitchPoint index only when requested.

In the case of method_missing we will request that SwitchPoint and then not use it. This is normally not a big deal; the method_missing call against the same class again will get the same SwitchPoint.

However, the oat library is constantly generating new classes. This in itself is a very bad practice, since it also affects other Ruby implementations that cache methods based on the last-seen class. In JRuby, however, it has the additional wrinkle that because it's a new class every time, it's a new SwitchPoint every time. So we go through all the effort of setting up the class and the SwitchPoint only to throw it away and start again the next iteration.

This is the main source of performance. It's marginally indy-related, since there are call sites and SwitchPoint and so on, but it's more a symptom of class churn in oat.

The flaw in the JRuby logic is that if we keep failing to cache a method, because we're supposed to call through method_missing, we never give up trying to cache methods at that call site.

The simple fix will be to "fail" the call site after some number of method_missing calls, as we do after some number of invalidated, previously-cached calls. This will cause it to use the non-caching path and avoid the overhead.

I'm not sure what a more comprehensive fix would be. The classes entering your benchmark's schema block are always new and always using method_missing. There's not really anything to cache, as a result.

Member

headius commented May 10, 2017

Ok, elaborating on this a bit more.

In any call site (using indy) we wrap the cached method in a SwitchPoint from the original class. This acts as an "off" switch for the cache, flipped when the original class is modified. We lazily create that SwitchPoint index only when requested.

In the case of method_missing we will request that SwitchPoint and then not use it. This is normally not a big deal; the method_missing call against the same class again will get the same SwitchPoint.

However, the oat library is constantly generating new classes. This in itself is a very bad practice, since it also affects other Ruby implementations that cache methods based on the last-seen class. In JRuby, however, it has the additional wrinkle that because it's a new class every time, it's a new SwitchPoint every time. So we go through all the effort of setting up the class and the SwitchPoint only to throw it away and start again the next iteration.

This is the main source of performance. It's marginally indy-related, since there are call sites and SwitchPoint and so on, but it's more a symptom of class churn in oat.

The flaw in the JRuby logic is that if we keep failing to cache a method, because we're supposed to call through method_missing, we never give up trying to cache methods at that call site.

The simple fix will be to "fail" the call site after some number of method_missing calls, as we do after some number of invalidated, previously-cached calls. This will cause it to use the non-caching path and avoid the overhead.

I'm not sure what a more comprehensive fix would be. The classes entering your benchmark's schema block are always new and always using method_missing. There's not really anything to cache, as a result.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius May 10, 2017

Member

Ok, so with a quick fix that immediately fails the method_missing based call site (which, btw, is the "property" calls inside the "schema" block in your benchmark) returns perf to more what we'd expect:

[] ~/projects/jruby $ jruby -Xcompile.invokedynamic blah.rb
Running with jruby 9.1.9.0-SNAPSHOT (2.3.3) 2017-05-10 941cfe8 Java HotSpot(TM) 64-Bit Server VM 25.112-b16 on 1.8.0_112-b16 +indy +jit [darwin-x86_64] (JRUBY_OPTS: )
Warming up --------------------------------------
       FooSerializer    12.525k i/100ms
Calculating -------------------------------------
       FooSerializer    200.413k (±18.0%) i/s -     19.890M in 120.242456s

[] ~/projects/jruby $ jruby blah.rb
Running with jruby 9.1.9.0-SNAPSHOT (2.3.3) 2017-05-10 941cfe8 Java HotSpot(TM) 64-Bit Server VM 25.112-b16 on 1.8.0_112-b16 +jit [darwin-x86_64] (JRUBY_OPTS: )
Warming up --------------------------------------
       FooSerializer    10.318k i/100ms
Calculating -------------------------------------
       FooSerializer    152.951k (±16.0%) i/s -     16.230M in 120.297981s

I am not familiar with this library. If this pattern you are benchmarking is typical of hot-path runtime execution, then the library author may want to reconsider creating so many singleton objects and doing method_missing dispatch against them, because that's going to be a system-wide perf hit on MRI and locally rather poor on JRuby even with this fix.

I will put a more thoughtful fix into 9.1.9.0 for this.

Member

headius commented May 10, 2017

Ok, so with a quick fix that immediately fails the method_missing based call site (which, btw, is the "property" calls inside the "schema" block in your benchmark) returns perf to more what we'd expect:

[] ~/projects/jruby $ jruby -Xcompile.invokedynamic blah.rb
Running with jruby 9.1.9.0-SNAPSHOT (2.3.3) 2017-05-10 941cfe8 Java HotSpot(TM) 64-Bit Server VM 25.112-b16 on 1.8.0_112-b16 +indy +jit [darwin-x86_64] (JRUBY_OPTS: )
Warming up --------------------------------------
       FooSerializer    12.525k i/100ms
Calculating -------------------------------------
       FooSerializer    200.413k (±18.0%) i/s -     19.890M in 120.242456s

[] ~/projects/jruby $ jruby blah.rb
Running with jruby 9.1.9.0-SNAPSHOT (2.3.3) 2017-05-10 941cfe8 Java HotSpot(TM) 64-Bit Server VM 25.112-b16 on 1.8.0_112-b16 +jit [darwin-x86_64] (JRUBY_OPTS: )
Warming up --------------------------------------
       FooSerializer    10.318k i/100ms
Calculating -------------------------------------
       FooSerializer    152.951k (±16.0%) i/s -     16.230M in 120.297981s

I am not familiar with this library. If this pattern you are benchmarking is typical of hot-path runtime execution, then the library author may want to reconsider creating so many singleton objects and doing method_missing dispatch against them, because that's going to be a system-wide perf hit on MRI and locally rather poor on JRuby even with this fix.

I will put a more thoughtful fix into 9.1.9.0 for this.

headius added a commit to headius/jruby that referenced this issue May 10, 2017

Also increment thresholds in indy sites for method_missing.
Fixes #4596.

JRuby currently does not cache anything for method_missing calls
(i.e. calls to methods that do not exist on the target class's
method table). Instead, after discovering the missing method, it
proceeds down a slow lookup path, dispatching to method_missing
with a symbolic method name.

However this logic still attempts to acquire a SwitchPoint from
the target class, since that SP must be acquired before the
initial method lookup to ensure proper invalidation ordering.

Normally, this isn't a problem. If the class is still relatively
static, the SwitchPoint will be created once, and then the cost
of calling method_missing is the same as for non-indy. However for
type very rapidly. This led to a large number of unused
SwitchPoint being created, one for each class. And since the
method_missing path never cached anything, it never updated call
site thresholds, resulting in call sites that would continue this
logic forever.

Normal calls, if they fail repeatedly, will eventually fail the
entire site and revert to non-indy method dispatch (currently a
simple monomorphic cache).

This patch lifts the threshold-bumping out of the caching logic
so it can also be used by method_missing dispatches. This allows
repeated method_missing calls against different classes to also
fail the site, reverting it to non-indy behavior. With this patch,
the benchmark in #4596 runs faster with indy than without, as
expected.
@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius May 10, 2017

Member

I have pushed #4602 which contains a reasonable fix for this issue.

I will again point out, however, that the pattern being benchmarked will never perform "well" due to the large number of singleton classes being created.

Member

headius commented May 10, 2017

I have pushed #4602 which contains a reasonable fix for this issue.

I will again point out, however, that the pattern being benchmarked will never perform "well" due to the large number of singleton classes being created.

@ivoanjo

This comment has been minimized.

Show comment
Hide comment
@ivoanjo

ivoanjo May 11, 2017

Contributor

Wow @headius thanks for the lightning fast answer and awesomely detailed analysis!

I started looking at this as I was benchmarking a test application with invokedynamic enabled, and in that configuration the serialization of responses (this was an http json hal api) was spending a noticeable amount of time on this part of the code.

With the invokedynamic off configuration the impact became rather small compared to other parts of request processing, but as I improve those parts I think I'll get to the point where this will become a lot more noticeable.

I'll report this to the oat developers and perhaps I'll have some time to pick it up and try to fix it myself.

I am not familiar with this library. If this pattern you are benchmarking is typical of hot-path runtime execution, then the library author may want to reconsider creating so many singleton objects and doing method_missing dispatch against them, because that's going to be a system-wide perf hit on MRI and locally rather poor on JRuby even with this fix.

Just one question regarting to your comment: are the singleton objects (and the constant generation of new classes) a result of the calls to instance_eval? I don't see any other place which may be causing this on this test case.

Contributor

ivoanjo commented May 11, 2017

Wow @headius thanks for the lightning fast answer and awesomely detailed analysis!

I started looking at this as I was benchmarking a test application with invokedynamic enabled, and in that configuration the serialization of responses (this was an http json hal api) was spending a noticeable amount of time on this part of the code.

With the invokedynamic off configuration the impact became rather small compared to other parts of request processing, but as I improve those parts I think I'll get to the point where this will become a lot more noticeable.

I'll report this to the oat developers and perhaps I'll have some time to pick it up and try to fix it myself.

I am not familiar with this library. If this pattern you are benchmarking is typical of hot-path runtime execution, then the library author may want to reconsider creating so many singleton objects and doing method_missing dispatch against them, because that's going to be a system-wide perf hit on MRI and locally rather poor on JRuby even with this fix.

Just one question regarting to your comment: are the singleton objects (and the constant generation of new classes) a result of the calls to instance_eval? I don't see any other place which may be causing this on this test case.

ivoanjo pushed a commit to Talkdesk/oat that referenced this issue May 17, 2017

Ivo Anjo
Avoid usage of instance_eval by defining schemas as methods
While investigating a performance issue in JRuby when using oat in a
specific JRuby configuration, @headius pointed out that the way oat
was using `instance_eval` to apply blocks would have an impact on
the library performance, both on MRI and on JRuby.

In a nutshell, this happens because while every object behaves as
it has a singleton class, the object representing that singleton
class is generated lazily, only when really needed. And this happens
with `instance_eval` --- it causes the singleton class to be created
both on MRI and on JRuby.

This can be confirmed on MRI by using ObjectSpace:

```ruby
>> require 'objspace'
>> ObjectSpace.count_objects[:T_CLASS]
=> 988
>> Object.new.instance_eval { }
=> nil
>> ObjectSpace.count_objects[:T_CLASS]
=> 989
>> Object.new.instance_eval { }
=> nil
>> ObjectSpace.count_objects[:T_CLASS]
=> 990
```

The same happens in JRuby, where a JVM heap visualization tool such
as JVisualVM can observe a new `org.jruby.MetaClass` instance being
created on every call to `instance_eval`.

To avoid this overhead and improve oat's performance, this PR replaces
the `instance_eval` we use to evaluate blocks in the context of a
serializer class with methods that are created from the block.

This means that when a `schema` is registered, we now generate a new
private method from it (e.g. `schema_block_N`) and thus when we're
serializing a given object we can just call this method (or methods,
if several schemas were registered) instead of `instance_eval`.
Because this method is defined on the serializer class, it behaves
exactly as the `instance_eval`, but performance is considerably
increased.

Simple benchmark (using `benchmark-ips` gem):

```ruby
require 'benchmark/ips'
require 'oat'
require 'oat/adapters/hal'

class Foo
  attr_reader :foo, :bar, :baz

  def initialize(**values)
    values.each do |(key, value)|
      respond_to?(key) ? instance_variable_set(:"@#{key}", value) :
raise
    end
  end
end

class FooSerializer < Oat::Serializer
  adapter Oat::Adapters::HAL

  schema do
    property :foo, item.foo
    property :bar, item.bar
    property :baz, item.baz
  end
end

puts "Running with #{RUBY_DESCRIPTION}"

FOO = Foo.new(foo: 1, bar: 2, baz: 'baz')

Benchmark.ips do |benchmark|
  benchmark.time = 30
  benchmark.warmup = 30
  benchmark.report('FooSerializer') { FooSerializer.new(FOO).to_hash }
  benchmark.compare!
end
```

And results on my machine:

* Before
  * MRI 2.4.1:
    `FooSerializer    264.516k (± 7.9%) i/s -      7.890M in  30.032221s`
  * JRuby 9.1.9.0 + invokedynamic:
    `FooSerializer    300.490k (±20.6%) i/s -      8.652M in  30.080662s`

* After
  * MRI 2.4.1:
    `FooSerializer    326.571k (± 5.8%) i/s -      9.771M in  30.035173s`
  * JRuby 9.1.9.0 + invokedynamic:
    `FooSerializer    622.808k (± 3.8%) i/s -     18.680M in  30.039202s`

E.g. this speeds up Oat by 1.23x on MRI and 2.07x on JRuby.

See also jruby/jruby#4596

ivoanjo pushed a commit to Talkdesk/oat that referenced this issue May 17, 2017

Ivo Anjo
Improve performance by avoiding use of method_missing in Serializer
Following up the previous commit, another thing that @headius noted
was that Oat's usage of `method_missing` to delegate from the
serializer to the adapter was also impacting performance of the
serialization process.

This `method_missing` operation is used in the Serializer to dispatch
the methods used inside a schema block (e.g. `property`, `link`, etc)
to the currently selected adapter, which implements them.

To improve performance, of this operation, we can "learn" which
methods the adapter implementds on the first `method_missing` calls,
and from them on we define a method on the serializer class that
directly dispatches to the adapter, avoiding the `method_missing`
call.

Note that this works even if the adapter is changed, as we expect
the changed adapter to still answer to the interface (and if it
doesn't, we just get a `NoMethodError`).

Using the same benchmark included in the previous commit, here is
the impact of this optimization:

* Before
  * MRI 2.4.1:
    `FooSerializer    326.571k (± 5.8%) i/s -      9.771M in  30.035173s`
  * JRuby 9.1.9.0 + invokedynamic:
    `FooSerializer    622.808k (± 3.8%) i/s -     18.680M in  30.039202s`

* After
  * MRI 2.4.1:
    `FooSerializer    374.500k (± 4.1%) i/s -     11.246M in  30.084775s`
  * JRuby 9.1.9.0 + invokedynamic:
    `FooSerializer      1.058M (± 4.7%) i/s -     31.695M in  30.042851s`

This speeds up Oat by 1.15x on MRI and 1.70x on JRuby. Note that these
numbers are relative to the previous commit; relative to current master
both improvements yield 1.41x on MRI and 3.52x on JRuby.

See also jruby/jruby#4596
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment