Optimize getDefinedMessage by deduping during initialization #4366

Merged
merged 1 commit into from Dec 7, 2016

Projects

None yet

4 participants

@ivoanjo
Contributor
ivoanjo commented Dec 6, 2016

While benchmarking a barebones rails application I was seeing a lot of
lock contention. When I looked into it, all threads were spending a lot
of time waiting to grab the mutex on the dedupMap inside
Ruby.freezeAndDedupString().

This happened because activesupport uses
defined?(@somevar) && @somevar a lot while filling the
output template, and thus threads were spending a lot of time in the
Ruby.getDefinedMessage() method fighting to dedup the same strings
over and over and over again.

To fix this, instead of deduping when they are accessed, let's dedup the
defined messages when they are initially populated inside the
definedMessages map. This way we get the same effect, but only dedup
once.

This fix has a considerable effect on single-thread performance:

require 'benchmark/ips'

class Test
  def initialize
    @foo = 'foo'
  end

  def defined_test
    defined?(@foo) && @foo
  end
end

t = Test.new

Benchmark.ips do |benchmark|
  benchmark.time = 20
  benchmark.warmup = 20

  benchmark.report("defined") { t.defined_test }

  benchmark.compare!
end

with fix:

Warming up --------------------------------------
             defined   256.296k i/100ms
Calculating -------------------------------------
             defined     11.992M (± 4.4%) i/s -    239.380M in  20.008996s

without fix:

Warming up --------------------------------------
             defined   224.042k i/100ms
Calculating -------------------------------------
             defined      7.373M (± 2.5%) i/s -    147.420M in  20.009003s

...but it has an even more noticeable effect on multi-threaded performance:

def defined_loop(count)
  @foo = true
  count.times do
    defined?(@foo) && @foo
  end
end

defined_loop(2**27)

puts "Warmup done"

start = Time.now

t = (0..20).map do
  Thread.new do
    defined_loop(2**25)
  end
end

t.each(&:join)

puts "Done in #{Time.now - start} seconds"

In this case I wasn't able to find a decent ruby multithreaded benchmark
harness, so pardon the crappiness, but I ran it fine times and got
(on a linux dual-core skylake i7):

with fix:

Done in 12.442 seconds
Done in 12.284 seconds
Done in 12.803 seconds
Done in 13.724 seconds
Done in 13.844 seconds

and under visualvm it looks like this:

Thread state graph with fix

without fix:

Done in 71.088 seconds
Done in 71.664 seconds
Done in 64.57 seconds
Done in 65.504 seconds
Done in 62.81 seconds

and under visualvm it looks like this:

Thread state graph without fix

As expected, this is a major win on multithreaded scenarios.

@ivoanjo ivoanjo Optimize getDefinedMessage by deduping during initialization
While benchmarking a barebones rails application I was seeing a lot of
lock contention. When I looked into it, all threads were spending a lot
of time waiting to grab the mutex on the `dedupMap` inside
`Ruby.freezeAndDedupString()`.

This happened because activesupport uses
`defined?(@somevar) && @somevar` a lot while filling the
output template, and thus threads were spending a lot of time in the
`Ruby.getDefinedMessage()` method fighting to dedup the same strings
over and over and over again.

To fix this, instead of deduping when they are accessed, let's dedup the
defined messages when they are initially populated inside the
definedMessages map. This way we get the same effect, but only dedup
once.

This fix has a considerable effect on single-thread performance:

```ruby
require 'benchmark/ips'

class Test
  def initialize
    @foo = 'foo'
  end

  def defined_test
    defined?(@foo) && @foo
  end
end

t = Test.new

Benchmark.ips do |benchmark|
  benchmark.time = 20
  benchmark.warmup = 20

  benchmark.report("defined") { t.defined_test }

  benchmark.compare!
end
```

with fix:

```
Warming up --------------------------------------
             defined   256.296k i/100ms
Calculating -------------------------------------
             defined     11.992M (± 4.4%) i/s -    239.380M in  20.008996s
```

without fix:

```
Warming up --------------------------------------
             defined   224.042k i/100ms
Calculating -------------------------------------
             defined      7.373M (± 2.5%) i/s -    147.420M in  20.009003s
```

...but it has an even more noticeable effect on multi-threaded performance:

```ruby
def defined_loop(count)
  @foo = true
  count.times do
    defined?(@foo) && @foo
  end
end

defined_loop(2**27)

puts "Warmup done"

start = Time.now

t = (0..20).map do
  Thread.new do
    defined_loop(2**25)
  end
end

t.each(&:join)

puts "Done in #{Time.now - start} seconds"
```

In this case I wasn't able to find a decent ruby multithreaded benchmark
harness, so pardon the crappiness, but I ran it fine times and got
(on a linux dual-core skylake i7):

with fix:

```
Done in 12.442 seconds
Done in 12.284 seconds
Done in 12.803 seconds
Done in 13.724 seconds
Done in 13.844 seconds
```

and under visualvm it looks like this:

![Thread state graph with fix](http://i.imgur.com/i3f72H4.png)

without fix:

```
Done in 71.088 seconds
Done in 71.664 seconds
Done in 64.57 seconds
Done in 65.504 seconds
Done in 62.81 seconds
```

and under visualvm it looks like this:

![Thread state graph without fix](http://i.imgur.com/0ulO9zM.png)

As expected, this is a major win on multithreaded scenarios.
95b5d2e
@kares
Member
kares commented Dec 7, 2016

Excellent Ivo, this is certainly merge material.
Throwing in some discussion about revisiting dedupMap = Collections.synchronizedMap(new WeakHashMap<RubyString, WeakReference<RubyString>>()); I believe this should be further improved by not using a locking map impl -> possibly replacing the cache with a concurrent map.

... would certainly not want to trade locking for less memory usage with frozen string literals. thoughts?

there seems to be a note on that already around freezeAndDedupString :

     * Note that this cache does some sync against the Ruby instance. This
     * could cause contention under heavy concurrent load, so a reexamination
     * of this design might be warranted.
@kares kares added this to the JRuby 9.1.7.0 milestone Dec 7, 2016
@ivoanjo
Contributor
ivoanjo commented Dec 7, 2016

Fully agreed @kares.

I had that jotted down on my notebook as something to be explored next, and it wouldn't even be the first time I'd write a concurrent hash map with weak keys (you need some extra mojo to remove the weak references after they get collected).

If you think that's the way to go, I'm willing to take a stab at it :)

@kares
Member
kares commented Dec 7, 2016

@ivoanjo ah right WeakConcurrentMap - probably the reason why its not there in the first place.
believe it makes sense (if its not too much work on top of the JDK concurrent map) ... at org.jruby.util

the sync was introduced in ivoanjo@e88911e as a bug work-around, you should look at #3171 for details

@kares kares merged commit 933fae0 into jruby:master Dec 7, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@headius
Member
headius commented Dec 7, 2016

Wow, this is a great find! Could explain slower-than-expected performance other folks have reported. Thank you! Looking forward to other discoveries :-)

@headius headius added a commit that referenced this pull request Dec 7, 2016
@headius headius Further optimize defined? messages. See #4366.
In #4366, @ivoanjo fixed a locking bottleneck by pre-deduplicating
the messages produced for defined?. This patch further improves
things by caching those messages (frozen and pre-determined)
inline at the site of the defined? check, avoiding a round-trip
through context.runtime and the enum map therein. This also will
allow defined? checks to constant fold away, once we introduce
inline caching of defined? results.
56cb52b
@headius
Member
headius commented Dec 7, 2016

I pushed an additional improvement that caches the defined? messages inline. This avoids the round-trip through the runtime and the defined messages map entirely.

We still would like to de-bottleneck the dedup cache, of course, since it's used many other places.

@headius
Member
headius commented Dec 7, 2016 edited

My numbers with the additional inline-caching tweak and invokedynamic enabled:

Before: 150.741s
After: 8.462s

We can further improve these once we improve defined? to cache its lookups (like method calls do).

@nirvdrum
Contributor
nirvdrum commented Dec 8, 2016

Wow. Apologies if that #3171 fix was too blunt. But it seems like an easy performance win to fix then.

@ivoanjo
Contributor
ivoanjo commented Dec 9, 2016

@headius if I may ask a probably-derp question about your approach 😄
It obviously is much better to just have a pointer for the ruby string containing the message, rather than going the roundabout way of getting it from the EnumMap.

When I was cooking the patch, I investigated a bit and the reason why there was a freezeAndDedup there in the first place, was, according to b3fb6e4, so that "We need to dedup defined message strings so they all have same id", which makes sense to me.

But your approach always seems to create new FrozenString instances, so intuitively it would seem to me that this would not hold. But I tried experimenting with it, and it does hold, I still get the same object_id every time with your patch. So... why? I'm scratching my head a bit :)

@ivoanjo
Contributor
ivoanjo commented Dec 18, 2016

@kares I've been trying to see where freezeAndDedupString is still being used in order to devise a benchmark that can then be used to test an alternative approach to the current weakhashmap and it seems to me that after the change in this PR, there aren't any performance-critical uses of it anymore, so the added complexity may not be entirely worth it.

Am I missing some use case?

@kares
Member
kares commented Dec 18, 2016

@ivoanjo thought that the map is a cache for eventually all frozen strings (haven't looked into the details)

@ivoanjo
Contributor
ivoanjo commented Dec 19, 2016

@kares I don't think frozen strings are deduped by default, e.g.

jruby-9.1.5.0 :001 > x = 97.chr.freeze
 => "a" 
jruby-9.1.5.0 :002 > y = 'a'.freeze
 => "a" 
jruby-9.1.5.0 :003 > x.object_id
 => 2002 
jruby-9.1.5.0 :004 > y.object_id
 => 2004 

I think they are only deduped when they are read in as literals:

jruby-9.1.5.0 :001 > x = 'a'.freeze
 => "a" 
jruby-9.1.5.0 :002 > y = 'a'.freeze
 => "a" 
jruby-9.1.5.0 :003 > x.object_id
 => 2002 
jruby-9.1.5.0 :004 > y.object_id
 => 2002 

This is why I'm having a bit of a difficulty in coding a benchmark to test this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment