JRuby 9.0.1.0 has slower $global_var usage #3350

Closed
itspomf opened this Issue Sep 25, 2015 · 8 comments

Comments

Projects
None yet
4 participants
@itspomf

itspomf commented Sep 25, 2015

Using a (very simple) benchmark to test how JRuby handles differing scopes for variables, I found that unlike MRI 2.2.2, JRuby 9.0.1.0 was performing far slower than other variable types / scopes with globals.

The test code is as follows:

require 'benchmark'

LOOPS = 30000000

$global = 0
@outer_inst = 0
outer_scope = 0

Benchmark.bmbm {|test|
    @inner_inst = 0
    inner_scope = 0

    test.report('$global'){ LOOPS.times { $global += 1 } }
    test.report('@outer_inst'){ LOOPS.times { @outer_inst += 1 } }
    test.report('outer_scope'){ LOOPS.times { outer_scope += 1 } }
    test.report('@inner_inst'){ LOOPS.times { @inner_inst += 1 } }
    test.report('inner_scope'){ LOOPS.times { inner_scope += 1 } }
    test.report('local_scope'){ local_scope = 0; LOOPS.times { local_scope += 1 } }
}

The results for JRuby:
jruby 9.0.1.0 (2.2.2) 2015-09-02 583f336 Java HotSpot(TM) 64-Bit Server VM 25.51-b03 on 1.8.0_51-b16 +jit [Windows 7-amd64]

Rehearsal -----------------------------------------------
$global       6.167000   0.000000   6.167000 (  6.167375)
@outer_inst   2.830000   0.000000   2.830000 (  2.829050)
outer_scope   2.777000   0.000000   2.777000 (  2.776937)
@inner_inst   3.118000   0.000000   3.118000 (  3.117344)
inner_scope   2.631000   0.000000   2.631000 (  2.630262)
local_scope   2.297000   0.000000   2.297000 (  2.297344)
------------------------------------- total: 19.820000sec

                  user     system      total        real
$global       6.164000   0.000000   6.164000 (  6.162756)
@outer_inst   3.078000   0.000000   3.078000 (  3.077359)
outer_scope   2.360000   0.000000   2.360000 (  2.359673)
@inner_inst   2.614000   0.000000   2.614000 (  2.614211)
inner_scope   2.384000   0.000000   2.384000 (  2.384571)
local_scope   2.462000   0.000000   2.462000 (  2.461775)

And MRI:
ruby 2.2.2p95 (2015-04-13 revision 50295) [x64-mingw32]

Rehearsal -----------------------------------------------
$global       5.086000   0.000000   5.086000 (  5.099396)
@outer_inst   5.272000   0.000000   5.272000 (  5.281465)
outer_scope   4.821000   0.016000   4.837000 (  4.832652)
@inner_inst   4.305000   0.000000   4.305000 (  4.335950)
inner_scope   4.727000   0.000000   4.727000 (  4.745273)
local_scope   5.101000   0.000000   5.101000 (  5.102266)
------------------------------------- total: 29.328000sec

                  user     system      total        real
$global       4.759000   0.000000   4.759000 (  4.764358)
@outer_inst   5.335000   0.000000   5.335000 (  5.348511)
outer_scope   5.179000   0.000000   5.179000 (  5.180154)
@inner_inst   5.429000   0.000000   5.429000 (  5.503256)
inner_scope   4.929000   0.000000   4.929000 (  4.937495)
local_scope   4.556000   0.015000   4.571000 (  4.567554)

It seems that while MRI is fairly normalized across the board, JRuby only shows relatively equal times for local, inner, and instance variable scopes. In particular, JRuby's time for globals is ~226% of the average of all other tested scopes.

The good news, however, is that JRuby's jitting is definitely resulting in faster overall performance.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Sep 25, 2015

Member

The perf impact here seems to be mostly the writes; I split your benchmark into reads and writes and global var reads are about the same perf as all the others (and probably lost in the noise of the loop):

9k:

                        user     system      total        real
$global read        1.170000   0.010000   1.180000 (  1.198987)
@outer_inst read    1.020000   0.010000   1.030000 (  0.957400)
outer_scope read    1.150000   0.010000   1.160000 (  1.186650)
@inner_inst read    0.970000   0.000000   0.970000 (  0.968351)
inner_scope read    1.080000   0.010000   1.090000 (  1.055287)
local_scope read    0.980000   0.000000   0.980000 (  0.977480)
$global write       2.450000   0.030000   2.480000 (  2.404311)
@outer_inst write   1.170000   0.010000   1.180000 (  1.178349)
outer_scope write   1.180000   0.000000   1.180000 (  1.188390)
@inner_inst write   1.160000   0.000000   1.160000 (  1.163344)
inner_scope write   1.040000   0.000000   1.040000 (  1.044345)
local_scope write   1.050000   0.000000   1.050000 (  1.066801)

1.7.21:

                        user     system      total        real
$global read        1.190000   0.000000   1.190000 (  1.197000)
@outer_inst read    1.040000   0.000000   1.040000 (  1.043000)
outer_scope read    0.960000   0.000000   0.960000 (  0.967000)
@inner_inst read    1.050000   0.010000   1.060000 (  1.059000)
inner_scope read    0.990000   0.010000   1.000000 (  1.003000)
local_scope read    0.960000   0.000000   0.960000 (  0.968000)
$global write       2.380000   0.030000   2.410000 (  2.348000)
@outer_inst write   1.260000   0.010000   1.270000 (  1.259000)
outer_scope write   0.980000   0.000000   0.980000 (  0.985000)
@inner_inst write   1.300000   0.010000   1.310000 (  1.303000)
inner_scope write   1.000000   0.000000   1.000000 (  1.001000)
local_scope write   0.950000   0.000000   0.950000 (  0.954000)

Note, however, that 1.7 also has some invokedynamic optimizations for reads to make them faster. I suspect that's lost in the loop noise too, but here's those numbers:

1.7.21 with indy, first iteration

$global read        0.900000   0.010000   0.910000 (  0.654000)
@outer_inst read    1.150000   0.010000   1.160000 (  1.022000)
outer_scope read    0.950000   0.010000   0.960000 (  0.943000)
@inner_inst read    1.060000   0.010000   1.070000 (  1.029000)
inner_scope read    0.950000   0.000000   0.950000 (  0.942000)
local_scope read    0.860000   0.010000   0.870000 (  0.853000)

However, globals that are written repeatedly destroy this optimization (so we don't constantly try to keep rebinding it) and the second iteration goes back to the same perf:

1.7.21 with indy, second iteration

$global read        1.070000   0.010000   1.080000 (  1.017000)
@outer_inst read    1.000000   0.000000   1.000000 (  1.006000)
outer_scope read    0.970000   0.000000   0.970000 (  0.972000)
@inner_inst read    0.970000   0.000000   0.970000 (  0.965000)
inner_scope read    0.930000   0.010000   0.940000 (  0.933000)
local_scope read    0.900000   0.000000   0.900000 (  0.900000)

Ultimately 9k does not perform much differently than 1.7 (other than the missing indy optimization) but I think it's worth taking a few minutes to improve this.

Member

headius commented Sep 25, 2015

The perf impact here seems to be mostly the writes; I split your benchmark into reads and writes and global var reads are about the same perf as all the others (and probably lost in the noise of the loop):

9k:

                        user     system      total        real
$global read        1.170000   0.010000   1.180000 (  1.198987)
@outer_inst read    1.020000   0.010000   1.030000 (  0.957400)
outer_scope read    1.150000   0.010000   1.160000 (  1.186650)
@inner_inst read    0.970000   0.000000   0.970000 (  0.968351)
inner_scope read    1.080000   0.010000   1.090000 (  1.055287)
local_scope read    0.980000   0.000000   0.980000 (  0.977480)
$global write       2.450000   0.030000   2.480000 (  2.404311)
@outer_inst write   1.170000   0.010000   1.180000 (  1.178349)
outer_scope write   1.180000   0.000000   1.180000 (  1.188390)
@inner_inst write   1.160000   0.000000   1.160000 (  1.163344)
inner_scope write   1.040000   0.000000   1.040000 (  1.044345)
local_scope write   1.050000   0.000000   1.050000 (  1.066801)

1.7.21:

                        user     system      total        real
$global read        1.190000   0.000000   1.190000 (  1.197000)
@outer_inst read    1.040000   0.000000   1.040000 (  1.043000)
outer_scope read    0.960000   0.000000   0.960000 (  0.967000)
@inner_inst read    1.050000   0.010000   1.060000 (  1.059000)
inner_scope read    0.990000   0.010000   1.000000 (  1.003000)
local_scope read    0.960000   0.000000   0.960000 (  0.968000)
$global write       2.380000   0.030000   2.410000 (  2.348000)
@outer_inst write   1.260000   0.010000   1.270000 (  1.259000)
outer_scope write   0.980000   0.000000   0.980000 (  0.985000)
@inner_inst write   1.300000   0.010000   1.310000 (  1.303000)
inner_scope write   1.000000   0.000000   1.000000 (  1.001000)
local_scope write   0.950000   0.000000   0.950000 (  0.954000)

Note, however, that 1.7 also has some invokedynamic optimizations for reads to make them faster. I suspect that's lost in the loop noise too, but here's those numbers:

1.7.21 with indy, first iteration

$global read        0.900000   0.010000   0.910000 (  0.654000)
@outer_inst read    1.150000   0.010000   1.160000 (  1.022000)
outer_scope read    0.950000   0.010000   0.960000 (  0.943000)
@inner_inst read    1.060000   0.010000   1.070000 (  1.029000)
inner_scope read    0.950000   0.000000   0.950000 (  0.942000)
local_scope read    0.860000   0.010000   0.870000 (  0.853000)

However, globals that are written repeatedly destroy this optimization (so we don't constantly try to keep rebinding it) and the second iteration goes back to the same perf:

1.7.21 with indy, second iteration

$global read        1.070000   0.010000   1.080000 (  1.017000)
@outer_inst read    1.000000   0.000000   1.000000 (  1.006000)
outer_scope read    0.970000   0.000000   0.970000 (  0.972000)
@inner_inst read    0.970000   0.000000   0.970000 (  0.965000)
inner_scope read    0.930000   0.010000   0.940000 (  0.933000)
local_scope read    0.900000   0.000000   0.900000 (  0.900000)

Ultimately 9k does not perform much differently than 1.7 (other than the missing indy optimization) but I think it's worth taking a few minutes to improve this.

@itspomf

This comment has been minimized.

Show comment
Hide comment
@itspomf

itspomf Sep 25, 2015

This is refreshing to hear. I had been worried when first I ran this test;
glad to see it's only the writes that are slow.

Not sure why I didn't think to check that. Thank you!
On Sep 25, 2015 3:20 PM, "Charles Oliver Nutter" notifications@github.com
wrote:

The perf impact here seems to be mostly the writes; I split your benchmark
into reads and writes and global var reads are about the same perf as all
the others (and probably lost in the noise of the loop):

9k:

                    user     system      total        real

$global read 1.170000 0.010000 1.180000 ( 1.198987)
@outer_inst read 1.020000 0.010000 1.030000 ( 0.957400)
outer_scope read 1.150000 0.010000 1.160000 ( 1.186650)
@inner_inst read 0.970000 0.000000 0.970000 ( 0.968351)
inner_scope read 1.080000 0.010000 1.090000 ( 1.055287)
local_scope read 0.980000 0.000000 0.980000 ( 0.977480)
$global write 2.450000 0.030000 2.480000 ( 2.404311)
@outer_inst write 1.170000 0.010000 1.180000 ( 1.178349)
outer_scope write 1.180000 0.000000 1.180000 ( 1.188390)
@inner_inst write 1.160000 0.000000 1.160000 ( 1.163344)
inner_scope write 1.040000 0.000000 1.040000 ( 1.044345)
local_scope write 1.050000 0.000000 1.050000 ( 1.066801)

1.7.21:

                    user     system      total        real

$global read 1.190000 0.000000 1.190000 ( 1.197000)
@outer_inst read 1.040000 0.000000 1.040000 ( 1.043000)
outer_scope read 0.960000 0.000000 0.960000 ( 0.967000)
@inner_inst read 1.050000 0.010000 1.060000 ( 1.059000)
inner_scope read 0.990000 0.010000 1.000000 ( 1.003000)
local_scope read 0.960000 0.000000 0.960000 ( 0.968000)
$global write 2.380000 0.030000 2.410000 ( 2.348000)
@outer_inst write 1.260000 0.010000 1.270000 ( 1.259000)
outer_scope write 0.980000 0.000000 0.980000 ( 0.985000)
@inner_inst write 1.300000 0.010000 1.310000 ( 1.303000)
inner_scope write 1.000000 0.000000 1.000000 ( 1.001000)
local_scope write 0.950000 0.000000 0.950000 ( 0.954000)

Note, however, that 1.7 also has some invokedynamic optimizations for
reads to make them faster. I suspect that's lost in the loop noise too, but
here's those numbers:

1.7.21 with indy, first iteration

$global read 0.900000 0.010000 0.910000 ( 0.654000)
@outer_inst read 1.150000 0.010000 1.160000 ( 1.022000)
outer_scope read 0.950000 0.010000 0.960000 ( 0.943000)
@inner_inst read 1.060000 0.010000 1.070000 ( 1.029000)
inner_scope read 0.950000 0.000000 0.950000 ( 0.942000)
local_scope read 0.860000 0.010000 0.870000 ( 0.853000)

However, globals that are written repeatedly destroy this optimization (so
we don't constantly try to keep rebinding it) and the second iteration goes
back to the same perf:

1.7.21 with indy, second iteration

$global read 1.070000 0.010000 1.080000 ( 1.017000)
@outer_inst read 1.000000 0.000000 1.000000 ( 1.006000)
outer_scope read 0.970000 0.000000 0.970000 ( 0.972000)
@inner_inst read 0.970000 0.000000 0.970000 ( 0.965000)
inner_scope read 0.930000 0.010000 0.940000 ( 0.933000)
local_scope read 0.900000 0.000000 0.900000 ( 0.900000)

Ultimately 9k does not perform much differently than 1.7 (other than the
missing indy optimization) but I think it's worth taking a few minutes to
improve this.


Reply to this email directly or view it on GitHub
#3350 (comment).

itspomf commented Sep 25, 2015

This is refreshing to hear. I had been worried when first I ran this test;
glad to see it's only the writes that are slow.

Not sure why I didn't think to check that. Thank you!
On Sep 25, 2015 3:20 PM, "Charles Oliver Nutter" notifications@github.com
wrote:

The perf impact here seems to be mostly the writes; I split your benchmark
into reads and writes and global var reads are about the same perf as all
the others (and probably lost in the noise of the loop):

9k:

                    user     system      total        real

$global read 1.170000 0.010000 1.180000 ( 1.198987)
@outer_inst read 1.020000 0.010000 1.030000 ( 0.957400)
outer_scope read 1.150000 0.010000 1.160000 ( 1.186650)
@inner_inst read 0.970000 0.000000 0.970000 ( 0.968351)
inner_scope read 1.080000 0.010000 1.090000 ( 1.055287)
local_scope read 0.980000 0.000000 0.980000 ( 0.977480)
$global write 2.450000 0.030000 2.480000 ( 2.404311)
@outer_inst write 1.170000 0.010000 1.180000 ( 1.178349)
outer_scope write 1.180000 0.000000 1.180000 ( 1.188390)
@inner_inst write 1.160000 0.000000 1.160000 ( 1.163344)
inner_scope write 1.040000 0.000000 1.040000 ( 1.044345)
local_scope write 1.050000 0.000000 1.050000 ( 1.066801)

1.7.21:

                    user     system      total        real

$global read 1.190000 0.000000 1.190000 ( 1.197000)
@outer_inst read 1.040000 0.000000 1.040000 ( 1.043000)
outer_scope read 0.960000 0.000000 0.960000 ( 0.967000)
@inner_inst read 1.050000 0.010000 1.060000 ( 1.059000)
inner_scope read 0.990000 0.010000 1.000000 ( 1.003000)
local_scope read 0.960000 0.000000 0.960000 ( 0.968000)
$global write 2.380000 0.030000 2.410000 ( 2.348000)
@outer_inst write 1.260000 0.010000 1.270000 ( 1.259000)
outer_scope write 0.980000 0.000000 0.980000 ( 0.985000)
@inner_inst write 1.300000 0.010000 1.310000 ( 1.303000)
inner_scope write 1.000000 0.000000 1.000000 ( 1.001000)
local_scope write 0.950000 0.000000 0.950000 ( 0.954000)

Note, however, that 1.7 also has some invokedynamic optimizations for
reads to make them faster. I suspect that's lost in the loop noise too, but
here's those numbers:

1.7.21 with indy, first iteration

$global read 0.900000 0.010000 0.910000 ( 0.654000)
@outer_inst read 1.150000 0.010000 1.160000 ( 1.022000)
outer_scope read 0.950000 0.010000 0.960000 ( 0.943000)
@inner_inst read 1.060000 0.010000 1.070000 ( 1.029000)
inner_scope read 0.950000 0.000000 0.950000 ( 0.942000)
local_scope read 0.860000 0.010000 0.870000 ( 0.853000)

However, globals that are written repeatedly destroy this optimization (so
we don't constantly try to keep rebinding it) and the second iteration goes
back to the same perf:

1.7.21 with indy, second iteration

$global read 1.070000 0.010000 1.080000 ( 1.017000)
@outer_inst read 1.000000 0.000000 1.000000 ( 1.006000)
outer_scope read 0.970000 0.000000 0.970000 ( 0.972000)
@inner_inst read 0.970000 0.000000 0.970000 ( 0.965000)
inner_scope read 0.930000 0.010000 0.940000 ( 0.933000)
local_scope read 0.900000 0.000000 0.900000 ( 0.900000)

Ultimately 9k does not perform much differently than 1.7 (other than the
missing indy optimization) but I think it's worth taking a few minutes to
improve this.


Reply to this email directly or view it on GitHub
#3350 (comment).

headius added a commit that referenced this issue Sep 25, 2015

Wire up invokedynamic caching for global variable reads.
This improves global read perf for #3350 but global variable
writes still seem to be slower than MRI. I'm not exactly sure why
but it probably has to do with doing a full lookup for our too-
abstract wrapper around these values.
@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Oct 2, 2015

Member

With f2612a2, 9k should have the same perf for globals as 1.7.x. Why our globals are slower to write than MRI's...that's still an open question.

Member

headius commented Oct 2, 2015

With f2612a2, 9k should have the same perf for globals as 1.7.x. Why our globals are slower to write than MRI's...that's still an open question.

@kares kares added the performance label Jan 18, 2017

@kares kares added this to the JRuby 9.1.8.0 milestone Jan 18, 2017

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Mar 3, 2017

Member

Note to self: include modified benchmarks in the original report.

I suspect the cost of writing is that even after we've given up on caching the global value, we continue to invalidate a SwitchPoint for the indy logic in the global variable. This is certainly not free, because invalidating it means we also construct a new one. I think the right way to fix this would be to also mark the variable as "no longer cacheable" which in turn means "don't bother invalidating".

I'll try to fix this tomorrow.

Member

headius commented Mar 3, 2017

Note to self: include modified benchmarks in the original report.

I suspect the cost of writing is that even after we've given up on caching the global value, we continue to invalidate a SwitchPoint for the indy logic in the global variable. This is certainly not free, because invalidating it means we also construct a new one. I think the right way to fix this would be to also mark the variable as "no longer cacheable" which in turn means "don't bother invalidating".

I'll try to fix this tomorrow.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Mar 3, 2017

Member

Split up read/write benchmark. Note that the reads of some of these types of variables will be optimized completely away by either our IR or by the JVM, so they're really just measuring the block dispatch overhead.

require 'benchmark'

LOOPS = 30000000

$global = 0
@outer_inst = 0
outer_scope = 0

loop { Benchmark.bmbm {|test|
    @inner_inst = 0
    inner_scope = 0

    test.report('$global w'){ LOOPS.times { $global = 1 } }
    test.report('@outer_inst w'){ LOOPS.times { @outer_inst = 1 } }
    test.report('outer_scope w'){ LOOPS.times { outer_scope = 1 } }
    test.report('@inner_inst w'){ LOOPS.times { @inner_inst = 1 } }
    test.report('inner_scope w'){ LOOPS.times { inner_scope = 1 } }
    test.report('local_scope w'){ local_scope = 0; LOOPS.times { local_scope = 1 } }
    test.report('$global r'){ LOOPS.times { $global } }
    test.report('@outer_inst r'){ LOOPS.times { @outer_inst } }
    test.report('outer_scope r'){ LOOPS.times { outer_scope } }
    test.report('@inner_inst r'){ LOOPS.times { @inner_inst } }
    test.report('inner_scope r'){ LOOPS.times { inner_scope } }
    test.report('local_scope r'){ local_scope = 0; LOOPS.times { local_scope } }
} }
Member

headius commented Mar 3, 2017

Split up read/write benchmark. Note that the reads of some of these types of variables will be optimized completely away by either our IR or by the JVM, so they're really just measuring the block dispatch overhead.

require 'benchmark'

LOOPS = 30000000

$global = 0
@outer_inst = 0
outer_scope = 0

loop { Benchmark.bmbm {|test|
    @inner_inst = 0
    inner_scope = 0

    test.report('$global w'){ LOOPS.times { $global = 1 } }
    test.report('@outer_inst w'){ LOOPS.times { @outer_inst = 1 } }
    test.report('outer_scope w'){ LOOPS.times { outer_scope = 1 } }
    test.report('@inner_inst w'){ LOOPS.times { @inner_inst = 1 } }
    test.report('inner_scope w'){ LOOPS.times { inner_scope = 1 } }
    test.report('local_scope w'){ local_scope = 0; LOOPS.times { local_scope = 1 } }
    test.report('$global r'){ LOOPS.times { $global } }
    test.report('@outer_inst r'){ LOOPS.times { @outer_inst } }
    test.report('outer_scope r'){ LOOPS.times { outer_scope } }
    test.report('@inner_inst r'){ LOOPS.times { @inner_inst } }
    test.report('inner_scope r'){ LOOPS.times { inner_scope } }
    test.report('local_scope r'){ local_scope = 0; LOOPS.times { local_scope } }
} }
@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Mar 3, 2017

Member

I was wrong; the overhead is not from constantly re-invalidating -- we have a failover point at which we just leave it permanently invalid -- but from doing a slow hash lookup of the variable every time. Since this affects all globals, including some that are expected to be updated if you're using them at all, I'll make appropriate modifications to avoid this lookup.

Member

headius commented Mar 3, 2017

I was wrong; the overhead is not from constantly re-invalidating -- we have a failover point at which we just leave it permanently invalid -- but from doing a slow hash lookup of the variable every time. Since this affects all globals, including some that are expected to be updated if you're using them at all, I'll make appropriate modifications to avoid this lookup.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Mar 3, 2017

Member

Calling this one done. Global variable reads cache their value unless invalidated too many times (set by -Xinvokedynamic.global.maxfail=100) and cache the wrapper object after that to avoid hash lookup overhead. Global variable sets cache the wrapper object. That's about as cheap as I can make these right now.

Member

headius commented Mar 3, 2017

Calling this one done. Global variable reads cache their value unless invalidated too many times (set by -Xinvokedynamic.global.maxfail=100) and cache the wrapper object after that to avoid hash lookup overhead. Global variable sets cache the wrapper object. That's about as cheap as I can make these right now.

@headius headius closed this Mar 3, 2017

@itspomf

This comment has been minimized.

Show comment
Hide comment
@itspomf

itspomf Mar 3, 2017

Thanks for this! Just saw your progress and I'm glad to hear there's a good resolution.

itspomf commented Mar 3, 2017

Thanks for this! Just saw your progress and I'm glad to hear there's a good resolution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment