Optimize event handling with large arguments #9643

Open
wants to merge 1 commit into
from

Projects

None yet
@simonkcleung
simonkcleung commented Nov 16, 2016 edited
Checklist
Affected core subsystem(s)
Description of change

In V8, it is faster to create an array by [] instead of new Array().

@simonkcleung simonkcleung Optimize event handling with large arguments
In V8, it is faster to create an array by ```[]``` instead of ``` new Array()'''.
ccf0b0c
@cjihrig
Member
cjihrig commented Nov 16, 2016

Is it still faster taking into consideration that the current code creates an appropriately sized array from the start?

@mscdex
Contributor
mscdex commented Nov 16, 2016 edited

I'm a bit skeptical about this, can you provide a benchmark results comparison using benchmark/compare.R?

@simonkcleung
simonkcleung commented Nov 16, 2016 edited

@cjihrig
yes, it is faster

@mscdex
Not familiar with benchmark. I tested it in https://github.com/nodejs/node/blob/master/benchmark/events/ee-emit-multi-args.js and line 17 changed to ee.emit('dummy',1,2,3,4,5,6);

@JacksonTian
Contributor

-1 if without benchmark result.

@Trott
Member
Trott commented Nov 16, 2016 edited

I ran the events benchmarks and the statistically-significant results indicate that this new code is slightly slower.

Command I ran (after compiling the master branch and moving it to /var/tmp, then compiling the master branch with this patch applied and moving it to a different name in /var/tmp):

 node benchmark/compare.js --old /var/tmp/node-master --new /var/tmp/node-simonkcleung events | tee /var/tmp/events.csv

Command I then ran to process the data:

cat /var/tmp/events.csv | Rscript benchmark/compare.R

Results:

                                                     improvement significant    p.value
 events/ee-add-remove.js n=250000                         0.76 %             0.07922860
 events/ee-emit-multi-args.js n=2000000                  -1.30 %           * 0.04312630
 events/ee-emit.js n=2000000                             -0.63 %             0.63112687
 events/ee-listener-count-on-prototype.js n=50000000     -0.26 %             0.52637424
 events/ee-listeners-many.js n=5000000                   -1.25 %           * 0.03766378
 events/ee-listeners.js n=5000000                         0.19 %             0.75764889
@mscdex
Contributor
mscdex commented Nov 16, 2016 edited

Yeah I don't see how calling push() on a non-pre-allocated array could be faster than a pre-allocated array. I could maybe understand if it performed the same if V8 implicitly pre-allocates some space and the benchmark only happens to push() a number of items less than that initial internal allocation. Otherwise V8 has to perform some kind of additional allocation on push(), whereas specifying a size upfront would never (or at least should not) cause any additional allocations (provided you don't try to append more than that specified size of course).

@Trott
Member
Trott commented Nov 16, 2016

ALTHOUGH...I changed the benchmark file as described by @simonkcleung in #9643 (comment) and the improvement if I do that is downright astonishing:

events/ee-emit-multi-args.js n=2000000                1017.58 %         *** 3.928048e-40
@Fishrock123
Member

@Trott are you sure you changed that right?

@Trott
Member
Trott commented Nov 17, 2016

@Fishrock123 Yes, I'm pretty certain, but if you or someone else (@mscdex?) want to see if you can or can't replicate my results, that would be great.

@targos
Member
targos commented Nov 17, 2016

I also get a similar result:

events/ee-emit-multi-args.js n=2000000                 991.02 %         *** 4.679818e-72

It still feels terribly wrong to me...

@bnoordhuis
Member

Can you post the diff? I'd like to give it a try.

@targos
Member
targos commented Nov 17, 2016

You mean this diff ?

diff --git a/benchmark/events/ee-emit-multi-args.js b/benchmark/events/ee-emit-multi-args.js
index b423c21..0ca3026 100644
--- a/benchmark/events/ee-emit-multi-args.js
+++ b/benchmark/events/ee-emit-multi-args.js
@@ -14,7 +14,7 @@ function main(conf) {

   bench.start();
   for (var i = 0; i < n; i += 1) {
-    ee.emit('dummy', 5, true);
+    ee.emit('dummy', 1, 2, 3, 4, 5, 6);
   }
   bench.end(n);
 }
@targos
Member
targos commented Nov 18, 2016

So I profiled a run of this modified benchmark with current master and the result shows that it passes a significant amount of time in Runtime_CreateListFromArrayLike :

 [C++ entry points]:
   ticks    cpp   total   name
   4285   99.1%   80.7%  v8::internal::Runtime_CreateListFromArrayLike(int, v8::internal::Object**, v8::internal::Isolate*)
     20    0.5%    0.4%  v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*)
...

 [Bottom up (heavy) profile]:
...
   ticks parent  name
    542   10.2%  void v8::internal::LookupIterator::Start<true>()
    542  100.0%    v8::internal::Runtime_CreateListFromArrayLike(int, v8::internal::Object**, v8::internal::Isolate*)
    542  100.0%      LazyCompile: *emit events.js:136:44
    535   98.7%        LazyCompile: *main /home/mzasso/git/forks/node/benchmark/events/ee-emit-multi-args.js:7:14
    535  100.0%          LazyCompile: ~Benchmark.process.nextTick /home/mzasso/git/forks/node/benchmark/common.js:24:22
    535  100.0%            LazyCompile: ~_combinedTickCallback internal/process/next_tick.js:65:33

This doesn't happen when this PR is applied (Runtime_CreateListFromArrayLike doesn't appear in the profile).

master.txt
pr-9643.txt

@mscdex
Contributor
mscdex commented Nov 18, 2016 edited

Ok so the regression happened between node v5.x and v6.0.0, which is V8 v4.6 and v5.1 respectively.

I've also found out that the performance regression actually has nothing to do with either the value types being assigned or the act of assigning values to the array in general, but it seems to be the array allocation causing the slowdown. If I remove the array assignment loop completely and just do:

  • new Array(6): it's still as slow as it is currently
  • new Array(1): ... same ...
  • new Array(0), new Array(), or []: now really fast (perhaps obviously)

So it's not related to explicit constructor use or the supplying of a length argument to the constructor, but it's whenever a positive, non-zero length is passed to the constructor.

Without bisecting, there is one particular V8 commit that was landed during V8 4.9 where array construction logic was changed (and these changes still exist in master) that might be a good candidate...

@targos
Member
targos commented Nov 18, 2016

cc @bmeurer ?

@bmeurer
Contributor
bmeurer commented Nov 18, 2016

W/o looking further into the benchmark, it seems that there's some Function.prototype.apply or Reflect.apply/Reflect.construct call involved, that would also explain the performance difference.

If you do

var a = [];
for (var i = 0; i < len; ++i) {
  a[i] = some value;
}

that creates a non-holey array (i.e. V8 knows that there are no holes in the elements and thus we don't need to fall back to lookups on prototypes). However

var a = new Array(len);
for (var i = 0; i < len; ++i) {
  a[i] = some value;
}

creates a holey array from the get go (thanks to EcmaScripts wonderful "length" property magic).

So if you now use such an array with Function.prototype.apply (or the Reflect.apply/Reflect.construct ES6 additions), you'll hit the fast path for non-holey arrays where we just push all elements onto the stack and call through, while for holey arrays, we first need to turn them into a list (represented as FixedArray internally) and can then push them onto the stack.

There's an open issue to extend this fast case at some point to also cover a bunch of holey array cases, but so far we didn't get back to that.

@mscdex
Contributor
mscdex commented Nov 18, 2016 edited

Here is a simpler repro, it uses a constant length value (which I think V8 typically optimizes for?) but it still seems to exhibit the same slowdown:

function foo() {
  return 1 + 1;
}

var n = 2e6;

console.time('fn.apply');
for (var i = 0; i < n; i += 1) {
  //var args = new Array(6);
  //args[0] = args[1] = args[2] = args[3] = args[4] = args[5] = 1;
  var args = [];
  args.push(1);
  args.push(1);
  args.push(1);
  args.push(1);
  args.push(1);
  args.push(1);
  foo.apply(this, args);
}
console.timeEnd('fn.apply');

Replace the push()s with the commented section to see the difference.

  • node v5.12.0 / V8 4.6.85.32
    • new Array(6): fn.apply: 231.184ms
    • sequential push()s: fn.apply: 303.116ms
  • node v6.0.0 / V8 5.0.71.35
    • new Array(6): fn.apply: 913.341ms
    • sequential push()s: fn.apply: 156.452ms
  • node master / V8 5.4.500.41
    • new Array(6): fn.apply: 972.002ms
    • sequential push()s: fn.apply: 148.616ms

EDIT: oops, didn't see the responses above before I posted this....

@bmeurer
Contributor
bmeurer commented Nov 18, 2016

Yap, so it's the missing fast path for holey arrays in Function.prototype.apply then.

@bmeurer
Contributor
bmeurer commented Nov 18, 2016

This is related to https://bugs.chromium.org/p/v8/issues/detail?id=4826, although that particular one is about the double arrays.

@bnoordhuis
Member

Yes, that's indeed what happens.

$ node --allow_natives_syntax -e '%DebugPrint([1,2,3])' | grep 'elements kind'
 - elements kind: FAST_SMI_ELEMENTS

$ node --allow_natives_syntax -e '%DebugPrint([5,true])' | grep 'elements kind'
 - elements kind: FAST_ELEMENTS

$ node --allow_natives_syntax -e '%DebugPrint(new Array(2))' | grep 'elements kind'
 - elements kind: FAST_HOLEY_SMI_ELEMENTS
@bmeurer
Contributor
bmeurer commented Nov 18, 2016

I'll see if I can cook a quickfix for the FAST_HOLEY_SMI_ELEMENTS and FAST_HOLEY_ELEMENTS cases.

@mscdex
Contributor
mscdex commented Nov 18, 2016

Even

var args = [];
args[0] = args[1] = args[2] = args[3] = args[4] = args[5] = 1;
fn.apply(this, args);

is just as slow FWIW.

@bmeurer
Contributor
bmeurer commented Nov 18, 2016

Sure, you store to args[5] first, meaning you turn args into a holey array with holes at elements 0,...,4.

@mscdex
Contributor
mscdex commented Nov 18, 2016 edited

Sure, you store to args[5] first, meaning you turn args into a holey array with holes at elements 0,...,4.

Ah oops you're right :-) That's what I get for trying to make a one-liner... Reversing the indices is fast though as expected now.

@bmeurer
Contributor
bmeurer commented Nov 18, 2016

I have a fix here (intel only): https://codereview.chromium.org/2510043004

@jasnell

I'm -1 on this. Using the preallocated array is preferable when possible.

@simonkcleung

As @bmeurer pointed out, using a holy array in apply() causes the slowdown. If it is fixed (in V8), nothing have to be changed.

@fhinkel
Member
fhinkel commented Dec 6, 2016 edited

I think we're still waiting for the S390 and PPC port of that V8 fix /cc @jasnell

@mscdex mscdex added the V8 label Dec 6, 2016
@jbajwa
Contributor
jbajwa commented Dec 10, 2016

@fhinkel Sorry for the delay, I somehow missed this CL. Just uploaded the port for PPC/s390.

@bmeurer
Contributor
bmeurer commented Dec 11, 2016

@jbajwa Thanks. So with your port, we have Intel, ARM, MIPS and PPC/s390 ports for the regression.

@jasnell Pre-allocating an array is not generally beneficial, depending on how the array is used. For example, load/store access to non-holey array is generally faster. So if you can create an array by pushing elements to it, that can be performance-win if the potential overhead from copying during setup is not dominating.

@mscdex
Contributor
mscdex commented Jan 17, 2017

Any word on if/when the relevant upstream V8 changes will be backported to V8 5.4 (assuming V8 5.5 won't make it into node v7)?

@addaleax
Member

@mscdex Not sure if that’s even a partial answer to your question, but according to #9730 (comment) V8 5.4 is not maintained anymore, which I guess means that we’d have to take care of any backports ourselves?

@mscdex
Contributor
mscdex commented Jan 17, 2017

@addaleax If that's the case, I thought there was still supposed to be some sort of special coordination going on to get them backported on V8's side so we wouldn't have to float patches ourselves? Or is that only for V8 versions used by node LTS releases?

@addaleax
Member

Sorry, I don’t really know what our current process for these things is. @nodejs/v8 probably knows?

@bmeurer
Contributor
bmeurer commented Jan 17, 2017

As far as I know we only backmerge fixes for Node LTS versions (and Chrome versions in the wild). @fhinkel might know more.

These CLs might be quite easy to backmerge AFAIR.

@natorion
natorion commented Jan 17, 2017 edited

EDIT: Node has its own V8 fork where patches are merged to. See https://github.com/nodejs/node/blob/master/doc/guides/maintaining-V8.md for more information on this.

Seems like the separate fork is still a proposal. Please refer to https://github.com/nodejs/node/blob/master/doc/guides/maintaining-V8.md#backporting-to-abandoned-branches as a manual for backporting.

@ofrobots is often coordinating this.

@ofrobots
Contributor

I'm +1 on backporting for 5.4, however this would also be need to be backported to V8 5.5 and 5.6 upstream. @natorion How likely is upstream to approve the merge request for 5.5 and 5.6?

@mscdex mscdex referenced this pull request Jan 19, 2017
Open

buffer: improve toJSON() performance #10895

3 of 3 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment