Optimize event handling with large arguments #9643

simonkcleung · 2016-11-16T15:48:03Z

Checklist

Affected core subsystem(s)

Description of change

In V8, it is faster to create an array by [] instead of new Array().

In V8, it is faster to create an array by ```[]``` instead of ``` new Array()'''.

cjihrig · 2016-11-16T15:59:05Z

Is it still faster taking into consideration that the current code creates an appropriately sized array from the start?

mscdex · 2016-11-16T16:11:54Z

I'm a bit skeptical about this, can you provide a benchmark results comparison using benchmark/compare.R?

simonkcleung · 2016-11-16T16:22:21Z

@cjihrig
yes, it is faster

@mscdex
Not familiar with benchmark. I tested it in https://github.com/nodejs/node/blob/master/benchmark/events/ee-emit-multi-args.js and line 17 changed to ee.emit('dummy',1,2,3,4,5,6);

JacksonTian · 2016-11-16T17:28:06Z

-1 if without benchmark result.

Trott · 2016-11-16T22:17:40Z

I ran the events benchmarks and the statistically-significant results indicate that this new code is slightly slower.

Command I ran (after compiling the master branch and moving it to /var/tmp, then compiling the master branch with this patch applied and moving it to a different name in /var/tmp):

 node benchmark/compare.js --old /var/tmp/node-master --new /var/tmp/node-simonkcleung events | tee /var/tmp/events.csv

Command I then ran to process the data:

cat /var/tmp/events.csv | Rscript benchmark/compare.R

Results:

                                                     improvement significant    p.value
 events/ee-add-remove.js n=250000                         0.76 %             0.07922860
 events/ee-emit-multi-args.js n=2000000                  -1.30 %           * 0.04312630
 events/ee-emit.js n=2000000                             -0.63 %             0.63112687
 events/ee-listener-count-on-prototype.js n=50000000     -0.26 %             0.52637424
 events/ee-listeners-many.js n=5000000                   -1.25 %           * 0.03766378
 events/ee-listeners.js n=5000000                         0.19 %             0.75764889

mscdex · 2016-11-16T22:23:53Z

Yeah I don't see how calling push() on a non-pre-allocated array could be faster than a pre-allocated array. I could maybe understand if it performed the same if V8 implicitly pre-allocates some space and the benchmark only happens to push() a number of items less than that initial internal allocation. Otherwise V8 has to perform some kind of additional allocation on push(), whereas specifying a size upfront would never (or at least should not) cause any additional allocations (provided you don't try to append more than that specified size of course).

Trott · 2016-11-16T22:35:06Z

ALTHOUGH...I changed the benchmark file as described by @simonkcleung in #9643 (comment) and the improvement if I do that is downright astonishing:

events/ee-emit-multi-args.js n=2000000                1017.58 %         *** 3.928048e-40

Fishrock123 · 2016-11-17T05:10:58Z

@Trott are you sure you changed that right?

Trott · 2016-11-17T18:11:20Z

@Fishrock123 Yes, I'm pretty certain, but if you or someone else (@mscdex?) want to see if you can or can't replicate my results, that would be great.

targos · 2016-11-17T19:01:22Z

I also get a similar result:

events/ee-emit-multi-args.js n=2000000                 991.02 %         *** 4.679818e-72

It still feels terribly wrong to me...

bnoordhuis · 2016-11-17T19:25:02Z

Can you post the diff? I'd like to give it a try.

targos · 2016-11-17T20:52:41Z

You mean this diff ?

diff --git a/benchmark/events/ee-emit-multi-args.js b/benchmark/events/ee-emit-multi-args.js
index b423c21..0ca3026 100644
--- a/benchmark/events/ee-emit-multi-args.js
+++ b/benchmark/events/ee-emit-multi-args.js
@@ -14,7 +14,7 @@ function main(conf) {

   bench.start();
   for (var i = 0; i < n; i += 1) {
-    ee.emit('dummy', 5, true);
+    ee.emit('dummy', 1, 2, 3, 4, 5, 6);
   }
   bench.end(n);
 }

targos · 2016-11-18T07:49:41Z

So I profiled a run of this modified benchmark with current master and the result shows that it passes a significant amount of time in Runtime_CreateListFromArrayLike :

 [C++ entry points]:
   ticks    cpp   total   name
   4285   99.1%   80.7%  v8::internal::Runtime_CreateListFromArrayLike(int, v8::internal::Object**, v8::internal::Isolate*)
     20    0.5%    0.4%  v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*)
...

 [Bottom up (heavy) profile]:
...
   ticks parent  name
    542   10.2%  void v8::internal::LookupIterator::Start<true>()
    542  100.0%    v8::internal::Runtime_CreateListFromArrayLike(int, v8::internal::Object**, v8::internal::Isolate*)
    542  100.0%      LazyCompile: *emit events.js:136:44
    535   98.7%        LazyCompile: *main /home/mzasso/git/forks/node/benchmark/events/ee-emit-multi-args.js:7:14
    535  100.0%          LazyCompile: ~Benchmark.process.nextTick /home/mzasso/git/forks/node/benchmark/common.js:24:22
    535  100.0%            LazyCompile: ~_combinedTickCallback internal/process/next_tick.js:65:33

This doesn't happen when this PR is applied (Runtime_CreateListFromArrayLike doesn't appear in the profile).

master.txt
pr-9643.txt

mscdex · 2016-11-18T09:45:52Z

Ok so the regression happened between node v5.x and v6.0.0, which is V8 v4.6 and v5.1 respectively.

I've also found out that the performance regression actually has nothing to do with either the value types being assigned or the act of assigning values to the array in general, but it seems to be the array allocation causing the slowdown. If I remove the array assignment loop completely and just do:

new Array(6): it's still as slow as it is currently
new Array(1): ... same ...
new Array(0), new Array(), or []: now really fast (perhaps obviously)

So it's not related to explicit constructor use or the supplying of a length argument to the constructor, but it's whenever a positive, non-zero length is passed to the constructor.

Without bisecting, there is one particular V8 commit that was landed during V8 4.9 where array construction logic was changed (and these changes still exist in master) that might be a good candidate...

targos · 2016-11-18T09:54:27Z

cc @bmeurer ?

bmeurer · 2016-11-18T10:03:36Z

W/o looking further into the benchmark, it seems that there's some Function.prototype.apply or Reflect.apply/Reflect.construct call involved, that would also explain the performance difference.

If you do

var a = [];
for (var i = 0; i < len; ++i) {
  a[i] = some value;
}

that creates a non-holey array (i.e. V8 knows that there are no holes in the elements and thus we don't need to fall back to lookups on prototypes). However

var a = new Array(len);
for (var i = 0; i < len; ++i) {
  a[i] = some value;
}

creates a holey array from the get go (thanks to EcmaScripts wonderful "length" property magic).

So if you now use such an array with Function.prototype.apply (or the Reflect.apply/Reflect.construct ES6 additions), you'll hit the fast path for non-holey arrays where we just push all elements onto the stack and call through, while for holey arrays, we first need to turn them into a list (represented as FixedArray internally) and can then push them onto the stack.

There's an open issue to extend this fast case at some point to also cover a bunch of holey array cases, but so far we didn't get back to that.

mscdex · 2016-11-18T10:06:08Z

Here is a simpler repro, it uses a constant length value (which I think V8 typically optimizes for?) but it still seems to exhibit the same slowdown:

function foo() {
  return 1 + 1;
}

var n = 2e6;

console.time('fn.apply');
for (var i = 0; i < n; i += 1) {
  //var args = new Array(6);
  //args[0] = args[1] = args[2] = args[3] = args[4] = args[5] = 1;
  var args = [];
  args.push(1);
  args.push(1);
  args.push(1);
  args.push(1);
  args.push(1);
  args.push(1);
  foo.apply(this, args);
}
console.timeEnd('fn.apply');

Replace the push()s with the commented section to see the difference.

node v5.12.0 / V8 4.6.85.32
- new Array(6): fn.apply: 231.184ms
- sequential push()s: fn.apply: 303.116ms
node v6.0.0 / V8 5.0.71.35
- new Array(6): fn.apply: 913.341ms
- sequential push()s: fn.apply: 156.452ms
node master / V8 5.4.500.41
- new Array(6): fn.apply: 972.002ms
- sequential push()s: fn.apply: 148.616ms

EDIT: oops, didn't see the responses above before I posted this....

bmeurer · 2016-11-18T10:07:41Z

Yap, so it's the missing fast path for holey arrays in Function.prototype.apply then.

bmeurer · 2016-11-18T10:12:19Z

This is related to https://bugs.chromium.org/p/v8/issues/detail?id=4826, although that particular one is about the double arrays.

bnoordhuis · 2016-11-18T10:12:27Z

Yes, that's indeed what happens.

$ node --allow_natives_syntax -e '%DebugPrint([1,2,3])' | grep 'elements kind'
 - elements kind: FAST_SMI_ELEMENTS

$ node --allow_natives_syntax -e '%DebugPrint([5,true])' | grep 'elements kind'
 - elements kind: FAST_ELEMENTS

$ node --allow_natives_syntax -e '%DebugPrint(new Array(2))' | grep 'elements kind'
 - elements kind: FAST_HOLEY_SMI_ELEMENTS

bmeurer · 2016-11-18T10:17:33Z

I'll see if I can cook a quickfix for the FAST_HOLEY_SMI_ELEMENTS and FAST_HOLEY_ELEMENTS cases.

mscdex · 2016-11-18T10:20:46Z

Even

var args = [];
args[0] = args[1] = args[2] = args[3] = args[4] = args[5] = 1;
fn.apply(this, args);

is just as slow FWIW.

bmeurer · 2016-11-18T10:22:20Z

Sure, you store to args[5] first, meaning you turn args into a holey array with holes at elements 0,...,4.

mscdex · 2016-11-18T10:23:51Z

Sure, you store to args[5] first, meaning you turn args into a holey array with holes at elements 0,...,4.

Ah oops you're right :-) That's what I get for trying to make a one-liner... Reversing the indices is fast though as expected now.

bmeurer · 2016-11-18T12:53:45Z

I have a fix here (intel only): https://codereview.chromium.org/2510043004

jasnell

I'm -1 on this. Using the preallocated array is preferable when possible.

simonkcleung · 2016-11-20T14:30:17Z

As @bmeurer pointed out, using a holy array in apply() causes the slowdown. If it is fixed (in V8), nothing have to be changed.

fhinkel · 2016-12-06T14:11:46Z

I think we're still waiting for the S390 and PPC port of that V8 fix /cc @jasnell

jbajwa · 2016-12-10T01:02:07Z

@fhinkel Sorry for the delay, I somehow missed this CL. Just uploaded the port for PPC/s390.

bmeurer · 2016-12-11T13:18:20Z

@jbajwa Thanks. So with your port, we have Intel, ARM, MIPS and PPC/s390 ports for the regression.

@jasnell Pre-allocating an array is not generally beneficial, depending on how the array is used. For example, load/store access to non-holey array is generally faster. So if you can create an array by pushing elements to it, that can be performance-win if the potential overhead from copying during setup is not dominating.

mscdex · 2017-01-17T05:26:01Z

Any word on if/when the relevant upstream V8 changes will be backported to V8 5.4 (assuming V8 5.5 won't make it into node v7)?

addaleax · 2017-01-17T05:30:17Z

@mscdex Not sure if that’s even a partial answer to your question, but according to #9730 (comment) V8 5.4 is not maintained anymore, which I guess means that we’d have to take care of any backports ourselves?

mscdex · 2017-01-17T05:45:56Z

@addaleax If that's the case, I thought there was still supposed to be some sort of special coordination going on to get them backported on V8's side so we wouldn't have to float patches ourselves? Or is that only for V8 versions used by node LTS releases?

addaleax · 2017-01-17T05:47:46Z

Sorry, I don’t really know what our current process for these things is. @nodejs/v8 probably knows?

bmeurer · 2017-01-17T07:19:00Z

As far as I know we only backmerge fixes for Node LTS versions (and Chrome versions in the wild). @fhinkel might know more.

These CLs might be quite easy to backmerge AFAIR.

natorion · 2017-01-17T08:25:49Z

EDIT: ~~Node has its own V8 fork where patches are merged to. See https://github.com/nodejs/node/blob/master/doc/guides/maintaining-V8.md for more information on this.~~

Seems like the separate fork is still a proposal. Please refer to https://github.com/nodejs/node/blob/master/doc/guides/maintaining-V8.md#backporting-to-abandoned-branches as a manual for backporting.

@ofrobots is often coordinating this.

ofrobots · 2017-01-18T00:05:38Z

I'm +1 on backporting for 5.4, however this would also be need to be backported to V8 5.5 and 5.6 upstream. @natorion How likely is upstream to approve the merge request for 5.5 and 5.6?

fhinkel · 2017-01-23T19:20:36Z

@ofrobots 5.5 is abandoned as of this week. We'd need to backmerge ourselves, just like for 5.4.

jasnell · 2017-03-24T22:54:02Z

Status update on this one?

fhinkel · 2017-03-26T12:19:46Z

I think we can close this PR because the performance problem is addressed upstream in V8.

Optimize event handling with large arguments

ccf0b0c

In V8, it is faster to create an array by ```[]``` instead of ``` new Array()'''.

nodejs-github-bot added the events Issues and PRs related to the events subsystem / EventEmitter. label Nov 16, 2016

jasnell requested changes Nov 18, 2016

View reviewed changes

mscdex added the v8 engine Issues and PRs related to the V8 dependency. label Dec 6, 2016

mscdex mentioned this pull request Jan 19, 2017

buffer: improve toJSON() performance #10895

Closed

3 tasks

mscdex mentioned this pull request Mar 7, 2017

buffer: Preallocate array with buffer length in Buffer#toJSON #11733

Closed

2 tasks

jasnell added the stalled Issues and PRs that are stalled. label Mar 24, 2017

mscdex closed this Mar 26, 2017

Optimize event handling with large arguments #9643

Optimize event handling with large arguments #9643

Conversation

simonkcleung commented Nov 16, 2016 • edited

Checklist

Affected core subsystem(s)

Description of change

cjihrig commented Nov 16, 2016

mscdex commented Nov 16, 2016 • edited

simonkcleung commented Nov 16, 2016 • edited

JacksonTian commented Nov 16, 2016

Trott commented Nov 16, 2016 • edited

mscdex commented Nov 16, 2016 • edited

Trott commented Nov 16, 2016

Fishrock123 commented Nov 17, 2016

Trott commented Nov 17, 2016

targos commented Nov 17, 2016

bnoordhuis commented Nov 17, 2016

targos commented Nov 17, 2016

targos commented Nov 18, 2016

mscdex commented Nov 18, 2016 • edited

targos commented Nov 18, 2016

bmeurer commented Nov 18, 2016

mscdex commented Nov 18, 2016 • edited

bmeurer commented Nov 18, 2016

bmeurer commented Nov 18, 2016

bnoordhuis commented Nov 18, 2016

bmeurer commented Nov 18, 2016

mscdex commented Nov 18, 2016

bmeurer commented Nov 18, 2016

mscdex commented Nov 18, 2016 • edited

bmeurer commented Nov 18, 2016

jasnell left a comment

Choose a reason for hiding this comment

simonkcleung commented Nov 20, 2016

fhinkel commented Dec 6, 2016 • edited

jbajwa commented Dec 10, 2016

bmeurer commented Dec 11, 2016

mscdex commented Jan 17, 2017

addaleax commented Jan 17, 2017

mscdex commented Jan 17, 2017

addaleax commented Jan 17, 2017

bmeurer commented Jan 17, 2017

natorion commented Jan 17, 2017 • edited

ofrobots commented Jan 18, 2017

fhinkel commented Jan 23, 2017

jasnell commented Mar 24, 2017

fhinkel commented Mar 26, 2017

simonkcleung commented Nov 16, 2016 •

edited

mscdex commented Nov 16, 2016 •

edited

simonkcleung commented Nov 16, 2016 •

edited

Trott commented Nov 16, 2016 •

edited

mscdex commented Nov 16, 2016 •

edited

mscdex commented Nov 18, 2016 •

edited

mscdex commented Nov 18, 2016 •

edited

mscdex commented Nov 18, 2016 •

edited

fhinkel commented Dec 6, 2016 •

edited

natorion commented Jan 17, 2017 •

edited