async_wrap,lib: Firing async-wrap callbacks for next-tick callbacks #6082

mike-kaufman · 2016-04-06T18:50:20Z

Checklist

tests and code linting passes
a test and/or benchmark is included
documentation is changed or added
the commit message follows commit guidelines

Caveats:

seeing the following test failures on windows that occur without my changes:
- parallel/test-path (fix is part of PR path: fixing a test that breaks on some machines. #6067)
- parallel/test-tls-server-verify - intermittent failures without my changes.
- addons/load-long-path/test. See error message "The data area passed to a system call is too small". I'll need to dig a bit more to see if this repros without my changes.
- test\sequential\test-regress-benchmark: bump eventemitter number of iterations #746.js - this times out intermittently without my changes.
I'm not sure if there are docs that need updated here. Please let me know
I'm not sure if there are relevant benchmarks that I should run. Please let me know.

Affected core subsystem(s)

async_wrap;lib

Description of change

This change updates nextTick() callback processing to call back into
the node runtime to notify the runtime of async state transitions.
Currently, the runtime will track the active async ID and fire any
registered async-wrap callbacks. This gives async-wrap consistent user
semantics for nextTick() callbacks.

I anticipate some concerns around impact to perf. Rough measurements
show an approximate 2x increase in time to process 1,000,000 nextTick()
callbacks. These went from an average of 1094 ms per million
nextTick() callbacks to 2133 ms per million with the changes in this PR.

I'm open to alternative suggestions around implementation here. :) It's
not lost on me that the implementation of next_tick.js is making an
effort to minimize transitions from JS to the runtime, and this change
increases those transitions by a factor of three. One of the goals
here was to have basic entry-points for javascript code to notify the
runtime of async transitions (namely, "enqueue", "start", and "end").
This allows a future where async call graph information can be tracked
by the runtime and eliminates a number of cases where users will
require the async-wrap callbacks. For example, continuation-local
storage can be implemented on such an API, without callbacks, assuming
each node in the graph contains a tuple of (unique ID, parent ID,
pending child callback count, state), and where state is one of
enqueued, running, or completed.

This change updates nextTick() callback processing to call back into the node runtime to notify the runtime of async state transitions. Currently, the runtime will track the active async ID and fire any registered async-wrap callbacks. This gives async-wrap consistent user semantics for nextTick() callbacks. I anticipate some concerns around impact to perf. Rough measurements show an approximate 2x increase in time to process 1,000,000 nextTick() callbacks. These went from an average of 1094 ms per million nextTick() callbacks to 2133 ms per million with the changes in this PR. I'm open to alternative suggestions around implementation here. :) It's not lost on me that the implementation of next_tick.js is making an effort to minimize transitions from JS to the runtime, and this change increases those transitions by a factor of three. One of the goals here was to have basic entry-points for javascript code to notify the runtime of async transitions (namely, "enqueue", "start", and "end"). This allows a future where async call graph information can be tracked by the runtime and eliminates a number of cases where users will require the async-wrap callbacks. For example, continuation-local storage can be implemented on such an API, without callbacks, assuming each node in the graph contains a tuple of (unique ID, parent ID, pending child callback count, state), and where state is one of enqueued, running, or completed.

mike-kaufman · 2016-04-06T18:53:05Z

/CC: @trevnorris @AndreasMadsen

AndreasMadsen · 2016-04-08T15:45:07Z

I'm sorry. I'm way too busy to review anything right now, @trevnorris will have to take care of this.

trevnorris · 2016-04-08T17:16:20Z

I had an alternate implementation sitting in a local branch. Pushed part of it to trevnorris@4155f0e. This only handles a uid to TickObject, since the general mechanism for passing the parent id needs to be more flushed out so that async handles always have a parent id if they're not top level. The approach has almost negligible performance impact. The same can also be done for parent id's. If you'd like to attempt a refactor based on this that's fine. Though I have been working on essentially the same in order to support timers.

mike-kaufman · 2016-04-12T23:21:53Z

Thanks @trevnorris . I took a quick look at your changes. From my measurements using your array technique, it takes 1-2 milliseconds to read an int value 1,000,000 times. Using the alternative in this PR (creating a v8:Integer each time) takes about 10-12 milliseconds per 1,000,000 reads.

So, your approach is clearly faster, but my estimates are it will save only about 1-2% of the total cost of 1,000,000 nextTick callbacks. I didn't run this under a profiler, but I think the dominate factor here will be the transitions from JS->runtime->JS and back, for each of init/begin/complete callbacks.

I can implement the different techniques and get some concrete measurements if you think it is important?

mike-kaufman · 2016-04-13T21:18:23Z

@trevnorris - Any more feedback with these changes? I'd like to move this forward and I believe this PR gets to consistent API behavior for nextTick() callbacks.

I tried out the technique using the uint32 array, and it was slower by ~20%! Its possible I did something wrong. Feel free to take a look at the changes here. This change just made it easy to test the compare the two approaches by using calling either nextTick() (to get behavior with this PR) or nextTickTemp() (to enqueue a nextTick() callback and pull async-ID via the uint32 array).

Here's the code I ran to test out the perf of 1,000,000 callbacks:

var moment = require('moment');
console.log("starting");
var start = new moment();
var count = 0;
var numIterations = 1000000;

for (var i = 0; i < numIterations; i++) {
    process.nextTick(function() {
    // uncomment below to use retrieve current/next IDs via uint32 array
    // process.nextTickTemp(function() {
        count++;
        if (count === numIterations ) {
            var now = new moment();
            var elapsed = now.diff(start);
            console.log("elapsed time is " + elapsed + " milliseconds");
            process.exit();
        }
    });
}

trevnorris · 2016-04-14T07:39:15Z

There isn't a need to cross over between C++ and JS while executing the callbacks. There's no need to store that much state. It can be a lot simpler and cheaper. This also doesn't take into account the integration with timers, which also are under the same constraints. If using shared flags is slower than an implementation detail is incorrect. I've already tested it and overhead is almost unnoticeable.

That nextTick() benchmark is going to kill v8 due to the sheer size of the nextTickQueue array before anything can be processed. Here's an alternative:

var ITER = 1e6;
var cntr = 0;
var t = process.hrtime();
(function runner() {
  if (++cntr >= ITER) return done();
  process.nextTick(runner);
}());
function done() {
  t = process.hrtime(t);
  console.log(((t[0] * 1e9 + t[1]) / ITER).toFixed(1) + ' ns/op');
}

Or if you want to test having more than one callback in the nextTickQueue array:

var ITER = 1e6;
var cntr = 0;
var t = process.hrtime();
function noop() {}
(function runner() {
  if (++cntr >= ITER) return done();
  for (var i = 0; i < 9; i++)
    process.nextTick(noop);
  process.nextTick(runner);
}());
function done() {
  t = process.hrtime(t);
  console.log(((t[0] * 1e9 + t[1]) / (ITER * 10)).toFixed(1) + ' ns/op');
}

As far as whether the performance difference matters, yes it does. This is the center of the event loop. We are already battling performance hindrances from domains needing to be checked (missed changing _tickDomainCallback() the same way _tickCallback() was), needing to make sure the MicrotaskQueue is processed, etc. This needs to be as close to a noop on our end as possible.

mike-kaufman · 2016-04-14T16:41:04Z

Thanks @trevnorris. Some comments/questions:

There isn't a need to cross over between C++ and JS while executing the callbacks.

One of my goals here was to consolidate the logic around async enqueue/begin-execution/end-execution (here, here and here respectively). While any logic associated with these events can be replicated in javascript, this comes at the expense of code maintainability/readability.

There's no need to store that much state. It can be a lot simpler and cheaper.

Sorry, I'm not sure I'm following - specifically what state are you referring to? This PR only adds state to the runtime to track the "current-async-id", and state to each next-tick object (lines 155-157) to allow async-wrap callbacks with correct user semantics.

That nextTick() benchmark is going to kill v8 due to the sheer size of the nextTickQueue array before anything can be processed

Interesting... it seemed to run fine for me. I'll play around with your suggestion to see if results differ.

As far as whether the performance difference matters, yes it does.

Are there specific benchmarks I can run to measure a user's impact here? Probably my ignorance, but I'm a bit uncomfortable talking in the abstract about "it has to be fast". In my experience, it is more productive to speak quantitatively about how a set of changes impacts users' scenarios (e.g., "this change results in a 10% reduction in http requests/second").

missed changing _tickDomainCallback() the same way _tickCallback() was

Thanks. I'll update...

trevnorris · 2016-04-14T22:52:57Z

While any logic associated with these events can be replicated in javascript, this comes at the expense of code maintainability/readability.

Take a quick look around the file you're working in and you'll see that consolidating is not as important as performance.

Sorry, I'm not sure I'm following - specifically what state are you referring to?

You're adding 4 properties to a TickObject instance.

Interesting... it seemed to run fine for me.

Running, and then running in a way that gives you useful numbers are different. There is no real world use case where queue'ing up that many nextTick's exists.

Are there specific benchmarks I can run to measure a user's impact here?

There are too many variables to do a simple benchmark. The most reliable way is to generate the JIT'd code and analyze the difference. But in short, when running any benchmark that uses nextTick, the operation itself should essentially show up as a noop.

it is more productive to speak quantitatively about how a set of changes impacts users' scenarios (e.g., "this change results in a 10% reduction in http requests/second").

I've seen allowing minor regressions happen since v0.8. Thanks to all those small hits, node is now noticeably slower performing I/O and async operations than it used to be. I'm willing to accept compromise in less performance intensive areas, but not here. For the most basic test, say that the nextTick benchmarks I posted don't regress any more than 5 maybe 10%.

mike-kaufman · 2016-04-14T23:15:39Z

You're adding 4 properties to a TickObject instance.

This was the necessary state to call into existing AsyncWrap callbacks. Are there specific suggestions you have to simplify? A few things I can think of here are to drop the parent ID (should be available after init() call) and to drop the "asyncState" object, however not having a state object associated with the callbacks would be inconsistent current form of async-wrap APIs.

Thanks to all those small hits, node is now noticeably slower performing I/O and async operations than it used to be.

It would be really helpful to me if you could point me at some existing tests and benchmarks that quantify the impact of changes such as this.

For the most basic test, say that the nextTick benchmarks I posted don't regress any more than 5 maybe 10%.

I'll take a look at trying to reduce the overhead here.

Fishrock123 · 2016-04-15T00:06:18Z

The c++ code linting here doesn't actually pass.

I'm guessing you are running on windows, since our vcbuild file isn't able to invoke our cpplint.

Here's the errors:

src/async-wrap.cc:123:  Lines should be <= 80 characters long  [whitespace/line_length] [2]
src/async-wrap.cc:129:  Lines should be <= 80 characters long  [whitespace/line_length] [2]
src/async-wrap.cc:141:  { should almost always be at the end of the previous line  [whitespace/braces] [4]
src/async-wrap.cc:147:  Blank line at the start of a code block.  Is this needed?  [whitespace/blank_line] [2]
src/async-wrap.cc:191:  Lines should be <= 80 characters long  [whitespace/line_length] [2]
src/async-wrap.cc:194:  Lines should be <= 80 characters long  [whitespace/line_length] [2]
src/async-wrap.cc:205:  Using C-style cast.  Use reinterpret_cast<AsyncWrap*>(...) instead  [readability/casting] [4]
src/async-wrap.cc:216:  { should almost always be at the end of the previous line  [whitespace/braces] [4]
src/async-wrap.cc:219:  Blank line at the start of a code block.  Is this needed?  [whitespace/blank_line] [2]
src/async-wrap.cc:245:  { should almost always be at the end of the previous line  [whitespace/braces] [4]
src/async-wrap.cc:270:  Lines should very rarely be longer than 100 characters  [whitespace/line_length] [4]
src/async-wrap.cc:271:  Blank line at the start of a code block.  Is this needed?  [whitespace/blank_line] [2]
src/async-wrap.cc:295:  Lines should be <= 80 characters long  [whitespace/line_length] [2]
src/async-wrap.cc:296:  Lines should be <= 80 characters long  [whitespace/line_length] [2]
src/async-wrap.cc:299:  Lines should very rarely be longer than 100 characters  [whitespace/line_length] [4]
src/async-wrap.cc:300:  Blank line at the start of a code block.  Is this needed?  [whitespace/blank_line] [2]
src/async-wrap.cc:308:  Lines should be <= 80 characters long  [whitespace/line_length] [2]
src/async-wrap.cc:315:  Blank line at the end of a code block.  Is this needed?  [whitespace/blank_line] [3]
src/async-wrap.cc:435:  Lines should be <= 80 characters long  [whitespace/line_length] [2]
src/async-wrap.cc:436:  Line ends in whitespace. Consider deleting these extra spaces.  [whitespace/end_of_line] [4]
src/async-wrap-inl.h:28:  Lines should be <= 80 characters long  [whitespace/line_length] [2]
src/async-wrap-inl.h:36:  Lines should very rarely be longer than 100 characters  [whitespace/line_length] [4]
src/async-wrap.h:51:  Lines should very rarely be longer than 100 characters  [whitespace/line_length] [4]
src/async-wrap.h:51:  Missing space after ,  [whitespace/comma] [3]
src/async-wrap.h:52:  Lines should very rarely be longer than 100 characters  [whitespace/line_length] [4]
src/async-wrap.h:53:  Lines should very rarely be longer than 100 characters  [whitespace/line_length] [4]
src/async-wrap.h:54:  Lines should very rarely be longer than 100 characters  [whitespace/line_length] [4]

mike-kaufman · 2016-04-15T00:09:07Z

Yeah, sorry ran this on windows. I'll get this cleaned up. Looks like whitespace/line-length issues.

JS to runtime. Did this by doing the following: - Added a JS file at ./lib/async_wrap.js to expose certain async-wrap functionality to JS code. - exposed array of async-wrap callbacks from runtime to JS via an array. This is in anticpation of eventually supporting multiple async-wrap callbacks. - exposed current-async-ID & next-async-ID counters from runtime through JS via mapping an int64 to int32[2]. This allows reading/ writing of these values without transition to runtime & re-allocaiton of v8 types on each read/write.

mike-kaufman · 2016-04-19T18:12:47Z

Updated PR to reduce the number of times the nextTick logic needs to transition from
JS to runtime. Includes:

Added a JS file at ./lib/async_wrap.js to expose certain async-wrap
functionality to JS code.
exposed array of async-wrap callbacks from runtime to JS via an
array. This is in anticpation of eventually supporting multiple
async-wrap callbacks.
exposed current-async-ID & next-async-ID counters from runtime
through JS via mapping an int64 to int32[2]. This allows reading/
writing of these values without transition to runtime &
re-allocaiton of v8 types on each read/write.

I'm going to look into quantifying how this change impacts node benchmarks. Please let me know of any concerns or specific suggestions around improving perf.

fhinkel · 2017-03-26T11:04:50Z

ping @mike-kaufman Can you rebase this? Thanks.

AndreasMadsen · 2017-03-26T12:32:29Z

I think this is superseded by #11883

mike-kaufman · 2017-03-26T23:18:22Z

I think this is superseded by #11883

Yes, closing this.

mscdex added the async_wrap label Apr 6, 2016

estliberitas force-pushed the master branch 2 times, most recently from 7da4fd4 to c7066fb Compare April 26, 2016 05:23

acarstoiu mentioned this pull request May 28, 2016

async_wrap: provide a way to act as an asynchronous try-catch #7040

Closed

Trott force-pushed the master branch from b0df363 to c5ce7f4 Compare September 21, 2016 00:09

rvagg force-pushed the master branch 2 times, most recently from c133999 to 83c7a88 Compare October 18, 2016 17:01

MylesBorins force-pushed the master branch from 8df7ee0 to 54fef67 Compare February 1, 2017 01:00

jasnell added the stalled Issues and PRs that are stalled. label Mar 1, 2017

mike-kaufman closed this Mar 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

async_wrap,lib: Firing async-wrap callbacks for next-tick callbacks #6082

async_wrap,lib: Firing async-wrap callbacks for next-tick callbacks #6082

mike-kaufman commented Apr 6, 2016

mike-kaufman commented Apr 6, 2016

AndreasMadsen commented Apr 8, 2016

trevnorris commented Apr 8, 2016

mike-kaufman commented Apr 12, 2016

mike-kaufman commented Apr 13, 2016

trevnorris commented Apr 14, 2016

mike-kaufman commented Apr 14, 2016

trevnorris commented Apr 14, 2016

mike-kaufman commented Apr 14, 2016

Fishrock123 commented Apr 15, 2016

mike-kaufman commented Apr 15, 2016

mike-kaufman commented Apr 19, 2016

fhinkel commented Mar 26, 2017

AndreasMadsen commented Mar 26, 2017

mike-kaufman commented Mar 26, 2017

async_wrap,lib: Firing async-wrap callbacks for next-tick callbacks #6082

async_wrap,lib: Firing async-wrap callbacks for next-tick callbacks #6082

Conversation

mike-kaufman commented Apr 6, 2016

Checklist

Affected core subsystem(s)

Description of change

mike-kaufman commented Apr 6, 2016

AndreasMadsen commented Apr 8, 2016

trevnorris commented Apr 8, 2016

mike-kaufman commented Apr 12, 2016

mike-kaufman commented Apr 13, 2016

trevnorris commented Apr 14, 2016

mike-kaufman commented Apr 14, 2016

trevnorris commented Apr 14, 2016

mike-kaufman commented Apr 14, 2016

Fishrock123 commented Apr 15, 2016

mike-kaufman commented Apr 15, 2016

mike-kaufman commented Apr 19, 2016

fhinkel commented Mar 26, 2017

AndreasMadsen commented Mar 26, 2017

mike-kaufman commented Mar 26, 2017