Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

async_wrap,lib: Firing async-wrap callbacks for next-tick callbacks #6082

Closed

Conversation

mike-kaufman
Copy link

Checklist
  • tests and code linting passes
  • a test and/or benchmark is included
  • documentation is changed or added
  • the commit message follows commit guidelines

Caveats:

  • seeing the following test failures on windows that occur without my changes:
  • I'm not sure if there are docs that need updated here. Please let me know
  • I'm not sure if there are relevant benchmarks that I should run. Please let me know.
Affected core subsystem(s)

async_wrap;lib

Description of change

This change updates nextTick() callback processing to call back into
the node runtime to notify the runtime of async state transitions.
Currently, the runtime will track the active async ID and fire any
registered async-wrap callbacks. This gives async-wrap consistent user
semantics for nextTick() callbacks.

I anticipate some concerns around impact to perf. Rough measurements
show an approximate 2x increase in time to process 1,000,000 nextTick()
callbacks. These went from an average of 1094 ms per million
nextTick() callbacks to 2133 ms per million with the changes in this PR.

I'm open to alternative suggestions around implementation here. :) It's
not lost on me that the implementation of next_tick.js is making an
effort to minimize transitions from JS to the runtime, and this change
increases those transitions by a factor of three. One of the goals
here was to have basic entry-points for javascript code to notify the
runtime of async transitions (namely, "enqueue", "start", and "end").
This allows a future where async call graph information can be tracked
by the runtime and eliminates a number of cases where users will
require the async-wrap callbacks. For example, continuation-local
storage can be implemented on such an API, without callbacks, assuming
each node in the graph contains a tuple of (unique ID, parent ID,
pending child callback count, state), and where state is one of
enqueued, running, or completed.

This change updates nextTick() callback processing to call back into
the node runtime to notify the runtime of async state transitions.
Currently, the runtime will track the active async ID and fire any
registered async-wrap callbacks. This gives async-wrap consistent user
semantics for nextTick() callbacks.

I anticipate some concerns around impact to perf. Rough measurements
show an approximate 2x increase in time to process 1,000,000 nextTick()
callbacks.  These went from an average of 1094 ms per million
nextTick() callbacks to 2133 ms per million with the changes in this PR.

I'm open to alternative suggestions around implementation here. :) It's
not lost on me that the implementation of next_tick.js is making an
effort to minimize transitions from JS to the runtime, and this change
increases those transitions by a factor of three. One of the goals
here was to have basic entry-points for javascript code to notify the
runtime of async transitions (namely, "enqueue", "start", and "end").
This allows a future where async call graph information can be tracked
by the runtime and eliminates a number of cases where users will
require the async-wrap callbacks.  For example, continuation-local
storage can be implemented on such an API, without callbacks, assuming
each node in the graph contains a tuple of (unique ID, parent ID,
pending child callback count, state), and where state is one of
enqueued, running, or completed.
@mike-kaufman
Copy link
Author

/CC: @trevnorris @AndreasMadsen

@AndreasMadsen
Copy link
Member

I'm sorry. I'm way too busy to review anything right now, @trevnorris will have to take care of this.

@trevnorris
Copy link
Contributor

I had an alternate implementation sitting in a local branch. Pushed part of it to trevnorris@4155f0e. This only handles a uid to TickObject, since the general mechanism for passing the parent id needs to be more flushed out so that async handles always have a parent id if they're not top level. The approach has almost negligible performance impact. The same can also be done for parent id's. If you'd like to attempt a refactor based on this that's fine. Though I have been working on essentially the same in order to support timers.

@mike-kaufman
Copy link
Author

Thanks @trevnorris . I took a quick look at your changes. From my measurements using your array technique, it takes 1-2 milliseconds to read an int value 1,000,000 times. Using the alternative in this PR (creating a v8:Integer each time) takes about 10-12 milliseconds per 1,000,000 reads.

So, your approach is clearly faster, but my estimates are it will save only about 1-2% of the total cost of 1,000,000 nextTick callbacks. I didn't run this under a profiler, but I think the dominate factor here will be the transitions from JS->runtime->JS and back, for each of init/begin/complete callbacks.

I can implement the different techniques and get some concrete measurements if you think it is important?

@mike-kaufman
Copy link
Author

@trevnorris - Any more feedback with these changes? I'd like to move this forward and I believe this PR gets to consistent API behavior for nextTick() callbacks.

I tried out the technique using the uint32 array, and it was slower by ~20%! Its possible I did something wrong. Feel free to take a look at the changes here. This change just made it easy to test the compare the two approaches by using calling either nextTick() (to get behavior with this PR) or nextTickTemp() (to enqueue a nextTick() callback and pull async-ID via the uint32 array).

Here's the code I ran to test out the perf of 1,000,000 callbacks:

var moment = require('moment');
console.log("starting");
var start = new moment();
var count = 0;
var numIterations = 1000000;

for (var i = 0; i < numIterations; i++) {
    process.nextTick(function() {
    // uncomment below to use retrieve current/next IDs via uint32 array
    // process.nextTickTemp(function() {
        count++;
        if (count === numIterations ) {
            var now = new moment();
            var elapsed = now.diff(start);
            console.log("elapsed time is " + elapsed + " milliseconds");
            process.exit();
        }
    });
}

@trevnorris
Copy link
Contributor

There isn't a need to cross over between C++ and JS while executing the callbacks. There's no need to store that much state. It can be a lot simpler and cheaper. This also doesn't take into account the integration with timers, which also are under the same constraints. If using shared flags is slower than an implementation detail is incorrect. I've already tested it and overhead is almost unnoticeable.

That nextTick() benchmark is going to kill v8 due to the sheer size of the nextTickQueue array before anything can be processed. Here's an alternative:

var ITER = 1e6;
var cntr = 0;
var t = process.hrtime();
(function runner() {
  if (++cntr >= ITER) return done();
  process.nextTick(runner);
}());
function done() {
  t = process.hrtime(t);
  console.log(((t[0] * 1e9 + t[1]) / ITER).toFixed(1) + ' ns/op');
}

Or if you want to test having more than one callback in the nextTickQueue array:

var ITER = 1e6;
var cntr = 0;
var t = process.hrtime();
function noop() {}
(function runner() {
  if (++cntr >= ITER) return done();
  for (var i = 0; i < 9; i++)
    process.nextTick(noop);
  process.nextTick(runner);
}());
function done() {
  t = process.hrtime(t);
  console.log(((t[0] * 1e9 + t[1]) / (ITER * 10)).toFixed(1) + ' ns/op');
}

As far as whether the performance difference matters, yes it does. This is the center of the event loop. We are already battling performance hindrances from domains needing to be checked (missed changing _tickDomainCallback() the same way _tickCallback() was), needing to make sure the MicrotaskQueue is processed, etc. This needs to be as close to a noop on our end as possible.

@mike-kaufman
Copy link
Author

Thanks @trevnorris. Some comments/questions:

There isn't a need to cross over between C++ and JS while executing the callbacks.

One of my goals here was to consolidate the logic around async enqueue/begin-execution/end-execution (here, here and here respectively). While any logic associated with these events can be replicated in javascript, this comes at the expense of code maintainability/readability.

There's no need to store that much state. It can be a lot simpler and cheaper.

Sorry, I'm not sure I'm following - specifically what state are you referring to? This PR only adds state to the runtime to track the "current-async-id", and state to each next-tick object (lines 155-157) to allow async-wrap callbacks with correct user semantics.

That nextTick() benchmark is going to kill v8 due to the sheer size of the nextTickQueue array before anything can be processed

Interesting... it seemed to run fine for me. I'll play around with your suggestion to see if results differ.

As far as whether the performance difference matters, yes it does.

Are there specific benchmarks I can run to measure a user's impact here? Probably my ignorance, but I'm a bit uncomfortable talking in the abstract about "it has to be fast". In my experience, it is more productive to speak quantitatively about how a set of changes impacts users' scenarios (e.g., "this change results in a 10% reduction in http requests/second").

missed changing _tickDomainCallback() the same way _tickCallback() was

Thanks. I'll update...

@trevnorris
Copy link
Contributor

While any logic associated with these events can be replicated in javascript, this comes at the expense of code maintainability/readability.

Take a quick look around the file you're working in and you'll see that consolidating is not as important as performance.

Sorry, I'm not sure I'm following - specifically what state are you referring to?

You're adding 4 properties to a TickObject instance.

Interesting... it seemed to run fine for me.

Running, and then running in a way that gives you useful numbers are different. There is no real world use case where queue'ing up that many nextTick's exists.

Are there specific benchmarks I can run to measure a user's impact here?

There are too many variables to do a simple benchmark. The most reliable way is to generate the JIT'd code and analyze the difference. But in short, when running any benchmark that uses nextTick, the operation itself should essentially show up as a noop.

it is more productive to speak quantitatively about how a set of changes impacts users' scenarios (e.g., "this change results in a 10% reduction in http requests/second").

I've seen allowing minor regressions happen since v0.8. Thanks to all those small hits, node is now noticeably slower performing I/O and async operations than it used to be. I'm willing to accept compromise in less performance intensive areas, but not here. For the most basic test, say that the nextTick benchmarks I posted don't regress any more than 5 maybe 10%.

@mike-kaufman
Copy link
Author

You're adding 4 properties to a TickObject instance.

This was the necessary state to call into existing AsyncWrap callbacks. Are there specific suggestions you have to simplify? A few things I can think of here are to drop the parent ID (should be available after init() call) and to drop the "asyncState" object, however not having a state object associated with the callbacks would be inconsistent current form of async-wrap APIs.

Thanks to all those small hits, node is now noticeably slower performing I/O and async operations than it used to be.

It would be really helpful to me if you could point me at some existing tests and benchmarks that quantify the impact of changes such as this.

For the most basic test, say that the nextTick benchmarks I posted don't regress any more than 5 maybe 10%.

I'll take a look at trying to reduce the overhead here.

@Fishrock123
Copy link
Contributor

The c++ code linting here doesn't actually pass.

I'm guessing you are running on windows, since our vcbuild file isn't able to invoke our cpplint.

Here's the errors:

src/async-wrap.cc:123:  Lines should be <= 80 characters long  [whitespace/line_length] [2]
src/async-wrap.cc:129:  Lines should be <= 80 characters long  [whitespace/line_length] [2]
src/async-wrap.cc:141:  { should almost always be at the end of the previous line  [whitespace/braces] [4]
src/async-wrap.cc:147:  Blank line at the start of a code block.  Is this needed?  [whitespace/blank_line] [2]
src/async-wrap.cc:191:  Lines should be <= 80 characters long  [whitespace/line_length] [2]
src/async-wrap.cc:194:  Lines should be <= 80 characters long  [whitespace/line_length] [2]
src/async-wrap.cc:205:  Using C-style cast.  Use reinterpret_cast<AsyncWrap*>(...) instead  [readability/casting] [4]
src/async-wrap.cc:216:  { should almost always be at the end of the previous line  [whitespace/braces] [4]
src/async-wrap.cc:219:  Blank line at the start of a code block.  Is this needed?  [whitespace/blank_line] [2]
src/async-wrap.cc:245:  { should almost always be at the end of the previous line  [whitespace/braces] [4]
src/async-wrap.cc:270:  Lines should very rarely be longer than 100 characters  [whitespace/line_length] [4]
src/async-wrap.cc:271:  Blank line at the start of a code block.  Is this needed?  [whitespace/blank_line] [2]
src/async-wrap.cc:295:  Lines should be <= 80 characters long  [whitespace/line_length] [2]
src/async-wrap.cc:296:  Lines should be <= 80 characters long  [whitespace/line_length] [2]
src/async-wrap.cc:299:  Lines should very rarely be longer than 100 characters  [whitespace/line_length] [4]
src/async-wrap.cc:300:  Blank line at the start of a code block.  Is this needed?  [whitespace/blank_line] [2]
src/async-wrap.cc:308:  Lines should be <= 80 characters long  [whitespace/line_length] [2]
src/async-wrap.cc:315:  Blank line at the end of a code block.  Is this needed?  [whitespace/blank_line] [3]
src/async-wrap.cc:435:  Lines should be <= 80 characters long  [whitespace/line_length] [2]
src/async-wrap.cc:436:  Line ends in whitespace. Consider deleting these extra spaces.  [whitespace/end_of_line] [4]
src/async-wrap-inl.h:28:  Lines should be <= 80 characters long  [whitespace/line_length] [2]
src/async-wrap-inl.h:36:  Lines should very rarely be longer than 100 characters  [whitespace/line_length] [4]
src/async-wrap.h:51:  Lines should very rarely be longer than 100 characters  [whitespace/line_length] [4]
src/async-wrap.h:51:  Missing space after ,  [whitespace/comma] [3]
src/async-wrap.h:52:  Lines should very rarely be longer than 100 characters  [whitespace/line_length] [4]
src/async-wrap.h:53:  Lines should very rarely be longer than 100 characters  [whitespace/line_length] [4]
src/async-wrap.h:54:  Lines should very rarely be longer than 100 characters  [whitespace/line_length] [4]

@mike-kaufman
Copy link
Author

Yeah, sorry ran this on windows. I'll get this cleaned up. Looks like whitespace/line-length issues.

 JS to runtime.  Did this by doing the following:

  - Added a JS file at ./lib/async_wrap.js to expose certain async-wrap
    functionality to JS code.
  - exposed array of async-wrap callbacks from runtime to JS via an
    array.  This is in anticpation of eventually supporting multiple
    async-wrap callbacks.
  - exposed current-async-ID & next-async-ID counters from runtime
    through JS via mapping an int64 to int32[2].  This allows reading/
    writing of these values without transition to runtime &
    re-allocaiton of v8 types on each read/write.
@mike-kaufman
Copy link
Author

Updated PR to reduce the number of times the nextTick logic needs to transition from
JS to runtime. Includes:

  • Added a JS file at ./lib/async_wrap.js to expose certain async-wrap
    functionality to JS code.
  • exposed array of async-wrap callbacks from runtime to JS via an
    array. This is in anticpation of eventually supporting multiple
    async-wrap callbacks.
  • exposed current-async-ID & next-async-ID counters from runtime
    through JS via mapping an int64 to int32[2]. This allows reading/
    writing of these values without transition to runtime &
    re-allocaiton of v8 types on each read/write.

I'm going to look into quantifying how this change impacts node benchmarks. Please let me know of any concerns or specific suggestions around improving perf.

@fhinkel
Copy link
Member

fhinkel commented Mar 26, 2017

ping @mike-kaufman Can you rebase this? Thanks.

@AndreasMadsen
Copy link
Member

I think this is superseded by #11883

@mike-kaufman
Copy link
Author

I think this is superseded by #11883

Yes, closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stalled Issues and PRs that are stalled.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants