-
Notifications
You must be signed in to change notification settings - Fork 29.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
process.nextTick & node::MakeCallback are very bloated #10101
Comments
cc @trevnorris |
Would you mind sharing node version, platform and an example implementation? |
This is Node.js 5 (which I have found is faster than 7 and 6) on Linux. I don't have any minimal test right now but you should be able to just write a test that calls into JavaScript and returns many times and compare it with the native V8 function calls. |
What is an iteration? |
The problem is that MakeCallback is slow, very slow. While V8 alone is much faster. So I cannot use MakeCallback. One iteration is one epoll_wait return and handling of its ready fds. By avoiding MakeCallback on each call into JS I get a lot better performance, so the conclusion is that Node.js adds tons of overhead to V8 in this case which needs to be improved to allow better performance of addons and Node.js itself. |
The current behavior is not even as described in the docs: https://nodejs.org/api/process.html#process_process_nexttick_callback_args The behavior I implement is more inline with the docs. |
@alexhultman whether the docs for next tick are sufficiently detailed is not relevant.
That is not my understanding. An iteration is actually the processing of a single struct epoll_event (and yes, this is not documented, the challenge is describing it without refering to epoll, which isn't used on non-linux). If it wasn't like that, there would be observable-in-js differences in when next tick callbacks are called depending on the size of the epoll_wait output array, and the timing of the events. Intended behaviour is
Whether epoll_wait returned event1 and event2 in a single call, or two consecutive calls. Your suggestion would result in the next ticks all being bunched at the end, after the two events were both handled. Or so is my understanding, maybe @bnoordhuis or @saghul will contradict me. |
You may find the conversation in nodejs/nan#284 interesting. |
process.nextTick() callbacks bear almost no relation to libuv I/O. They fire whenever node makes a call into the VM but that doesn't need to be due to I/O. Example: timers. @alexhultman Try batching your calls. That's a good idea whether you use
Too risky. I expect the change in timing would cause considerable ecosystem fallout. |
I get the difference and I have read the linked issue (and many others). I know that whenever there is a call into JS the tick queue is emptied, this I have seen with testing. But the thing is, this is slow, and there is nothing in the documentation explicitly stating what counts as an iteration. If user code heavily depends on exact order of process.nextTick then that code is wrong, you shouldn't make such absurd assumptions about order and code like that kind of goes against the whole "async" programming idea where order is irrelevant. I have thought about batching up the calls but that would disrupt my zero-copy data flow (and I dont like having different behavior in the addon, compared to the core library). But it is not just about me and my use case - the idea is to speed up all calls in to JS land by relieving the rules of process.nextTick. I know Node.js itself is sprinkled with tons of MakeCallback and those could be replaced with pure V8 calls and significantly speed up lots of things. There shouldn't be such a major tax on every call in to JS, especially when V8 itself adds even more tax. Also, I already shipped this change and users haven't noticed anything. Not that I have billions of users but last time there was something wrong with process.nextTick I got people reporting that issue (it was way back in April when I still didn't use MakeCallback at all, only pure V8, which ofc fucks up the process.nextTick completely). |
That's not possible because of domains (i.e., the functionality from the domain module.) The right domain needs to have been entered when calling into JS land. Little used feature of questionable practicality, you say? I don't disagree but that doesn't mean it can change. |
I haven't looked into domains so I cannot make any kind of enlightened statement regarding it. However my view is still that this whole process.nextTick seems very inefficient and bloated for what it is trying to achieve. The way you splice the process.nextTick vector from the beginning for every single function is something I have never seen anywhere else (just to name one thing). Node.js is of course built with some performance limiting decisions (JavaScript/V8) that are also part of why it is popular. But when Node.js adds even more overhead to something already very performance limited [V8], that seems like something worth taking a look at. If one decides to have V8 as the center of the product you need to take extra care to really get 100% of the performance and not 40%. I mean, when I do things I often write a prototype first to see what kind of performance I can expect, then I add abstractions to that code and track my performance so that I do not deviate way off my expectations. With Node.js I feel there is really no benchmarkings done for things like these, I mean the difference between using V8 pure and using Node.js's addon features are Massive. I actually recently (maybe one or two months ago) also found that V8's ArrayBuffer was WAY faster than Node.js's Buffer type. It's kind of the same thing going on and on again: V8 is a lot slower than native, which is acceptable given that it is a scripting VM, but then Node.js (in c++) adds tons of overhead ontop of that (MakeCallback, Buffer, NaN). Point is, I think Node.js needs to take more caution when wrapping V8 because it seems you don't perform any benchmarks of those features you expose to addons. Swapping from Buffer to ArrayBuffer gave me a 100% throughput gain, swapping from MakeCallback to V8 Call gave me a 24% throughput gain. These numbers are very high for something that are made to achieve the exact same thing, or at least very similar things. |
It all makes perfect sense when you know the history. Node started in bazaar fashion: much enthusiasm, many contributors, little cohesion. Then node got big and it became increasingly difficult to revisit previous mistakes. That was already becoming more and more true five years ago, let alone now. I'm well aware that it's not an epitome of efficiency in places, probably more so than anyone else. At the end of the day though, most users don't really care about performance, even if they say they do; what they care about is that their stuff keeps working. Node is at a point in its life where backwards compatibility is more important than performance at all cost. I'm fairly confident I could write a node knockoff that's at least 5x faster but it would be different enough that no one is going to bother because their code wouldn't work out of the box. |
100% truth right there. "Scalability" among Nodesters™ seems to be more about the ability to put in more hardware rather than a measurement of software efficiency. I am still a bit unhappy with MakeCallback since that is the only thing I cannot get away from. |
Do you think MakeCallback can be improved when Domains gets fully removed? Then you should at least be able to replace everything with a simple if (hasPendingProcessNextTickFunction) {} else {pure V8 call}? |
and async hooks and the v8 microtask queue |
So why aren't you nuking the current process.nextTick mountain and just using the microtask features: void v8::Isolate::EnqueueMicrotask(Handle< Function > microtask) It seems V8 can even automatically run microtasks when you return from all Calls. I bet you can improve performance majorly by ditching all that JS code doing all those splices and whatnot for just using the V8 native counterparts. From looking at the V8 docs regarding this it sounds like you can still get away with only one single Call per event, and then just run the tasks at uv_check_t. To me it just looks extremely inefficient as it is right now. |
@alexhultman before you continue going on about how bloated things are, let's take a concrete example. quick benchmark script: C++ code for benchmark#include <v8.h>
#include <uv.h>
#include <node.h>
namespace bmod {
using namespace v8;
uv_idle_t my_idle;
size_t iter_;
Persistent<Context> context_p;
Persistent<Function> fn_p;
Persistent<Object> obj_p;
template <class TypeName>
v8::Local<TypeName> StrongPersistentToLocal(
const v8::Persistent<TypeName>& persistent) {
return *reinterpret_cast<v8::Local<TypeName>*>(
const_cast<v8::Persistent<TypeName>*>(&persistent));
}
void MakeCallback(const FunctionCallbackInfo<Value>& args) {
Isolate* isolate = args.GetIsolate();
Local<Object> ctx = args[0].As<Object>();
Local<Function> fn = args[1].As<Function>();
const size_t iter = args[2]->IntegerValue();
for (size_t i = 0; i < iter; i++) {
HandleScope scope(isolate);
node::MakeCallback(isolate, ctx, fn, 0, nullptr);
}
}
void FnCall(const FunctionCallbackInfo<Value>& args) {
Isolate* isolate = args.GetIsolate();
Local<Context> context = isolate->GetCurrentContext();
Local<Object> ctx = args[0].As<Object>();
Local<Function> fn = args[1].As<Function>();
const size_t iter = args[2]->IntegerValue();
for (size_t i = 0; i < iter; i++) {
HandleScope scope(isolate);
fn->Call(context, ctx, 0, nullptr).ToLocalChecked();
}
}
void run_makecallback(uv_idle_t* handle) {
uv_idle_stop(handle);
Isolate* isolate = static_cast<Isolate*>(handle->data);
HandleScope scope(isolate);
Local<Context> context = StrongPersistentToLocal(context_p);
Context::Scope context_scope(context);
Local<Object> ctx = StrongPersistentToLocal(obj_p);
Local<Function> fn = StrongPersistentToLocal(fn_p);
for (size_t i = 0; i < iter_; i++) {
HandleScope scope(isolate);
node::MakeCallback(isolate, ctx, fn, 0, nullptr);
//fn->Call(context, ctx, 0, nullptr).ToLocalChecked();
}
uv_close(reinterpret_cast<uv_handle_t*>(handle), nullptr);
}
void FromIdle(const FunctionCallbackInfo<Value>& args) {
Isolate* isolate = args.GetIsolate();
context_p.Reset(isolate, isolate->GetCurrentContext());
obj_p.Reset(isolate, args[0].As<Object>());
fn_p.Reset(isolate, args[1].As<Function>());
iter_ = args[2]->IntegerValue();
my_idle.data = isolate;
uv_idle_start(&my_idle, run_makecallback);
}
void Init(Local<Object> exports) {
NODE_SET_METHOD(exports, "makeCallback", MakeCallback);
NODE_SET_METHOD(exports, "fnCall", FnCall);
NODE_SET_METHOD(exports, "fromIdle", FromIdle);
uv_idle_init(uv_default_loop(), &my_idle);
}
} // namespace bmod
NODE_MODULE(addon, bmod::Init) and the JS file to run it: JS file for benchmark'use strict';
// uncomment this to see difference w/ domains
//require('domain');
const addon = require('./build/Release/addon');
const print = process._rawDebug;
const ITER = 1e6;
var cntr = 0;
var t = process.hrtime();
function noop() {
if (++cntr < ITER) return;
t = process.hrtime(t);
print(((t[0] * 1e9 + t[1]) / ITER).toFixed(1) + ' ns/op');
}
// This probably more appropriately measures what you are observing.
addon.fromIdle({}, noop, ITER);
return;
setImmediate(() => {
var t = process.hrtime();
//addon.makeCallback({}, noop, ITER);
//addon.fnCall({}, noop, ITER);
t = process.hrtime(t);
print(((t[0] * 1e9 + t[1]) / ITER).toFixed(1) + ' ns/op');
}); Results:
First notice that we have already done quite a bit to improve performance in regards to domains. Second, the time difference between Now, in regards to the difference between diff --git a/src/node.cc b/src/node.cc
index ce39cb4..24ea3bb 100644
--- a/src/node.cc
+++ b/src/node.cc
@@ -1316,6 +1316,7 @@ Local<Value> MakeCallback(Environment* env,
if (tick_info->length() == 0) {
tick_info->set_index(0);
+ return ret;
}
if (env->tick_callback_function()->Call(process, 0, nullptr).IsEmpty()) { Short of it is, if there is nothing in the More of that time can be removed from the following change (which will be coming a future PR): diff --git a/src/node.cc b/src/node.cc
index ce39cb4..e4a987e 100644
--- a/src/node.cc
+++ b/src/node.cc
@@ -1227,13 +1227,8 @@ Local<Value> MakeCallback(Environment* env,
Environment::AsyncCallbackScope callback_scope(env);
- // TODO(trevnorris): Adding "_asyncQueue" to the "this" in the init callback
- // is a horrible way to detect usage. Rethink how detection should happen.
if (recv->IsObject()) {
object = recv.As<Object>();
- Local<Value> async_queue_v = object->Get(env->async_queue_string());
- if (async_queue_v->IsObject())
- ran_init_callback = true;
}
if (env->using_domains()) { Which brings execution time down to As for the remaining
Because it doesn't support wrapping calls in
If that were the case then we'd ditch |
I'm going to leave this open until at least the first issue described above is addressed. The later, about removing the |
I don't get your point, didn't your benchmark just validate my report? That is, MakeCallback is extremely slow compared to pure V8 function calls. Even with those future fixes you mentioned, you are still looking at a 2x overhead per function call even though the process.nextTick queue is empty (which is the case most of the time). I already bypass most of Node.js with my addon but I'm still reliant on MakeCallback which, like I reported, plays a majorly negative role on my overall performance. Also I wouldn't agree that calling into C++ is slow, that is done for most things in Node.js already. Buffer.toString() is a C++ call and nobody is reporting issues about that being slow. Getting a small overhead when calling process.nextTick is far better than getting a constant overhead for every event trigger. Something optimized would be maybe 10% slower than V8 - not 100%. This recursive check is an example of how it is bloated - why would I need a check like that if I already know that my call is not going to be recursive? Those are the things I mean when I say it is bloated and those are the things that add up to the 2x overhead. |
Let's clarify "extremely slow". In the land of JS 70 ns is not "extremely" slow. Let's be reasonable about the use of our adjectives here.
You're conflating calling into C++ and from C++. The performance overhead is not the same. Calling into C++ is 10-15 ns. Calling out of C++ is ~70 ns.
Please, I beg of you to stop arguing things you haven't tested or validated. It's honestly a waste of developer's time. But to appease your insistence on something you're so sure of, let's simplify this. Diff to completely trim
|
To completely put this to rest, Here is the fully trimmed Local<Value> MakeCallback(Environment* env,
Local<Value> recv,
const Local<Function> callback,
int argc,
Local<Value> argv[]) {
return callback->Call(recv, argc, argv);
} And here's the test script: 'use strict';
const addon = require('./build/Release/addon');
const print = process._rawDebug;
const ITER = 1e6;
var cntr = 0;
var t = process.hrtime();
addon.makeCallback({}, () => {}, ITER);
t = process.hrtime(t);
print(((t[0] * 1e9 + t[1]) / ITER).toFixed(1) + ' ns/op'); Run time is void MakeCallback(const FunctionCallbackInfo<Value>& args) {
Isolate* isolate = args.GetIsolate();
Local<Object> ctx = args[0].As<Object>();
Local<Function> fn = args[1].As<Function>();
const size_t iter = args[2]->IntegerValue();
for (size_t i = 0; i < iter; i++) {
node::MakeCallback(isolate, ctx, fn, 0, nullptr);
}
} but you can't do this directly when returning from an async call or else the
So even if you're only calling |
@alexhultman I have modified your above comment to remove a single word which was less than ideal to use. As someone who has just poked their head in the thread I urge you to remember we abide by a code of conduct in this repo. |
@alexhultman so far @trevnorris has responded with a lot of technical detail on why As for the async queue I suspect using a cirtular buffer structure with zero (well, O(1) amortized) allocations could improve the speed of draining - this is what bluebird does. Currently using V8's microtask queue is very problematic for the reason Trevvor mentioned. You want to reduce
That would go a huge way, and please - enter this discussion in an honest attempt to have a dialogue with the people trying to collaborate with you. No one is shoving you into boxes or not taking your feedback seriously. |
@alexhultman First, it'd be a lot easier to communicate w/ you, and to care about what you're saying, if the entire issue were just posed as "I'm seeing a X ns/op discrepancy between using Second, you haven't put any effort into actually diagnosing the problem you're reporting. Instead of using demeaning remarks it would be helpful to provide something like provide a Third, yes I found one optimization for one use-case of Fourth, you're ignoring the fact that the shortcuts you're taking makes your code not "node". Meaning, it doesn't obey the I've attempted to be helpful. I have three posts with ample amount of code changes and benchmarks; with the hope they would spur productive discussion as to what else we can do. Below is the disassembled function calls that I've walked through and highlighted what additional steps are taken to execute the function in a minimal case: Disassembly of user facing `node::MakeCallback()`
Started to do this in the hope we could find some improvements, but at this point I'm just going to step back. When you've come back with some actual effort to find how things could improve I'll engage in further discussion with you.
I'd love to see that. Have a link?
Ah, mine only does 120k/sec. Probably because we have a different number of threads resolving requests. Unlike node and all, that only uses one thread. :P |
The But overall the overhead would still be negligible. We have extra overhead of needing to check for things like if more microtasks have been queued during the execution of a |
Yes, that makes sense. Promises can push a thousand things to the queue and it needs to be 0 allocations so it's almost as fast as synchronous calls and the performance loss is minimal - so a fast promise implementation has to use a better data structure. |
Anyways, you now have a report which you have validated yourself so you know where you can improve. If you can get those 330ns -> 190ns -> 140ns fixes committed then that would definitely be a nice initial step. Clearly this is piranhas water I don't want to stay in for more than necessary. Very toxic environment indeed. |
We need numbers! Tangible, actionable pieces of information that we can do something about. The entirety of your contribution to this issue could be paraphrased as "MakeCallback extremely slow and bloated. Fix it!" I provided patches, benchmarks and numbers hoping you'd jump in at some point and do the same.
For the love, where? You state a mythical benchmark exists but haven't posted a single link or line of code. I have exhaustively written benchmarks for 5 variations of a scenario that you say exists, but refuse to directly demonstrate.
I'm only bringing this up b/c I'm afraid that someone in the future will stumble on this thread and think I actually said that. Please, unless you explicitly state you're paraphrasing, only quote things that have actually been said.
Sorry. That was just a friendly poke at the phrasing. Not a jab at your understanding of how NGINX works. Thought that'd be conveyed by how blatant the comment was and by the
Queue'ing may be freaky fast, but don't forget about execution setup. Each microtask has ~60ns overhead as well. |
Hey, everyone! Just a reminder that we all want the same thing: For Node.js to be as awesome as it can be. In the interest of working together to achieve that common goal, perhaps we can try to avoid loaded words/phrases like On the upside, it looks like there's agreement on at least some of the technical path forward. Maybe we can focus on that and try to put the rancor in the rear-view mirror. And if I'm coming across a bit too preachy and just making things worse, please ignore me and I'll probably go away. |
I get a major boost in performance with the added "return":
is now:
with the second "fix" of removing those three lines below the TODO comment I get this:
I think the above benchmark results shows a very clear picture of MakeCallback having some serious performance problems, just like my very first post said. Despite having been called out multiple times for not having anything technical to back it up and nice posts like this one:
It's not like I have done this for 8+ months now... |
You have yet to post any code, stack trace, benchmark graph, etc. to technically backup your claim that it was slow. Were there improvements? Yes, but they were only found after I had written a number of benchmarks and test patches to find what could be improved.
This is unfortunate. If I retained animosity for every developer I've had a strong disagreement with I'd be one very lonely developer. |
I don't have anything to back up my claims. Everything I do is based on feelings and imagination. Those numbers I posted, those are all made up. I have no idea what I'm doing. I just woke up and thought it would be fun to imagine that MakeCallback was slow. What a coincidence! Who would have guessed there actually was performance problems with MakeCallback. That's some nice chances right there, I should probably go buy a lottery ticket straight away. Since I clearly have no idea what I'm doing it must have all been thanks to your infinite wisdom of technicality and benchmarking grandness. Thanks o great one, for we shall no longer fumble in the dark. Clearly nothing of what I have to say means anything since I obviously only imagine numbers and guess what to do next. Maybe this whole thing was a cover-up by the government? Maybe this all was just a big hallucination, oh wow who knows? Thank for your infinite imperfection and your guidance in this hard hard time. I have so much to learn from thou grand grand one. Btw, everything I do is open source and available on my github, including the benchmarks. |
@alexhultman I'm sorry, but after ample attempts to be civil your attitude has constantly been hostile towards healthy development. I'll open another issue with the technical points discussed here for further action. |
I'm doing some optimizations to an addon and I have found the node::MakeCallback and the accompanying process.nextTick to be a major bottleneck for calling into JS-land. I have looked through the code all the way down to the process._nextTickCallback, or whatever it was called, and it is very obvious the whole idea of "register this function for next tick" is very inefficiently implemented in Node.js.
I have found that if I call from C++ into JS using the pure V8 function call I get almost 2x the throughput, while this brings the problem of not correctly firing the enqueued functions to process.nextTick. So I'm doing a workaround where I only call MakeCallback on the last function call for that iteration (or at least that is somewhat close to what I'm trying to do) and this properly fires the enqueued functions.
It would be nice to have a better control of process.nextTick and when to fire it from native land since the current code is like some kind of "solve-everything-with-dynamic-checks-for-every-possible-scenario" kind of solution. The best idea I have come up with is to MakeCallback into a noop fuction just to get the process.nextTick functions to trigger when I'm done with my iteration.
If addons would get to include the env.h they could use TickInfo to check if there are any enqueued functions and then call env->get_tick_callback()->Call(yada, yada... to take care of that thing more efficiently, or you could just add some new functions to node.h so that addons can check the current number of functions to call, and a function to actually do it.
Or, you could just remove the entire thing and only call the enqueued functions from uv_check_t so that you only do it once per iteration and not once per every single function call.
The text was updated successfully, but these errors were encountered: