New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spawn() is not asynchronous, blocks event loop for 2-3 seconds #14917

Closed
jorangreef opened this Issue Aug 18, 2017 · 27 comments

Comments

Projects
None yet
9 participants
@jorangreef
Contributor

jorangreef commented Aug 18, 2017

  • Version: v8.2.1
  • Platform: Linux ronomon.com 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
  • Subsystem: child_process

spawn() is not asynchronous while launching a child process.

I think this is a known issue but it took a few months to track down.

Our event loop was blocking frequently:

2017-08-18T12:18:03.935Z INFO Event loop blocked for 258ms.
2017-08-18T12:18:17.200Z INFO Event loop blocked for 223ms.
2017-08-18T12:18:45.417Z INFO Event loop blocked for 288ms.
2017-08-18T12:19:11.981Z INFO Event loop blocked for 296ms.
2017-08-18T12:19:39.142Z INFO Event loop blocked for 282ms.
2017-08-18T12:19:43.750Z INFO Event loop blocked for 329ms.
2017-08-18T12:19:48.655Z INFO Event loop blocked for 247ms.
2017-08-18T12:19:55.191Z INFO Event loop blocked for 266ms.
2017-08-18T12:20:52.000Z INFO Event loop blocked for 269ms.
2017-08-18T12:21:18.157Z INFO Event loop blocked for 294ms.
2017-08-18T12:21:20.388Z INFO Event loop blocked for 272ms.
2017-08-18T12:21:31.557Z INFO Event loop blocked for 276ms.
2017-08-18T12:21:32.529Z INFO Event loop blocked for 320ms.
2017-08-18T12:21:36.400Z INFO Event loop blocked for 307ms.

The system has many components and the RSS is around 14 GB. We had had issues with GC pause times so we moved everything off-heap and reduced the number of pointers. We may still have issues with GC pause times, but I thought it might also be something else.

In addition, over time, the majority of components were rewritten to run in the threadpool.

I thought it was finally down to deduplication chunking and hashing, but when this was recently moved to the threadpool, I took another look at a spawn routine which was calling out to clamdscan (moving to a socket protocol was the long term plan but spawning clamdscan was the first draft):

var stdout = '';
var stderr = '';
var child = Node.child.spawn('clamdscan', ['-']);
child.stdin.on('error', function(error) {}); // Fix ECONNRESET bug with Node.
child.stdin.write(buffer);
child.stdin.end();
child.stdout.on('data', function(data) { stdout += data; });
child.stderr.on('data', function(data) { stderr += data; });
child.on('error',
  function(error) {
    end(undefined, false, '', '');
  }
);
child.on('exit',
  function(code) {
    end(undefined, code === 1, stdout, stderr);
  }
);

I came across nodejs/node-v0.x-archive#9250 today and https://github.com/davepacheco/node-spawn-async seems to explain it in terms of the large heap or RSS size:

While workloads that make excessive use of fork(2) are hardly high-performance to begin with, the behavior of blocking the main thread for hundreds of milliseconds each time is unnecessarily pathological for otherwise reasonable workloads.

Wrapping the spawn code above in a simple timer shows:

2017-08-18T12:18:03.935Z INFO MDA: Blocking: scan: 271ms
2017-08-18T12:18:17.196Z INFO MDA: Blocking: scan: 263ms
2017-08-18T12:18:45.417Z INFO MDA: Blocking: scan: 296ms
2017-08-18T12:19:11.980Z INFO MDA: Blocking: scan: 293ms
2017-08-18T12:19:39.142Z INFO MDA: Blocking: scan: 306ms
2017-08-18T12:19:43.750Z INFO MDA: Blocking: scan: 293ms
2017-08-18T12:19:48.655Z INFO MDA: Blocking: scan: 276ms
2017-08-18T12:19:55.191Z INFO MDA: Blocking: scan: 299ms
2017-08-18T12:20:52.000Z INFO MDA: Blocking: scan: 277ms
2017-08-18T12:21:18.156Z INFO MDA: Blocking: scan: 294ms
2017-08-18T12:21:20.388Z INFO MDA: Blocking: scan: 292ms
2017-08-18T12:21:31.557Z INFO MDA: Blocking: scan: 290ms
2017-08-18T12:21:32.529Z INFO MDA: Blocking: scan: 318ms
2017-08-18T12:21:36.400Z INFO MDA: Blocking: scan: 315ms

All correlated with the event loop blocks reported above.

I understand that spawn is not meant to be called every few seconds, but I never expected average latency of 300ms?

Is there any way to improve spawn along the lines of what Dave Pacheco has done?

Or at least to document that spawn() will block the event loop for around 300ms depending on heap size?

@bnoordhuis

This comment has been minimized.

Show comment
Hide comment
@bnoordhuis

bnoordhuis Aug 18, 2017

Member

Some sleuthing with perf(1) should tell you where the time is spent. I have a few hunches but why guess when you can know for sure, right?

Member

bnoordhuis commented Aug 18, 2017

Some sleuthing with perf(1) should tell you where the time is spent. I have a few hunches but why guess when you can know for sure, right?

@jorangreef

This comment has been minimized.

Show comment
Hide comment
@jorangreef

jorangreef Aug 21, 2017

Contributor

Some sleuthing with perf(1) should tell you where the time is spent. I have a few hunches but why guess when you can know for sure, right?

Thanks Ben.

Here's a script to reproduce. Would you try your perf magic on it to see what you see?

It grows the heap every 2 seconds by 256 mb and then spawns echo hello. You should see pause times increasing as RSS increases beyond a few GB.

node --max-old-space-size=32768 spawn.js
var child = require('child_process');
var rss = [];

function alloc() {
  var buffer = Buffer.alloc(256 * 1024 * 1024);
  var index = 0;
  var length = buffer.length;
  while (index < length) {
    buffer[index] = index & 255;
    index += 4096;
  }
  rss.push(buffer);
  process.stdout.write('RSS=' + process.memoryUsage().rss + ': ');
}

function spawn() {
  var now = Date.now();
  var stdout = '';
  var stderr = '';
  var instance = child.spawn('echo', ['hello']);
  instance.stdin.on('error', function(error) {});
  instance.stdin.end();
  instance.stdout.on('data', function(data) { stdout += data; });
  instance.stderr.on('data', function(data) { stderr += data; });
  instance.on('error', function(error) { console.error(error); });
  instance.on('exit', function() {});
  process.stdout.write(
    'spawn() blocked event loop for ' + (Date.now() - now) + 'ms\n'
  );
}

setInterval(
  function() {
    alloc();
    spawn();
  },
  2000
);

I get the following:

RSS=284983296: spawn() blocked event loop for 25ms
RSS=553795584: spawn() blocked event loop for 14ms
RSS=822267904: spawn() blocked event loop for 19ms
RSS=1091592192: spawn() blocked event loop for 22ms
RSS=1359175680: spawn() blocked event loop for 25ms
RSS=1626419200: spawn() blocked event loop for 32ms
RSS=1895088128: spawn() blocked event loop for 35ms
RSS=2163490816: spawn() blocked event loop for 36ms
RSS=2432102400: spawn() blocked event loop for 36ms
RSS=2700414976: spawn() blocked event loop for 42ms
RSS=2968948736: spawn() blocked event loop for 47ms
RSS=3237343232: spawn() blocked event loop for 48ms
RSS=3505799168: spawn() blocked event loop for 49ms
RSS=3774234624: spawn() blocked event loop for 58ms
RSS=4042723328: spawn() blocked event loop for 60ms
RSS=4311146496: spawn() blocked event loop for 59ms
RSS=4579643392: spawn() blocked event loop for 67ms
RSS=4848050176: spawn() blocked event loop for 70ms
RSS=5116522496: spawn() blocked event loop for 71ms
RSS=5385150464: spawn() blocked event loop for 75ms
RSS=5653585920: spawn() blocked event loop for 80ms
RSS=5922144256: spawn() blocked event loop for 79ms
RSS=6190612480: spawn() blocked event loop for 82ms
RSS=6459039744: spawn() blocked event loop for 90ms
RSS=6727450624: spawn() blocked event loop for 92ms
RSS=6995865600: spawn() blocked event loop for 96ms
RSS=7264325632: spawn() blocked event loop for 97ms
RSS=7532867584: spawn() blocked event loop for 102ms
RSS=7801253888: spawn() blocked event loop for 108ms
RSS=8069656576: spawn() blocked event loop for 110ms
RSS=8338120704: spawn() blocked event loop for 108ms
RSS=8606556160: spawn() blocked event loop for 116ms
RSS=8874950656: spawn() blocked event loop for 118ms
RSS=9143349248: spawn() blocked event loop for 120ms
RSS=9411850240: spawn() blocked event loop for 123ms
RSS=9680269312: spawn() blocked event loop for 124ms
RSS=9948717056: spawn() blocked event loop for 130ms
RSS=10217205760: spawn() blocked event loop for 131ms
RSS=10485600256: spawn() blocked event loop for 139ms
RSS=10754035712: spawn() blocked event loop for 143ms
RSS=11022508032: spawn() blocked event loop for 145ms
RSS=11290894336: spawn() blocked event loop for 147ms
RSS=11559350272: spawn() blocked event loop for 150ms
RSS=11827826688: spawn() blocked event loop for 155ms
RSS=12096196608: spawn() blocked event loop for 161ms
RSS=12364754944: spawn() blocked event loop for 158ms
RSS=12633448448: spawn() blocked event loop for 165ms
RSS=12901801984: spawn() blocked event loop for 169ms
RSS=13170253824: spawn() blocked event loop for 172ms
RSS=13438726144: spawn() blocked event loop for 172ms
RSS=13707096064: spawn() blocked event loop for 175ms
RSS=13975621632: spawn() blocked event loop for 178ms
RSS=14243762176: spawn() blocked event loop for 188ms
RSS=14512508928: spawn() blocked event loop for 188ms
RSS=14780751872: spawn() blocked event loop for 193ms
RSS=15049326592: spawn() blocked event loop for 193ms
RSS=15317659648: spawn() blocked event loop for 197ms
RSS=15586279424: spawn() blocked event loop for 199ms
RSS=15854604288: spawn() blocked event loop for 202ms
RSS=16121454592: spawn() blocked event loop for 2501ms
RSS=16034512896: spawn() blocked event loop for 3156ms
RSS=16088698880: spawn() blocked event loop for 684ms
RSS=16056156160: spawn() blocked event loop for 1745ms
RSS=16123555840: spawn() blocked event loop for 494ms
RSS=16005734400: spawn() blocked event loop for 547ms
RSS=16055742464: spawn() blocked event loop for 1253ms
RSS=16117850112: spawn() blocked event loop for 632ms
RSS=16047591424: spawn() blocked event loop for 1752ms
RSS=16092176384: spawn() blocked event loop for 2044ms
RSS=16003870720: spawn() blocked event loop for 422ms
RSS=16033472512: spawn() blocked event loop for 365ms
RSS=15988920320: spawn() blocked event loop for 716ms
RSS=15989665792: spawn() blocked event loop for 463ms
RSS=16040112128: spawn() blocked event loop for 657ms
RSS=16017305600: spawn() blocked event loop for 1441ms
RSS=15969497088: spawn() blocked event loop for 519ms
RSS=15969656832: spawn() blocked event loop for 828ms
RSS=16012042240: spawn() blocked event loop for 281ms

I think this is the explanation but perhaps there's a better:

forking a Node server with a large heap just to exec some other program causes a significant amount of additional swap to be used

Contributor

jorangreef commented Aug 21, 2017

Some sleuthing with perf(1) should tell you where the time is spent. I have a few hunches but why guess when you can know for sure, right?

Thanks Ben.

Here's a script to reproduce. Would you try your perf magic on it to see what you see?

It grows the heap every 2 seconds by 256 mb and then spawns echo hello. You should see pause times increasing as RSS increases beyond a few GB.

node --max-old-space-size=32768 spawn.js
var child = require('child_process');
var rss = [];

function alloc() {
  var buffer = Buffer.alloc(256 * 1024 * 1024);
  var index = 0;
  var length = buffer.length;
  while (index < length) {
    buffer[index] = index & 255;
    index += 4096;
  }
  rss.push(buffer);
  process.stdout.write('RSS=' + process.memoryUsage().rss + ': ');
}

function spawn() {
  var now = Date.now();
  var stdout = '';
  var stderr = '';
  var instance = child.spawn('echo', ['hello']);
  instance.stdin.on('error', function(error) {});
  instance.stdin.end();
  instance.stdout.on('data', function(data) { stdout += data; });
  instance.stderr.on('data', function(data) { stderr += data; });
  instance.on('error', function(error) { console.error(error); });
  instance.on('exit', function() {});
  process.stdout.write(
    'spawn() blocked event loop for ' + (Date.now() - now) + 'ms\n'
  );
}

setInterval(
  function() {
    alloc();
    spawn();
  },
  2000
);

I get the following:

RSS=284983296: spawn() blocked event loop for 25ms
RSS=553795584: spawn() blocked event loop for 14ms
RSS=822267904: spawn() blocked event loop for 19ms
RSS=1091592192: spawn() blocked event loop for 22ms
RSS=1359175680: spawn() blocked event loop for 25ms
RSS=1626419200: spawn() blocked event loop for 32ms
RSS=1895088128: spawn() blocked event loop for 35ms
RSS=2163490816: spawn() blocked event loop for 36ms
RSS=2432102400: spawn() blocked event loop for 36ms
RSS=2700414976: spawn() blocked event loop for 42ms
RSS=2968948736: spawn() blocked event loop for 47ms
RSS=3237343232: spawn() blocked event loop for 48ms
RSS=3505799168: spawn() blocked event loop for 49ms
RSS=3774234624: spawn() blocked event loop for 58ms
RSS=4042723328: spawn() blocked event loop for 60ms
RSS=4311146496: spawn() blocked event loop for 59ms
RSS=4579643392: spawn() blocked event loop for 67ms
RSS=4848050176: spawn() blocked event loop for 70ms
RSS=5116522496: spawn() blocked event loop for 71ms
RSS=5385150464: spawn() blocked event loop for 75ms
RSS=5653585920: spawn() blocked event loop for 80ms
RSS=5922144256: spawn() blocked event loop for 79ms
RSS=6190612480: spawn() blocked event loop for 82ms
RSS=6459039744: spawn() blocked event loop for 90ms
RSS=6727450624: spawn() blocked event loop for 92ms
RSS=6995865600: spawn() blocked event loop for 96ms
RSS=7264325632: spawn() blocked event loop for 97ms
RSS=7532867584: spawn() blocked event loop for 102ms
RSS=7801253888: spawn() blocked event loop for 108ms
RSS=8069656576: spawn() blocked event loop for 110ms
RSS=8338120704: spawn() blocked event loop for 108ms
RSS=8606556160: spawn() blocked event loop for 116ms
RSS=8874950656: spawn() blocked event loop for 118ms
RSS=9143349248: spawn() blocked event loop for 120ms
RSS=9411850240: spawn() blocked event loop for 123ms
RSS=9680269312: spawn() blocked event loop for 124ms
RSS=9948717056: spawn() blocked event loop for 130ms
RSS=10217205760: spawn() blocked event loop for 131ms
RSS=10485600256: spawn() blocked event loop for 139ms
RSS=10754035712: spawn() blocked event loop for 143ms
RSS=11022508032: spawn() blocked event loop for 145ms
RSS=11290894336: spawn() blocked event loop for 147ms
RSS=11559350272: spawn() blocked event loop for 150ms
RSS=11827826688: spawn() blocked event loop for 155ms
RSS=12096196608: spawn() blocked event loop for 161ms
RSS=12364754944: spawn() blocked event loop for 158ms
RSS=12633448448: spawn() blocked event loop for 165ms
RSS=12901801984: spawn() blocked event loop for 169ms
RSS=13170253824: spawn() blocked event loop for 172ms
RSS=13438726144: spawn() blocked event loop for 172ms
RSS=13707096064: spawn() blocked event loop for 175ms
RSS=13975621632: spawn() blocked event loop for 178ms
RSS=14243762176: spawn() blocked event loop for 188ms
RSS=14512508928: spawn() blocked event loop for 188ms
RSS=14780751872: spawn() blocked event loop for 193ms
RSS=15049326592: spawn() blocked event loop for 193ms
RSS=15317659648: spawn() blocked event loop for 197ms
RSS=15586279424: spawn() blocked event loop for 199ms
RSS=15854604288: spawn() blocked event loop for 202ms
RSS=16121454592: spawn() blocked event loop for 2501ms
RSS=16034512896: spawn() blocked event loop for 3156ms
RSS=16088698880: spawn() blocked event loop for 684ms
RSS=16056156160: spawn() blocked event loop for 1745ms
RSS=16123555840: spawn() blocked event loop for 494ms
RSS=16005734400: spawn() blocked event loop for 547ms
RSS=16055742464: spawn() blocked event loop for 1253ms
RSS=16117850112: spawn() blocked event loop for 632ms
RSS=16047591424: spawn() blocked event loop for 1752ms
RSS=16092176384: spawn() blocked event loop for 2044ms
RSS=16003870720: spawn() blocked event loop for 422ms
RSS=16033472512: spawn() blocked event loop for 365ms
RSS=15988920320: spawn() blocked event loop for 716ms
RSS=15989665792: spawn() blocked event loop for 463ms
RSS=16040112128: spawn() blocked event loop for 657ms
RSS=16017305600: spawn() blocked event loop for 1441ms
RSS=15969497088: spawn() blocked event loop for 519ms
RSS=15969656832: spawn() blocked event loop for 828ms
RSS=16012042240: spawn() blocked event loop for 281ms

I think this is the explanation but perhaps there's a better:

forking a Node server with a large heap just to exec some other program causes a significant amount of additional swap to be used

@jorangreef

This comment has been minimized.

Show comment
Hide comment
@jorangreef

jorangreef Aug 28, 2017

Contributor

We removed the offending spawn() call, replacing it with a unix domain socket write last week.

The problem is down to http://docs.libuv.org/en/v1.x/process.html#c.uv_spawn being run in the main event loop thread, and not in the thread pool. Forking a new process causes the RSS of the parent process to be copied. As RSS increases, spawn time increases.

Anyone familiar with uv_spawn() interested in fixing this?

In the meantime, I'm willing to open a PR to add a warning to the child_process docs:

spawn() is not truly asynchronous and will block the event loop while launching a child process. The larger the RSS of the parent process, the longer spawn() will block. For resident set sizes of 200 MB, spawn() will block for +/-20ms. For resident set sizes of 20 GB, spawn() will block for between 200ms and 2 seconds. See #14917 for more information.

Contributor

jorangreef commented Aug 28, 2017

We removed the offending spawn() call, replacing it with a unix domain socket write last week.

The problem is down to http://docs.libuv.org/en/v1.x/process.html#c.uv_spawn being run in the main event loop thread, and not in the thread pool. Forking a new process causes the RSS of the parent process to be copied. As RSS increases, spawn time increases.

Anyone familiar with uv_spawn() interested in fixing this?

In the meantime, I'm willing to open a PR to add a warning to the child_process docs:

spawn() is not truly asynchronous and will block the event loop while launching a child process. The larger the RSS of the parent process, the longer spawn() will block. For resident set sizes of 200 MB, spawn() will block for +/-20ms. For resident set sizes of 20 GB, spawn() will block for between 200ms and 2 seconds. See #14917 for more information.

@bnoordhuis

This comment has been minimized.

Show comment
Hide comment
@bnoordhuis

bnoordhuis Sep 18, 2017

Member

The problem is down to http://docs.libuv.org/en/v1.x/process.html#c.uv_spawn being run in the main event loop thread, and not in the thread pool. Forking a new process causes the RSS of the parent process to be copied. As RSS increases, spawn time increases.

That depends. Linux preforks the working set (to reduce page faults afterwards) but platforms that do simple copy-on-write profit from synchronicity because they only copy a few pages of stack.

In fact, such platforms would be penalized when thread A forks just when thread B is collecting garbage and touching all pages.

Member

bnoordhuis commented Sep 18, 2017

The problem is down to http://docs.libuv.org/en/v1.x/process.html#c.uv_spawn being run in the main event loop thread, and not in the thread pool. Forking a new process causes the RSS of the parent process to be copied. As RSS increases, spawn time increases.

That depends. Linux preforks the working set (to reduce page faults afterwards) but platforms that do simple copy-on-write profit from synchronicity because they only copy a few pages of stack.

In fact, such platforms would be penalized when thread A forks just when thread B is collecting garbage and touching all pages.

@jorangreef

This comment has been minimized.

Show comment
Hide comment
@jorangreef

jorangreef Sep 19, 2017

Contributor

Blocking the event loop for upwards of 2-3 seconds whenever spawn() is called for RSS of 16 GB is something I think needs to be addressed. Node should not be calling uv_spawn() from the main event loop thread in the first place. It's not a breaking change for Node, since it does not affect the interface in any way. It might be a pain to fix but it should be fixed properly.

There must be plenty of people affected by this issue who are yet to realize where all their throughput is disappearing to.

Linux preforks the working set (to reduce page faults afterwards) but platforms that do simple copy-on-write profit from synchronicity because they only copy a few pages of stack.

It's good to know the problem is limited to Linux but Linux is a major platform for Node. If anyone is running Node servers with large RSS they are probably doing that on Linux. Surely calling uv_spawn from another thread will solve the problem for Linux?

In fact, such platforms would be penalized when thread A forks just when thread B is collecting garbage and touching all pages.

I'm not sure about that. GC is free to touch all pages in any event, regardless of whether fork is called from the main event loop thread or from a background thread.

Contributor

jorangreef commented Sep 19, 2017

Blocking the event loop for upwards of 2-3 seconds whenever spawn() is called for RSS of 16 GB is something I think needs to be addressed. Node should not be calling uv_spawn() from the main event loop thread in the first place. It's not a breaking change for Node, since it does not affect the interface in any way. It might be a pain to fix but it should be fixed properly.

There must be plenty of people affected by this issue who are yet to realize where all their throughput is disappearing to.

Linux preforks the working set (to reduce page faults afterwards) but platforms that do simple copy-on-write profit from synchronicity because they only copy a few pages of stack.

It's good to know the problem is limited to Linux but Linux is a major platform for Node. If anyone is running Node servers with large RSS they are probably doing that on Linux. Surely calling uv_spawn from another thread will solve the problem for Linux?

In fact, such platforms would be penalized when thread A forks just when thread B is collecting garbage and touching all pages.

I'm not sure about that. GC is free to touch all pages in any event, regardless of whether fork is called from the main event loop thread or from a background thread.

@bnoordhuis

This comment has been minimized.

Show comment
Hide comment
@bnoordhuis

bnoordhuis Sep 19, 2017

Member

Blocking the event loop for upwards of 2-3 seconds whenever spawn() is called for RSS of 16 GB is something I think needs to be addressed.

You are the first to bring it up, to my knowledge, so it can't be that prevalent. What's more, I don't think your particular use case would benefit from threading so much as it would from marking everything but a few pages madvise(MADV_DONTFORK). That's something you can test by tweaking libuv.

That said, have you verified with perf that the fork() or clone() system call is the actual cost center?

GC is free to touch all pages in any event, regardless of whether fork is called from the main event loop thread or from a background thread.

I think you misunderstand. The fact that uv_spawn() does not return until the fork calls execve() means there is never a concurrent compacting GC (except concurrent marking, but that's limited in scope and frequency.)

No GC is good because otherwise the parent incurs a page fault for every page it mutates until the execve() call. That's easily tens or even hundreds of thousands of page faults.

It's not a hypothetical. uv_spawn() initially didn't wait for the fork to complete. When I changed that, it had the pleasant side effect of improving a number of benchmarks, sometimes dramatically so.

Member

bnoordhuis commented Sep 19, 2017

Blocking the event loop for upwards of 2-3 seconds whenever spawn() is called for RSS of 16 GB is something I think needs to be addressed.

You are the first to bring it up, to my knowledge, so it can't be that prevalent. What's more, I don't think your particular use case would benefit from threading so much as it would from marking everything but a few pages madvise(MADV_DONTFORK). That's something you can test by tweaking libuv.

That said, have you verified with perf that the fork() or clone() system call is the actual cost center?

GC is free to touch all pages in any event, regardless of whether fork is called from the main event loop thread or from a background thread.

I think you misunderstand. The fact that uv_spawn() does not return until the fork calls execve() means there is never a concurrent compacting GC (except concurrent marking, but that's limited in scope and frequency.)

No GC is good because otherwise the parent incurs a page fault for every page it mutates until the execve() call. That's easily tens or even hundreds of thousands of page faults.

It's not a hypothetical. uv_spawn() initially didn't wait for the fork to complete. When I changed that, it had the pleasant side effect of improving a number of benchmarks, sometimes dramatically so.

@jorangreef

This comment has been minimized.

Show comment
Hide comment
@jorangreef

jorangreef Sep 19, 2017

Contributor

You are the first to bring it up, to my knowledge, so it can't be that prevalent.

It took a few months to track down. My point was actually that most people wouldn't think that Node's async spawn() would block the event loop for 2-3 seconds every call. It's not productive to argue that 16 GB RSS is "not normal" - if it's not "normal" it will be.

I'm not in fact the first to bring it up. @davepacheco ran into this issue at Joyent in production (https://github.com/davepacheco/node-spawn-async). Notwithstanding, whether it's prevalent is irrelevant. It should be fixed without regard to whether people happen to know it's broken or not.

What's more, I don't think your particular use case would benefit from threading so much as it would from marking everything but a few pages madvise(MADV_DONTFORK)

Sure, that would also work, but we just stopped using Node's spawn() altogether now.

That said, have you verified with perf that the fork() or clone() system call is the actual cost center?

See the script to reproduce here #14917 (comment). You are welcome to verify with perf yourself. You might spot something I've missed.

I think you misunderstand. The fact that uv_spawn() does not return until the fork calls execve() means there is never a concurrent compacting GC (except concurrent marking, but that's limited in scope and frequency.)

It's not a hypothetical. uv_spawn() initially didn't wait for the fork to complete. When I changed that, it had the pleasant side effect of improving a number of benchmarks, sometimes dramatically so.

Thanks for expanding. The current design is very tightly coupled to the GC.

At present, if I understand correctly, the story is that uv_spawn() was moved into the event loop to block the event loop intentionally and thereby keep the compacting GC from running while a fork is in progress?

Why not just keep the event loop running, run the fork in another thread and use another method to keep the compacting GC from running while a fork is in progress? Just pause the compacting GC?

Contributor

jorangreef commented Sep 19, 2017

You are the first to bring it up, to my knowledge, so it can't be that prevalent.

It took a few months to track down. My point was actually that most people wouldn't think that Node's async spawn() would block the event loop for 2-3 seconds every call. It's not productive to argue that 16 GB RSS is "not normal" - if it's not "normal" it will be.

I'm not in fact the first to bring it up. @davepacheco ran into this issue at Joyent in production (https://github.com/davepacheco/node-spawn-async). Notwithstanding, whether it's prevalent is irrelevant. It should be fixed without regard to whether people happen to know it's broken or not.

What's more, I don't think your particular use case would benefit from threading so much as it would from marking everything but a few pages madvise(MADV_DONTFORK)

Sure, that would also work, but we just stopped using Node's spawn() altogether now.

That said, have you verified with perf that the fork() or clone() system call is the actual cost center?

See the script to reproduce here #14917 (comment). You are welcome to verify with perf yourself. You might spot something I've missed.

I think you misunderstand. The fact that uv_spawn() does not return until the fork calls execve() means there is never a concurrent compacting GC (except concurrent marking, but that's limited in scope and frequency.)

It's not a hypothetical. uv_spawn() initially didn't wait for the fork to complete. When I changed that, it had the pleasant side effect of improving a number of benchmarks, sometimes dramatically so.

Thanks for expanding. The current design is very tightly coupled to the GC.

At present, if I understand correctly, the story is that uv_spawn() was moved into the event loop to block the event loop intentionally and thereby keep the compacting GC from running while a fork is in progress?

Why not just keep the event loop running, run the fork in another thread and use another method to keep the compacting GC from running while a fork is in progress? Just pause the compacting GC?

@bnoordhuis

This comment has been minimized.

Show comment
Hide comment
@bnoordhuis

bnoordhuis Sep 19, 2017

Member

It's not productive to argue that 16 GB RSS is "not normal" - if it's not "normal" it will be.

My point is that the number of bug reports is proportional to the number of affected users. We haven't received many bug reports so there likely aren't many affected users.

Notwithstanding, whether it's prevalent is irrelevant. It should be fixed without regard to whether people happen to know it's broken or not.

Issues are prioritized based on their impact. This is not a clear "always wrong" bug, it's suboptimal behavior under specific conditions.

In fact, it's not even clear if there is a way of addressing this without regressing the common case.

See the script to reproduce here #14917 (comment). You are welcome to verify with perf yourself.

The reason I ask you to is that there may be something particular to your setup (kernel, configuration, something else.) perf should be able to capture that.

At present, if I understand correctly, the story is that uv_spawn() was moved into the event loop to block the event loop intentionally and thereby keep the compacting GC from running while a fork is in progress?

No, hence "side effect" (although one I half expected when I made the change.)

Why not just keep the event loop running, run the fork in another thread and use another method to keep the compacting GC from running while a fork is in progress? Just pause the compacting GC?

You can't. It's like saying "why don't you stop breathing while you're under water?"

Member

bnoordhuis commented Sep 19, 2017

It's not productive to argue that 16 GB RSS is "not normal" - if it's not "normal" it will be.

My point is that the number of bug reports is proportional to the number of affected users. We haven't received many bug reports so there likely aren't many affected users.

Notwithstanding, whether it's prevalent is irrelevant. It should be fixed without regard to whether people happen to know it's broken or not.

Issues are prioritized based on their impact. This is not a clear "always wrong" bug, it's suboptimal behavior under specific conditions.

In fact, it's not even clear if there is a way of addressing this without regressing the common case.

See the script to reproduce here #14917 (comment). You are welcome to verify with perf yourself.

The reason I ask you to is that there may be something particular to your setup (kernel, configuration, something else.) perf should be able to capture that.

At present, if I understand correctly, the story is that uv_spawn() was moved into the event loop to block the event loop intentionally and thereby keep the compacting GC from running while a fork is in progress?

No, hence "side effect" (although one I half expected when I made the change.)

Why not just keep the event loop running, run the fork in another thread and use another method to keep the compacting GC from running while a fork is in progress? Just pause the compacting GC?

You can't. It's like saying "why don't you stop breathing while you're under water?"

@bnoordhuis

This comment has been minimized.

Show comment
Hide comment
@bnoordhuis

bnoordhuis Sep 21, 2017

Member

@jorangreef I'm still interested to see the output of perf report --stdio.

Member

bnoordhuis commented Sep 21, 2017

@jorangreef I'm still interested to see the output of perf report --stdio.

@jorangreef jorangreef changed the title from spawn() is not asynchronous to spawn() is not asynchronous, blocks event loop for 2-3 seconds Nov 4, 2017

@nathansobo

This comment has been minimized.

Show comment
Hide comment
@nathansobo

nathansobo Nov 10, 2017

I don't understand the details of process spawning as well as I'd like, but I thought it might be helpful to report that the synchronous nature of child_process.spawn has been a problem for us in Atom. We were making pretty heavy use of spawn to run Git commands in the background, and eventually tracked it down as the source of intermittent pauses that were harming the responsiveness of the application.

As a solution, we moved all Git spawning to a separate Electron process. Since that eats way more memory than we'd like, I'm tinkering with a native spawn server that will serve a similar role to the server in node-spawn-async, but with a lighter memory footprint.

Before proceeding though, I thought I'd check my understanding.

No GC is good because otherwise the parent incurs a page fault for every page it mutates until the execve() call. That's easily tens or even hundreds of thousands of page faults.

It's not a hypothetical. uv_spawn() initially didn't wait for the fork to complete. When I changed that, it had the pleasant side effect of improving a number of benchmarks, sometimes dramatically so.

This makes sense to me, but I'm not clear on the consequences of excessive page faults. If we built a native module that performed the spawn off the main thread would we be just treading into the same issues you fixed by moving it there to begin with? Would this still lead to poor responsiveness? Basically, is direct spawning from a Node process always a bad idea? That's my intuition based on this thread, but I wanted to check. How do other garbage-collected languages deal with spawning? Do they all suffer from this same pitfall? Are there platform-specific tools for spawning that could avoid forking?

Apologies if these are somewhat naive questions. The answer may be that I just need to go spend more time researching, but I thought other people might have similar questions.

nathansobo commented Nov 10, 2017

I don't understand the details of process spawning as well as I'd like, but I thought it might be helpful to report that the synchronous nature of child_process.spawn has been a problem for us in Atom. We were making pretty heavy use of spawn to run Git commands in the background, and eventually tracked it down as the source of intermittent pauses that were harming the responsiveness of the application.

As a solution, we moved all Git spawning to a separate Electron process. Since that eats way more memory than we'd like, I'm tinkering with a native spawn server that will serve a similar role to the server in node-spawn-async, but with a lighter memory footprint.

Before proceeding though, I thought I'd check my understanding.

No GC is good because otherwise the parent incurs a page fault for every page it mutates until the execve() call. That's easily tens or even hundreds of thousands of page faults.

It's not a hypothetical. uv_spawn() initially didn't wait for the fork to complete. When I changed that, it had the pleasant side effect of improving a number of benchmarks, sometimes dramatically so.

This makes sense to me, but I'm not clear on the consequences of excessive page faults. If we built a native module that performed the spawn off the main thread would we be just treading into the same issues you fixed by moving it there to begin with? Would this still lead to poor responsiveness? Basically, is direct spawning from a Node process always a bad idea? That's my intuition based on this thread, but I wanted to check. How do other garbage-collected languages deal with spawning? Do they all suffer from this same pitfall? Are there platform-specific tools for spawning that could avoid forking?

Apologies if these are somewhat naive questions. The answer may be that I just need to go spend more time researching, but I thought other people might have similar questions.

@bnoordhuis

This comment has been minimized.

Show comment
Hide comment
@bnoordhuis

bnoordhuis Nov 11, 2017

Member

@nathansobo No problem, happy to answer. The thread and later the fork still share memory with the parent until execve so yes, you'd have the same issue.

A Linux-only solution might be to call clone(2) with the right set of flags to create a thread or fork with unshared memory that spawns the new process, although it's probably not exactly trivial to combine with the other setup that uv_spawn() does.

If you have time to test something: see this comment in libuv's sources; I'd be interested to know if disabling the pipe impacts Electron positively or negatively.

Member

bnoordhuis commented Nov 11, 2017

@nathansobo No problem, happy to answer. The thread and later the fork still share memory with the parent until execve so yes, you'd have the same issue.

A Linux-only solution might be to call clone(2) with the right set of flags to create a thread or fork with unshared memory that spawns the new process, although it's probably not exactly trivial to combine with the other setup that uv_spawn() does.

If you have time to test something: see this comment in libuv's sources; I'd be interested to know if disabling the pipe impacts Electron positively or negatively.

@tim-kos

This comment has been minimized.

Show comment
Hide comment
@tim-kos

tim-kos Jan 11, 2018

My point is that the number of bug reports is proportional to the number of affected users. We haven't received many bug reports so there likely aren't many affected users.

Disagreed here. We have had unexplainable event loop blocks for years. The reason why we didn't file a bug report was that:

a) The problem was not big enough for us to worry about (RSS sizes not high enough).
b) With blocked event loops one first analyzes his own code in an out before assuming it might be an issue in node itself.

Point is: For most other problems I'd agree with your thesis. But this one is different.

I was really happy to find this issue. 👍

tim-kos commented Jan 11, 2018

My point is that the number of bug reports is proportional to the number of affected users. We haven't received many bug reports so there likely aren't many affected users.

Disagreed here. We have had unexplainable event loop blocks for years. The reason why we didn't file a bug report was that:

a) The problem was not big enough for us to worry about (RSS sizes not high enough).
b) With blocked event loops one first analyzes his own code in an out before assuming it might be an issue in node itself.

Point is: For most other problems I'd agree with your thesis. But this one is different.

I was really happy to find this issue. 👍

@bnoordhuis

This comment has been minimized.

Show comment
Hide comment
@bnoordhuis

bnoordhuis Jan 27, 2018

Member

I had some time to look into MADV_DONTFORK today. Keep in mind it's about the most artificial benchmark possible but the results are encouraging.

With a ~600 MB heap, the test completes 10x faster when memory is marked with MADV_DONTFORK: 4.7s vs 57.8s with < 2% variance over 10 runs. The bigger the heap, the bigger the difference.

Patch:

diff --git a/deps/v8/src/base/platform/platform-posix.cc b/deps/v8/src/base/platform/platform-posix.cc
index b873197d3b..6d5fa1b6f6 100644
--- a/deps/v8/src/base/platform/platform-posix.cc
+++ b/deps/v8/src/base/platform/platform-posix.cc
@@ -136,6 +136,9 @@ void* Allocate(void* address, size_t size, OS::MemoryPermission access) {
   void* result =
       mmap(address, actual_size, prot, flags, kMmapFd, kMmapFdOffset);
   if (result == MAP_FAILED) return nullptr;
+#if defined(V8_OS_LINUX)
+  madvise(address, actual_size, MADV_DONTFORK);
+#endif
   return result;
 }

I filed https://bugs.chromium.org/p/v8/issues/detail?id=7381 to discuss.

Test:

'use strict';
const { spawnSync } = require('child_process');
for (global.a = []; a.length < 1e7; a.push({ a }));
const start = process.hrtime();
for (let i = 0; i < 2500; ++i) spawnSync('./a.out', {stdio:'inherit'});
const [sec, nsec] = process.hrtime(start);
console.log(`${sec}.${nsec}`);

Binary:

// cc -static -nostdlib -s
#include <sys/syscall.h>
void __attribute__((noreturn)) _start(void) {
  for (;;) asm volatile("syscall" :: "a" (SYS_exit_group), "D" (0));
}

And for a 'told you so' moment: I also checked whether removing the wait-for-fork logic made a difference. The answer is 'no', it's in fact 5-10% slower and highly variable for the reasons I've mentioned earlier.

diff --git a/deps/uv/src/unix/process.c b/deps/uv/src/unix/process.c
index 9842710d0e..775e2f7188 100644
--- a/deps/uv/src/unix/process.c
+++ b/deps/uv/src/unix/process.c
@@ -418,16 +418,12 @@ int uv_spawn(uv_loop_t* loop,
   /* fork is marked __WATCHOS_PROHIBITED __TVOS_PROHIBITED. */
   return -ENOSYS;
 #else
-  int signal_pipe[2] = { -1, -1 };
   int pipes_storage[8][2];
   int (*pipes)[2];
   int stdio_count;
-  ssize_t r;
   pid_t pid;
   int err;
-  int exec_errorno;
   int i;
-  int status;
 
   assert(options->file != NULL);
   assert(!(options->flags & ~(UV_PROCESS_DETACHED |
@@ -462,30 +458,6 @@ int uv_spawn(uv_loop_t* loop,
       goto error;
   }
 
-  /* This pipe is used by the parent to wait until
-   * the child has called `execve()`. We need this
-   * to avoid the following race condition:
-   *
-   *    if ((pid = fork()) > 0) {
-   *      kill(pid, SIGTERM);
-   *    }
-   *    else if (pid == 0) {
-   *      execve("/bin/cat", argp, envp);
-   *    }
-   *
-   * The parent sends a signal immediately after forking.
-   * Since the child may not have called `execve()` yet,
-   * there is no telling what process receives the signal,
-   * our fork or /bin/cat.
-   *
-   * To avoid ambiguity, we create a pipe with both ends
-   * marked close-on-exec. Then, after the call to `fork()`,
-   * the parent polls the read end until it EOFs or errors with EPIPE.
-   */
-  err = uv__make_pipe(signal_pipe, 0);
-  if (err)
-    goto error;
-
   uv_signal_start(&loop->child_watcher, uv__chld, SIGCHLD);
 
   /* Acquire write lock to prevent opening new fds in worker threads */
@@ -495,42 +467,17 @@ int uv_spawn(uv_loop_t* loop,
   if (pid == -1) {
     err = -errno;
     uv_rwlock_wrunlock(&loop->cloexec_lock);
-    uv__close(signal_pipe[0]);
-    uv__close(signal_pipe[1]);
     goto error;
   }
 
   if (pid == 0) {
-    uv__process_child_init(options, stdio_count, pipes, signal_pipe[1]);
+    uv__process_child_init(options, stdio_count, pipes, -1);
     abort();
   }
 
   /* Release lock in parent process */
   uv_rwlock_wrunlock(&loop->cloexec_lock);
-  uv__close(signal_pipe[1]);
-
   process->status = 0;
-  exec_errorno = 0;
-  do
-    r = read(signal_pipe[0], &exec_errorno, sizeof(exec_errorno));
-  while (r == -1 && errno == EINTR);
-
-  if (r == 0)
-    ; /* okay, EOF */
-  else if (r == sizeof(exec_errorno)) {
-    do
-      err = waitpid(pid, &status, 0); /* okay, read errorno */
-    while (err == -1 && errno == EINTR);
-    assert(err == pid);
-  } else if (r == -1 && errno == EPIPE) {
-    do
-      err = waitpid(pid, &status, 0); /* okay, got EPIPE */
-    while (err == -1 && errno == EINTR);
-    assert(err == pid);
-  } else
-    abort();
-
-  uv__close_nocheckstdio(signal_pipe[0]);
 
   for (i = 0; i < options->stdio_count; i++) {
     err = uv__process_open_stream(options->stdio + i, pipes[i], i == 0);
@@ -543,11 +490,8 @@ int uv_spawn(uv_loop_t* loop,
     goto error;
   }
 
-  /* Only activate this handle if exec() happened successfully */
-  if (exec_errorno == 0) {
-    QUEUE_INSERT_TAIL(&loop->process_handles, &process->queue);
-    uv__handle_start(process);
-  }
+  QUEUE_INSERT_TAIL(&loop->process_handles, &process->queue);
+  uv__handle_start(process);
 
   process->pid = pid;
   process->exit_cb = options->exit_cb;
@@ -555,7 +499,7 @@ int uv_spawn(uv_loop_t* loop,
   if (pipes != pipes_storage)
     uv__free(pipes);
 
-  return exec_errorno;
+  return 0;
 
 error:
   if (pipes != NULL) {
Member

bnoordhuis commented Jan 27, 2018

I had some time to look into MADV_DONTFORK today. Keep in mind it's about the most artificial benchmark possible but the results are encouraging.

With a ~600 MB heap, the test completes 10x faster when memory is marked with MADV_DONTFORK: 4.7s vs 57.8s with < 2% variance over 10 runs. The bigger the heap, the bigger the difference.

Patch:

diff --git a/deps/v8/src/base/platform/platform-posix.cc b/deps/v8/src/base/platform/platform-posix.cc
index b873197d3b..6d5fa1b6f6 100644
--- a/deps/v8/src/base/platform/platform-posix.cc
+++ b/deps/v8/src/base/platform/platform-posix.cc
@@ -136,6 +136,9 @@ void* Allocate(void* address, size_t size, OS::MemoryPermission access) {
   void* result =
       mmap(address, actual_size, prot, flags, kMmapFd, kMmapFdOffset);
   if (result == MAP_FAILED) return nullptr;
+#if defined(V8_OS_LINUX)
+  madvise(address, actual_size, MADV_DONTFORK);
+#endif
   return result;
 }

I filed https://bugs.chromium.org/p/v8/issues/detail?id=7381 to discuss.

Test:

'use strict';
const { spawnSync } = require('child_process');
for (global.a = []; a.length < 1e7; a.push({ a }));
const start = process.hrtime();
for (let i = 0; i < 2500; ++i) spawnSync('./a.out', {stdio:'inherit'});
const [sec, nsec] = process.hrtime(start);
console.log(`${sec}.${nsec}`);

Binary:

// cc -static -nostdlib -s
#include <sys/syscall.h>
void __attribute__((noreturn)) _start(void) {
  for (;;) asm volatile("syscall" :: "a" (SYS_exit_group), "D" (0));
}

And for a 'told you so' moment: I also checked whether removing the wait-for-fork logic made a difference. The answer is 'no', it's in fact 5-10% slower and highly variable for the reasons I've mentioned earlier.

diff --git a/deps/uv/src/unix/process.c b/deps/uv/src/unix/process.c
index 9842710d0e..775e2f7188 100644
--- a/deps/uv/src/unix/process.c
+++ b/deps/uv/src/unix/process.c
@@ -418,16 +418,12 @@ int uv_spawn(uv_loop_t* loop,
   /* fork is marked __WATCHOS_PROHIBITED __TVOS_PROHIBITED. */
   return -ENOSYS;
 #else
-  int signal_pipe[2] = { -1, -1 };
   int pipes_storage[8][2];
   int (*pipes)[2];
   int stdio_count;
-  ssize_t r;
   pid_t pid;
   int err;
-  int exec_errorno;
   int i;
-  int status;
 
   assert(options->file != NULL);
   assert(!(options->flags & ~(UV_PROCESS_DETACHED |
@@ -462,30 +458,6 @@ int uv_spawn(uv_loop_t* loop,
       goto error;
   }
 
-  /* This pipe is used by the parent to wait until
-   * the child has called `execve()`. We need this
-   * to avoid the following race condition:
-   *
-   *    if ((pid = fork()) > 0) {
-   *      kill(pid, SIGTERM);
-   *    }
-   *    else if (pid == 0) {
-   *      execve("/bin/cat", argp, envp);
-   *    }
-   *
-   * The parent sends a signal immediately after forking.
-   * Since the child may not have called `execve()` yet,
-   * there is no telling what process receives the signal,
-   * our fork or /bin/cat.
-   *
-   * To avoid ambiguity, we create a pipe with both ends
-   * marked close-on-exec. Then, after the call to `fork()`,
-   * the parent polls the read end until it EOFs or errors with EPIPE.
-   */
-  err = uv__make_pipe(signal_pipe, 0);
-  if (err)
-    goto error;
-
   uv_signal_start(&loop->child_watcher, uv__chld, SIGCHLD);
 
   /* Acquire write lock to prevent opening new fds in worker threads */
@@ -495,42 +467,17 @@ int uv_spawn(uv_loop_t* loop,
   if (pid == -1) {
     err = -errno;
     uv_rwlock_wrunlock(&loop->cloexec_lock);
-    uv__close(signal_pipe[0]);
-    uv__close(signal_pipe[1]);
     goto error;
   }
 
   if (pid == 0) {
-    uv__process_child_init(options, stdio_count, pipes, signal_pipe[1]);
+    uv__process_child_init(options, stdio_count, pipes, -1);
     abort();
   }
 
   /* Release lock in parent process */
   uv_rwlock_wrunlock(&loop->cloexec_lock);
-  uv__close(signal_pipe[1]);
-
   process->status = 0;
-  exec_errorno = 0;
-  do
-    r = read(signal_pipe[0], &exec_errorno, sizeof(exec_errorno));
-  while (r == -1 && errno == EINTR);
-
-  if (r == 0)
-    ; /* okay, EOF */
-  else if (r == sizeof(exec_errorno)) {
-    do
-      err = waitpid(pid, &status, 0); /* okay, read errorno */
-    while (err == -1 && errno == EINTR);
-    assert(err == pid);
-  } else if (r == -1 && errno == EPIPE) {
-    do
-      err = waitpid(pid, &status, 0); /* okay, got EPIPE */
-    while (err == -1 && errno == EINTR);
-    assert(err == pid);
-  } else
-    abort();
-
-  uv__close_nocheckstdio(signal_pipe[0]);
 
   for (i = 0; i < options->stdio_count; i++) {
     err = uv__process_open_stream(options->stdio + i, pipes[i], i == 0);
@@ -543,11 +490,8 @@ int uv_spawn(uv_loop_t* loop,
     goto error;
   }
 
-  /* Only activate this handle if exec() happened successfully */
-  if (exec_errorno == 0) {
-    QUEUE_INSERT_TAIL(&loop->process_handles, &process->queue);
-    uv__handle_start(process);
-  }
+  QUEUE_INSERT_TAIL(&loop->process_handles, &process->queue);
+  uv__handle_start(process);
 
   process->pid = pid;
   process->exit_cb = options->exit_cb;
@@ -555,7 +499,7 @@ int uv_spawn(uv_loop_t* loop,
   if (pipes != pipes_storage)
     uv__free(pipes);
 
-  return exec_errorno;
+  return 0;
 
 error:
   if (pipes != NULL) {
@jorangreef

This comment has been minimized.

Show comment
Hide comment
@jorangreef

jorangreef Jan 29, 2018

Contributor

With a ~600 MB heap, the test completes 10x faster when memory is marked with MADV_DONTFORK: 4.7s vs 57.8s with < 2% variance over 10 runs. The bigger the heap, the bigger the difference.

Thanks Ben.

With your patch, how long is the event loop blocked when the RSS is 16 GB, using the test I provided? Fork may still block significantly just iterating across so many pages, even if they are marked with MADV_DONTFORK.

And for a 'told you so' moment: I also checked whether removing the wait-for-fork logic made a difference. The answer is 'no', it's in fact 5-10% slower and highly variable for the reasons I've mentioned earlier.

Thanks also for filing the V8 issue.

I think the real issue is really that fork is still being called from the main thread.

Fork should be called asynchronously from the libuv threadpool (and any GC issues which arise can be resolved differently).

No one wants to go around making sure every buffer is marked MADV_DONTFORK, or if that's going to be a thing, then the Buffer api will need to add a madvise() method. All because fork has issues on Linux.

It's punting the issue back to userland, when fork could just be moved off the main thread into the threadpool.

Contributor

jorangreef commented Jan 29, 2018

With a ~600 MB heap, the test completes 10x faster when memory is marked with MADV_DONTFORK: 4.7s vs 57.8s with < 2% variance over 10 runs. The bigger the heap, the bigger the difference.

Thanks Ben.

With your patch, how long is the event loop blocked when the RSS is 16 GB, using the test I provided? Fork may still block significantly just iterating across so many pages, even if they are marked with MADV_DONTFORK.

And for a 'told you so' moment: I also checked whether removing the wait-for-fork logic made a difference. The answer is 'no', it's in fact 5-10% slower and highly variable for the reasons I've mentioned earlier.

Thanks also for filing the V8 issue.

I think the real issue is really that fork is still being called from the main thread.

Fork should be called asynchronously from the libuv threadpool (and any GC issues which arise can be resolved differently).

No one wants to go around making sure every buffer is marked MADV_DONTFORK, or if that's going to be a thing, then the Buffer api will need to add a madvise() method. All because fork has issues on Linux.

It's punting the issue back to userland, when fork could just be moved off the main thread into the threadpool.

@jorangreef

This comment has been minimized.

Show comment
Hide comment
@jorangreef

jorangreef Jan 29, 2018

Contributor

I also noticed that your test case is actually benchmarking spawnSync()?

Just to confirm that we are on the same page, that this issue is about spawn() not being asynchronous?

Contributor

jorangreef commented Jan 29, 2018

I also noticed that your test case is actually benchmarking spawnSync()?

Just to confirm that we are on the same page, that this issue is about spawn() not being asynchronous?

@bnoordhuis

This comment has been minimized.

Show comment
Hide comment
@bnoordhuis

bnoordhuis Jan 29, 2018

Member

Perhaps I didn't explain it clearly but the point is that fork() is inexpensive when it doesn't have to copy huge amounts of memory. Overhead inside node.js starts to dominate once that's eliminated. Moving things into the threadpool makes no sense, that just adds extra latency and variance.

I also noticed that your test case is actually benchmarking spawnSync()?

Yes, but it doesn't matter. They're pretty similar, implementation-wise.

No one wants to go around making sure every buffer is marked MADV_DONTFORK, or if that's going to be a thing, then the Buffer api will need to add a madvise() method.

You'll need to explain to me how you reached that conclusion...

Member

bnoordhuis commented Jan 29, 2018

Perhaps I didn't explain it clearly but the point is that fork() is inexpensive when it doesn't have to copy huge amounts of memory. Overhead inside node.js starts to dominate once that's eliminated. Moving things into the threadpool makes no sense, that just adds extra latency and variance.

I also noticed that your test case is actually benchmarking spawnSync()?

Yes, but it doesn't matter. They're pretty similar, implementation-wise.

No one wants to go around making sure every buffer is marked MADV_DONTFORK, or if that's going to be a thing, then the Buffer api will need to add a madvise() method.

You'll need to explain to me how you reached that conclusion...

@jorangreef

This comment has been minimized.

Show comment
Hide comment
@jorangreef

jorangreef Jan 29, 2018

Contributor

Perhaps I didn't explain it clearly but the point is that fork() is inexpensive when it doesn't have to copy huge amounts of memory.

Your MADV_DONTFORK benchmark only shows a 10x speedup. You need to show at least a 100x speedup if you want to cut the 2-3 second pause times for 16 GB RSS down to 20ms. Even then, 20ms pause times are not acceptable.

Moving things into the threadpool makes no sense, that just adds extra latency and variance.

Sure, we both know it makes sense to move a 20ms, 200ms or 2000ms spawn() call into the threadpool. As an example, take another look at crypto-async to see where it makes sense to start crossing the threadpool boundary, even for crypto operations which are not at all the same magnitude as spawn(). Libuv's threadpool overhead is actually not material at all compared to the magnitude of a 20ms, 200ms or 2000ms spawn() call.

Granted, Node could still do much better on the threadpool front (increasing the size of the threadpool to avoid head of line blocking, separating IO tasks from CPU tasks to optimize throughput and latency independently) but those are separate issues and no reason not to move spawn() into the threadpool.

Then again, moving spawn() into the threadpool will take some work and I think you're really wanting to see if we can avoid that. Do you think you can reduce the cost of spawn() by 1000x down to 2ms for RSS of 16 GB?

Yes, but it doesn't matter. They're pretty similar, implementation-wise.

It matters because Node advertises spawnSync() as synchronous and spawn() as asynchronous. No one expects spawn() to block the event loop for 20ms, 200ms or 2000ms per call. The comments by others in this thread make that clear. If calling spawn() is not guaranteed to give pause times of 0ms, we should put a huge red warning in the docs as suggested earlier in the thread.

No one using Node should have to be writing things like node-spawn-async and spawn-server. Node is supposed to be a proponent of non-blocking IO.

No one wants to go around making sure every buffer is marked MADV_DONTFORK, or if that's going to be a thing, then the Buffer api will need to add a madvise() method.

You'll need to explain to me how you reached that conclusion...

Sure, the script I provided specifically benchmarks RSS and not just the V8 heap as yours does since most people with RSS of 16 GB are probably already keeping everything off the heap to minimize GC.

Are you thinking that Node should mark all Buffers as MADV_DONTFORK by default? That would be great but probably a breaking change. Furthermore, users would need a way to mark buffers allocated by native add ons. All this would then need to be documented somewhere... MADV_DONTFORK is a slippery slope.

Moving spawn() off the event loop is the best way to solve this issue.

Contributor

jorangreef commented Jan 29, 2018

Perhaps I didn't explain it clearly but the point is that fork() is inexpensive when it doesn't have to copy huge amounts of memory.

Your MADV_DONTFORK benchmark only shows a 10x speedup. You need to show at least a 100x speedup if you want to cut the 2-3 second pause times for 16 GB RSS down to 20ms. Even then, 20ms pause times are not acceptable.

Moving things into the threadpool makes no sense, that just adds extra latency and variance.

Sure, we both know it makes sense to move a 20ms, 200ms or 2000ms spawn() call into the threadpool. As an example, take another look at crypto-async to see where it makes sense to start crossing the threadpool boundary, even for crypto operations which are not at all the same magnitude as spawn(). Libuv's threadpool overhead is actually not material at all compared to the magnitude of a 20ms, 200ms or 2000ms spawn() call.

Granted, Node could still do much better on the threadpool front (increasing the size of the threadpool to avoid head of line blocking, separating IO tasks from CPU tasks to optimize throughput and latency independently) but those are separate issues and no reason not to move spawn() into the threadpool.

Then again, moving spawn() into the threadpool will take some work and I think you're really wanting to see if we can avoid that. Do you think you can reduce the cost of spawn() by 1000x down to 2ms for RSS of 16 GB?

Yes, but it doesn't matter. They're pretty similar, implementation-wise.

It matters because Node advertises spawnSync() as synchronous and spawn() as asynchronous. No one expects spawn() to block the event loop for 20ms, 200ms or 2000ms per call. The comments by others in this thread make that clear. If calling spawn() is not guaranteed to give pause times of 0ms, we should put a huge red warning in the docs as suggested earlier in the thread.

No one using Node should have to be writing things like node-spawn-async and spawn-server. Node is supposed to be a proponent of non-blocking IO.

No one wants to go around making sure every buffer is marked MADV_DONTFORK, or if that's going to be a thing, then the Buffer api will need to add a madvise() method.

You'll need to explain to me how you reached that conclusion...

Sure, the script I provided specifically benchmarks RSS and not just the V8 heap as yours does since most people with RSS of 16 GB are probably already keeping everything off the heap to minimize GC.

Are you thinking that Node should mark all Buffers as MADV_DONTFORK by default? That would be great but probably a breaking change. Furthermore, users would need a way to mark buffers allocated by native add ons. All this would then need to be documented somewhere... MADV_DONTFORK is a slippery slope.

Moving spawn() off the event loop is the best way to solve this issue.

@bnoordhuis

This comment has been minimized.

Show comment
Hide comment
@bnoordhuis

bnoordhuis Jan 29, 2018

Member

You need to show at least a 100x speedup if you want to cut the 2-3 second pause times for 16 GB RSS down to 20ms.

That's the "the bigger the heap, the bigger the difference" I mentioned. Whether it's 600 MB or 16 GB doesn't matter because it isn't copied. Think O(1) vs. O(n).

Are you thinking that Node should mark all Buffers as MADV_DONTFORK by default?

Yes, and no, that's not a breaking change.

Yes, add-ons would need to do that too if they allocate a lot of memory but that's a good change in and of itself.

Forking from a thread won't help because of page faults; even if it did, it'd still waste a lot of CPU cycles, only less visible.

If you still disagree, the onus is on you to prove it with code and numbers.

Member

bnoordhuis commented Jan 29, 2018

You need to show at least a 100x speedup if you want to cut the 2-3 second pause times for 16 GB RSS down to 20ms.

That's the "the bigger the heap, the bigger the difference" I mentioned. Whether it's 600 MB or 16 GB doesn't matter because it isn't copied. Think O(1) vs. O(n).

Are you thinking that Node should mark all Buffers as MADV_DONTFORK by default?

Yes, and no, that's not a breaking change.

Yes, add-ons would need to do that too if they allocate a lot of memory but that's a good change in and of itself.

Forking from a thread won't help because of page faults; even if it did, it'd still waste a lot of CPU cycles, only less visible.

If you still disagree, the onus is on you to prove it with code and numbers.

@jorangreef

This comment has been minimized.

Show comment
Hide comment
@jorangreef

jorangreef Jan 29, 2018

Contributor

That's the "the bigger the heap, the bigger the difference" I mentioned. Whether it's 600 MB or 16 GB doesn't matter because it isn't copied. Think O(1) vs. O(n).

I am probably wrong, but I would speculate that MADV_DONTFORK would not remove the need for fork to iterate across the pages to decide whether they should be copied in the first place, unless the MADV_DONTFORK flags are combined for larger regions to amortize the branching cost. This branching cost might turn out to be significant, even as much as the copy cost, which is why a benchmark for RSS of 16 GB would be interesting. I have a suspicion it might not be O(1).

To be clear, Node and V8 doing MADV_DONTFORK by default would be great. I just don't think that's going to go far enough to address this issue.

Contributor

jorangreef commented Jan 29, 2018

That's the "the bigger the heap, the bigger the difference" I mentioned. Whether it's 600 MB or 16 GB doesn't matter because it isn't copied. Think O(1) vs. O(n).

I am probably wrong, but I would speculate that MADV_DONTFORK would not remove the need for fork to iterate across the pages to decide whether they should be copied in the first place, unless the MADV_DONTFORK flags are combined for larger regions to amortize the branching cost. This branching cost might turn out to be significant, even as much as the copy cost, which is why a benchmark for RSS of 16 GB would be interesting. I have a suspicion it might not be O(1).

To be clear, Node and V8 doing MADV_DONTFORK by default would be great. I just don't think that's going to go far enough to address this issue.

@bnoordhuis

This comment has been minimized.

Show comment
Hide comment
@bnoordhuis

bnoordhuis Jan 29, 2018

Member

unless the MADV_DONTFORK flags are combined for larger regions to amortize the branching cost

That's what the kernel does; the madvise applies to the vma (struct vm_area_struct), which is essentially a memory range descriptor.

Member

bnoordhuis commented Jan 29, 2018

unless the MADV_DONTFORK flags are combined for larger regions to amortize the branching cost

That's what the kernel does; the madvise applies to the vma (struct vm_area_struct), which is essentially a memory range descriptor.

@GrayedFox

This comment has been minimized.

Show comment
Hide comment
@GrayedFox

GrayedFox Mar 21, 2018

Hi peeps, very interesting discussion indeed! I've been racking my brains and searching through dump files for a potential memory leak but I think this is most likely what's slowing my scraper down so much over time.

I'm relying on spawn() to interact with a locally hosted cli (in my case litecoin-cli) and the spawn calls are part of a scraping project. Using hrtime to painstakingly debug and measure the performance of the different parts of the program has led me to this thread after googling around for reasons spawn() could be delayed.

Full disclosure: I'm certainly not a node ninja just yet and am self taught, so apologies if the answer to this question seems obvious.

If I call spawn() from inside a cluster worker, since that worker is already it's own process, how will this effect code execution and blocking in the other workers and/or the master process? The flow logic of the program is:

main.js > spawn os.numpCPU() workers using custer.fork() > workers scrape in parallel using spawn() to hit the local cli > get needed information, send back to master > master delegates next block to scrape, if any, and handles writing to the exported csv file

"Our event loop was blocking frequently"

When I read stuff like that I get a bit nervous - would be great to get a better understanding of how spawn() is blocking if it's called from a cluster worker, which is itself a child process -- I'm hoping that won't actually block anything in the master?

Keep up the good work!

GrayedFox commented Mar 21, 2018

Hi peeps, very interesting discussion indeed! I've been racking my brains and searching through dump files for a potential memory leak but I think this is most likely what's slowing my scraper down so much over time.

I'm relying on spawn() to interact with a locally hosted cli (in my case litecoin-cli) and the spawn calls are part of a scraping project. Using hrtime to painstakingly debug and measure the performance of the different parts of the program has led me to this thread after googling around for reasons spawn() could be delayed.

Full disclosure: I'm certainly not a node ninja just yet and am self taught, so apologies if the answer to this question seems obvious.

If I call spawn() from inside a cluster worker, since that worker is already it's own process, how will this effect code execution and blocking in the other workers and/or the master process? The flow logic of the program is:

main.js > spawn os.numpCPU() workers using custer.fork() > workers scrape in parallel using spawn() to hit the local cli > get needed information, send back to master > master delegates next block to scrape, if any, and handles writing to the exported csv file

"Our event loop was blocking frequently"

When I read stuff like that I get a bit nervous - would be great to get a better understanding of how spawn() is blocking if it's called from a cluster worker, which is itself a child process -- I'm hoping that won't actually block anything in the master?

Keep up the good work!

@gireeshpunathil

This comment has been minimized.

Show comment
Hide comment
@gireeshpunathil

gireeshpunathil Mar 21, 2018

Member

@GrayedFox - the problem context does not involve any cluster. Instead, a child process forking site (the child process itself can be a cluster member though). The issue is: there is a series of activities the parent has to perform before it can let the child on its own and it can let the program get onto the next set of instructions. This series of actions has to be synchronous, and causes block in the uv loop, with the latency as a function of the size of the parent that is spawning.

Member

gireeshpunathil commented Mar 21, 2018

@GrayedFox - the problem context does not involve any cluster. Instead, a child process forking site (the child process itself can be a cluster member though). The issue is: there is a series of activities the parent has to perform before it can let the child on its own and it can let the program get onto the next set of instructions. This series of actions has to be synchronous, and causes block in the uv loop, with the latency as a function of the size of the parent that is spawning.

@GrayedFox

This comment has been minimized.

Show comment
Hide comment
@GrayedFox

GrayedFox Mar 21, 2018

@gireeshpunathil thanks for the response, however I'm not sure I get what you mean by

Instead, a child process forking site (the child process itself can be a cluster member though).

I get that when a parent forks a child process this takes time (and is either asynchronous or not) and that the latency of this fork function is related to size of the parent process. What I want to know is if, say, I have a single master process with 4 workers, and each of these workers individually spawn child processes too (many of them, in fact) - will the spawned children 2 levels deep cause blocking in my master proc?

Here's a visualisation:

MASTER
||
cluster.fork(numCPUs)
||
||--Worker 1
||------||
||------spawn() -- Blocks Worker 1, or Worker 1 and Master, or all Workers and Master?
||
||--Worker 2
||------||
||------spawn() -- Blocks Worker 2, or Worker 2 and Master, or all Workers and Master?
||
||--Worker N
...

GrayedFox commented Mar 21, 2018

@gireeshpunathil thanks for the response, however I'm not sure I get what you mean by

Instead, a child process forking site (the child process itself can be a cluster member though).

I get that when a parent forks a child process this takes time (and is either asynchronous or not) and that the latency of this fork function is related to size of the parent process. What I want to know is if, say, I have a single master process with 4 workers, and each of these workers individually spawn child processes too (many of them, in fact) - will the spawned children 2 levels deep cause blocking in my master proc?

Here's a visualisation:

MASTER
||
cluster.fork(numCPUs)
||
||--Worker 1
||------||
||------spawn() -- Blocks Worker 1, or Worker 1 and Master, or all Workers and Master?
||
||--Worker 2
||------||
||------spawn() -- Blocks Worker 2, or Worker 2 and Master, or all Workers and Master?
||
||--Worker N
...

@gireeshpunathil

This comment has been minimized.

Show comment
Hide comment
@gireeshpunathil

gireeshpunathil Mar 21, 2018

Member

no, activities in (fully born) child processes do not bother the parent (or master) process in any manner.

As you already guessed it right, this concerns only if a process directly uses (invokes) spawn* family of functions:

$ strace -ftttT ./node -e "require('child_process').exec('ls')"

[pid  4454] 1521646359.077652 execve("/bin/sh", ["/bin/sh", "-c", "ls"], [/* 31 vars */] <unfinished ...>
[pid  4448] 1521646359.078230 <... read resumed> "", 4) = 0 <0.003965>
[pid  4454] 1521646359.078249 <... execve resumed> ) = 0 <0.000543>
...

So in this simple case, we are talking about the <0.000543> seconds being blocked in the loop, as we are busy spawning. (it may be a little bit more, as there are some additional preparatory work for the parent-child communication channel)

Member

gireeshpunathil commented Mar 21, 2018

no, activities in (fully born) child processes do not bother the parent (or master) process in any manner.

As you already guessed it right, this concerns only if a process directly uses (invokes) spawn* family of functions:

$ strace -ftttT ./node -e "require('child_process').exec('ls')"

[pid  4454] 1521646359.077652 execve("/bin/sh", ["/bin/sh", "-c", "ls"], [/* 31 vars */] <unfinished ...>
[pid  4448] 1521646359.078230 <... read resumed> "", 4) = 0 <0.003965>
[pid  4454] 1521646359.078249 <... execve resumed> ) = 0 <0.000543>
...

So in this simple case, we are talking about the <0.000543> seconds being blocked in the loop, as we are busy spawning. (it may be a little bit more, as there are some additional preparatory work for the parent-child communication channel)

@gireeshpunathil

This comment has been minimized.

Show comment
Hide comment
@gireeshpunathil

gireeshpunathil May 19, 2018

Member

I guess this discussion has run its course with a conclusion:

  • common use cases are unaffected by this
  • corner case can be handled with some design level changes in the libuv
  • PRs welcome!

Closing, thanks everyone for the great insights.

Member

gireeshpunathil commented May 19, 2018

I guess this discussion has run its course with a conclusion:

  • common use cases are unaffected by this
  • corner case can be handled with some design level changes in the libuv
  • PRs welcome!

Closing, thanks everyone for the great insights.

@davisjam

This comment has been minimized.

Show comment
Hide comment
@davisjam

davisjam Jun 9, 2018

Contributor

I think at minimum a documentation PR like #17250 would be a helpful outcome.
See #21234 for a first draft.

Contributor

davisjam commented Jun 9, 2018

I think at minimum a documentation PR like #17250 would be a helpful outcome.
See #21234 for a first draft.

davisjam added a commit to davisjam/node that referenced this issue Jun 9, 2018

doc: note synchronous part of child_process.spawn
If an application has a large RSS (GB), jorangreef
observed that child_process.spawn will block the
event loop for ms-s.

This is surprising behavior and merits documentation.

See discussion on nodejs#14917.

davisjam added a commit to davisjam/node that referenced this issue Jun 9, 2018

doc: note synchronous part of child_process.spawn
If an application has a large RSS (GB), jorangreef
observed that child_process.spawn will block the
event loop for ms-s.

This is surprising behavior and merits documentation.

Refs: nodejs#14917

davisjam added a commit to davisjam/node that referenced this issue Jun 9, 2018

doc: note synchronous part of child_process.spawn
If an application has a large RSS (GB), jorangreef
observed that child_process.spawn will block the
event loop for ms-s.

This is surprising behavior and merits documentation.

Refs: nodejs#14917
@bnoordhuis

This comment has been minimized.

Show comment
Hide comment

jasnell added a commit that referenced this issue Jun 21, 2018

doc: note synchronous part of child_process.spawn
If an application has a large RSS (GB), jorangreef
observed that child_process.spawn will block the
event loop for ms-s.

This is surprising behavior and merits documentation.

Refs: #14917

PR-URL: #21234
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Trivikram Kamat <trivikr.dev@gmail.com>

targos added a commit that referenced this issue Jun 22, 2018

doc: note synchronous part of child_process.spawn
If an application has a large RSS (GB), jorangreef
observed that child_process.spawn will block the
event loop for ms-s.

This is surprising behavior and merits documentation.

Refs: #14917

PR-URL: #21234
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Trivikram Kamat <trivikr.dev@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment