Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

Cluster module crashes: race condition between worker.send([message]) and net.Server.listen([port]) #5330

Closed
benoitv-code opened this issue Apr 18, 2013 · 15 comments

Comments

@benoitv-code
Copy link

There is a crash in Node 0.10.4 (Linux) when using the cluster module while sending messages to the workers. It seems to be a race condition leading to an "internalMessage" event being received with an undefined handle.

Context: sending a message from the master process while workers are calling .listen() on a net server.

Code to reproduce the issue

"use strict";

var cluster = require('cluster');
var net = require('net');

if (cluster.isMaster) {
  setInterval(function () {
    for (var id in cluster.workers) {
      cluster.workers[id].send({cmd: 'test'});
    }
  }, 1);

  console.log('forking...');

  for (var i = 0; i < 42; ++i) cluster.fork();
} else {
  var server = net.createServer(function(sock) {});
  console.log('listening...');
  server.listen(4242);
}

Crash message

net.js:1054                                                                                                                                                                                                          
    if (port && handle.getsockname && port != handle.getsockname().port) {                                                                                                                                           
                      ^                                                                                                                                                                                              
TypeError: Cannot read property 'getsockname' of undefined                                                                                                                                                           
    at net.js:1054:23                                                                                                                                                                                                
    at Object.40:1 (cluster.js:587:5)                                                                                                                                                                                
    at handleResponse (cluster.js:171:41)                                                                                                                                                                            
    at respond (cluster.js:192:5)                                                                                                                                                                                    
    at handleMessage (cluster.js:202:5)                                                                                                                                                                              
    at process.EventEmitter.emit (events.js:117:20)                                                                                                                                                                  
    at handleMessage (child_process.js:318:10)                                                                                                                                                                       
    at child_process.js:387:7                                                                                                                                                                                        
    at process.handleConversion.net.Native.got (child_process.js:91:7)                                                                                                                                               
    at process.<anonymous> (child_process.js:386:13)             
@bnoordhuis
Copy link
Member

Confirmed. Interesting, it worked before 9352c19 but that must have been purely by accident because said commit fixes a real bug.

@bnoordhuis
Copy link
Member

Okay, found it. SCM_RIGHTS messages (that is, inter-process messages with file descriptors attached) are not supposed to be merged. If you sendmsg() n messages with file descriptors, the receiver should also recvmsg() n distinct messages. Normal messages however (without file descriptors) can be coalesced at will.

Other operating systems break before SCM_RIGHTS messages but Linux apparently also breaks after such messages. That is, when you send three separate messages like below:

{"cmd":"test"}
{"cmd":"test"}
{"cmd":"NODE_HANDLE","type":"net.Native","msg":{"cmd":"NODE_CLUSTER","ack":1,"seq":41}}

On other operating systems, the first two are coalesced and received as a single message while the third one ends up in a separate message. Linux however happily coalesces all three into a single message. It's legal but it's also very annoying because it means we'll have to go to great pains to figure out what message the file descriptor belongs to. For internal cluster traffic, we can scan for the NODE_HANDLE message but that won't work for user code that sends around handles. Rock, hard place. :-/

@isaacs
Copy link

isaacs commented Apr 23, 2013

@bnoordhuis nice find. This sucks.

Can we get around this by wrapping and unwrapping all messages and adding a seq ID or something?

@bnoordhuis
Copy link
Member

Yes, but that sucks from a performance perspective. You'd have to ack each message in order to avoid messages from building up (and risk having them be coalesced.)

Ideally, we should have separate channels for regular and SCM_RIGHTS messages but that's something for the medium/long term.

@isaacs
Copy link

isaacs commented Apr 23, 2013

@bnoordhuis Yeah, that's a good point. So, it sounds like, in the short term, this is WONTFIX, but we can work something out post-0.12 for 1.0. Am I understanding that right?

@bnoordhuis
Copy link
Member

Yep, I think that's what it boils down to. Do you think I should revert 9352c19 for now? Substituting one bug for another, it's a tough choice.

bnoordhuis added a commit to bnoordhuis/node that referenced this issue May 12, 2013
Commit 9352c19 ("child_process: don't emit same handle twice") trades
one bug for another.

Before said commit, a handle was sometimes delivered with messages it
didn't belong to.

The bug fix introduced another bug that needs some explaining. On UNIX
systems, handles are basically file descriptors that are passed around
with the sendmsg() and recvmsg() system calls, using auxiliary data
(SCM_RIGHTS) as the transport.

node.js and libuv depend on the fact that none of the supported systems
ever emit more than one SCM_RIGHTS message from a recvmsg() syscall.
That assumption is something we should probably address someday for the
sake of portability but that's a separate discussion.

So, SCM_RIGHTS messages are never coalesced. SCM_RIGHTS and normal
messages however _are_ coalesced. That is, recvmsg() might return this:

  recvmsg();  // { "message-with-fd", "message", "message" }

The operating system implicitly breaks pending messages along
SCM_RIGHTS boundaries. Most Unices break before such messages but Linux
also breaks _after_ them.  When the sender looks like this:

  sendmsg("message");
  sendmsg("message-with-fd");
  sendmsg("message");

Then on most Unices the receiver sees messages arriving like this:

  recvmsg();  // { "message" }
  recvmsg();  // { "message-with-fd", "message" }

The bug fix in commit 9352c19 assumes this behavior. On Linux however,
those messages can also come in like this:

  recvmsg();  // { "message", "message-with-fd" }
  recvmsg();  // { "message" }

In other words, it's incorrect to assume that the file descriptor is
always attached to the first message. This commit makes node wise up.

Fixes nodejs#5330.
bnoordhuis added a commit that referenced this issue May 13, 2013
Commit 9352c19 ("child_process: don't emit same handle twice") trades
one bug for another.

Before said commit, a handle was sometimes delivered with messages it
didn't belong to.

The bug fix introduced another bug that needs some explaining. On UNIX
systems, handles are basically file descriptors that are passed around
with the sendmsg() and recvmsg() system calls, using auxiliary data
(SCM_RIGHTS) as the transport.

node.js and libuv depend on the fact that none of the supported systems
ever emit more than one SCM_RIGHTS message from a recvmsg() syscall.
That assumption is something we should probably address someday for the
sake of portability but that's a separate discussion.

So, SCM_RIGHTS messages are never coalesced. SCM_RIGHTS and normal
messages however _are_ coalesced. That is, recvmsg() might return this:

  recvmsg();  // { "message-with-fd", "message", "message" }

The operating system implicitly breaks pending messages along
SCM_RIGHTS boundaries. Most Unices break before such messages but Linux
also breaks _after_ them.  When the sender looks like this:

  sendmsg("message");
  sendmsg("message-with-fd");
  sendmsg("message");

Then on most Unices the receiver sees messages arriving like this:

  recvmsg();  // { "message" }
  recvmsg();  // { "message-with-fd", "message" }

The bug fix in commit 9352c19 assumes this behavior. On Linux however,
those messages can also come in like this:

  recvmsg();  // { "message", "message-with-fd" }
  recvmsg();  // { "message" }

In other words, it's incorrect to assume that the file descriptor is
always attached to the first message. This commit makes node wise up.

Fixes #5330.
bnoordhuis pushed a commit that referenced this issue May 14, 2013
Increasing the number of workers from 2 to 8 makes this test
more likely to trigger race conditions. See #5330 for background.
bnoordhuis added a commit that referenced this issue Jun 8, 2013
node.js and libuv depend on the fact that none of the supported systems
ever emit more than one SCM_RIGHTS message from a recvmsg() syscall.

SCM_RIGHTS messages are never coalesced. SCM_RIGHTS and normal messages
however _are_ coalesced. That is, recvmsg() might return this:

  recvmsg();  // { "message-with-fd", "message", "message" }

The operating system implicitly breaks pending messages along
SCM_RIGHTS boundaries. Most Unices break before such messages but Linux
also breaks _after_ them.  When the sender looks like this:

  sendmsg("message");
  sendmsg("message-with-fd");
  sendmsg("message");

Then on most Unices the receiver sees messages arriving like this:

  recvmsg();  // { "message" }
  recvmsg();  // { "message-with-fd", "message" }

The bug fix in commit 9352c19 assumes this behavior. On Linux however,
those messages can also come in like this:

  recvmsg();  // { "message", "message-with-fd" }
  recvmsg();  // { "message" }

In other words, it's incorrect to assume that the file descriptor is
always attached to the first message. This commit makes node wise up.

This is a back-port of commit 21bd456 from the v0.10 branch. The test
has been dropped as it's not compatible with the v0.8 process model.

Fixes #5330.

Conflicts:
	lib/child_process.js
@PerfectedApp
Copy link

I see that this issue has been closed. Would someone be so kind as to state from which version has this been patched?

@Mithgol
Copy link

Mithgol commented Aug 12, 2013

As far as I see,

  • this issue is patched in version 0.11.5 by commit 21bd456,
  • this patch is backported to version 0.8 by commit 8a3d0c8.

I am not sure whether a fix for 0.8 was ever released and if the version 0.10 is patched.

@bnoordhuis
Copy link
Member

Would someone be so kind as to state from which version has this been patched?

v0.8.25, v0.10.6 and v0.11.2.

@vsviridov
Copy link

Just encountered it in 0.10.17

@bnoordhuis
Copy link
Member

@vsviridov Can you open a new issue with a test case and details on your environment (output of uname -a and node -pe process.versions and anything else you think is relevant)? Thanks.

@vsviridov
Copy link

Might be hard to isolate a test case, but I'll try.

@matchdav
Copy link

Definitely having this issue in 0.10.24

@vsviridov
Copy link

Mush shame, such lack of test case... Wow. Do not even have the codebase with the original issue anymore :(

@Meekohi
Copy link

Meekohi commented Apr 14, 2014

For anyone else who gets here:
Having this issue in 0.10.22, but appears fixed in 0.10.26

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants