Cluster 2.0 #2038

wants to merge 155 commits into


None yet
5 participants

AndreasMadsen commented Nov 7, 2011

In 0.6.x the cluster module was just a small extension on the node_process.fork() function. The propose of this pull request is to make the cluster module easier to setup, without removing the basic functionality. It also add a lot of new functionality there make it possible for userland plugins to interact with the cluster module.

The documentation can be found here.

This request fix the following issues:

  • issue #2060when the master gets an kill/quit signal it will do the same with the workers.
  • issue #2073made cluster silent.
  • issue #2088added method for gracefull worker and master shutdown.

This request contain the following changes:

  • Added .eachWorker(fn) method.
  • Allow workers to be another file, using .setupMaster()
  • Allow parsing arguments to workers, using .fork([env])
  • Internal messages won't be send to the message event
  • Allow workers to commit suicide, using worker.destroy()
  • Moved process from the child_process.fork to cluster.fork().process.
  • Make the API in master and workers equal using an internal new Worker() object.
  • Easy acces to worker details, inside the worker.
  • Added .disconnect() method to make a gracefully shutdown
  • Added events: fork, listening, disconnect
  • Made cluster silent
  • You can use SIGTERM, SIGINT and SIGQUIT on a worker
  • Add cluster.workerOnline property to get how many workers there are online
  • Support SIGTERM, SIGINT, SIGQUIT and SIGCHLD on master.
  • Create a cluster.disconnect() method.
  • Create a cluster.destroy() method.
  • cluster.isWorker and cluster.isMaster is now protected
  • Kill all workers the moment the master exit
  • Allow echo callback from both worker and master
  • Allow echo callback to be used by userland
  • Added a workerID, it's a number there will be reused when the worker spawn/respawn.
  • Added a uniqueID, if is a unique number there will change for each spawn/respawn.
  • Throw error when conflict between .autoFork() and manual .fork().
  • Prevent and detect respawn infinite loops, caused by errors in workers.
  • Emit a citicalError event when a respawn infinite loops is detected.
  • Added startup property to worker object.
  • Added a silent option to the setupMaster to prevent output from workers to be showen in the master as outout.
  • Add a zero-downtime restart method.
  • When getting a SIGHUP signal the cluster will restart graceful.
  • Added settings object to the cluster
  • Added setup to cluster, this will emit when setupMaster execute.
  • You now kill worker using destroy not kill
  • The worker event kill is changed to death
  • Updated and improve documentation
  • Added testcases (a lot of testcases)

This changes was made in other modules:

  • To support silent option, the child_process.fork() has been updated to take silent as a option in the options argument. This is a very small change.

This changes will be included in the near future:

  • Fix issue #2194But before that UDP port sharing must be suppored
  • Allow that the number of CPUs can change. This depend on issue #2265

AndreasMadsen added some commits Nov 7, 2011

@AndreasMadsen AndreasMadsen send tcp self to cluster._getServer 256ec65
@AndreasMadsen AndreasMadsen renamed function from _startWorker to _setupWorker 0afde5b
@AndreasMadsen AndreasMadsen added testcase for cluster events 50820e7
@AndreasMadsen AndreasMadsen eachWorker: made public; startMaster: made public, renamed to setupMa…
…ster, added options argument; HandleWorkerMessage: emit events, echo allwas queryID if required, added listening command; fork: give each worker a protected workerID; queryMaster: made public, renamed to worker.send; _getServer: send a listening command to master.
@AndreasMadsen AndreasMadsen Documented cluster events c97c232

AndreasMadsen commented Nov 7, 2011

I'm in the process of writing the documentation.


AndreasMadsen commented Nov 7, 2011

I finished documenting the changes i have made.
However in the documentation i have:

  • renamed cluster.worker.send to cluster.worker.respond
  • added a message event cluster.worker.on('message')
  • changed the behavouore cluster.fork().on('message')

This changes has not yet been made in the cluster.js native module.


AndreasMadsen commented Nov 8, 2011

I has added @visionmedia commit 58b558a to this pull request.


AndreasMadsen commented Nov 8, 2011

I am considering to make the Worker class useful to both master and worker. In that way the API will almost be the same for the master and worker.

In master
A Worker object could in the master be obtained by cluster.workers[id].

  • Its message event will be emitted when it receive non-internal data form the worker.
  • Its kill method would set a suicide state, and kill the process
  • Its send method would send a message, and call a callback when a echo is received from the worker.

This is all ready done.

In worker
The Worker object could in the worker be obtained by cluster.worker.

  • Its message event will be emitted when it receive non-internal data form the master.
  • Its kill method would send a suicide state to the master, when the callback is called it will kill itself
  • Its send (currently respond) method would send a message to the master, and call a callback when a echo is received from the master.

AndreasMadsen commented Nov 8, 2011

I noticed that my commits don't fit the make jslint should i fix that or do it in another pull request when this is pulled?


AndreasMadsen commented Nov 8, 2011

I have updated the pull request so both the worker and master are using the Worker class.
This had the side effekt that the code got a lot simpler.
At 5c1d481 I was worried about the code would some jiberish, since i had to make individual changes for both worker and master, but I do not this this is a problem any more.

I do not have any more plans for changes, but i would like to discuss:

  • Why do new workers need to have a new id, when they can just use the one from the dead worker.
  • Preventing evil error wheels.
  • The finalized event (when all workers are listening).
  • Additional testcases, there was not a single one before my commit

I will try to fix the issues there are reported


AndreasMadsen commented Nov 14, 2011

As proposed in issue #2088 i have added .disconnect() method in commit 25f5793. It is extremely difficult to test if the worker do exit when/if all connections stops, since there are no IPC.


AndreasMadsen commented Nov 14, 2011

As requested in #2073 the cluster is now silent


AndreasMadsen commented Nov 14, 2011

I have been trying t fix issue #2060 but as a JavaScripter this seams to be out of my range.
This is my current work: but it don't work.
Hopefully someone will look at it or give me a hint.


AndreasMadsen commented Nov 14, 2011

Here is an update:
You can now use SIGTERM, SIGINT and SIGQUIT on a worker, (not the master). The worker will send a suicide message to the master so master won't spawn another one. SIGQUIT do a gracefull shutdown, SIGTERM and SIGINT just commit suicide.

Now I'm trying to support SIGCHLD on the master. However (and this is were my brain stops to work) when i make a kill -s TERM #pid on a worker the on.process('SIGCHLD') in the master emits.

You can review the code here:
To debug this, this is what I'm doing:
I have a test file:
Then i execute it using node test.js. I then tells me a worker has a pid called 1139, in another terminal i then write kill -s TERM 1139.

The result is that I see that the worker die, and don't restart because if the suicide state. But then the SIGCHILD emits there are calling autoFork(), end result another worker is online an a wrong process is emitted.
The purpose of SIGCHILD is the same as in LearnBoost/cluster – it spawn workers if there are a lack of them.


AndreasMadsen commented Nov 15, 2011

Okay got it, SIGCHLD emits when one of the masters children dies. I will need to make some more research on when this happen (.disconnect()) and if it works om Windows as well.

Perhaps it should replace the .on('death', cluster.fork) in autoFork, if it is reliable.


AndreasMadsen commented Nov 16, 2011

I don't see that SIGCHLD cloud be supported properly, since there is no way to know what worker there died, this is necessary in order to know if the worker did suicide. So I only made support for SIGINT, SIGTERM og SIGQUIT.


AndreasMadsen commented Nov 19, 2011

In the last commit stack, I have:

  • renamed workerID to uniqueID, since it made more sense to call the cpu depending id workerID.
  • made workerID there will be reused when respawning a dead worker.
  • Reimplemented uncaughtException handling, this was necessary since process.on('exit') isn't called when an error makes the kill.
  • Preventing userland to use both .autoFork() and .fork() in the same master. There are now an internal variable there can have two values manual or auto. Depending on what function there was used first.
  • Prevent and detect respawn infinite loops, caused by errors in workers. When using .autoFork() it would autorespawn dead workers, however if they died because of a syntax error it would have created an infinite loop. It now log the previously 5 death, and if they all died within 1 sec, it won't spawn any more.
  • When a respawn infinite loops is detected, the cluster module will emit a citicalError event so the developer can get an email if wanted.
  • To support the detection, there is now a .startup property there is set by

I currently don't have anything more to add to this pull request. If there is missing anything please tell.
I will update the documentation, and make sure each feature is tested in a testcase.


AndreasMadsen commented Nov 20, 2011

This is a list of required testcases:

  • test-cluster-basic: Tests basic events, and the ability to kill a singel worker

    cluster events: fork, online, listening, death

    worker events: fork, online, listening, exit

    methods: fork, worker.kill

    check states: none, online, listening, dead
  • test-cluster-message: Test the IPC system, this includes the callback feature

    worker events: message

    methods: worker.send
  • test-cluster-auto-fork: Test the auto fork feature, and the ability to kill a worker, and manual respawn it

    methods: _autoFork, eachWorker _

    properties: workerID, uniqueID, onlineWorkers, startup
  • test-cluster-throw: Check protected properties, and conflict preventing checks

    protected properties: isMaster, isWorker, startup, workerID, uniqueID, onlineWorkers

    conflict preventing: autoFork followed by fork
  • test-cluster-citical-error: Control the error loop preventing feature

    cluster events: citicalError
  • test-cluster-disconnect: Check the disconnect feature, and make sure that the worker did die

    cluster events: disconnect, death

    worker events: disconnect

    methods: worker.send, disconnect, worker.disconnect
  • test-cluster-destroy: Check that the destroy works correct

    methods: destroy
  • test-cluster-worker-signal: Check that workers handle signals correct

    worker signals: SIGINT, SIGTERM, SIGQUIT

    worker events: disconnect, exit

    cluster events: disconnect, death
  • test-cluster-master-signal: Check that the master handle signals correct, the testcase will need to spawn a master

    master signals: SIGINT, SIGTERM, SIGQUIT
  • test-cluster-master-error: Check that the master kill workers when a error is uncatched, the testcase will need to spawn a master
  • test-cluster-setup-master: Check the options in the setupMaster function

    methods: setupMaster

    properties: workers, exec, args
  • test-cluster-fork-args: Check that you can pass individual env options to workers

    methods: fork

AndreasMadsen commented Nov 26, 2011

I have now adde a silent option, if set to true stdout and stderr wont propagate directly to the master output. To support this it was required to change the child_process.js file.
Both the cluster and child_process documentation has been updated.


AndreasMadsen commented Nov 26, 2011

I will try to look at issue #2194, so I will take a break from the testcase making, and pull what i got.

This are the testcases I have made so far:

  • test-cluster-basic
  • test-cluster-message
  • test-cluster-auto-fork
  • test-cluster-throw
  • test-cluster-citical-error

This is out of my abilities as a JavaScript guy. When dgram port sharing is supported i will look at it again. I will get back to the testcases :)


AndreasMadsen commented Nov 27, 2011

I discovered some sort of bug when creating the test-cluster-disconnect testcase. It looks like the worker.on('disconnect') is emitted twice when disconnecting a worker, but only internal. The same with the worker.process.on('exit') event.
I drives me cray, and it will take some time, but I thought any reader should know.

Okay so it wasn't really a bug but a combination of lack of logic and the fact the Worker class is used in both master and worker.
I improved the logic in commit 70f1c97 by isolating exit and disconnect. It seams that in node 0.6.3 the process.on('exit') do fire even if the channel is closed, so there is no need to handle exit events in disconnect.
Also, the state and suicide mode is now configured outside the disconnect event (don't know what is was thinking before), and then disconnect emits.


AndreasMadsen commented Dec 2, 2011

In the commits i have added a processWatch and a progressTracker testing module, (not as native module). I don't know what the policy is about this, but I really needed som sort of method to watching a progress, in more that one testcase.


AndreasMadsen commented Dec 2, 2011

I finally got time to make the last testcases, every thing seams to work fine.


AndreasMadsen commented Dec 4, 2011

So I asked about feedback in the IRC. It seams that the only thing missing is a zero-downtime restart method (@mmalecki).

I should also try to make this more multiple machine friendly (not just cores), I have not worked with multiple machine before but I suspect it primally involves the fact that the number of CPUs can change. Currently the number of workers are set once and cannot be changed.


AndreasMadsen commented Dec 7, 2011

I have added a zero-downtime restart method to the cluster object and a simple restart method to the worker object, it could be discussed if the worker.restart should exist, but right now I don't see a reason not to.

The zero-downtime restart methods behavior is a little different depending on if the user use autoFork or fork, so this require about 3 testcases to be sure that it works as expected, I will look at that another day :)


piscisaureus commented Dec 7, 2011


You seem to be very dedicated to improve the builtin cluster module, there is a lot of good stuff in these pull requests. However we for node we try to be pretty minimal, and these patches add a lot of stuff. If you have the time, please drop by at the #libuv freenode channel so we can work together.


AndreasMadsen commented Dec 8, 2011

I will do that, but my time is limited at the moment, when are you online?

I see that we are in the same timezone, so it shouldn't be to big a problem. I will try to be online the rest of the day (until 22:00 UTC+1).

@ry ry commented on an outdated diff Dec 8, 2011

@@ -101,7 +101,7 @@ function setupChannel(target, channel) {
var message = JSON.parse(json);
target.emit('message', message, recvHandle);
- start = i+1;

AndreasMadsen commented Dec 13, 2011

If plugins have to interact with the cluster module, they should be allowed to know how the master is setuped. So the setupMaster settings should go into a readonly options object. And there should be a event so plugins know when the readonly options object is created, also the internal forkMode should be in the readonly options object too.

I also think the cluster.workers object should be an object, there is no reason to have push, slice, and length ... options in this object.


AndreasMadsen commented Dec 19, 2011

I have now updated this pull request so it fit the new master (node v0.7-pre). This should make it easier to make the step patches.

The only thing broken is the master-error testcase, I will look intro that.


AndreasMadsen commented Dec 19, 2011

That was a thought one, the problem was that process._channel.close() had become async in node 0.7.0-pre. After fitting the cluster lib to be async ( 3454ab9 ) the process.exit did not kill the workers, because of the way workers do a suicide without communicating with the master. Commit 4345d5b fix that, this also had the positiv that the uncaughtException handler now can call process.exit without waiting until all the worker dies.


AndreasMadsen commented Dec 30, 2011

Okay this is a little noisy, so i will just tell what has changed :)

The process.on('message') now filter out all messages with a NODE_ prefix in there cmd property, and relay them to process.on('internalMessage'). The child_process documentation has been updated to tell users to avoid use of NODE_ prefix and internalMessage event.

The cluster.js core has been updated to depend on the message filter. This allow users to use process.send() and process.on('message') without any problems or internal message noise. Furthermore the cluster documentation has been updated, so it now tells users that they can use this methods/events. The old cluster.worker.send and cluster.worker.send('message') do still exist, but are now just simpel relays. They need to exist because the Worker class is isomorphic (used in both master and worker) and the master can't use its process object to send messages to workers.

tomyan commented Jan 2, 2012

Really liking the look of these changes :-) Wondering if it will be possible to set the exec option per worker that's spawned. I have a server that behaves differently depending on command line options. I'd like to be able to exercise these different behaviours from a single set of tests. I could use child_process.fork to run the server multiple times, but maybe it would be more convenient to be able to run workers that do different jobs from the same master?




AndreasMadsen commented Jan 2, 2012

@tomyan sure this is possible :)
However I'm not sure what the purpose is (could you write a simple example or API change suggestion).

Please note: This module is not made for easy testing. And too have different workers running do not mak much sense, since there is no way to know how the OS will balance this and then clients will be treaded differently :/

Can you not use the env option in .fork(env) or use child_process.fork to spawn not a worker but a master/cluster.

I'm open to suggestion but I do not see the purpose of this yet.


AndreasMadsen commented Jan 5, 2012

There was some errors related to when there is a critical error in the worker and autoFork is on. When the error was detected it tried to disconnect all workers (good) but did it multiply times on the same worker, resulting on some also random errors. Also the critical error handling is tricky but I think the latest patch to a good job.

I'm hoping #2463 will make this case easier to detect.


AndreasMadsen commented Jan 24, 2012

That was a difficult merge!

isaacs commented Mar 19, 2012

Closing this. Everything has been split up into separate reqs.

isaacs closed this Mar 19, 2012


AndreasMadsen commented Mar 20, 2012

@isaacs true, but could you also close this #2060

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment