make it possible to locally configure cluster #3367

Closed
wants to merge 8 commits into from

8 participants

@tomyan

The goal of the change is to make it possible to configure and use the cluster module without modifying global config. The changes leaves the current interface entirely in tact, but additionally allows configuration of a cluster with the following:

var cluster = require('cluster');

var master = new cluster.Master();
master.setupMaster({ exec: "some/file.js" });

Using the existing interface still works:

cluster.setupMaster({ exec: "some/other/file.js" });

..except that the global master instance has no effect on the other master, and vice versa. This is kinda similar to how you can configure a local http.Agent without affecting the global one.

The reason I wanted to propose this change was because I have a project that uses cluster to do some long running work. The project exposes a library interface and I would like users to be able to use that library interface without having side-effects other than the one they have initiated by using the library (i.e. does not modify global cluster config).

@AndreasMadsen
Node.js Foundation member

May I ask what the long running process do?

@tomyan

The project is an http server that uses cluster for two different purposes depending how it's invoked. The first is for hot (re)loading code, where a new cluster worker is created for each request (as well as subsequent requests within a short period). I'm aware that alternative approaches exist for this (i.e. watching for filesystem changes), but this works pretty well and is the way I prefer to do it. The second use is more typical, using cluster to make use of multiple CPUs.

The primary mechanism for running this server is as a standalone program (i.e. would not need a cluster module whose configuration has global side-effects), but it also provides a library interface. Part of the motivation for this is to be able to drive the functionality from unit tests (via a framework that runs each test within the same process), but I think it might be useful beyond this. Either way, I find avoiding global side-effects to make things testable is generally a good pattern.

@isaacs

No for 0.8. Maybe for 0.9.

Let's review this more closely in a few weeks.

@tomyan

Cool. Let me know if there's stuff I can do to improve the proposed change when the time comes.

Thanks

Tom

@hasanyasin

I hope I understood the problem correctly: You want to create Worker processes with different settings (a different executable or with different command line arguments) than the settings provided to setupMaster.

If I understood it correctly, an alternative solution would be allowing fork() calls to override cluster setup:

var cluster = require('cluster');
// Setting up cluster:
cluster.setupMaster({ exec: "some/file.js", args:"blah blah" }); // Global config.

// a worker with global settings:
worker = new cluster.Worker();

// Executable overridden, arguments kept:
custom_worker = new cluster.Worker({ exec:"some/other/file.js" });

// Both executable and arguments are overridden:
custom_worker_2 = new cluster.Worker({ exec:"some/another/file.js", args: "even more blah blah" });
@tomyan

Yeah, that would be fine if that resulted in a worker being forked. Currently I don't think you can create a forked worker like that, you need to use cluster.fork(), which only takes the env parameter.

Tom

@hasanyasin

I am developing a web server on Node and it has a master process and multiple child processes for each core. I personally really did not like using the same module for both kind of work and separating code blocks with is_master is_worker. So, I designed a master process which runs from a js file and child processes that run from another js.

To achieve this in v.0.6, I had to patch it. Half of what I did was in fact already documented; but not implemented in code. For other half, I did what I explained in my previous comment. I was going to fork and request a pull; but after seeing 1900+ forks of the project and seeing the cluster module already changed a lot at 0.7, I did not do that.

0.8 provides half of what wanted. I originally designed the system in a way that child processes are created with custom command-line arguments. Since 0.8 now provides the main need, having different commands for master and child, I did not want to patch it again; but instead, I did what I wanted to do with command line arguments via an initial message sent to child processes after they are created. Then I liked this even more than my original approach since I can have much complex object instead of basic strings as command line arguments.

I hope this can be a solution for your problem too.

@AndreasMadsen
Node.js Foundation member

I'm still not convinced this is a good idea.

@hasanyasin your API proposal raise most concern. I often get mails from there ask how they can have different workers, separated in different files, so they can pipe incoming requests depending on there IP or some signature. The problem is that is not possible with any type of cluster there use the OS to load balance, it distribute the TCP requests equally (almost) between each listener there attached to the main handler (fd). Having different workers is simply a bad design decision.

However if that is what you need you can stil do it. The procedure is to create you own load balancer in node by creating workers with child_process.fork() and then listen to incoming TCP in the master and distribute them using child.send(Socket) (yes, that is a new feature in node 0.8).

Furthermore, you are speaking about two half could you describe them more precisely, so we can understand the usecase. A personal request, if I may, please describe both what you want and how you did it, often I only get what someone did and it almost always shows that there could be some improvements or that it isn't the purpose of cluster.

@tomyan I'm not totally against your API proposal, however I seams very complicated to support both the current and the new API proposal. Could it not be simplified down to that you always have to call cluster.setupMaster and that will simply return a new Master instance. However @isaacs will have to be the judge of that.

Also note that cluster.Worker is not documented API use cluster.fork() and yes there are a different and it may become bigger.

I'm not sure why code reloading should be done with cluster, could you explain that? It seams to me that this could just as well, without any more complexity, be implemented with child_process.spawn().

In any case it sounds to me that your different implementations, will never run at the same time. One or development and one for production. If that is the case I don't see the need for precreating two different cluster configurations, when the different configurations simply can be invoked at runtime.

That in it self, dose of course doesn’t make the API bad, what really concerns me is the following:

Extendibility: the main purpose with the cluster rewrite was to make it more extendable by exposing the necessary information directly from cluster, without making an opinionated plugin API. For instance an "auto restart" module could be implemented with:

var cluster = require('cluster');
module.exports = function autorestart() {
  cluster.on('exit', function () {
   cluster.fork();
  });
};

Of course this could just as well be implemented with autorestart(cluster) there would fit your API, however allowing to create hole new instances of cluster objects could enforce a much worse extendibility environment:

var cluster = require('cluster');

module.exports = function autorestart(settings) {
  var master = cluster.setupMaster(settings);
  master.on('exit', function () {
    master.fork();
  });
  return master;
};

Since can can not be extended more using other modules there use the same structure, since that would just create a new master. The first extend design can this because that takes only a cluster object.

Master principal: I also often get mails about how to restart the master using the new cluster and the answer is you can not. That must be done on a lower level. However it is my too ideal principal never to create a master there is so complex that there is a high risk for bugs. Allowing one to use the same process for two different masters, makes the process code a lot more complex (if they actually create two masters), and it may introduce new bugs since any external module is cached and therefor can reuse some cache object there was only meant to work with one master. For instance it could assume that worker.id is unique. However if one creates two masters that it not true.

Of course this is all about protection userland and node.js doesn’t do that. But have been watching #node.js for a long time it get the impression is most certainly not well understood (it seams to begin turning for two month ago). Keeping that in mind I fell that this is not the time for increasing the complexity. However I'm not opposed to the API.

@hasanyasin

Hey Andreas, thank very much for all the details and explanation.

I am just misunderstood. :)

What I was trying to tell was I am really happy with the state in 0.8. What I wanted to separate was Master and Worker processes, not workers between each other. This had been already documented in 0.6. Below is from cluster.js in 0.6.19:

// options.workerFilename
// Specifies the script to execute for the child processes. Default is
// process.argv[1]

However the function implementation did not have this so I just added the argument and made it possible to have different sources for Master and Worker processes.

With 0.8, now we can design Master and Worker processes to have different sources. I do not mean having master.js, worker1.js, worker2.js, etc... What I mean is having just one master.js and one worker.js to have cleaner code instead of keeping both in one file and using if/else blocks to keep things separate.

In addition to this, I had also made it possible to call Worker processes with different arguments. This is not to create completely different child processes. It was just to keep some variables. Now I send a message to each worker instead of calling them with different arguments. It makes me even happier than using arguments for this since I already use messages between Master and Worker processes for many things and this makes things much more consistent.

In summary, I am very happy with the current state of the child process creation.

UPDATE: I now realized that the cause of misunderstanding was my first comment, not the previous one. I had tried to give an alternative implementation to tomyan's request. In my own use, I do not have different worker processes, just one.

@AndreasMadsen
Node.js Foundation member

@hasanyasin okay, I think I got it.

You are speaking about the confusion comment at https://github.com/joyent/node/blob/v0.6/lib/cluster.js#L58 you should know that is some leftover from the original original cluster there worked by calling node cluster worker.js. The cluster argument would then be swapped out by src/node.js so process.argv[1] was worker.js.

In addition to this, I had also made it possible to call Worker processes with different arguments.

Are you speaking about the API proposal or an actual implementation?

@hasanyasin

I had thought that it was documented as designed; but not yet implemented.

FYI, below is what I had done. I am not posting this as a suggestion or anything. I had just tried to solve my problem quickly this way and as you asked, I am writing.

60:
--- // options.workerFilename
+++ // options.path

71:
--- function startMaster() {
+++ function startMaster(options) {

77-78:
---    workerFilename = process.argv[1];
---    workerArgs = process.argv.slice(2);
+++  if (options) {
+++    workerFilename = options.path || process.argv[1];
+++    workerArgs = options.args || process.argv.slice(2);
+++  }
+++  else {
+++    workerFilename = process.argv[1];
+++    workerArgs = process.argv.slice(2);
+++  }

145:
--- cluster.fork = function() {
+++ cluster.fork = function(args) {

161:
---  var worker = fork(workerFilename, workerArgs, { env: envCopy });
+++  var worker = fork( workerFilename, (args || workerArgs), { env: envCopy });
@tomyan

Hi @AndreasMadsen

Thanks for taking the time to go through it. Comments inline...

@tomyan I'm not totally against your API proposal, however I seams very complicated to support both the current and the new API proposal. Could it not be simplified down to that you always have to call cluster.setupMaster and that will simply return a new Master instance. However @isaacs will have to be the judge of that.

I like the idea of having setupMaster return a new master - I guess this would mean that if you use cluster you get the defaults, and if you call setupMaster to get a differently configured master you don't have any side effects on other code that uses cluster. I'd be happy to work on a patch that achieves that.

Also note that cluster.Worker is not documented API use cluster.fork() and yes there are a different and it may become bigger.

I'm not sure why code reloading should be done with cluster, could you explain that? It seams to me that this could just as well, without any more complexity, be implemented with child_process.spawn().

I was having trouble doing it with a single listening socket between reloads. I wasn't able to bind to a socket without accepting connections in the parent process, which I think I would need - see #2289. I could do it by accepting the connections in the parent process and then passing the handles to the child process I guess, but it seems like cluster does almost exactly what I need. I'm also not too keen on doing per request handle passing between processes as I've tried this before in another server (and language) and it didn't perform as well as accepting in each child - this isn't too much of an issue for the code reloading use-case as that would be only used in development, but I'd like to avoid the complexity if possible.

In any case it sounds to me that your different implementations, will never run at the same time. One or development and one for production. If that is the case I don't see the need for precreating two different cluster configurations, when the different configurations simply can be invoked at runtime.

They run in the same process in my tests, or at least I would like them too. I'd also like it to work as a library without having to caveat its usage.

That in it self, dose of course doesn’t make the API bad, what really concerns me is the following:

Extendibility: the main purpose with the cluster rewrite was to make it more extendable by exposing the necessary information directly from cluster, without making an opinionated plugin API. For instance an "auto restart" module could be implemented with:

var cluster = require('cluster');
module.exports = function autorestart() {
cluster.on('exit', function () {
cluster.fork();
});
};
Of course this could just as well be implemented with autorestart(cluster) there would fit your API, however allowing to create hole new instances of cluster objects could enforce a much worse extendibility environment:

var cluster = require('cluster');

module.exports = function autorestart(settings) {
var master = cluster.setupMaster(settings);
master.on('exit', function () {
master.fork();
});
return master;
};
Since can can not be extended more using other modules there use the same structure, since that would just create a new master. The first extend design can this because that takes only a cluster object.

I think the auto-restart feature would be one I'd love to see in cluster, but I think that might be ignoring your point :-). On your direct point (about extendibility), I think that extensions are better where they don't have a global (side-)effect. I think this kind of pattern can lead to action at a distance, and hence non-obvious behaviour and difficult to track down bugs. I'm therefore, much more in favour of your second API (in most cases there will only be one master anyway).

Master principal: I also often get mails about how to restart the master using the new cluster and the answer is you can not. That must be done on a lower level. However it is my too ideal principal never to create a master there is so complex that there is a high risk for bugs. Allowing one to use the same process for two different masters, makes the process code a lot more complex (if they actually create two masters), and it may introduce new bugs since any external module is cached and therefor can reuse some cache object there was only meant to work with one master. For instance it could assume that worker.id is unique. However if one creates two masters that it not true.

That's only an issue if the modules you use make use of global data. The modules I tend to use (and write) avoid using global mutable data as this makes the code more reusable within different parts of an application without those parts treading on each other's toes (I guess this is the principal I'm trying to push here).

Of course this is all about protection userland and node.js doesn’t do that. But have been watching #node.js for a long time it get the impression is most certainly not well understood (it seams to begin turning for two month ago). Keeping that in mind I fell that this is not the time for increasing the complexity. However I'm not opposed to the API.

What did you think of the structural change in my pull request? I found it much easier to implement the changes by splitting the code that was shared between parent and child, from the code for the child, and the code for the parent. If you (and @isaacs) agree to the API proposal, it would be good to know that the approach to adding the feature is okay too.

Anyway, thanks for the feedback and for the module, is really useful.

Tom

@AndreasMadsen
Node.js Foundation member

I like the idea of having setupMaster return a new master - I guess this would mean that if you use cluster you get the defaults, and if you call setupMaster to get a differently configured master you don't have any side effects on other code that uses cluster. I'd be happy to work on a patch that achieves that.

It means require('cluster') returns a setupMaster function and calling that will return a new Master instance. However the best decision is properly to let require('cluster') support the old API if possible as well but make that deprecated.

To make it totally clear:

  • calling require('cluster') as a function returns a new master instance
  • calling require('cluster').whatever use the old master instance. Note that if this do not support the old require('cluster').setupMaster, calling require('cluster').whatever should not be supported at all.

I was having trouble doing it with a single listening socket between reloads. I wasn't able to bind to a socket without accepting connections in the parent process, which I think I would need - see #2289. I could do it by accepting the connections in the parent process and then passing the handles to the child process I guess, but it seems like cluster does almost exactly what I need. I'm also not too keen on doing per request handle passing between processes as I've tried this before in another server (and language) and it didn't perform as well as accepting in each child - this isn't too much of an issue for the code reloading use-case as that would be only used in development, but I'd like to avoid the complexity if possible.

Ahh, I think I understand. The problem is that the port isn't cleared fast enough, but using cluster works perfectly since the server instance is shared, to the server is never closed just reused.

They run in the same process in my tests, or at least I would like them too. I'd also like it to work as a library without having to caveat its usage.

Well abstraction testcases between programs is quite easy, I don't see that as a problem. Also the only case where running both development and production tests is in your cluster abstraction module itself. If you need to do it in a module there use your cluster abstraction module as a dependency, then your cluster abstraction module has failed (or so I belive).

What goes for you second comment I don't understand it.

I think the auto-restart feature would be one I'd love to see in cluster, but I think that might be ignoring your point :-). On your direct point (about extendibility), I think that extensions are better where they don't have a global (side-)effect. I think this kind of pattern can lead to action at a distance, and hence non-obvious behaviour and difficult to track down bugs. I'm therefore, much more in favour of your second API (in most cases there will only be one master anyway).

Well if global is the only thing there can exist, there won't be any confusion. If both global and local can exist that will be an disaster, that is also why I'm in favour of not supporting the current API if the patch lands.

What did you think of the structural change in my pull request? I found it much easier to implement the changes by splitting the code that was shared between parent and child, from the code for the child, and the code for the parent. If you (and @isaacs) agree to the API proposal, it would be good to know that the approach to adding the feature is okay too.

I must say, when I discovered that the first line of real change would introduce a major bug I did not read the rest. Also I think we need to decide how/if this should be backward compatible before using to much time on the code. But don't worry I will read it careful when that is decided :)

@tomyan
@AndreasMadsen
Node.js Foundation member

Why not make this work as it does today

I'm cool with using cluster.createMaster, as long cluster.setupMaster don't have to purposes.
If that is done, i would expect something like this:

exports = module.exports = new Master();
exports.setupMaster();

exports.createMaster = function (settings) {
  var o = new Master();
  o.setupMaster(settings);
  return 0;
};

I don't have a separate set of development and production tests, I just
have a single test suite that exercises all of the functionality of my
module within a single invocation of node - or at least that's what I want.
The module exposes functionality that is useful in development
(hot-reloading of code) or in production, but I'd like both to be covered
by the tests.

A module of mine ( thintalk ) is an RPC abstraction there provide the same API over two different communication protocols. Since the API is the same, the expected result should also be the same.

As you see in the testcases I have sperate files for IPC and TCP but they both use the same testcase abstraction. In my case it would properly make sense to test both IPC and TCP transport-layer in the same node-process to ensure they go together. But in your case with cluster I still can't see why testing both development and production in the same node-process makes any sense.

What goes for you second comment I don't understand it.

I think that was my comment.

@AndreasMadsen AndreasMadsen commented on an outdated diff Jul 7, 2012
lib/cluster.js
@@ -33,85 +35,17 @@ var debug;
if (process.env.NODE_DEBUG && /cluster/.test(process.env.NODE_DEBUG)) {
debug = function(x) {
var prefix = process.pid + ',' +
- (process.env.NODE_UNIQUE_ID ? 'Worker' : 'Master');
+ (process.env.NODE_CLUSTER_ID ? 'Worker' : 'Master');
@AndreasMadsen
Node.js Foundation member

This will introduce a bug, since child_process.fork() should be called with NODE_UNIQUE_ID in its env and it is used in src/node.js

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@tomyan
@AndreasMadsen
Node.js Foundation member

I was suggesting that we remove (i.e. deprecate) setupMaster altogether, so
that the only time you can alter settings is when creating a new master.
How does that sound?

You will have to ask @isaacs, but if cluster.setupMaster is removed then I don't see any point in maintaining backward compatibility.

@Nodejs-Jenkins

Can one of the admins verify this patch?

@Rush

What is the status of this change? How can we help? I would like to use this in a module which setups its workers but without polluting cluster's global status for any other uses. I am considering temporarily ripping cluster.js from node sources into my module so that the status is local but preferably I would fix node.js itself.

@rmg
Node.js Foundation member

Possibly relevant to the original motivation, as of #7671 cluster.setupMaster() can be called multiple times to reconfigure some aspects of the cluster's behaviour over time.

Being able to create a cluster.Master() instance looks like a useful step toward a less automagical and more explicit cluster API. Sounds like A Good Thing ™️

Given the code freeze/slush, does this still belong in the v0.12 milestone?

@tomyan

rmg: I'd be happy to see this cleaned up and applied. Could look at it at some point, but probably not for a week or two.

Thanks

Tom

@7nights 7nights referenced this pull request in doxout/recluster May 14, 2015
Closed

Multiple calls support #28

@jasnell
Node.js Foundation member

@tomyan ... I know it's been a while, but is this still something you want to pursue? If yes, the PR would need to be updated significantly. If not, we should likely go ahead and close.

@jasnell
Node.js Foundation member

Closing due to lack of activity. Can revisit if someone is interested in updating the PR. Given that this is a feature request, however, it would need to be targeted at either http://github.com/nodejs/io.js or http://github.com/nodejs/node

@jasnell jasnell closed this Aug 3, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment