Skip to content
This repository has been archived by the owner. It is now read-only.

cluster: centralize removal from workers list. #8193

Closed
wants to merge 1 commit into from

Conversation

@misterdjules
Copy link

misterdjules commented Aug 16, 2014

Currently, cluster workers can be removed from the workers list in three
different places:

  • In the exit event handler for the worker process.
  • In the disconnect event handler of the worker process.
  • In the disconnect event handler of the cluster master.

However, handles for a given worker are cleaned up only in one of these
places: in the cluster master's disconnect event handler.

Because these events happen asynchronously, it is possible that the
workers list is empty before we even clean up one handle. This makes
the assert that makes sure that no handle is left when the workers
list is empty fail.

We could either remove the associated handles every time we remove a
worker from the list, or remove workers from the list in only one
handler. This commit implements the latter.

Fixes #8191 and #8192.

@misterdjules

This comment has been minimized.

Copy link
Author

misterdjules commented Aug 18, 2014

@sam-github Since you seem to know the cluster module very well, I thought I'd put you in the loop. If you had some time to look at this change that would be great! Thank you!

@sam-github

This comment has been minimized.

Copy link
Member

sam-github commented Aug 18, 2014

The change will need a doc change, since current behaviour is documented.

I've been bitten by the "when exactly does cluster.workers[id] return undefined" problem before (I saved an ID, sent the worker a signal, then if it didn't exit in some period, tried to look up the worker by ID... and it wasn't there).

So, I like the idea, but I think its really important that the ordering of event emissions and deletion of worker is documented and consistent. As-is, it appears to me that my docs are wrong. The worker is deleted AFTER the worker disconnect even is emitted, but BEFORE the cluster .disconnect event is fired (well, technically, deleted in the first cluster disconnect listener, the user-provided ones will always be after clusters listener). Do you read it the same way?

Anyhow, for the sake of guaranteeing the state of cluster.workers at the time that either disconnect event occurs, what do you think of doing deleting the worker, and its handles at https://github.com/joyent/node/blob/master/lib/cluster.js#L323, before either the worker or cluster disconnect event is emitted? Basically removing this handler, https://github.com/joyent/node/blob/master/lib/cluster.js#L348, and moving its code to before line 323?

@misterdjules

This comment has been minimized.

Copy link
Author

misterdjules commented Aug 18, 2014

@sam-github Yes, I think we read the code the same way. However, one thing you didn't mention is that the worker can also sometimes be deleted after the worker exit event is emitted, which can be before the worker disconnect event is emitted.

I agree with your suggestion about removing the handler for the 'disconnect' event on the cluster's master, and moving its implementation to the worker's disconnect handler.

However, I think we should also execute the same code in the worker's exit event handler for the reason mentioned above and if we want to be consistent with the documentation.

What do you think?

@sam-github

This comment has been minimized.

Copy link
Member

sam-github commented Aug 18, 2014

I'm worried about race conditions, that a worker could send a number of messages, then exit, master exit event could occur, then message, then message, then disconnect.... and when those messages come in, the worker ID won't be in the worker list.

I guess the question is: do we want the worker removed from the list as soon as possible, or as late as possible? ASAP means on first disconnect OR exit, and as late as possible is when exit, disconnect (and perhaps 'close'?) have occurred.

Since you can be disconnected, but not exited, indefinitely its possible for a Worker process to be alive (not-exited) well past when the disconnect event is observed in the master.

Actually, the whole disconnect thing is a mess, what it does and what is reasonable and predictable are quite far from each other. :-( So, I'm not sure what to say.

Obvious backwards compat bug-fix is to do as you suggest: make sure the handle delete code is run on exit as well as disconnect, and remove the duplicate delete in the worker.on disconnect, and the cluster.on disconnect.

@misterdjules

This comment has been minimized.

Copy link
Author

misterdjules commented Aug 19, 2014

@sam-github I'm not sure to understand what you mean when you say that
a worker could send a number of messages, then exit, master exit event could occur, then message, then message, then disconnect.... and when those messages come in, the worker ID won't be in the worker list.

Regarding the rest of your comment, is it really a problem that a worker can stay alive well past the disconnected event has been emitted? I guess in this case even if the process itself is still running, it's just not considered as part of the cluster anymore, and it's thus safe to not have it in the cluster.workers list.

As suggested in private, maybe let's discuss this on {Skype,Hangouts} and post our conclusions here when we're done?

Thank you!

@misterdjules misterdjules force-pushed the misterdjules:fix-issue-8191 branch 2 times, most recently from 51f42c1 to a686d75 Aug 19, 2014
@misterdjules

This comment has been minimized.

Copy link
Author

misterdjules commented Aug 19, 2014

@sam-github I updated the PR according to our earlier discussion. The comments and the documentation change pretty much summarize the content of this discussion, so I won't report it here.

One thing I'm not sure about is adding isDead() and isDisconnected() to Worker's prototype. If that's an issue, we can make these two functions internal.

_and_ exited. The order between these two events cannot be determined in
advance. However, it is guaranteed that the
removal from the cluster.workers list happens before the last of
`'disconnect'` and `'exit'` events is emitted.

This comment has been minimized.

Copy link
@sam-github

sam-github Aug 21, 2014

Member

I'd express as "before last 'disconnect' or 'exit' event is emitted.", and wrap to 80 columns.

This comment has been minimized.

Copy link
@misterdjules

misterdjules Aug 21, 2014

Author

Thanks @sam-github, fixed in the updated PR!

@sam-github

This comment has been minimized.

Copy link
Member

sam-github commented Aug 21, 2014

LGTM

I don't have an opinion on whether adding more private/undocumented methods to Worker is good or bad, better to ask somebody who knows more node core history, what has or has not caused problems in the past. Its definitely safer to go the c++y way, and not add as a method anything that can be expressed as a private function, but what's the point of OO if you can't add methods to objects?

I noticed a pre-existing weirdness... final worker.state is either 'dead' or 'disconnected', depending on which event happened last. This strikes me as odd. Even more strangely, until this change, I can find no evidence that the worker state was actually used... some trouble is gone through to assign it to various values, but no comparisons are made to it (your code is the first to use it, and least in cluster.js). This implies that it exists to be public and used elsewhere, but its undocumented. Hm. A mystery.

Fwiw, it would be possible to delete the (apparently unused) .state, and instead write isDead() as this.process.exitCode != null || this.process.signalCode != null, and isDisconnected() as !this.process.connected, reducing the amount double tracking of state.

@misterdjules misterdjules force-pushed the misterdjules:fix-issue-8191 branch 2 times, most recently from 7b432fb to 2b19fa3 Aug 21, 2014
@@ -595,7 +633,7 @@ function workerInit() {

Worker.prototype.destroy = function() {
this.suicide = true;
if (!process.connected) process.exit(0);
if (!this.isCOnnected()) process.exit(0);

This comment has been minimized.

Copy link
@sam-github

sam-github Aug 21, 2014

Member

"COnnected"? You will find this when you test.. :-)

This comment has been minimized.

Copy link
@misterdjules

misterdjules Aug 21, 2014

Author

@sam-github Good catch. For some reason I haven't had the time to investigate, all tests pass on my computer with this typo. Do they pass on yours too?

This comment has been minimized.

Copy link
@misterdjules

misterdjules Aug 21, 2014

Author

@sam-github Fixed the typo. I will check why the tests didn't catch the typo and update tests if needed.

This comment has been minimized.

Copy link
@sam-github

sam-github Aug 21, 2014

Member

I didn't run the tests, just read the diff, but I would bet the code isn't
covered. You need to call kill or destroy in a worker to reach it, quite
unusual. Particularly since this function accidentally became partially
undocumented when destroy() was re-named to kill()...

This comment has been minimized.

Copy link
@misterdjules

misterdjules Aug 21, 2014

Author

@sam-github Created a new issue here: #8223.

@misterdjules

This comment has been minimized.

Copy link
Author

misterdjules commented Aug 21, 2014

@sam-github Thank you Sam. Indeed, my implementation of Worker.prototype.isDead and Worker.prototype.isDisconnected was broken. They wouldn't be usable outside of this specific use case, which would completely defeat the purpose of adding them to the prototype.

In the latest update to this PR, I changed their implementation so that they don't rely on the state property, which as you said can't be 'dead' and 'disconnected', but instead on the process' state. I also renamed isDisconnected to isConnected and updated the implementation accordingly.

If we decide to leave them in the Worker's prototype, the documentation will need to be updated to reflect that before merging these changes. We will also need to provide some tests for the two newly added methods. I will wait for feedback from the core team before doing that.

Thanks for the review @sam-github!

/cc @tjfontaine

@misterdjules misterdjules force-pushed the misterdjules:fix-issue-8191 branch from 2b19fa3 to 90c2918 Aug 21, 2014
@trevnorris trevnorris added the cluster label Aug 21, 2014
@indutny

This comment has been minimized.

Copy link
Member

indutny commented Aug 27, 2014

LGTM, please add the tests and docs.

Currently, cluster workers can be removed from the workers list in three
different places:
- In the exit event handler for the worker process.
- In the disconnect event handler of the worker process.
- In the disconnect event handler of the cluster master.

However, handles for a given worker are cleaned up only in one of these
places: in the cluster master's disconnect event handler.

Because these events happen asynchronously, it is possible that the
workers list is empty before we even clean up one handle. This makes
the assert that makes sure that no handle is left when the workers
list is empty fail.

This commit removes the worker from the cluster.workers list only when
the worker is dead _and_ disconnected, at which point we're sure that
its associated handles are cleaned up.

Fixes #8191 and #8192.
@misterdjules misterdjules force-pushed the misterdjules:fix-issue-8191 branch from 90c2918 to f5d8660 Aug 27, 2014
@@ -587,7 +587,7 @@
};

startup.processKillAndExit = function() {
process.exitCode = 0;

This comment has been minimized.

Copy link
@misterdjules

misterdjules Aug 27, 2014

Author

Initializing process.exitCode to 0 here doesn't seem to be consistent with how it's used elsewhere in the code. It means that worker.process.exitCode is 0 in a worker when it starts executing, although it has not exited yet.

The convention in the rest of the code seems to be that process.exitCode is undefined until the process has exited.

With this change, all tests pass, but I understand it could have a big impact, so I'd rather highlight it now so that we consider it carefully.

@misterdjules

This comment has been minimized.

Copy link
Author

misterdjules commented Aug 27, 2014

@indutny Updated my changes, ready to review again. Thank you!

@indutny

This comment has been minimized.

Copy link
Member

indutny commented Sep 2, 2014

Landed in 90d1147, thank you!

@indutny indutny closed this Sep 2, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
5 participants
You can’t perform that action at this time.