-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform some kind of heartbeating for workers #33
Comments
Note: this could also be handled by setting TCP keepalive socket options (that are accessible via ZeroMQ). The defaults are to disconnect after just over 2 hours, so this is too long. Customizing these options per socket is possible using
on the OSs (source)
|
Update: TCP keepalive does not work because disconnected peers will not raise a ZeroMQ message or exception. Also, killed ( Alternatively, application-level heartbeats could be used (ZeroMQ>=4), which would break e.g. Ubuntu support. |
I should also consider worker timeouts: If they miss heartbeats, also shut them down (irrespective of running computations) (ref: #131) |
Closing because #150 is a better approach |
If workers crash because they run out of memory (without
ulimit
protection) or call C code that is not caught bytry(...)
, they just disappear and the master never completes.Find an appropriate way to check if workers are up. Maybe combined this with
PUSH
/PULL
sockets for work andREQ
/REP
for control (#30)The text was updated successfully, but these errors were encountered: