Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform some kind of heartbeating for workers #33

Closed
mschubert opened this issue May 31, 2017 · 4 comments
Closed

Perform some kind of heartbeating for workers #33

mschubert opened this issue May 31, 2017 · 4 comments

Comments

@mschubert
Copy link
Owner

If workers crash because they run out of memory (without ulimit protection) or call C code that is not caught by try(...), they just disappear and the master never completes.

Find an appropriate way to check if workers are up. Maybe combined this with PUSH/PULL sockets for work and REQ/REP for control (#30)

@mschubert mschubert added this to the v0.8.0 milestone Jun 8, 2017
@mschubert mschubert added idea and removed enhancement labels May 2, 2018
@mschubert mschubert removed this from the v0.9.0 milestone May 5, 2018
@mschubert mschubert added this to the undecided milestone May 15, 2018
@mschubert mschubert modified the milestones: undecided, v0.9.0 Dec 28, 2018
@mschubert
Copy link
Owner Author

mschubert commented Dec 31, 2018

Note: this could also be handled by setting TCP keepalive socket options (that are accessible via ZeroMQ).

The defaults are to disconnect after just over 2 hours, so this is too long. Customizing these options per socket is possible using

  • TCP_KEEPCNT
  • TCP_KEEPIDLE
  • TCP_KEEPINTVL

on the OSs (source)

  • Linux >= 2.4 (2001)
  • OS X >= 10.7.0 ("Lion"; 2011)
  • Windows >= 10.0.1709 (original 2015; release October 2017)

@mschubert
Copy link
Owner Author

Update: TCP keepalive does not work because disconnected peers will not raise a ZeroMQ message or exception. Also, killed (kill -9) workers still send a FIN packet to the master, which probably cleans up the low-level connection correctly, but we have no access to this via the ZeroMQ API.

Alternatively, application-level heartbeats could be used (ZeroMQ>=4), which would break e.g. Ubuntu support.

@mschubert
Copy link
Owner Author

I should also consider worker timeouts: If they miss heartbeats, also shut them down (irrespective of running computations) (ref: #131)

@mschubert
Copy link
Owner Author

Closing because #150 is a better approach

@mschubert mschubert removed this from the v1.0 milestone Jun 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant