controller/server load can disrupt heartbeat #1304

minrk opened this Issue Jan 20, 2012 · 3 comments


None yet

2 participants

minrk commented Jan 20, 2012

For code that uses ZMQStreams for heartbeat (notebook, hub), it is actually possible for server load to disrupt the heartbeat. I'm not sure if this actually happens, but it is a real possibility. I expect the most likely case is the Hub, which can have very long blocking calls when using the mongodb/sqlite task stores.

The relevant aspects of the code:

  • stream.send(msg) does not actually send a message, it registers a message for sending, which will happen at the top of the next loop iteration at the earliest.
  • server load can make this loop iteration take an abitrarily long time, inserting a potentially long delay between stream.send(msg) and socket.send(msg).
  • heartbeat timer starts simultaneously with stream.send(), not socket.send(), so the above delay is interpreted as a delay in the heartbeat response, and can be longer than the total heartbeat period, causing a heart failure.

Possible solutions:

  1. ZMQStreams have an 'on_send' method for registering callbacks for when the actual Socket sends the message. If we schedule the next heartbeat relative to that, rather than as a PeriodicCallback with stream.send then the heartbeat measure would be accurate.
  2. call stream.flush() immediately after stream.send(), which will cause the socket to send the message, if possible (it will never be impossible with our current patterns).
  3. directly call stream.socket.send(), bypassing the issue entirely.

I am not sure I have a good sense of which of the solutions would be better. I would say that 3 seems like the simplest option, but 2 seems almost equivalent. 1 sounds a bit more complex but perhaps it more directly targets the underlying problem. I don't see any reason why any of these would not work though.

minrk commented Jan 23, 2012

I did 2. in PR #1312. The case against 3 is that it's sort-of a private API call, whereas 2. is the ~official way to send immediately. I honestly don't know which is better. 2. is more 'official', while 3. is more direct.

@minrk minrk closed this in 1487f2f Jan 23, 2012
minrk commented Jan 23, 2012

By the way, I was able to induce this heart failure with artificial load in the notebook server (sleep longer than heartbeat in PeriodicCallback), and PR #1312 successfully prevented this from causing a heart failure, even with heartbeats of 0.1s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment