Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

controller/server load can disrupt heartbeat #1304

Closed
minrk opened this Issue · 3 comments

2 participants

@minrk
Owner

For code that uses ZMQStreams for heartbeat (notebook, hub), it is actually possible for server load to disrupt the heartbeat. I'm not sure if this actually happens, but it is a real possibility. I expect the most likely case is the Hub, which can have very long blocking calls when using the mongodb/sqlite task stores.

The relevant aspects of the code:

  • stream.send(msg) does not actually send a message, it registers a message for sending, which will happen at the top of the next loop iteration at the earliest.
  • server load can make this loop iteration take an abitrarily long time, inserting a potentially long delay between stream.send(msg) and socket.send(msg).
  • heartbeat timer starts simultaneously with stream.send(), not socket.send(), so the above delay is interpreted as a delay in the heartbeat response, and can be longer than the total heartbeat period, causing a heart failure.

Possible solutions:

  1. ZMQStreams have an 'on_send' method for registering callbacks for when the actual Socket sends the message. If we schedule the next heartbeat relative to that, rather than as a PeriodicCallback with stream.send then the heartbeat measure would be accurate.
  2. call stream.flush() immediately after stream.send(), which will cause the socket to send the message, if possible (it will never be impossible with our current patterns).
  3. directly call stream.socket.send(), bypassing the issue entirely.
@ellisonbg
Owner

I am not sure I have a good sense of which of the solutions would be better. I would say that 3 seems like the simplest option, but 2 seems almost equivalent. 1 sounds a bit more complex but perhaps it more directly targets the underlying problem. I don't see any reason why any of these would not work though.

@minrk
Owner

I did 2. in PR #1312. The case against 3 is that it's sort-of a private API call, whereas 2. is the ~official way to send immediately. I honestly don't know which is better. 2. is more 'official', while 3. is more direct.

@minrk minrk closed this in 1487f2f
@minrk
Owner

By the way, I was able to induce this heart failure with artificial load in the notebook server (sleep longer than heartbeat in PeriodicCallback), and PR #1312 successfully prevented this from causing a heart failure, even with heartbeats of 0.1s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.