Cap maximum grpc wait time when heartbeating to heartbeatTimeout/2 #494
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi, I saw this behavior as well.
When a voter stops, the leader's backoff mechanism waits for more than 10 seconds (10ms*1024) to send heartbeats and replicate messages to the voter. When the voter comes back up, it times out and triggers a new election. This could cause a leadership change or flap.
Does it make sense to disable backoff for heartbeats?
@mechpen We normally heartbeat every 1/10th of the heartbeat timeout. When these fail other warnings/errors are emitted in the logs. In these disconnected scenarios we are attempting to perform exponential backoff to ensure we don't needlessly fill up logs and use more network bandwidth than necessary. The bug we have encountered is where we backoff for far too long and get into the situation where a restarted server may hit its heartbeat timeout before the leader attempts it again.
The solution proposed in this PR just drastically lowers the cap on how much we can backoff of the usual rate to ensure that we always send a heartbeat within the timeout value. It reduces the cap enough that practically speaking we only allow backoff by a factor of 5x the original value as opposed to the 100x it was previously. That 5x though will mean there are 1/5 the warning logs which could make figuring out whats going on during an incident a tiny bit easier so I think the solution implemented in this PR is probably the way to go.