Cap maximum grpc wait time when heartbeating to heartbeatTimeout/2 #494

HridoyRoy · 2022-03-09T17:35:31Z

This PR aims to resolve https://hashicorp.atlassian.net/browse/VAULT-5310 , and the associated GH vault issue hashicorp/vault#14153.

Context: Currently, if the a follower is shut down for some amount of time, the leader cannot heartbeat to it and does an exponential backoff. When the follower restarts, the leader waits for longer than election_timeout to send a heartbeat to it, thus the follower starts an election and increases its term, which causes leadership to change.

hashicorp-cla · 2022-03-12T16:36:36Z

All committers have signed the CLA.

util.go

…add test

util.go

replication.go

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>

…ft into hridoyroy/heartbeat-backoff

briankassouf · 2022-03-19T23:11:26Z

util.go

+		if base > cap {
+			return cap
+		}


Suggested change

if base > cap {

return cap

}

So long as the limit is less than or equal to 50 (assuming 10ms base wait time) then removing this should be fine. If we use a limit higher than that (which wouldn't make any logical sense as the wait time would be astronomically high) then we could run into integer overflow issues with the next multiplication.

This is practically safe but results in code with technically undefined behavior.

raft_test.go

mechpen · 2022-04-08T17:42:24Z

hi, I saw this behavior as well.

When a voter stops, the leader's backoff mechanism waits for more than 10 seconds (10ms*1024) to send heartbeats and replicate messages to the voter. When the voter comes back up, it times out and triggers a new election. This could cause a leadership change or flap.

Does it make sense to disable backoff for heartbeats?

mkeeler · 2022-04-11T14:59:27Z

@mechpen We normally heartbeat every 1/10th of the heartbeat timeout. When these fail other warnings/errors are emitted in the logs. In these disconnected scenarios we are attempting to perform exponential backoff to ensure we don't needlessly fill up logs and use more network bandwidth than necessary. The bug we have encountered is where we backoff for far too long and get into the situation where a restarted server may hit its heartbeat timeout before the leader attempts it again.

The solution proposed in this PR just drastically lowers the cap on how much we can backoff of the usual rate to ensure that we always send a heartbeat within the timeout value. It reduces the cap enough that practically speaking we only allow backoff by a factor of 5x the original value as opposed to the 100x it was previously. That 5x though will mean there are 1/5 the warning logs which could make figuring out whats going on during an incident a tiny bit easier so I think the solution implemented in this PR is probably the way to go.

HridoyRoy added 3 commits March 9, 2022 09:33

cap maximum grpc wait time when heartbeating to heartbeatTimeout/2

337234e

change timeout cap to heartbeatTimeout

bab8c54

test in progress

8c1dd50

briankassouf reviewed Mar 14, 2022

View reviewed changes

util.go Outdated Show resolved Hide resolved

HridoyRoy added 2 commits March 14, 2022 13:31

added test stub used to check logs and see if a re-election occurs

790258c

modify exponential backoff to be capped at heartbeat timeout / 2 and …

085d373

…add test

HridoyRoy marked this pull request as ready for review March 16, 2022 18:35

HridoyRoy changed the title ~~cap maximum grpc wait time when heartbeating to heartbeatTimeout/2~~ Cap maximum grpc wait time when heartbeating to heartbeatTimeout/2 Mar 16, 2022

HridoyRoy requested review from briankassouf and ncabatoff March 16, 2022 18:40

remove comment

2411b31

briankassouf reviewed Mar 16, 2022

View reviewed changes

util.go Show resolved Hide resolved

briankassouf requested a review from mkeeler March 16, 2022 18:51

briankassouf reviewed Mar 16, 2022

View reviewed changes

replication.go Outdated Show resolved Hide resolved

HridoyRoy and others added 3 commits March 17, 2022 11:27

Update util.go

47fe146

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>

change s.failures to failures for heartbeat backoff

585ec5b

Merge branch 'hridoyroy/heartbeat-backoff' of github.com:hashicorp/ra…

9503919

…ft into hridoyroy/heartbeat-backoff

HridoyRoy requested a review from briankassouf March 17, 2022 18:36

briankassouf reviewed Mar 19, 2022

View reviewed changes

mkeeler reviewed Mar 23, 2022

View reviewed changes

raft_test.go Outdated Show resolved Hide resolved

ncabatoff mentioned this pull request May 3, 2022

Single Vault follower restart causes election even with established quorum hashicorp/vault#14153

Closed

make timeouts 100 milliseconds each and remove the time.Sleeps

81a7d27

HridoyRoy requested review from mkeeler and briankassouf May 9, 2022 18:50

mkeeler approved these changes May 9, 2022

View reviewed changes

HridoyRoy merged commit 9174562 into main May 9, 2022

HridoyRoy deleted the hridoyroy/heartbeat-backoff branch May 9, 2022 18:51

dhiaayachi mentioned this pull request Oct 5, 2022

bump raft version to v1.3.11 hashicorp/consul#14897

Merged

This was referenced Oct 27, 2022

Backport of bump raft version to v1.3.11 into release/1.12.x hashicorp/consul#15174

Closed

Backport of bump raft version to v1.3.11 into release/1.13.x hashicorp/consul#15175

Merged

biazmoreira mentioned this pull request Feb 19, 2024

[1.15.4] Raft leader election unexpected behavior hashicorp/vault#24995

Open

biazmoreira added the bug label Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cap maximum grpc wait time when heartbeating to heartbeatTimeout/2 #494

Cap maximum grpc wait time when heartbeating to heartbeatTimeout/2 #494

HridoyRoy commented Mar 9, 2022 •

edited

hashicorp-cla commented Mar 12, 2022 •

edited

briankassouf Mar 19, 2022

mkeeler Mar 23, 2022

mechpen commented Apr 8, 2022

mkeeler commented Apr 11, 2022

Cap maximum grpc wait time when heartbeating to heartbeatTimeout/2 #494

Cap maximum grpc wait time when heartbeating to heartbeatTimeout/2 #494

Conversation

HridoyRoy commented Mar 9, 2022 • edited

hashicorp-cla commented Mar 12, 2022 • edited

briankassouf Mar 19, 2022

Choose a reason for hiding this comment

mkeeler Mar 23, 2022

Choose a reason for hiding this comment

mechpen commented Apr 8, 2022

mkeeler commented Apr 11, 2022

HridoyRoy commented Mar 9, 2022 •

edited

hashicorp-cla commented Mar 12, 2022 •

edited