-
Notifications
You must be signed in to change notification settings - Fork 995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pre-vote campaign #474
Pre-vote campaign #474
Conversation
Appears to resolve #31 |
Nice work! Hopefully PreVote solves the issue - when running in Kubernetes - that when StatefulSet pods are restarting they will trigger an election because their DNS entry is not immediately available and they will start sending RequestVote's which will trigger a read-only-phase/election. Any idea how this relates to issues remaining without CheckQuorum? I found the mention of this great article in #31: The situation explained there can happen easily on a Kubernetes StatefulSet too. However, maybe CheckQuorum is already implemented in this implementation due to the gRPC transport? Do you know this? |
@hermanbanken I don't believe checkQuorum is done in |
Hey there, |
Still relevant |
raft.go
Outdated
// Create a response channel | ||
respCh := make(chan *voteResult, len(r.configurations.latest.Servers)) | ||
|
||
// PreVote campaigns are sent on the next term. | ||
term := r.getCurrentTerm() + 1 | ||
if preVote { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about this part. We increment the current term by two if preVote is true; I would have thought we would leave it unchanged.
Part of my confusion I'm sure is because I've not found a very detailed description of Pre-Vote in the literature.
In Ongaro's thesis Pre-Vote is described as 'In the Pre-Vote algorithm, a candidate only increments its term if it first learns from a majority of the cluster that they would be willing to grant the candidate their votes' (9.6, pg 137 in my copy). Elsewhere (4.2.3 disruptive servers) "Only if the candidate believed it could get votes from a majority of the cluster would it increment its term and start a normal election."
Looking at the algorithm description in "4 modifications for Raft consensus" I think the intention is to increment by one, but they also talk about the need to avoid 'term inflation' and how this is avoided by Pre-Vote, which I freely admit I don't fully understand how to reconcile.
I'm still looking at other implementations for insight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of my confusion I'm sure is because I've not found a very detailed description of Pre-Vote in the literature.
This took some effort to actually originally page this in as not to disrupt the original implementation, I'll have to look over my old notes about this.
Sorry, I can't be much more helpful atm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @SimonRichardson, @markan,
my understanding based on the same literature you are reading @markan is that a node should not increment it's own term for pre-vote to avoid term inflation but still need to send a pre-vote request with a term equal to what it would send if asking for a normal vote.
Based on this, I think the correct way of dealing with this in here is to only setCurrentTerm
if preVote
is false and only increment the term once. That way the node will send the exact same term for pre-vote and for normal vote if it decide to start an election after the pre-vote.
pre-vote should not increment the node term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this, I think the correct way of dealing with this in here is to only setCurrentTerm if preVote is false and only increment the term once. That way the node will send the exact same term for pre-vote and for normal vote if it decide to start an election after the pre-vote.
It seems like this is correct. I've spent some time reading back through this. Originally my notes say that it should be sent on the next term, but reading back through the papers we should send this on the same term as a normal vote.
@markan When I get a chance, I'll rebase against main and update the term update. |
This is a very rough draft on adding pre-voting (see 9.6 of the raft thesis) to raft. The commit is just familiarising the internal code base before ensuring the correct implementation. In the Pre-Vote algorithm, a candidate only increments its term if it first learns from a majority of the cluster that they would be willing to grant the candidate their votes. Thus saving any potential disruption when a server attempts to rejoin.
To ensure we haven't regressed any existing tests, we run the full test suite with PreVoting campaign off and on. Although not ideal, any new and old tests are run with both PreVoting on and off. It would be better to have some sort of test suite for this kind of thing, yet we can replicate that same behaviour here.
Ensuring we send prevotes on the same term, prevents term inflation. Although not bad in general, it prevents issues in large scale deployments where terms can be exhausted over long periods of time.
0be5ab7
to
00112c5
Compare
@markan this was rebased and fixed to just increment the term by 1 for all queries to prevent hyperinflation of terms. |
Thanks a lot for working on this @SimonRichardson. We're running a lot of vault clusters and are getting bitten by this very often. I hope @markan gets a chance to complete the review soon. |
Any movement on this one? Love to see it landed. |
Hey there, |
Still relevant! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This optimization is something I've wanted for a long time. Thanks @SimonRichardson for contributing.
I might be missunderstanding (end of a long day) but I had a few questions inline. It seems like this doesn't actually implement the PreVote
optimization as I understood it and I'm not clear that it actually prevents the problems it's designed to.
I'll look again soon, but if you can point out anything I've missed that would be great!
// Create a response channel | ||
respCh := make(chan *voteResult, len(r.configurations.latest.Servers)) | ||
|
||
// Increment the term | ||
// PreVote campaigns are sent on the same term. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, the comment points out that PreVote
shouldn't increment the term (true) but the code still does increment the term even if preVote == true
.
Am I reading this wrong? If we are still incrementing the term then I don't think this is a working implementation of PreVote
because this candidate will still increment it's term as long as it's disconnected and then will still disrupt the cluster with a newer term when it is reconnected right?
Or did I missunderstand?
The following attempts to implement a pre-voting campaign optimization.