Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-vote campaign #474

Closed

Conversation

SimonRichardson
Copy link
Contributor

The following attempts to implement a pre-voting campaign optimization.

One downside of Raft’s leader election algorithm is that a server that has been partitioned from the
cluster is likely to cause a disruption when it regains connectivity. When a server is partitioned, it
will not receive heartbeats. It will soon increment its term to start an election, although it won’t
be able to collect enough votes to become leader. When the server regains connectivity sometime
later, its larger term number will propagate to the rest of the cluster (either through the server’s
RequestVote requests or through its AppendEntries response). This will force the cluster leader to
step down, and a new election will have to take place to select a new leader. Fortunately, such events
are likely to be rare, and each will only cause one leader to step down.

@rbhitchcock
Copy link

Appears to resolve #31

@hermanbanken
Copy link

hermanbanken commented Oct 23, 2021

Nice work! Hopefully PreVote solves the issue - when running in Kubernetes - that when StatefulSet pods are restarting they will trigger an election because their DNS entry is not immediately available and they will start sending RequestVote's which will trigger a read-only-phase/election.

Any idea how this relates to issues remaining without CheckQuorum? I found the mention of this great article in #31:
https://decentralizedthoughts.github.io/2020-12-12-raft-liveness-full-omission/

The situation explained there can happen easily on a Kubernetes StatefulSet too. However, maybe CheckQuorum is already implemented in this implementation due to the gRPC transport? Do you know this?

@SimonRichardson
Copy link
Contributor Author

@hermanbanken I don't believe checkQuorum is done in hashicorp/raft. I do have a rough idea of how to implement it, but I need to find time to be able to do it (as is life).

@stale
Copy link

stale bot commented Jan 9, 2022

Hey there,
We wanted to check in on this request since it has been inactive for at least 90 days.
Have you reviewed the latest godocs?
If you think this is still an important issue in the latest version of the Raft library or
its documentation please feel let us know and we'll keep it open for investigation.
If there is still no activity on this request in 30 days, we will go ahead and close it.
Thank you!

@stale stale bot added the waiting-reply label Jan 9, 2022
@hermanbanken
Copy link

Still relevant

@stale stale bot removed the waiting-reply label Jan 10, 2022
raft.go Outdated
// Create a response channel
respCh := make(chan *voteResult, len(r.configurations.latest.Servers))

// PreVote campaigns are sent on the next term.
term := r.getCurrentTerm() + 1
if preVote {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this part. We increment the current term by two if preVote is true; I would have thought we would leave it unchanged.

Part of my confusion I'm sure is because I've not found a very detailed description of Pre-Vote in the literature.

In Ongaro's thesis Pre-Vote is described as 'In the Pre-Vote algorithm, a candidate only increments its term if it first learns from a majority of the cluster that they would be willing to grant the candidate their votes' (9.6, pg 137 in my copy). Elsewhere (4.2.3 disruptive servers) "Only if the candidate believed it could get votes from a majority of the cluster would it increment its term and start a normal election."

Looking at the algorithm description in "4 modifications for Raft consensus" I think the intention is to increment by one, but they also talk about the need to avoid 'term inflation' and how this is avoided by Pre-Vote, which I freely admit I don't fully understand how to reconcile.

I'm still looking at other implementations for insight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of my confusion I'm sure is because I've not found a very detailed description of Pre-Vote in the literature.

This took some effort to actually originally page this in as not to disrupt the original implementation, I'll have to look over my old notes about this.

Sorry, I can't be much more helpful atm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @SimonRichardson, @markan,
my understanding based on the same literature you are reading @markan is that a node should not increment it's own term for pre-vote to avoid term inflation but still need to send a pre-vote request with a term equal to what it would send if asking for a normal vote.
Based on this, I think the correct way of dealing with this in here is to only setCurrentTerm if preVote is false and only increment the term once. That way the node will send the exact same term for pre-vote and for normal vote if it decide to start an election after the pre-vote.
pre-vote should not increment the node term.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this, I think the correct way of dealing with this in here is to only setCurrentTerm if preVote is false and only increment the term once. That way the node will send the exact same term for pre-vote and for normal vote if it decide to start an election after the pre-vote.

It seems like this is correct. I've spent some time reading back through this. Originally my notes say that it should be sent on the next term, but reading back through the papers we should send this on the same term as a normal vote.

@hashicorp-cla
Copy link

hashicorp-cla commented Mar 12, 2022

CLA assistant check
All committers have signed the CLA.

@SimonRichardson
Copy link
Contributor Author

@markan When I get a chance, I'll rebase against main and update the term update.

This is a very rough draft on adding pre-voting (see 9.6 of the raft
thesis) to raft. The commit is just familiarising the internal code base
before ensuring the correct implementation.

In the Pre-Vote algorithm, a candidate only increments its term if it
first learns from a majority of the cluster that they would be willing
to grant the candidate their votes. Thus saving any potential disruption
when a server attempts to rejoin.
To ensure we haven't regressed any existing tests, we run the full
test suite with PreVoting campaign off and on. Although not ideal,
any new and old tests are run with both PreVoting on and off. It
would be better to have some sort of test suite for this kind of
thing, yet we can replicate that same behaviour here.
Ensuring we send prevotes on the same term, prevents term inflation.
Although not bad in general, it prevents issues in large scale
deployments where terms can be exhausted over long periods of time.
@SimonRichardson
Copy link
Contributor Author

@markan this was rebased and fixed to just increment the term by 1 for all queries to prevent hyperinflation of terms.

@danmilon
Copy link

Thanks a lot for working on this @SimonRichardson. We're running a lot of vault clusters and are getting bitten by this very often. I hope @markan gets a chance to complete the review soon.

@manadart
Copy link

Any movement on this one? Love to see it landed.

@stale
Copy link

stale bot commented Sep 21, 2022

Hey there,
We wanted to check in on this request since it has been inactive for at least 90 days.
Have you reviewed the latest godocs?
If you think this is still an important issue in the latest version of the Raft library or
its documentation please feel let us know and we'll keep it open for investigation.
If there is still no activity on this request in 30 days, we will go ahead and close it.
Thank you!

@stale stale bot added the waiting-reply label Sep 21, 2022
@hermanbanken
Copy link

Still relevant!

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization is something I've wanted for a long time. Thanks @SimonRichardson for contributing.

I might be missunderstanding (end of a long day) but I had a few questions inline. It seems like this doesn't actually implement the PreVote optimization as I understood it and I'm not clear that it actually prevents the problems it's designed to.

I'll look again soon, but if you can point out anything I've missed that would be great!

// Create a response channel
respCh := make(chan *voteResult, len(r.configurations.latest.Servers))

// Increment the term
// PreVote campaigns are sent on the same term.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, the comment points out that PreVote shouldn't increment the term (true) but the code still does increment the term even if preVote == true.

Am I reading this wrong? If we are still incrementing the term then I don't think this is a working implementation of PreVote because this candidate will still increment it's term as long as it's disconnected and then will still disrupt the cluster with a newer term when it is reconnected right?

Or did I missunderstand?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants