Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct handling of term updates during the election for Candidate/Follower #55

Merged
merged 2 commits into from
Aug 8, 2015

Conversation

dmitraver
Copy link
Contributor

According to the raft paper
"Current terms are exchanged whenever servers communicate; if one server’s current term is smaller than the other’s, then it updates its current term to the larger value. If a candidate or leader discovers that its term is out of date, it immediately reverts to follower state. If a server receives a request with a stale term number, it rejects the request."

This pull request implements this rules for election phase of Follower/Candidate states. It also fixes the #46 and #47 issues.

@dmitraver
Copy link
Contributor Author

After running whole tests few times I've discovered that the multi jvm tests are failing from time to time and for now I have no idea why. I need to investigate this test cases deeper and update my pull request accordingly.

case Event(msg: RequestVote, m: ElectionMeta) if m.canVoteIn(msg.term) =>
sender ! VoteCandidate(m.currentTerm)
stay() using m.withVoteFor(msg.term, candidate())
case Event(msg: RequestVote, m: ElectionMeta) if msg.term < m.currentTerm =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: it looks like the rest of the source code uses two spaces for indentation, whereas this pull request has tabs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for pointing this out, I will fix that.

@colin-scott
Copy link
Contributor

Random guess: I bet that the multi-jvm tests are inherently somewhat non-deterministic.

@dmitraver
Copy link
Contributor Author

@colin-scott Hi, Colin. I guess you're right but it's not good anyway. The problem is that previously term updates that are handled by this PR weren't reflected in the code and this new changes can cascade and affect other parts of code that depends on it because everything is linked to some extend.

@colin-scott
Copy link
Contributor

Just realized: I think you should also have the leader step down if term > currentTerm.

It currently stay()s:

https://github.com/ktoso/akka-raft/blob/master/src/main/scala/pl/project13/scala/akka/raft/Leader.scala#L28

def forFollower: Meta = Meta(clusterSelf, currentTerm, config, Map.empty)
def forNewElection: ElectionMeta = this.forFollower.forNewElection
def forLeader: LeaderMeta = LeaderMeta(clusterSelf, currentTerm, config)
def forFollower(term: Term = currentTerm): Meta = Meta(clusterSelf, term, config, Map.empty)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than having a default argument here, maybe it would be cleaner to chain the methods?

e.g.

goto(Follower) with m.forFollower.withTerm(term)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Also thought about that. There is a trade off between efficiency and readability and I just wanted to remove unnecessary copying of objects if its possible. Surely, I know about premature optimizations :) I think you're right there is also one more place where same "side effect" is hidden.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular piece I'm ok with the default param, though the explicit version would be nice as well. No need to change just yet.

@dmitraver
Copy link
Contributor Author

Can be an issue though. Leader should also have the same term update rules as candidate. Will check that later.

@dmitraver
Copy link
Contributor Author

After investigating test failures for some time I've found that election tests are failing from time to time on one of three JVM nodes with the following output:

[JVM-1] [WARN] [08/07/2015 15:50:22.389] [RaftClusterSpec-akka.remote.default-remote-dispatcher-5] [akka.tcp://RaftClusterSpec@127.0.0.1:33394/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FRaftClusterSpec%40127.0.0.1%3A58897-4] Association with remote system [akka.tcp://RaftClusterSpec@127.0.0.1:58897] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
[JVM-1] [INFO] [08/07/2015 15:50:22.389] [RaftClusterSpec-akka.actor.default-dispatcher-19] [akka://RaftClusterSpec/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FRaftClusterSpec%40127.0.0.1%3A51937-4] Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://RaftClusterSpec/deadLetters] to Actor[akka://RaftClusterSpec/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FRaftClusterSpec%40127.0.0.1%3A51937-4#-372792519] was not delivered. [4] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[JVM-1] [WARN] [08/07/2015 15:50:38.813] [RaftClusterSpec-akka.remote.default-remote-dispatcher-5] [Remoting] Association to [akka.tcp://RaftClusterSpec@127.0.0.1:58897] having UID [1443814610] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.

Same error messages can be observed in the failed build log of travis. Don't actually know what can cause this problem but probably it's not related to the bugs in code itself.

@ktoso
Copy link
Owner

ktoso commented Aug 7, 2015

Gating happens when a connection to another node fails (i.e. in multi node tests when a node is killed) as a node notices it cannot talk to the other node and "gates it" for a while, trying to re-establish the connection later on.

@dmitraver
Copy link
Contributor Author

@ktoso Is there any way to prevent it from failing or its an expected behavior? I can retrigger the build and it will eventually succeed but it's not the best solution I guess.

@ktoso
Copy link
Owner

ktoso commented Aug 7, 2015

I'll need to look at the tests where exactly it happens, some tests could be purposefully killing nodes – then it's totally expected. Sadly now prepping and delivering a webinar so will only get to it rather later.

Thanks for all the work btw! I hope to get to reviewing soon.

stay()

case Event(msg: RequestVote, m: ElectionMeta) if msg.term > m.currentTerm =>
log.info("Received newer {}. Current term is {}. Revert to follower state.", msg.term, m.currentTerm)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps you should forward the RequestVote message instead of dropping it?

e.g.

m.clusterSelf forward msg
goto(Follower) using m.forFollower(msg.term)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like an optimisation (i.e. not needed for correctness), right?
If so let's skip it for now - let's aim for pure correctness in the upcoming work and PRs :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's an optimization

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it an issue so we can re-visit it later #61

@@ -64,7 +91,7 @@ private[raft] trait Candidate {
if (leaderIsAhead) {
log.info("Reverting to Follower, because got AppendEntries from Leader in {}, but am in {}", append.term, m.currentTerm)
m.clusterSelf forward append
goto(Follower) using m.forFollower
goto(Follower) using m.forFollower()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I actually like to use forFollower methods as it indicated the method has no side effects. Imagine it's a getter / field.
That style refers to the http://docs.scala-lang.org/glossary/#uniform-access-principle

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, with a default parameter (as @dmitraver added), it doesn't compile without parens

@ktoso
Copy link
Owner

ktoso commented Aug 8, 2015

Changes make sense, thanks a lot @dmitraver!
I looked at the failures and would say they're timing problems due to too small await timeouts for the travis box (it's pretty slow).

I'll merge the changes and work a bit in the infra / timeouts to make tests not as fleaky.

@ktoso
Copy link
Owner

ktoso commented Aug 8, 2015

For the next contrib @dmitraver please use 2-spaces though... 😉

ktoso added a commit that referenced this pull request Aug 8, 2015
Correct handling of term updates during the election for Candidate/Follower
@ktoso ktoso merged commit e656196 into ktoso:master Aug 8, 2015
@dmitraver
Copy link
Contributor Author

@ktoso Thx for the hint ;) Will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants