Any exception can cause two leaders to become elected in the same Term #62

Closed
colin-scott opened this Issue Aug 9, 2015 · 9 comments

Comments

Projects
None yet
3 participants
@colin-scott
Contributor

colin-scott commented Aug 9, 2015

The Raft protocol assumes a crash-recovery failure model -- that is, it allows for the possibility that crashed nodes will come back (with non-volatile state intact).

Currently, akka-raft does not write anything to disk. That's actually fine -- it just means that akka-raft currently assumes a crash-stop failure model, where crashed nodes are never allowed to come back.

However, akka-raft should enforce crash-stop! It currently does not, and this can lead to problems.

More specifically:

By default, akka's supervision strategy restarts any actor that throws an exception. When it restarts, all its state is reinitialized.

In the Raft paper, the votedFor variable is listed as persistent (non-volatile) state on each server. If votedFor is forgotten, a node may end up voting for two different candidates in the same term.

If RaftActor throws an exception for any reason, it may end up in the situation where it eventually votes for two different candidates in the same, possibly causing two different leaders to be elected.

I see two solutions:

  • Either write to disk
  • or force akka.actor.guardian-supervisor-strategy = akka.actor.StoppingSupervisorStrategy, so that nodes do not recover

I discovered this bug because one of my recent fixes accidentally triggered an exception -- and later in my fuzz run I saw that two leaders became elected in the same term.

@ktoso ktoso added the bug label Aug 9, 2015

@ktoso

This comment has been minimized.

Show comment
Hide comment
@ktoso

ktoso Aug 9, 2015

Owner

Good catch (as always), initially I was planning to go for the persistent route.

The persistence layer could be done via akka-persistence, here's an old ticket for it #7

Owner

ktoso commented Aug 9, 2015

Good catch (as always), initially I was planning to go for the persistent route.

The persistence layer could be done via akka-persistence, here's an old ticket for it #7

@ktoso ktoso added this to the 1.0 milestone Aug 9, 2015

@dmitraver

This comment has been minimized.

Show comment
Hide comment
@dmitraver

dmitraver Oct 9, 2015

Contributor

I'm working on it. Also I think it makes sense to update the libraries to the current akka version, at least the persistence module is no longer marked as experimental in 2.4.

Contributor

dmitraver commented Oct 9, 2015

I'm working on it. Also I think it makes sense to update the libraries to the current akka version, at least the persistence module is no longer marked as experimental in 2.4.

@ktoso

This comment has been minimized.

Show comment
Hide comment
@ktoso

ktoso Oct 9, 2015

Owner

Thanks @dmitraver!
Yeah, it makes sense to upgrade to 2.4, not only because of that, but also because a lot of cluster things have improved stability now.

Owner

ktoso commented Oct 9, 2015

Thanks @dmitraver!
Yeah, it makes sense to upgrade to 2.4, not only because of that, but also because a lot of cluster things have improved stability now.

@ktoso ktoso added the in-progress label Oct 9, 2015

@dmitraver

This comment has been minimized.

Show comment
Hide comment
@dmitraver

dmitraver May 7, 2016

Contributor

@ktoso
I actually implemented the desired behaviour that ensures that non-volatile state such as votedFor and currentTerm is persisted. I used persistent FSM for that. However, it looks like TestFSMRef class that is used almost everywhere in tests doesn't support the FSM persistent actors cause I'm getting the following error:
Error:(18, 29) Cannot prove that pl.project13.scala.akka.raft.SnapshottingWordConcatRaftActor with pl.project13.scala.akka.raft.EventStreamAllMessages <:< akka.actor.FSM[S,D]. val candidate = TestFSMRef(new SnapshottingWordConcatRaftActor with EventStreamAllMessages)
I checked the apply method of the class and found that its signature requires having a FSM trait
def apply[S, D, T <: Actor: ClassTag](factory: ⇒ T)(implicit ev: T <:< FSM[S, D], system: ActorSystem) which won't work with PersistentFSMBase trait that is used in persistence fsm.

I'm not sure whether its possible to overcome this without changing the TestFSMRef internals. I can actually make a PR with this change but I'm not sure that its a good idea without having working tests.

Contributor

dmitraver commented May 7, 2016

@ktoso
I actually implemented the desired behaviour that ensures that non-volatile state such as votedFor and currentTerm is persisted. I used persistent FSM for that. However, it looks like TestFSMRef class that is used almost everywhere in tests doesn't support the FSM persistent actors cause I'm getting the following error:
Error:(18, 29) Cannot prove that pl.project13.scala.akka.raft.SnapshottingWordConcatRaftActor with pl.project13.scala.akka.raft.EventStreamAllMessages <:< akka.actor.FSM[S,D]. val candidate = TestFSMRef(new SnapshottingWordConcatRaftActor with EventStreamAllMessages)
I checked the apply method of the class and found that its signature requires having a FSM trait
def apply[S, D, T <: Actor: ClassTag](factory: ⇒ T)(implicit ev: T <:< FSM[S, D], system: ActorSystem) which won't work with PersistentFSMBase trait that is used in persistence fsm.

I'm not sure whether its possible to overcome this without changing the TestFSMRef internals. I can actually make a PR with this change but I'm not sure that its a good idea without having working tests.

@ktoso

This comment has been minimized.

Show comment
Hide comment
@ktoso

ktoso May 7, 2016

Owner

Sounds great, thanks for your work.
I think the project should move away from test actor ref in any case by the
way, so that would be another reason to do so.

Konrad Malawski

From: Dmitry Avershin notifications@github.com notifications@github.com
Reply: ktoso/akka-raft
reply@reply.github.com
reply@reply.github.com
Date: 7 May 2016 at 17:11:35
To: ktoso/akka-raft akka-raft@noreply.github.com
akka-raft@noreply.github.com
CC: Konrad Malawski konrad.malawski@project13.pl
konrad.malawski@project13.pl, Mention mention@noreply.github.com
mention@noreply.github.com
Subject: Re: [ktoso/akka-raft] Any exception can cause two leaders to
become elected in the same Term (#62)

@ktoso https://github.com/ktoso

I actually implemented the desired behaviour that ensures that
non-volatile state such as votedFor and currentTerm is persisted. I
used persistent FSM for that. However, it looks like TestFSMRef class
that is used almost everywhere in tests doesn't support the FSM persistent
actors cause I'm getting the following error:

Error:(18, 29) Cannot prove that
pl.project13.scala.akka.raft.SnapshottingWordConcatRaftActor with
pl.project13.scala.akka.raft.EventStreamAllMessages <:< akka.actor.FSM[S,D].
val candidate = TestFSMRef(new SnapshottingWordConcatRaftActor with
EventStreamAllMessages)

I checked the apply method of the class and found that its signature
requires having a FSM trait
def apply[S, D, T <: Actor: ClassTag](factory: ⇒ T)(implicit ev: T <:<
FSM[S, D], system: ActorSystem) which won't work with PersistentFSMBase
trait that is used in persistence fsm.

I'm not sure whether its possible to overcome this without changing the
TestFSMRef internals. I can actually make a PR with this change but I'm
not sure that its a good idea without having working tests.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#62 (comment)

Owner

ktoso commented May 7, 2016

Sounds great, thanks for your work.
I think the project should move away from test actor ref in any case by the
way, so that would be another reason to do so.

Konrad Malawski

From: Dmitry Avershin notifications@github.com notifications@github.com
Reply: ktoso/akka-raft
reply@reply.github.com
reply@reply.github.com
Date: 7 May 2016 at 17:11:35
To: ktoso/akka-raft akka-raft@noreply.github.com
akka-raft@noreply.github.com
CC: Konrad Malawski konrad.malawski@project13.pl
konrad.malawski@project13.pl, Mention mention@noreply.github.com
mention@noreply.github.com
Subject: Re: [ktoso/akka-raft] Any exception can cause two leaders to
become elected in the same Term (#62)

@ktoso https://github.com/ktoso

I actually implemented the desired behaviour that ensures that
non-volatile state such as votedFor and currentTerm is persisted. I
used persistent FSM for that. However, it looks like TestFSMRef class
that is used almost everywhere in tests doesn't support the FSM persistent
actors cause I'm getting the following error:

Error:(18, 29) Cannot prove that
pl.project13.scala.akka.raft.SnapshottingWordConcatRaftActor with
pl.project13.scala.akka.raft.EventStreamAllMessages <:< akka.actor.FSM[S,D].
val candidate = TestFSMRef(new SnapshottingWordConcatRaftActor with
EventStreamAllMessages)

I checked the apply method of the class and found that its signature
requires having a FSM trait
def apply[S, D, T <: Actor: ClassTag](factory: ⇒ T)(implicit ev: T <:<
FSM[S, D], system: ActorSystem) which won't work with PersistentFSMBase
trait that is used in persistence fsm.

I'm not sure whether its possible to overcome this without changing the
TestFSMRef internals. I can actually make a PR with this change but I'm
not sure that its a good idea without having working tests.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#62 (comment)

@dmitraver

This comment has been minimized.

Show comment
Hide comment
@dmitraver

dmitraver May 7, 2016

Contributor

Sounds good to me but what are other alternatives? I haven't had a lot of experience with akka testing so my knowledge is a bit limited here :) I can fix the tests and make a PR request once its ready .

Contributor

dmitraver commented May 7, 2016

Sounds good to me but what are other alternatives? I haven't had a lot of experience with akka testing so my knowledge is a bit limited here :) I can fix the tests and make a PR request once its ready .

@ktoso

This comment has been minimized.

Show comment
Hide comment
@ktoso

ktoso May 7, 2016

Owner

Just normal messaging using actors and test probes.
I'll likely lose internet soon now (airborne), but will try to carve out
some time to give you hints if more needed.

Thanks for hakking on the project :)

Konrad Malawski

From: Dmitry Avershin notifications@github.com notifications@github.com
Reply: ktoso/akka-raft
reply@reply.github.com
reply@reply.github.com
Date: 7 May 2016 at 17:18:53
To: ktoso/akka-raft akka-raft@noreply.github.com
akka-raft@noreply.github.com
CC: Konrad Malawski konrad.malawski@project13.pl
konrad.malawski@project13.pl, Mention mention@noreply.github.com
mention@noreply.github.com
Subject: Re: [ktoso/akka-raft] Any exception can cause two leaders to
become elected in the same Term (#62)

Sounds good to me but what are other alternatives? I haven't had a lot of

experience with akka testing so my knowledge is a bit limited here :) I can
fix the tests and make a PR request once its ready .


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#62 (comment)

Owner

ktoso commented May 7, 2016

Just normal messaging using actors and test probes.
I'll likely lose internet soon now (airborne), but will try to carve out
some time to give you hints if more needed.

Thanks for hakking on the project :)

Konrad Malawski

From: Dmitry Avershin notifications@github.com notifications@github.com
Reply: ktoso/akka-raft
reply@reply.github.com
reply@reply.github.com
Date: 7 May 2016 at 17:18:53
To: ktoso/akka-raft akka-raft@noreply.github.com
akka-raft@noreply.github.com
CC: Konrad Malawski konrad.malawski@project13.pl
konrad.malawski@project13.pl, Mention mention@noreply.github.com
mention@noreply.github.com
Subject: Re: [ktoso/akka-raft] Any exception can cause two leaders to
become elected in the same Term (#62)

Sounds good to me but what are other alternatives? I haven't had a lot of

experience with akka testing so my knowledge is a bit limited here :) I can
fix the tests and make a PR request once its ready .


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#62 (comment)

@dmitraver

This comment has been minimized.

Show comment
Hide comment
@dmitraver

dmitraver May 7, 2016

Contributor

Thanks, its real fun :) I will try to fix the tests then and hopefully make a PR soon.

Contributor

dmitraver commented May 7, 2016

Thanks, its real fun :) I will try to fix the tests then and hopefully make a PR soon.

@dmitraver

This comment has been minimized.

Show comment
Hide comment
@dmitraver

dmitraver May 9, 2016

Contributor

I investigated the test logic a bit more and found that actually the whole test suit relies on TestActorRef and TestFsmRef. There are three "unit" tests that use TestFsmRef - Leader/Follower/Candidate Tests. Other "integration" tests use TestActorRef/TestFsmRef indirectly in RaftSpec. The whole test suite and almost all of the tests should be rewritten to use plain actors and I think the only way to simulate the old test behavior is to subscribe to certain messages and then wait for them to be received. The hardest part in my opinion is to rewrite "unit" tests due to non deterministic behavior of integration testing.

Contributor

dmitraver commented May 9, 2016

I investigated the test logic a bit more and found that actually the whole test suit relies on TestActorRef and TestFsmRef. There are three "unit" tests that use TestFsmRef - Leader/Follower/Candidate Tests. Other "integration" tests use TestActorRef/TestFsmRef indirectly in RaftSpec. The whole test suite and almost all of the tests should be rewritten to use plain actors and I think the only way to simulate the old test behavior is to subscribe to certain messages and then wait for them to be received. The hardest part in my opinion is to rewrite "unit" tests due to non deterministic behavior of integration testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment