Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple anti-entropy mechanism #24

Merged
merged 4 commits into from Jun 5, 2020

Conversation

manuelbernhardt
Copy link
Contributor

@manuelbernhardt manuelbernhardt commented Apr 23, 2020

It can happen that a node misses a part of the consensus messages whilst still being able to send out its own vote (unidirectional network partition, message overload, ...). In this case, the rest of the group will see this node as being part of the group and the monitoring mechanism will still be working as expected, but the stale node will run an old configuration.

In order to enforce consistency in this case, the following new anti-entropy mechanism is introduced:

  • each node maintains a set of configurations it has been part of
  • probe messages now contain the configuration ID of the observer
  • when a node receives a probe message with a configuration ID it does not know, it will start a background task to check again after a configured timeout (1 minute by default)
  • if the configuration ID is still unknown after the timeout has reached, the node leaves proactively (using the LEAVE protocol)

In order for FastPaxos to also have its set of Settings, MemberShipService needs to be passed a common Settings class that implements both FastPaxos.ISettings and MembershipService.ISettings
It can happen that a node misses a part of the consensus messages whilst still being able to send out its own vote (unidirectional network partition, message overload, ...). In this case, the rest of the group will see this node as being part of the group and the monitoring mechanism will still be working as expected, but the stale node will run an old configuration.

In order to enforce consistency in this case, the following new anti-entropy mechanism is used:

- each node maintains a set of configurations it has been part of
- probe messages now contain the configuration ID of the observer
- when a node receives a probe message with a configuration ID it does not know, it will start a background task to check again after a configured timeout (1 minute by default)
- if the configuration ID is still unknown after the timeout has reached, the node leaves (using the LEAVE protocol)
@lalithsuresh
Copy link
Owner

@lalithsuresh lalithsuresh commented Apr 23, 2020

Thanks for the patch @manuelbernhardt.

If a node is receiving probe messages but with a different configuration ID, it means it is already part of the membership. It might be better for the node to simply learn how to get to the current configuration ID rather than leave.

What we need to maintain at nodes is a log of membership changes and the corresponding configuration IDs (there was a time when I did maintain such a log in MembershipService for this precise use case but removed it because I wasn't using it). A node X that is part of a membership but has not heard enough votes from the previous consensus round can safely learn the new value of the membership by asking any one of the nodes Y in the new round (because if Y actually moved to the new round, then it is guaranteed to be safe because a Fast Paxos quorum of nodes approved that move). The nodes X is receiving probe messages from are good candidates to ask.

In doing so, we avoid unnecessary removals. The protocol I just mentioned, as far as the state machine is concerned, will behave like simply receiving missing messages.

Another concern is to make sure this works even when using custom failure detectors.

Thoughts?

@manuelbernhardt
Copy link
Contributor Author

@manuelbernhardt manuelbernhardt commented Apr 24, 2020

@lalithsuresh yes I was thinking about a recovery synchronization (like the one in Lifeguard) - the particular case that led me to implement this now is that I'm facing rogue nodes that exhibit what seems to be short-lived partitions for some of the other nodes in the network. There seem to be a few of those nodes that are "flaky" - sometimes they work normally, sometimes they don't receive messages from some of the other nodes in the network (but they still are able to send). Even if they could recover, I don't necessarily want these nodes around as I don't trust them. So I'm thinking perhaps this feature could be configured to behave either one way or the other.

That is, I should've marked this branch "work in progress" - I definitely think that the synchronization should be part of it. Do you have any thoughts about how to go about asking one of the observers about the new configuration? It looks like a new message type to me

@manuelbernhardt
Copy link
Contributor Author

@manuelbernhardt manuelbernhardt commented May 4, 2020

@lalithsuresh I implemented a synchronization mechanism, in addition to the leave behavior. Actually I'm thinking of getting rid of that one since it feels like too many possibilities.

The way it works is to maintain a history of change proposals and their details (joiner metadata and UUID). Seems to be working quite nicely. It is implemented in such a way that a node can survive missing multiple rounds - I'm thinking about the case of having e.g two quick rounds with a slow / lenient failure detector.

@lalithsuresh lalithsuresh merged commit 0a1788f into lalithsuresh:master Jun 5, 2020
1 check failed
@lalithsuresh
Copy link
Owner

@lalithsuresh lalithsuresh commented Jun 5, 2020

@manuelbernhardt missConsensusMessagesAndSync fails when I run the test on my Mac from maven. I get a long stream of the following error, suggesting a bug in the sync protocol.

SEVERE: RuntimeException while executing runnable com.google.common.util.concurrent.Futures$4@613b58e4 with executor MoreExecutors.directExecutor()
com.vrg.rapid.MembershipView$UUIDAlreadySeenException: Endpoint add attempt with identifier already seen: {host: hostname: "127.0.0.1"
port: 1274
, identifier: high: -88718771626098184
low: -8026499079153775638
}:
	at com.vrg.rapid.MembershipView.ringAdd(MembershipView.java:129)
	at com.vrg.rapid.MembershipService.decideViewChange(MembershipService.java:442)
	at com.vrg.rapid.MembershipService.access$1200(MembershipService.java:80)
	at com.vrg.rapid.MembershipService$SynchronizationCallback.onSuccess(MembershipService.java:843)
	at com.vrg.rapid.MembershipService$SynchronizationCallback.onSuccess(MembershipService.java:824)
	at com.google.common.util.concurrent.Futures$4.run(Futures.java:1132)
	at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:435)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:900)
	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:811)
	at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:653)
	at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:47)
	at com.vrg.rapid.messaging.impl.Retries$1.onSuccess(Retries.java:68)
	at com.google.common.util.concurrent.Futures$4.run(Futures.java:1132)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

lalithsuresh added a commit that referenced this issue Jun 10, 2020
lalithsuresh added a commit that referenced this issue Jun 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants