BFT-464: Twins generator #116

aakoshh · 2024-05-22T15:16:49Z

What ❔

Implements utilities to support generating Twins scenarios

Generate all possible partitioning of nodes
Prune partitions so that either always, or at least eventually, they always have a partition capable of producing a QC
Function to size items based on desired number of replicas and twins
Pair partitions with possible leaders
Permutate leader-partition combinations for a desired number of rounds
Generate a desired number of random scenarios

Liveness

I haven't yet added the the logic for creating rounds where there is no chance for a quorum in any of the partitions. It would be easy to do, but I'd like to ask your opinion on how many to mix in, or whether we should always end with a potential quroum, etc. Perhaps just a probability of a no-quorum round?

Why ❔

So that we can generate the combinations and permutations that we're going to execute with the mock network.

node/actors/bft/src/testonly/twins/mod.rs

node/actors/bft/src/testonly/twins/tests.rs

node/actors/bft/src/testonly/twins/mod.rs

node/actors/bft/src/testonly/twins/partition.rs

node/actors/bft/src/testonly/twins/mod.rs

node/actors/bft/src/testonly/twins/scenario.rs

node/actors/bft/src/testonly/twins/partition.rs

node/actors/bft/src/testonly/twins/scenario.rs

node/actors/bft/src/testonly/twins/mod.rs

aakoshh · 2024-05-23T19:57:04Z

node/actors/bft/src/testonly/twins/scenario.rs

+
+    /// The number of signatures required for a high-vote, ie. `2 * f + 1`
+    pub fn subqourum_size(&self) -> usize {
+        self.num_faulty() * 2 + 1


This should be self.num_fault() * 2 / 5 + 1 for it not to degenerate into 1 if num_replicas <= 5.

this is not a degenerate case. If you want to change it, you need to prove correctness of the algorithm for all possible values of n.

So we start with n = 5f+1 and therefore f = (n-1)/5. @brunoffranca says the correct formula would be sq = n-3*f, so let's see:

sq = n-3*f sq = n - 3 * (n-1)/5 sq = n - 3/5n + 3/5 sq = 2/5n + 3/5

Now let's start from the current sq = 2f+1:

sq = 2f+1 sq = 2*(n-1)/5 + 1 sq = 2/5*n -2/5 + 1 sq = 2/5n + 3/5

I believe these are equivalent in mathematical terms, however due to the integer arithmetic, for n<6 we have f=0, but that's just an artifact of how computers work.

Similarly how the spec says the commit quorum size is 4f+1, when n=5 it gives 1, which is clearly wrong, for the same reason I think it's wrong for 2f+1, except we use it for the more difficult to understand subquorum, where nobody who sees it first is really sure if it's an optimisation or a required for safety.

Here are the values for the different values of n:

>>> for t in [(n, f, 2*f+1, 2*n//5+1, n-3*f) for (n, f) in [(n, (n-1)//5) for n in range(1, 15)]]: print("n={} f={} sq_2f1={} sq_25n1={} sq_n3f={}".format(*t)) ... n=1 f=0 sq_2f1=1 sq_25n1=1 sq_n3f=1 n=2 f=0 sq_2f1=1 sq_25n1=1 sq_n3f=2 n=3 f=0 sq_2f1=1 sq_25n1=2 sq_n3f=3 n=4 f=0 sq_2f1=1 sq_25n1=2 sq_n3f=4 n=5 f=0 sq_2f1=1 sq_25n1=3 sq_n3f=5 n=6 f=1 sq_2f1=3 sq_25n1=3 sq_n3f=3 n=7 f=1 sq_2f1=3 sq_25n1=3 sq_n3f=4 n=8 f=1 sq_2f1=3 sq_25n1=4 sq_n3f=5 n=9 f=1 sq_2f1=3 sq_25n1=4 sq_n3f=6 n=10 f=1 sq_2f1=3 sq_25n1=5 sq_n3f=7 n=11 f=2 sq_2f1=5 sq_25n1=5 sq_n3f=5 n=12 f=2 sq_2f1=5 sq_25n1=5 sq_n3f=6 n=13 f=2 sq_2f1=5 sq_25n1=6 sq_n3f=7 n=14 f=2 sq_2f1=5 sq_25n1=6 sq_n3f=8

I think if n-3f is correct and 2f+1 is correct then the middle ground must be correct too. You can see they don't give the exact same results all across the board, but of course the starkest difference is below n=6.

If it's a safety feature, ie. if there is exactly one highest subquorum then you must repropose it, then I'm a bit worried that if the sq=1, it's more likely that we don't repropose because we see two of them, and lose this security. I don't have an example for it from the top of my head, but it seems more intuitive for the other values of 2f+1 that if there are two of them then there surely wasn't any quorum. For example if n=5 and a leader go isolated and voted for their own proposal, but missed the fact that there was a quorum for a value already in the previous round.

After working through some examples I can't find any counter examples which would leave to a safety violation: if there was enough votes for a QC in a round, at least 2f+1 of them will show up in the next round (via the timeout QC) and there won't be any other SQC; if there are multiple SQCs present then there wasn't any QC. So the only difference is whether we broadcast a fresh payload, or try to repropose the highest view that the SQC has seen.

aakoshh · 2024-05-24T10:48:28Z

node/actors/bft/src/testonly/twins/scenario.rs

+        // Prune partitionings so that all of them have at least one where a quorum is possible.
+        // Alternatively we could keep all and make sure every scenario has eventually one with a quorum, for liveness.


One thing to consider here would be how the network is implemented in the tests. My idea was that I'd look at the view of the message, and then look at the partitioning in that round, and deliver to all nodes who are in the same partition as the sender.

The ChonkyBFT spec repeats timeouts periodically, however if every node would be in a partition with a size less than 4f+1, then they would never have a TimeoutQuorum, and would never move on to the next view.

Whereas if at least one partition has 4f+1 views, then they can time out, and in the next view they might be broadcasting to their stranded brethren, who at that point can move on to the future view on which they now observe a quorum in the justification.

Alternatively the network would have to contain some logic to eventually un-partition views, to behave in line with the assumption that all messages are delivered eventually.

The currently implemented network semantics is that all (consensus) messages are delivered eventually, unless overridden by a newer message (see MessagePool for exact semantics).

The best approach IMO is to unpartition every scenario at the end. Partitions are a temporary network condition anyway, with these tests we want to see if the consensus can recover from a partition.

brunoffranca

Looks good to me. Approved!

node/actors/bft/src/testonly/twins/scenario.rs

Co-authored-by: Bruno França <bruno@franca.xyz>

BFT-464: Partitioning

651f007

aakoshh requested a review from brunoffranca as a code owner May 22, 2024 15:16

aakoshh marked this pull request as draft May 22, 2024 15:17

aakoshh force-pushed the 464-twins-gen branch from ff12598 to 535caee Compare May 22, 2024 15:18

BFT-464: Test partitioning

d62d90a

aakoshh force-pushed the 464-twins-gen branch from 535caee to d62d90a Compare May 22, 2024 15:31

pompon0 reviewed May 22, 2024

View reviewed changes

aakoshh added 5 commits May 22, 2024 20:10

BFT-464: Nits

ae18e08

BFT-464: Use string labels in test

c4e495d

BFT-464: Tweak docs

d71078a

BFT-464: Assert number of partitionings found

d04fe08

BFT-464: Simpler passing of inputs

29c7438

aakoshh force-pushed the 464-twins-gen branch from 2ee3e0b to 29c7438 Compare May 23, 2024 08:39

aakoshh added 4 commits May 23, 2024 09:48

BFT-464: Move to parts.rs

ee6adeb

BFT-464: Cluster and Twin types

79afdc4

BFT-464: Scenario generator

13f27ff

BFT-464: Test the scenario generator

e92d602

aakoshh force-pushed the 464-twins-gen branch from 4dfea3d to e92d602 Compare May 23, 2024 14:23

aakoshh marked this pull request as ready for review May 23, 2024 14:26