Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BFT-464: Twins generator #116

Merged
merged 21 commits into from
May 28, 2024
Merged

BFT-464: Twins generator #116

merged 21 commits into from
May 28, 2024

Conversation

aakoshh
Copy link
Contributor

@aakoshh aakoshh commented May 22, 2024

What ❔

Implements utilities to support generating Twins scenarios

  • Generate all possible partitioning of nodes
  • Prune partitions so that either always, or at least eventually, they always have a partition capable of producing a QC
  • Function to size items based on desired number of replicas and twins
  • Pair partitions with possible leaders
  • Permutate leader-partition combinations for a desired number of rounds
  • Generate a desired number of random scenarios

Liveness

I haven't yet added the the logic for creating rounds where there is no chance for a quorum in any of the partitions. It would be easy to do, but I'd like to ask your opinion on how many to mix in, or whether we should always end with a potential quroum, etc. Perhaps just a probability of a no-quorum round?

Why ❔

So that we can generate the combinations and permutations that we're going to execute with the mock network.

@aakoshh aakoshh marked this pull request as draft May 22, 2024 15:17
node/actors/bft/src/testonly/twins/mod.rs Outdated Show resolved Hide resolved
node/actors/bft/src/testonly/twins/mod.rs Outdated Show resolved Hide resolved
node/actors/bft/src/testonly/twins/tests.rs Outdated Show resolved Hide resolved
node/actors/bft/src/testonly/twins/tests.rs Outdated Show resolved Hide resolved
@aakoshh aakoshh marked this pull request as ready for review May 23, 2024 14:26

/// The number of signatures required for a high-vote, ie. `2 * f + 1`
pub fn subqourum_size(&self) -> usize {
self.num_faulty() * 2 + 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be self.num_fault() * 2 / 5 + 1 for it not to degenerate into 1 if num_replicas <= 5.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a degenerate case. If you want to change it, you need to prove correctness of the algorithm for all possible values of n.

Copy link
Contributor Author

@aakoshh aakoshh May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we start with n = 5f+1 and therefore f = (n-1)/5. @brunoffranca says the correct formula would be sq = n-3*f, so let's see:

sq = n-3*f 
sq = n - 3 * (n-1)/5
sq = n - 3/5n + 3/5
sq = 2/5n + 3/5

Now let's start from the current sq = 2f+1:

sq = 2f+1
sq = 2*(n-1)/5 + 1
sq = 2/5*n -2/5 + 1
sq = 2/5n + 3/5

I believe these are equivalent in mathematical terms, however due to the integer arithmetic, for n<6 we have f=0, but that's just an artifact of how computers work.

Similarly how the spec says the commit quorum size is 4f+1, when n=5 it gives 1, which is clearly wrong, for the same reason I think it's wrong for 2f+1, except we use it for the more difficult to understand subquorum, where nobody who sees it first is really sure if it's an optimisation or a required for safety.

Here are the values for the different values of n:

>>> for t in [(n, f, 2*f+1, 2*n//5+1, n-3*f) for (n, f) in  [(n, (n-1)//5) for n in range(1, 15)]]: print("n={} f={} sq_2f1={} sq_25n1={} sq_n3f={}".format(*t))
... 
n=1 f=0 sq_2f1=1 sq_25n1=1 sq_n3f=1
n=2 f=0 sq_2f1=1 sq_25n1=1 sq_n3f=2
n=3 f=0 sq_2f1=1 sq_25n1=2 sq_n3f=3
n=4 f=0 sq_2f1=1 sq_25n1=2 sq_n3f=4
n=5 f=0 sq_2f1=1 sq_25n1=3 sq_n3f=5
n=6 f=1 sq_2f1=3 sq_25n1=3 sq_n3f=3
n=7 f=1 sq_2f1=3 sq_25n1=3 sq_n3f=4
n=8 f=1 sq_2f1=3 sq_25n1=4 sq_n3f=5
n=9 f=1 sq_2f1=3 sq_25n1=4 sq_n3f=6
n=10 f=1 sq_2f1=3 sq_25n1=5 sq_n3f=7
n=11 f=2 sq_2f1=5 sq_25n1=5 sq_n3f=5
n=12 f=2 sq_2f1=5 sq_25n1=5 sq_n3f=6
n=13 f=2 sq_2f1=5 sq_25n1=6 sq_n3f=7
n=14 f=2 sq_2f1=5 sq_25n1=6 sq_n3f=8

I think if n-3f is correct and 2f+1 is correct then the middle ground must be correct too. You can see they don't give the exact same results all across the board, but of course the starkest difference is below n=6.

If it's a safety feature, ie. if there is exactly one highest subquorum then you must repropose it, then I'm a bit worried that if the sq=1, it's more likely that we don't repropose because we see two of them, and lose this security. I don't have an example for it from the top of my head, but it seems more intuitive for the other values of 2f+1 that if there are two of them then there surely wasn't any quorum. For example if n=5 and a leader go isolated and voted for their own proposal, but missed the fact that there was a quorum for a value already in the previous round.

After working through some examples I can't find any counter examples which would leave to a safety violation: if there was enough votes for a QC in a round, at least 2f+1 of them will show up in the next round (via the timeout QC) and there won't be any other SQC; if there are multiple SQCs present then there wasn't any QC. So the only difference is whether we broadcast a fresh payload, or try to repropose the highest view that the SQC has seen.

Comment on lines +130 to +131
// Prune partitionings so that all of them have at least one where a quorum is possible.
// Alternatively we could keep all and make sure every scenario has eventually one with a quorum, for liveness.
Copy link
Contributor Author

@aakoshh aakoshh May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to consider here would be how the network is implemented in the tests. My idea was that I'd look at the view of the message, and then look at the partitioning in that round, and deliver to all nodes who are in the same partition as the sender.

The ChonkyBFT spec repeats timeouts periodically, however if every node would be in a partition with a size less than 4f+1, then they would never have a TimeoutQuorum, and would never move on to the next view.

Whereas if at least one partition has 4f+1 views, then they can time out, and in the next view they might be broadcasting to their stranded brethren, who at that point can move on to the future view on which they now observe a quorum in the justification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively the network would have to contain some logic to eventually un-partition views, to behave in line with the assumption that all messages are delivered eventually.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The currently implemented network semantics is that all (consensus) messages are delivered eventually, unless overridden by a newer message (see MessagePool for exact semantics).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best approach IMO is to unpartition every scenario at the end. Partitions are a temporary network condition anyway, with these tests we want to see if the consensus can recover from a partition.

Copy link
Member

@brunoffranca brunoffranca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Approved!

node/actors/bft/src/testonly/twins/scenario.rs Outdated Show resolved Hide resolved
node/actors/bft/src/testonly/twins/scenario.rs Outdated Show resolved Hide resolved
node/actors/bft/src/testonly/twins/scenario.rs Outdated Show resolved Hide resolved
@aakoshh aakoshh merged commit 9241d8f into main May 28, 2024
5 checks passed
@aakoshh aakoshh deleted the 464-twins-gen branch May 28, 2024 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants