Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dBFT 2.1 (solving 2.0 liveness lock) #792

Open
AnnaShaleva opened this issue Mar 3, 2023 · 10 comments
Open

dBFT 2.1 (solving 2.0 liveness lock) #792

AnnaShaleva opened this issue Mar 3, 2023 · 10 comments

Comments

@AnnaShaleva
Copy link
Member

AnnaShaleva commented Mar 3, 2023

Summary or problem description

This issue is triggered by the dBFT 2.0 liveness lock problem mentioned in neo-project/neo#2029 (comment) and other discussions in issues and PRs. We've used TLA+ formal modeling tool to analyze dBFT 2.0 behaviour. In this issue we'll present the formal algorithm specification with a set of identified problems and propose ways to fix them in so-called dBFT 2.1.

dBFT 2.0 formal models

We've created two models of dBFT 2.0 algorithm in TLA+. Please, consider reading the brief introduction to our modelling approach at the README and take a look at the base model. Below we present several error traces that were found by TLC Model Checker in the four-nodes network scenario.

Model checking note

Please, consider reading the model checking note before exploring the error traces below.

1. Liveness lock with four non-faulty nodes

The TLA+ specification configuration assumes participating of the four non-faulty replicas precisely following the dBFT 2.0 algorithm and maximum reachable view set to be 2. Here's the model values configuration used for TLC Model Checker in the described scenario:

RM RMFault RMDead MaxView
{0, 1, 2, 3} {} {} 2

The following liveness lock scenario was found by the TLC Model Checker:

Steps to reproduce the liveness lock
  1. The primary (at view 0) replica 0 sends the PrepareRequest message.
  2. The primary (at view 0) replica 0 decides to change its view (possible on timeout) and sends the ChangeView message.
  3. The backup (at view 0) replica 1 receives the PrepareRequest of view 0 and broadcasts its PrepareResponse.
  4. The backup (at view 0) replica 1 decides to change its view (possible on timeout) and sends the ChangeView message.
  5. The backup (at view 0) replica 2 receives the PrepareRequest of view 0 and broadcasts its PrepareResponse.
  6. The backup (at view 0) replica 2 collects M prepare messages (from itself and replicas 0, 1) and broadcasts the Commit message for view 0.
  7. The backup (at view 0) replica 3 decides to change its view (possible on timeout) and sends the ChangeView message.
  8. The primary (at view 0) replica 0 collects M ChangeView messages (from itself and replicas 1, 3) and changes its view to 1.
  9. The backup (at view 0) replica 1 collects M ChangeView messages (from itself and replicas 0, 3) and changes its view to 1.
  10. The primary (at view 1) replica 1 sends the PrepareRequest message.
  11. The backup (at view 1) replica 0 receives the PrepareResuest of view 1 and sends the PrepareResponse.
  12. The backup (at view 1) replica 0 decides to change its view (possible on timeout) and sends the ChangeView message.
  13. The primary (at view 1) replica 1 decides to change its view (possible on timeout) and sends the ChangeView message.
  14. The backup (at view 0) replica 3 collects M ChangeView messages (from itself and replicas 0, 1) and changes its view to 1.
  15. The backup (at view 1) replica 3 receives PrepareRequest of view 1 and broadcasts its PrepareResponse.
  16. The backup (at view 1) replica 3 collects M prepare message and broadcasts the Commit message for view 1.
  17. The rest of undelivered messages eventually reaches their receivers, but it doesn't change the node's states.

Here's the TLC error trace attached: base_deadlock1_dl.txt

After the described sequence of steps we end up in the following situation:

Replica View State
0 1 ChangeView sent, in the process of changing view from 1 to 2
1 1 ChangeView sent, in the process of changing view from 1 to 2
2 0 Commit sent, waiting for the rest nodes to commit at view 0
3 1 Commit sent, waiting for the rest nodes to commit at view 1

So we have the replica 2 stuck at the view 0 without possibility to exit from the commit stage and without possibility to collect more Commit messages from other replicas. We also have replica 3 stuck at the view 1 with the same problem. And finally, we have
replicas 0 and 1 entered the view changing stage and not being able either to commit (as there's only F nodes that have been committed at the view 1) or to change view (as the replica 2 can't send its ChangeView from the commit stage).

This liveness lock happens because the outcome of the subsequent consensus round (either commit or do change view) completely depends on the message receiving order. Moreover, we've faced with exactly this kind of deadlock in real functioning network, this incident was fixed by the consensus nodes restarting.

2. Liveness lock with one "dead" node and three non-faulty nodes

The TLA+ specification configuration assumes participating of the three non-faulty nodes precisely following the dBFT 2.0 algorithm and one node which can "die" and stop sending consensus messages or changing its state at any point of the behaviour. The liveness lock can be reproduced both when the first primary node is able to "die" and when the non-primary node is "dying" in the middle of the consensus process. The maximum reachable view set to be 2. Here are two model values configurations used for TLC Model Checker in the described scenario:

RM RMFault RMDead MaxView
{0, 1, 2, 3} {} {0} 2
{0, 1, 2, 3} {} {1} 2

The following liveness lock scenario was found by the TLC Model Checker:

Steps to reproduce the liveness lock (first configuration with primary node "dying" is taken as an example)
  1. The primary (at view 0) replica 0 sends the PrepareRequest message.
  2. The primary (at view 0) replica is "dying" and can't send or handle any consensus messages further.
  3. The backup (at view 0) replica 1 receives the PrepareRequest of view 0 and broadcasts its PrepareResponse.
  4. The backup (at view 0) replica 1 decides to change its view (possible on timeout) and sends the ChangeView message.
  5. The backup (at view 0) replica 2 receives the PrepareRequest of view 0 and broadcasts its PrepareResponse.
  6. The backup (at view 0) replica 2 collects M prepare messages (from itself and replicas 0, 1) and broadcasts the Commit message for view 0.
  7. The backup (at view 0) replica 3 decides to change its view (possible on timeout) and sends the ChangeView message.

Here's the TLC error traces attached:

After the described sequence of steps we end up in the following situation:

Replica View State
0 0 Dead (but PrepareRequest sent for view 0).
1 0 ChangeView sent, in the process of changing view from 0 to 1
2 0 Commit sent, waiting for the rest nodes to commit at view 0
3 0 ChangeView sent, in the process of changing view from 0 to 1

So we have the replica 0 permanently dead at the view 0 without possibility to affect the consensus process. Replica 2 has its Commit sent and unsuccessfully waiting for other replicas to enter the commit stage as far. Finally, replicas 1 and 3 have entered the view changing stage and not being able either to commit (as there's only F nodes that have been committed at the view 1) or to change view (as the replica 2 can't send its ChangeView from the commit stage).

It should be noted that dBFT 2.0 algorithm is expected to guarantee the block acceptance with at least M non-faulty ("alive" in this case) nodes, which isn't true in the described situation.

3. Running the TLC model checker with "faulty" nodes

Both models allow to specify the set of malicious nodes indexes via RMFault model constant. "Faulty" nodes are allowed to send any valid message at any step of the behaviour. At the same time, weak fairness is required from the next-state action predicate for both models. It means if it's possible for the model to take any non-stuttering step, this step must eventually be taken. Thus, running the basic dBFT 2.0 model with single faulty and three non-faulty nodes doesn't reveal any model deadlock: the malicious node keeps sending messages to escape from the liveness lock. It should also be noticed that the presence of faulty nodes slightly increases the states graph size, so that it takes more time to evaluate the whole set of possible model behaviours.

Nethertheless, it's a thought-provoking experiment to check the model specification behaviour with non-empty faulty nodes set. We've checked the basic model with the following configurations and didn't find any liveness lock:

RM RMFault RMDead MaxView
{0, 1, 2, 3} {0} {} 2
{0, 1, 2, 3} {1} {} 2

dBFT 2.1 proposed models

Based on the liveness issues found by the TLC Model Checker we've developed a couple of ways to improve dBFT 2.0 algorithm and completely avoid mentioned liveness and safety problems. The improved models will further be referred to as dBFT 2.1 models. The model descriptions and specifications are too large to attach them to this issue, thus, they are kept in a separate repo. Please, consider reading the dBFT 2.1 models description at the README and check the models TLA+ specifications.

Running the TLC Model Checker on dBFT 2.1 models

We've run the TLC Model Checker to check both proposed dBFT 2.1 models with the same set of configurations as described above for the basic model. Here's the table of configurations and model checking results (they are almost the same for both dBFT 2.1 models):

RM RMFault RMDead MaxView Model checking result
{0, 1, 2, 3} {} {} 1 PASSED, no liveness lock found
{0, 1, 2, 3} {} {} 2 NOT FINISHED, TLC failed with OOM (too many states, see the model checking note), no liveness lock found
{0, 1, 2, 3} {} {0} 1 PASSED, no liveness lock found
{0, 1, 2, 3} {} {0} 2 NOT FINISHED, TLC failed with OOM (too many states, see the model checking note), no liveness lock found
{0, 1, 2, 3} {} {1} 1 PASSED, no liveness lock found
{0, 1, 2, 3} {} {1} 2 NOT FINISHED, TLC failed with OOM (too many states, see the model checking note), no liveness lock found
{0, 1, 2, 3} {0} {} 1 FAILED for dbftCV3.tla, see the note below the table.
{0, 1, 2, 3} {0} {} 1 NOT FINISHED for dbftCentralizedCV.tla in reasonable time; the "faulty" node description probably needs some more care.

Note for the FAILED case: assuming that malicious nodes are allowed to send literally any message at any step of the behaviour, non-empty RMFault set is able to ruin the whole consensus process. However, current dBFT 2.0 algorithm faces the liveness lock even with the "dead" nodes (the proposed dBFT 2.1 models successfully handle this case). Moreover, we believe that in real functioning network it's more likely to face with the first type of attack ("dead" nodes) rather than with the malicious clients intentionally sending any type of messages at any time. Thus, we believe that the dBFT 2.1 models perform much better than the dBFT 2.0 and solve the dBFT 2.0 liveness lock problems.

Other than this, the TLC Model Checker didn't find any liveness properties violations for dBFT 2.1 models both with MaxView constraint set to 1 and MaxView constraint set to 2 (see the model checking note for clarification).

Conclusion

We believe that proposed dBFT 2.1 models allow to solve the dBFT 2.0 liveness lock problems. Anyone who has thoughts, ideas, questions, suggestions or doubts is welcomed to join the discussion. The proposed specifications may have bugs or inaccuracies thus we accept all kinds of reasoned comments and related feedback. If you have troubles with the models understanding/editing/checking, please, don't hesitate to write a comment to this issue.

Further work

The whole idea of TLA+ dBFT modelling was to reveal and avoid the liveness lock problems so that we've got a normally operating consensus algorithm in our network. There are several directions for the further related work in our mind:

  1. Collect and process the community feedback on proposed TLA+ dBFT 2.0 and 2.1 specifications, fix the models (in case of any bugs found). The discussion period is two months starting from the issue submitting date. Please, consider writing your feedback in time! After the discussion we're starting the dBFT 2.1 code-level implementation based on the discussion results.
  2. Implement the improved dBFT 2.1 algorithm at the code level.
  3. Create TLA+ specification and investigate any potential problems of the dBFT 3.0 (double speaker model, dBFT 3.0 Specification neo#2029).
@vncoelho
Copy link
Member

vncoelho commented Mar 3, 2023

It is a nice study.

We will study the cases.

From a first look on the first steps, from 1 to 17.
It looks like one replica is at view 0 committed(and lost because it missed change view) and at step 16 replica 3 commits at view 1.

On step 17 you mention that state will not change.Why?It is not fully clear to me.
There are still two replicas that can commit at view 1.
And they will not change view because we have some locks for that.

@shargon
Copy link
Member

shargon commented Mar 3, 2023

Amazing work guys 👏 👏

The backup (at view 0) replica 1 receives the PrepareRequest of view 0 and broadcasts its PrepareResponse.

@AnnaShaleva just to clarify, when you said backup is a non primary or a primary backup?

@Jim8y
Copy link
Contributor

Jim8y commented Mar 4, 2023

@AnnaShaleva just to clarify, when you said backup is a non primary or a primary backup?

Since this is not 3.0, i assumed it is non-primary.

@AnnaShaleva
Copy link
Member Author

@vncoelho, thank you for the quick response.

Speaking of the first case (Liveness lock with four non-faulty nodes):

It looks like one replica is at view 0 committed(and lost because it missed change view) and at step 16 replica 3 commits at view 1.

That's true, we've end up in a state where replica 2 has committed at view 0, replica 3 has committed at view 1, replicas 0 and 1 have sent their ChangeView messages and trying to go to the view 2.

On step 17 you mention that state will not change.Why?

Replica 2 was stuck forever at the view 0 at the commitSent state waiting for the Commits of view 0 (there won't be any more). Replica 3 was stuck at the view 1 at the commitSent state waiting for the Commits of view 1. There won't be more Commits of view 1 because replicas 0 and 1 went to the view changing process and have sent their ChangeView messages trying to go to the view 2. Replicas 0 and 1 will never send their Commits for view 1 even after receiving PrepareResponse of view 1:

if (context.PreparationPayloads[message.ValidatorIndex] != null || context.NotAcceptingPayloadsDueToViewChanging) return;

or even after receiving a single Commit of view 1 from replica 3 because it's the only Commit for this view.

Replicas 0 and 1 will never change their view as far, because it's not enough ChangeView messages in the network (only two of them).

@AnnaShaleva
Copy link
Member Author

@shargon,

@AnnaShaleva just to clarify, when you said backup is a non primary or a primary backup?

The issue is related strictly to dBFT 2.0, we do not consider double-speakers model here. Thus, backup is a non-primary node.

@Liaojinghui,

Since this is not 3.0, i assumed it is non-primary.

Exactly.

@vncoelho
Copy link
Member

vncoelho commented Mar 6, 2023

Hi @AnnaShaleva ,

Thanks for the reply and attention.

Take a look at:

        public bool NotAcceptingPayloadsDueToViewChanging => ViewChanging && !MoreThanFNodesCommittedOrLost; 
         // A possible attack can happen if the last node to commit is malicious and either sends change view after his 
         // commit to stall nodes in a higher view, or if he refuses to send recovery messages. In addition, if a node 
         // asking change views loses network or crashes and comes back when nodes are committed in more than one higher 
         // numbered view, it is possible for the node accepting recovery to commit in any of the higher views, thus 
         // potentially splitting nodes among views and stalling the network. 
         public bool MoreThanFNodesCommittedOrLost => (CountCommitted + CountFailed) > F; 
         #endregion 

In that case, Replicas may still commit at view 1.

AnnaShaleva added a commit to AnnaShaleva/dbft that referenced this issue Mar 6, 2023
Based on the neo-project/neo-modules#792 (comment)
the definition of MoreThanFNodesCommitted should be adjusted to match
the core algorithm as it's the key factor of four-nodes deadlock
scenario. There are two changes:
1. We consider the node to be "Committed" if it has the Commit message
   sent at _any_ view.
2. We should count lost nodes as far. We consider the node to be "Lost"
   if it hasn't sent any messages in the current round.

Based on this adjustment, the first liveness lock scenario mentioned in
neo-project/neo-modules#792 (comment)
("Liveness lock with four non-faulty nodes") is unreachable. However,
there's stil a liveness lock when one of the nodes is in the RMDead list,
i.e. can "die" at any moment. Consider running the base model specification
with the following configuration:
```
RM      	RMFault 	RMDead 	MaxView
{0, 1, 2, 3} 	{}      	{0} 	2
```
AnnaShaleva added a commit to AnnaShaleva/dbft that referenced this issue Mar 6, 2023
Based on the neo-project/neo-modules#792 (comment)
the definition of MoreThanFNodesCommitted should be adjusted to match
the core algorithm as it's the key factor of four-nodes deadlock
scenario. There are two changes:
1. We consider the node to be "Committed" if it has the Commit message
   sent at _any_ view.
2. We should count lost nodes as far. We consider the node to be "Lost"
   if it hasn't sent any messages in the current round.

Based on this adjustment, the first liveness lock scenario mentioned in
neo-project/neo-modules#792 (comment)
("Liveness lock with four non-faulty nodes") is unreachable. However,
there's stil a liveness lock when one of the nodes is in the RMDead list,
i.e. can "die" at any moment. Consider running the base model specification
with the following configuration:
```
RM      	RMFault 	RMDead 	MaxView
{0, 1, 2, 3} 	{}      	{0} 	2
```
AnnaShaleva added a commit to AnnaShaleva/dbft that referenced this issue Mar 6, 2023
Based on the neo-project/neo-modules#792 (comment),
the definition of MoreThanFNodesCommitted should be adjusted to match
the core algorithm as it's the key factor of the four-good-nodes liveness lock
scenario. There are two changes:
1. We consider the node to be "Committed" if it has the Commit message
   sent at _any_ view.
2. We should count lost nodes as far. We consider the node to be "Lost"
   if it hasn't sent any messages in the current round.

Based on this adjustment, the first liveness lock scenario mentioned in
neo-project/neo-modules#792 (comment)
("Liveness lock with four non-faulty nodes") is unreachable. However,
there's still a liveness lock when one of the nodes is in the RMDead list,
i.e. can "die" at any moment. Consider running the base model specification
with the following configuration:
```
RM      	RMFault 	RMDead 	MaxView
{0, 1, 2, 3} 	{}      	{0} 	2
```
@AnnaShaleva
Copy link
Member Author

AnnaShaleva commented Mar 6, 2023

@vncoelho,

In that case, Replicas may still commit at view 1.

Based on the code-level definition of MoreThanFNodesCommittedOrLost, that's true. We've adjusted the definition of MoreThanFNodesCommittedOrLost in our TLA+ specification (see the roman-khimov/dbft#2) to be closer to the code-level definition and this deadlock seems to be solved: replicas 0 and 1 can eventually receive enough PrepareResponse messages and commit at view 1.

However, consider the following cases:

A. Replica 0 permanently dies in the end of the liveness lock 1 scenario (after step 16). Then replica 1 is still able to send its Commit, but replicas 1 and 3 will never collect enough commits to accept block at view 1, because replica 2 is stuck at the view 0 and replica 0 is dead. We expect the network to function properly in this case, because there are M non-faulty nodes alive and able to send/receive messages.

B. Run the adjusted basic specification with the following configuration (one node is able to die):

RM RMFault RMDead MaxView
{0, 1, 2, 3} {} {0} 2

The TLC Model Checker finds the following liveness lock in this case:

Steps to reproduce the liveness lock
  1. The primary (at view 0) replica 0 sends the PrepareRequest message.
  2. The primary (at view 0) replica permanently "dies".
  3. The backup (at view 0) replica 1 receives the PrepareRequest of view 0 and broadcasts its PrepareResponse.
  4. The backup (at view 0) replica 1 decides to change its view (possible on timeout) and sends the ChangeView message.
  5. The backup (at view 0) replica 2 receives the PrepareRequest of view 0 and broadcasts its PrepareResponse.
  6. The backup (at view 0) replica 2 collects M prepare messages (from itself and replicas 0, 1) and broadcasts the Commit message for view 0.
  7. The backup (at view 0) replica 3 doesn't receive any messages in time, decides to change its view (possible on timeout) and sends the ChangeView message.

errtrace.txt

After the described sequence of steps we end up in the following situation:

Replica View State
0 0 PrepareRequest sent, dead
1 0 ChangeView sent, in the process of changing view from 0 to 1, do not accepting preparations anymore
2 0 Commit sent, waiting for the rest nodes to commit at view 0
3 0 ChangeView sent, in the process of changing view from 0 to 1

We end up in a situation when replica 2 is stuck in the commitSent state and trying to collect enough Commit messages for view 0. Replica 0 is dead, but it was able to sent its PrepareRequest for view 0. Replica 1 is in the process of view changing, it has PrepareRequest received from the replica 0; Commit received from replica 2 (and PrepareResponse missed); ChangeView received from replica 3; thus, replica 1 will never accept PrepareResponse messages anymore (MoreThanFNodesCommittedOrLost is false for replica 1 and NotAcceptingPayloadsDueToViewChanging is true) and block the further consensus process. Replica 3 is in the process of view changing and there can be such situation that it won't be able to send its Commit as far.

@vncoelho
Copy link
Member

vncoelho commented Mar 6, 2023

We will soon read and review the other cases,@AnnaShaleva.
I will coordinate with @igormcoelho .

AnnaShaleva added a commit to AnnaShaleva/dbft that referenced this issue Mar 6, 2023
Based on the neo-project/neo-modules#792 (comment),
the definition of MoreThanFNodesCommitted should be adjusted to match
the core algorithm as it's the key factor of the four-good-nodes liveness lock
scenario. There following change is made:

* We consider the node to be "Committed" if it has the Commit message
  sent at _any_ view. If the good node has committed, then we know for sure
  that it won't go further to the next consensus round and can rely on
  this information.

The thing that remains the same is that we do not conu "lost" nodes, because
this information can't be reliably trusted. See the comment inside the commit.

Based on this adjustment, the first liveness lock scenario mentioned in
neo-project/neo-modules#792 (comment)
("Liveness lock with four non-faulty nodes") needs to include one more step
to enter a deadlock: one of the replicas that in the "cv" state must die.

Moreover, there's another liveness lock scenario when one of the nodes is in
the RMDead list, i.e. can "die" at any moment. Consider running the base model
specification with the following configuration:
```
RM      	RMFault 	RMDead 	MaxView
{0, 1, 2, 3} 	{}      	{0} 	2
```
AnnaShaleva added a commit to AnnaShaleva/dbft that referenced this issue Mar 6, 2023
Based on the neo-project/neo-modules#792 (comment),
the definition of MoreThanFNodesCommitted should be adjusted to match
the core algorithm as it's the key factor of the four-good-nodes liveness lock
scenario. There following change is made:

* We consider the node to be "Committed" if it has the Commit message
  sent at _any_ view. If the good node has committed, then we know for sure
  that it won't go further to the next consensus round and can rely on
  this information.

The thing that remains the same is that we do not count the "lost" nodes, because
this information can't be reliably trusted. See the comment inside the commit.

Based on this adjustment, the first liveness lock scenario mentioned in
neo-project/neo-modules#792 (comment)
("Liveness lock with four non-faulty nodes") needs to include one more step
to enter a deadlock: one of the replicas that in the "cv" state must die.

Moreover, there's another liveness lock scenario when one of the nodes is in
the RMDead list, i.e. can "die" at any moment. Consider running the base model
specification with the following configuration:
```
RM      	RMFault 	RMDead 	MaxView
{0, 1, 2, 3} 	{}      	{0} 	2
```
AnnaShaleva added a commit to AnnaShaleva/dbft that referenced this issue Mar 6, 2023
Based on the neo-project/neo-modules#792 (comment),
the definition of MoreThanFNodesCommitted should be adjusted to match
the core algorithm as it's the key factor of the four-good-nodes liveness lock
scenario. There following change is made:

* We consider the node to be "Committed" if it has the Commit message
  sent at _any_ view. If the good node has committed, then we know for sure
  that it won't go further to the next consensus round and can rely on
  this information.

The thing that remains the same is that we do not count the "lost" nodes, because
this information can't be reliably trusted. See the comment inside the commit.

Based on this adjustment, the first liveness lock scenario mentioned in
neo-project/neo-modules#792 (comment)
("Liveness lock with four non-faulty nodes") needs to include one more step
to enter a deadlock: one of the replicas that in the "cv" state must die in the
end of the scenario.

Moreover, there's another liveness lock scenario when one of the nodes is in
the RMDead list, i.e. can "die" at any moment. Consider running the base model
specification with the following configuration:
```
RM      	RMFault 	RMDead 	MaxView
{0, 1, 2, 3} 	{}      	{0} 	2
```
AnnaShaleva added a commit to AnnaShaleva/dbft that referenced this issue Mar 6, 2023
Based on the neo-project/neo-modules#792 (comment),
the definition of MoreThanFNodesCommitted should be adjusted to match
the core algorithm as it's the key factor of the four-good-nodes liveness lock
scenario. There following change is made:

* We consider the node to be "Committed" if it has the Commit message
  sent at _any_ view. If the good node has committed, then we know for sure
  that it won't go further to the next consensus round and can rely on
  this information.

The thing that remains the same is that we do not count the "lost" nodes, because
this information can't be reliably trusted. See the comment inside the commit.

Based on this adjustment, the first liveness lock scenario mentioned in
neo-project/neo-modules#792 (comment)
("Liveness lock with four non-faulty nodes") needs to include one more step
to enter a deadlock: one of the replicas that in the "cv" state must die in the
end of the scenario.

Moreover, there's another liveness lock scenario when one of the nodes is in
the RMDead list, i.e. can "die" at any moment. Consider running the base model
specification with the following configuration:
```
RM      	RMFault 	RMDead 	MaxView
{0, 1, 2, 3} 	{}      	{0} 	2
```
@AnnaShaleva
Copy link
Member Author

AnnaShaleva commented Mar 6, 2023

@vncoelho,

After discussion with Roman I need to adjust the previous comment:

MoreThanFNodesCommittedOrLost in our TLA+ specification implemented as MoreThanFNodesCommitted. We've adjusted the way we count committed nodes to consider not only commits from the current view, but also commits from any view (see the roman-khimov/dbft#2). This adjustment allows to escape from the liveness lock scenario "Liveness lock with four non-faulty nodes", however, cases A and B described in #792 (comment) are still valid liveness locks.

It also should be mentioned that we do not count "lost" nodes in the TLA+ basic specification, the reason is explained in roman-khimov/dbft#2 (comment) and the commit message roman-khimov/dbft@0ff31fd. However, this doesn't affect the liveness locks found.

@vang1ong7ang
Copy link

vang1ong7ang commented Sep 18, 2023

that's amazing work. 👍

keep improving the formal model and it's even possible to provide a formal proof based on TLA+.

to make to TLA+ model clear and verifiable, I suggest to follow the best practice as below and avoid NOT FINISHED:

Spec == Init /\ [][Next]_<<STATE>>
LivenessSpec == Spec /\ WF_<<STATE>>(Next)
----------------------------------------------------------------------------
ConditionSafety == [] ...
ConditionLiveness == <> ...
----------------------------------------------------------------------------
THEOREM Spec => ConditionSafety
THEOREM LivenessSpec => ConditionLiveness

also it would be better if msgs can be replaced with inbox and outbox since dbft is working in a asynchronous network environment and messages may be dropped, duplicated and reordered ...

(***************************************************************************)
(* I: inboxes of all replicas                                              *)
(* O: outboxes of all replicas                                             *)
(***************************************************************************)

VARIABLE I, O

Send(from, to, msgbody) == ...

MaxView is better to be removed. SafetySpec should always be guaranteed. We don’t have to force TLC can help us verify liveness very well since TLC is not good at it. But it is not worth sacrificing the accuracy of the formal model for the performance of TLC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants