Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dBFT 2.0 #547

Merged
merged 31 commits into from Mar 3, 2019

Conversation

5 participants
@vncoelho
Copy link
Member

vncoelho commented Jan 14, 2019

PR with the key changes for the achievement of dBFT 2.0 (#426, #422, #534, among others...), an improved and optimization version of the pioner dBFT 1.0.

Key features will be merged here before final merge on the Master code.

Main contributors: @erikzhang, @shargon, @igormcoelho, @vncoelho, @belane, @wanglongfei88, @edwardz246003, @jsolman

erikzhang and others added some commits Jan 14, 2019

Add commit phase to consensus algorithm (#534)
* Add commit phase to consensus algorithm

* fix tests

* Prevent repeated sending of `Commit` messages

* RPC call gettransactionheight (#541)

* getrawtransactionheight

Nowadays two calls are need to get a transaction height, `getrawtransaction` with `verbose` and then use the `blockhash`.
Other option is to use `confirmations`, but it can be misleading.

* Minnor fix

* Shargon's tip

* modified

* Allow to use the wallet inside a RPC plugin (#536)

* Clean code

* Clean code
@vncoelho

This comment has been minimized.

Copy link
Member Author

vncoelho commented Jan 14, 2019

From #534, the only opened discussion was related to the Regeneration phase.

vncoelho and others added some commits Jan 14, 2019

@vncoelho

This comment has been minimized.

Copy link
Member Author

vncoelho commented Jan 22, 2019

@erikzhang, do you have an idea for the template for the Regeneration?
If you want we can try to adjust the ideas of #426 to the current context? I think that the main difference will be about how to track and share the Signatures of the Payloads, which we will need to get the from Witness in order to correctly regenerate.

@erikzhang

This comment has been minimized.

Copy link
Member

erikzhang commented Jan 23, 2019

I think regeneration should have two levels. The first is to achieve state recovery by reading the log recorded by the node, and the second is to achieve state synchronization through the replay of network messages.

@vncoelho

This comment has been minimized.

Copy link
Member Author

vncoelho commented Jan 23, 2019

Sounds good, Erik.
Replaying network messages is a little different than the previous approach.
I just wonder if we also follow that line with a single Payload that can regenerate the node, because it contains signatures of the previous agreements. Do you think that Messages with plenty of signatures hashes (imagining 1000 nodes) is a bad design?

@erikzhang erikzhang added the critical label Jan 24, 2019

erikzhang added some commits Jan 24, 2019

Prevent `ConsensusService` from receiving messages before starting (#573
)

* Prevent `ConsensusService` from receiving messages before starting

* fixed tests - calling OnStart now
Consensus recovery log (#572)
* Pass store to `ConsensusService`

* Implement `ISerializable` in `ConsensusContext`

* Start from recovery log

* Fix unit tests due to constructor taking the store.

* Add unit tests for serializing and deserializing the consensus context.

@erikzhang erikzhang added this to the NEO 3.0 milestone Jan 25, 2019

@jsolman

This comment has been minimized.

Copy link
Contributor

jsolman commented Jan 25, 2019

@erikzhang I added also saving consensus context upon changing view, check #575

@vncoelho

This comment has been minimized.

Copy link
Member Author

vncoelho commented Jan 25, 2019

Nice to see this as NEO 3.0, @erikzhang, deserved. If we start to think deeply the set of changes of this PR summarizes knowledge, backgrounds, scientific studies, insights and discussions of several involved agents.

@vncoelho

This comment has been minimized.

Copy link
Member Author

vncoelho commented Jan 28, 2019

Perfect, @erikzhang, it seems to be working as expected.

I did not check the Regeneration yet, in an intensive manner.
I think that we also need a Regeneration strategy for when a node asks for Change View and the other nodes are already in commit phase waiting for final signatures, as done in the other PR.
With this last feature we gonna probably reach a code ready for an initial merge into the master.

@vncoelho

This comment has been minimized.

Copy link
Member Author

vncoelho commented Feb 9, 2019

After #579 is merged this PR might be ready for being integrated into master.

Considering recent discussions that have been carried on in such aforementioned thread, we fell that the complexity of the Regeneration/Recover mechanism is reaching another level of quality.
In this sense, after this current PR is merged we might have an open field for improving the Reason for each ChangeView Request. With such description, we might be able to create additional mechanisms that filter such Reasons and Recover each node accordingly (letting view changes only for cases with Inconsistencies).

In connection with a Redundant mechanism for replacing Speakers (in case of delays) we might experience another level of Consensus in the next phase of improvements that will be carried on after this PR.

Thanks to everyone who has been contributing to this PR.

@jsolman

This comment has been minimized.

Copy link
Contributor

jsolman commented Feb 10, 2019

@vncoelho I think we will also want to merge #575 into this along with #579 before merging it to master.

@erikzhang

This comment has been minimized.

Copy link
Member

erikzhang commented Feb 24, 2019

One more optimization we could make: If a node sends ChangeView with a NewViewNumber less than the expected view for that node, a recovery message could be used to send that node it’s previously sent changeview message so it could immediately jump it’s expected view forward. Thus if nodes crash and come back before nodes agree on changing to a higher view we can recover faster in that case.

I am afraid that even if the node has previously sent a ChangeView message, it cannot be changed directly to the view. It must have 2f+1 messages.

@vncoelho

This comment has been minimized.

Copy link
Member Author

vncoelho commented Feb 24, 2019

Hi @erikzhang, I agree with you.
I think that the case that @jsolman mentioned is not for change the view itself, it is just to speed up the request of the change view.
From my perspective, it can be done without problems since the node would had already signed that payload requesting view in a higher level.
Even if such case could be rare I believe that it happens some time (for example, 1 per day), thus, everything we can cover we gonna be benefited in the future.

But maybe as Jeff said we could do these additional improvements in other PRs in the near future and try to merge this one here when ready.

@jsolman

This comment has been minimized.

Copy link
Contributor

jsolman commented Feb 24, 2019

I am afraid that even if the node has previously sent a ChangeView message, it cannot be changed directly to the view. It must have 2f+1 messages.

I am not suggesting it change the view directly if had sent in the past; I was just suggesting it start it’s expectedView where it left off after it was restarted. It’s an optimization that can wait for later if we decide to implement it.

@erikzhang

This comment has been minimized.

Copy link
Member

erikzhang commented Feb 25, 2019

The tests NGD is doing are expected to be completed on March 1.

@jsolman

This comment has been minimized.

Copy link
Contributor

jsolman commented Feb 25, 2019

The tests NGD is doing are expected to be completed on March 1

Great! I will do some additional testing between now and then as well then.

@jsolman

This comment has been minimized.

Copy link
Contributor

jsolman commented Feb 25, 2019

Testing with #608 looks good so far.

@jsolman
Copy link
Contributor

jsolman left a comment

These changes introduce a networking incompatibility with the existing old clients. See my included comment.

erikzhang added some commits Feb 26, 2019

#609 resolves my last review comments

@vncoelho

This comment has been minimized.

Copy link
Member Author

vncoelho commented Feb 26, 2019

[19:12:54.896] chain sync: expected=379 current=354 nodes=7
[19:12:55.175] timeout: height=355 view=0 state=Backup, RequestReceived, ResponseSent, CommitSent
[19:12:55.176] send recovery to resend commit

When a CN is lagging behind and timeout is this message expected?
Maybe it should stop consensus layer until synced or something like that.

@vncoelho

This comment has been minimized.

Copy link
Member Author

vncoelho commented Feb 26, 2019

The current version is online at: https://neocompiler.io/#/ecolab

image

Click on Consensus Info for following consensus logs.
The consensus behavior is quite impressive in such a limited computational resource we have there, as well as the amount of plugins and tools that are running there.

@erikzhang, do you think that we can update the variable SecondsPerBlock to ms in this PR? This is important for our tests with low latency IoT devices, which is a promising field of application for private chains.

@jsolman

This comment has been minimized.

Copy link
Contributor

jsolman commented Feb 27, 2019

When a CN is lagging behind and timeout is this message expected?
Maybe it should stop consensus layer until synced or something like that.

Yes, it is expected. From it's perspective it hasn't received newer blocks yet and it hasn't received enough commits for the block it is on, so it will periodically send recovery until it either receives enough commits to persist the block it is on, or until it receives the next block that others have already persisted.

@erikzhang

This comment has been minimized.

Copy link
Member

erikzhang commented Feb 27, 2019

do you think that we can update the variable SecondsPerBlock to ms in this PR?

We can reach a consensus within 1 second?

@vncoelho

This comment has been minimized.

Copy link
Member Author

vncoelho commented Feb 27, 2019

I think so, Erik, aehuaheauah....even less nowadays.
On the other hand, before that Refactoring that you conducted with Akka it was something that we were seeing far away. That move with Akka was incredible and a huge step with "quite simple" modifications for the consensus communication (in the future we also believe that Akka can be improved as we are discussing in that other thread with @lightszero, however, it was a precise, quick and great move).
image

We are talking about local blockchain networks, for example, a bank, supply chain enterprises, or a database partially-centralized running CNs. There are some useful application for low latency systems, for example:

  • a drone reporting data in a journey for verification of energy transmission grid;
  • sensors applied for cities' surveillance;
  • tracking production of things in a supply chain.

Such systems could for example, just need few couple of tx's per seconds in private blockchain, however, requiring them to be published as soon as possible.
In this sense, we need, for the sake of publishing certificates, something that ensures that the information was confirmed inside the network as quick as possible.
In addition, this quick blocks persistence has great potential with some timestamps provided by hardware-based trusted execution environments (TEEs).

It would be great if we could change it now.

We need to be sure that the exponential view_timeout formula is not affected with the change.

@erikzhang

This comment has been minimized.

Copy link
Member

erikzhang commented Feb 28, 2019

public uint Timestamp;

One difficulty is that Block.Timestamp is accurate to the second. If the consensus time is less than one second, it may cause some problems.

@vncoelho

This comment has been minimized.

Copy link
Member Author

vncoelho commented Feb 28, 2019

Don't worry, Erik. Let's do it later, in some more couple of weeks.
Thanks for the support as always.

@jsolman

jsolman approved these changes Mar 2, 2019

Copy link
Contributor

jsolman left a comment

Changes look very nice, and it is working well from my testing. I'm looking forward to see the results from NGD testing that should be complete soon.

@erikzhang erikzhang merged commit f88c427 into master Mar 3, 2019

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@erikzhang erikzhang deleted the consensus/improved_dbft branch Mar 3, 2019

@igormcoelho

This comment has been minimized.

Copy link
Contributor

igormcoelho commented Mar 4, 2019

A day to celebrate! Congratulations to everyone that worked hard to solve this issue, especially: @vncoelho @jsolman @shargon and others... to not forget @erikzhang, as always :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.