Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Consensus Liveness: Count failed nodes to inhibit changing view and request recovery #665

Merged
merged 32 commits into from Apr 1, 2019

Conversation

5 participants
@igormcoelho
Copy link
Contributor

igormcoelho commented Mar 27, 2019

We believe this is an important change to allow change views even some nodes are knocked out due to chain sync and other issues. This allows this situation to evolve:

  • 1 node committed
  • 1 node fully failed
  • 2 nodes changing view

Usually, the 2 nodes would be locked in ViewChanging state, depending on the return of the failed node. However, now they will know that a node is failed (because no message has been received from it recently), and they will ignore it from the view changing count.

An extremely unlikely case would be: node has just failed on the same height (so it won't be detected as failed), an the unlikely situation of one committed and others viewchanging happen on that exact height. But we can fix this too, changing from height count to last received message time (not necessary now in our opinion).

@igormcoelho igormcoelho requested a review from erikzhang Mar 27, 2019

@igormcoelho igormcoelho assigned jsolman and unassigned jsolman Mar 27, 2019

@igormcoelho igormcoelho requested a review from jsolman Mar 27, 2019

@erikzhang

This comment has been minimized.

Copy link
Member

erikzhang commented Mar 27, 2019

I have no time to review this pr. Waiting for @jsolman 😆

@vncoelho

This comment has been minimized.

Copy link
Member

vncoelho commented Mar 27, 2019

Looks finished @erikzhang, this looks like to be the best solution! It was a good insight of the meeting that happened some hour ago with @igormcoelho, @PeterLinX, @shargon and @belane.
@jsolman told that he will try to revise later today.

@erikzhang

This comment has been minimized.

Copy link
Member

erikzhang commented Mar 27, 2019

I'm working on NeoVM and pricing model.

@vncoelho

This comment has been minimized.

Copy link
Member

vncoelho commented Mar 27, 2019

Keep prices low, @erikzhang. aeahueaea :D

@jsolman
Copy link
Contributor

jsolman left a comment

These changes look good and safe to me. I will expect to approve them once some testing has occurred.

Show resolved Hide resolved neo/Consensus/Helper.cs
Show resolved Hide resolved neo/Consensus/ConsensusService.cs Outdated
Show resolved Hide resolved neo/Consensus/ConsensusService.cs
@erikzhang

This comment has been minimized.

Copy link
Member

erikzhang commented Mar 27, 2019

Once you have finished it, I will ask NGD team to test.

@vncoelho

This comment has been minimized.

Copy link
Member

vncoelho commented Mar 27, 2019

Ensuring continuous ping between CN. This was critical for genesis block initialization and maybe other future scenarios.

[15:22:42.622] OnStart
[15:22:42.711] initialize: height=1 view=0 index=2 role=Backup
[15:22:44.720] timeout: height=1 view=0
[15:22:44.739] Node will not request change view to nv=1 because nc=0 nf=3
[15:22:45.010] OnRecoveryMessageReceived: height=1 view=0 index=0
[15:22:45.025] OnRecoveryMessageReceived: finished (valid/total) ChgView: 0/0 PrepReq: 0/0 PrepResp: 0/0 Commits: 0/0
[15:22:48.258] OnChangeViewReceived: height=1 view=0 index=0 nv=1
[15:22:48.277] send recovery from view: 0 to view: 0
[15:22:48.324] OnRecoveryMessageReceived: height=1 view=0 index=2
[15:22:48.325] OnRecoveryMessageReceived: finished (valid/total) ChgView: 0/0 PrepReq: 0/0 PrepResp: 0/0 Commits: 0/0
[15:22:48.707] OnChangeViewReceived: height=1 view=0 index=3 nv=1
[15:22:48.738] timeout: height=1 view=0
[15:22:48.738] request change view: height=1 view=0 nv=1 nc=0 nf=0
[15:22:48.740] changeview: view=1 primary=02a7bc55fe8684e0119768d104ba30795bdcc86619e864add26156723ed185cd62
[15:22:48.740] initialize: height=1 view=1 index=2 role=Backup
[15:22:48.826] OnPrepareRequestReceived: height=1 view=1 index=0 tx=2
[15:22:48.860] send prepare response
[15:22:48.917] OnPrepareResponseReceived: height=1 view=1 index=3
[15:22:48.920] send commit
[15:22:48.971] OnCommitReceived: height=1 view=1 index=0 nc=1 nf=0
[15:22:48.983] OnCommitReceived: height=1 view=1 index=3 nc=2 nf=0
[15:22:49.013] relay block: height=1 hash=0x3375076f83ffce8be03aafbea6dd572b98be99e535949b00dec19caaad758cce tx=2
[15:22:49.054] persist block: height=1 hash=0x3375076f83ffce8be03aafbea6dd572b98be99e535949b00dec19caaad758cce tx=2

vncoelho added some commits Mar 27, 2019

@vncoelho

This comment has been minimized.

Copy link
Member

vncoelho commented Mar 27, 2019

The consensus looks to be much more stable here.

image

image

igormcoelho added some commits Mar 27, 2019

@igormcoelho igormcoelho changed the title Counting failed node to weaken change view Counting failed node to weaken change view and fix changeview increase Mar 27, 2019

@vncoelho

This comment has been minimized.

Copy link
Member

vncoelho commented Mar 29, 2019

Another possible case that could happen could be to receive a Preparation from last view and then accepting a changeview from a lower view.

For this reason, GetLastExpectedView function has been simplified in the last commit 343497f

@vncoelho

This comment has been minimized.

Copy link
Member

vncoelho commented Mar 29, 2019

In fact, NewViewNumber variable can now be abolished. Nodes just ask for the next view, that is, viewNumber++.

@igormcoelho, this is an important simplification also of this PR!
Removing problems and simplifying.
Congratulations for the great insights.

@vncoelho

This comment has been minimized.

Copy link
Member

vncoelho commented Mar 29, 2019

In the ChangeView.cs we can just remove it from Deserialize, Serialize and make it static to ViewNumber++

Show resolved Hide resolved neo/Consensus/ConsensusService.cs
Show resolved Hide resolved neo/Consensus/ConsensusService.cs Outdated
@erikzhang

This comment has been minimized.

Copy link
Member

erikzhang commented Mar 29, 2019

In fact, NewViewNumber variable can now be abolished. Nodes just ask for the next view, that is, viewNumber++.

In this case, you can remove RecoveryMessage.ChangeViewPayloadCompact.OriginalViewNumber too.

@vncoelho

This comment has been minimized.

Copy link
Member

vncoelho commented Mar 29, 2019

Sounds good, @erikzhang. But without that it maybe will become hard to reproduce the ChangeView.
I think it is possible to remove as you said. But we need to draw this logic again.
Lets make this as soon as possible. I will not be able to make it today.

@shargon shargon self-requested a review Mar 29, 2019

@vncoelho

This comment has been minimized.

Copy link
Member

vncoelho commented Mar 29, 2019

@jsolman, we gonna remove the Recovery Size calculation and add it in another PR, ok?

vncoelho added some commits Mar 29, 2019

Setting NewViewNumber to static get (#670)
* Setting NewViewNumber to static get

* Pushing RecoveryMessage file that was not saved

* Fixing UT_Consensus

* Update neo/Consensus/ChangeView.cs

Co-Authored-By: shargon <shargon@gmail.com>

* Update ChangeView.cs

shargon added a commit that referenced this pull request Mar 29, 2019

Fix calculation size for RecoveryMessage
Original from @jsolman but moved to a separated Pull Request
#665

@shargon shargon self-requested a review Mar 29, 2019

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int CountCommitted(this IConsensusContext context) => context.CommitPayloads.Count(p => p != null);
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int CountFailed(this IConsensusContext context) => context.LastSeenMessage.Count(p => p < (((int) context.BlockIndex) - 1));

This comment has been minimized.

Copy link
@jsolman

jsolman Mar 29, 2019

Contributor

How about if timestamp is older than 5 minutes here we can also consider failed? I expect we can add another timer also requesting recovery every block time When appropriate. Then even the edge cases will be fixed for a node dieing where it could stall.

This comment has been minimized.

Copy link
@jsolman

jsolman Mar 30, 2019

Contributor

We can make this change in a subsequent PR.

erikzhang added a commit that referenced this pull request Mar 30, 2019

Fix calculation size for RecoveryMessage (#671)
Original from @jsolman but moved to a separated Pull Request
#665
@jsolman
Copy link
Contributor

jsolman left a comment

It has been running without stalling for 15 hours+ with the P2P plugin dropping 25% of prepare requests, 50% of prepare responses, and 50% of commits. Looks good!

In a subsequent PR we can add additional improvements for periodically requesting recovery and for using the last timestamp to improve considering nodes as lost.

@vncoelho
Copy link
Member

vncoelho left a comment

Approved with praise!

Once more, congratulations to all of us, for the great effort and dedication, in special @erikzhang, brotherrr @igormcoelho, @PeterLinX, @shargon, @jsolman, @belane, @wanglongfei88, @EdgeDLT @toghrulmaharramov @KickSeason, @joeqian10 @jessicaAtNeo, and many others, for motivating, helping and directing positive vibrations for making this a reality.
There is more to come, may the force be with you..aehuahuea 💃 👯‍♂️ 👯

@igormcoelho

This comment has been minimized.

Copy link
Contributor Author

igormcoelho commented Mar 31, 2019

Another step towards a consensus protocol 2.0 that is solid as a rock! Congratulations to all :)

@jsolman jsolman merged commit 52894db into master Apr 1, 2019

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@shargon shargon deleted the countfailed branch Apr 1, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.