-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continue building blocks as liveness fallback in case of epoch failure #1250
Conversation
…-go into jordan/continue-failed-epoch
Codecov Report
@@ Coverage Diff @@
## master #1250 +/- ##
==========================================
- Coverage 56.28% 56.22% -0.06%
==========================================
Files 497 497
Lines 30327 30384 +57
==========================================
+ Hits 17069 17083 +14
- Misses 10945 10976 +31
- Partials 2313 2325 +12
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. Tried to help out by adding goDoc and consolidating the code a bit: please see my PR #1255 (targeting this branch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall approach looks good to me for a temporary solution. Have some ideas, what we could do as a bit more mature approach (follow up work).
…:onflow/flow-go into alex/continue-failed-epoch_-_suggestions
…gestions EECC suggestions
return flow.ZeroID, fmt.Errorf("could not compute epoch fallback leader selection: %w", err) | ||
} | ||
c.mu.Lock() | ||
c.leaders[counter] = selection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how often would this update happen?
I think we should generate the leader selection only once. Even for the EECC epoch. How to ensure that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will happen exactly once, the first time we attempt to retrieve the leader for a view which falls outside of a committed epoch. The reasoning for why this is true is in
flow-go/consensus/hotstuff/committees/consensus_committee.go
Lines 103 to 130 in 24706fe
if !errors.Is(err, errSelectionNotComputed) { | |
return flow.ZeroID, err | |
} | |
// we only reach the following code, if we got a errSelectionNotComputed | |
// STEP 2 - we haven't yet computed leader selection for an epoch containing | |
// the requested view. We compute leader selection for the current and previous | |
// epoch (w.r.t. the finalized head) at initialization then compute leader | |
// selection for the next epoch when we encounter any view for which we don't | |
// know the leader. The series of epochs we have computed leaders for is | |
// strictly consecutive, meaning we know the leader for all views V where: | |
// | |
// oldestEpoch.firstView <= V <= newestEpoch.finalView | |
// | |
// Thus, the requested view is either before oldestEpoch.firstView or after | |
// newestEpoch.finalView. | |
// | |
// CASE 1: V < oldestEpoch.firstView | |
// If the view is before the first view we've computed the leader for, this | |
// represents an invalid query because we only guarantee the protocol state | |
// will contain epoch information for the current, previous, and next epoch - | |
// such a query must be for a view within an epoch at least TWO epochs before | |
// the current epoch when we started up. This is considered an invalid query. | |
// | |
// CASE 2: V > newestEpoch.finalView | |
// If the view is after the last view we've computed the leader for, we | |
// assume the view is within the next epoch (w.r.t. the finalized head). | |
// This assumption is equivalent to assuming that we build at least one |
selection, err := leader.ComputeLeaderSelectionFromSeed( | ||
firstView, | ||
seed, | ||
int(firstView+leader.EstimatedSixMonthOfViews), // the fallback epoch lasts until the next spork |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm my understanding:
- Once entering EECC, then it won't enter another EECC?
2.Once entering EECC, there will be no DKG in EECC - Once entering EECC, the Epoch phase will stay at Staking Phase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once entering EECC, then it won't enter another EECC?
Once entering EECC, there will be no DKG in EECC
Once entering EECC, the Epoch phase will stay at Staking Phase
These are all correct.
// If the given view is within the bounds of the next epoch, and the epoch | ||
// has not been set up or committed, we pretend that we are still in the | ||
// current epoch and return that epoch's counter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are still in the current epoch during EECC, why we do still generate leaders for next epoch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation allows it, and it's a simpler change than trying to replace the existing leader selection for the current epoch which is being extended with EECC.
Co-authored-by: Leo Zhang <zhangchiqing@gmail.com>
This is ready for another look @zhangchiqing |
if view > currentFinalView { | ||
_, err := next.DKG() // either of the following errors indicates that we have transitioned into EECC | ||
if errors.Is(err, protocol.ErrEpochNotCommitted) || errors.Is(err, protocol.ErrNextEpochNotSetup) { | ||
return current.Counter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to return the error, and make another function like EpochForSignerView
for the SignerStore to call.
Because different callers will expect different epoch counters for views that have no epoch committed:
- the signer store expects EpochForView to return the current epoch
- the leader selection expects EpochForView to returns the next epoch
Better we let signer store to call EpochForSignerView
and let leader selection to call EpochForView
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Added in 7e85f17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 👍
bors merge |
This PR implements a fallback mechanism to continue block production after the specified end view of an epoch, if the next epoch has not been successfully set up or committed.
Changes
EpochStatus
of its parent, so that it is treated as part of the last epoch