Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continue building blocks as liveness fallback in case of epoch failure #1250

Merged
merged 36 commits into from
Sep 8, 2021

Conversation

jordanschalm
Copy link
Member

@jordanschalm jordanschalm commented Sep 3, 2021

This PR implements a fallback mechanism to continue block production after the specified end view of an epoch, if the next epoch has not been successfully set up or committed.

Changes

  • When building the first (and all subsequent) blocks of an epoch, if that epoch has not been set up or committed, store the block with the EpochStatus of its parent, so that it is treated as part of the last epoch
  • Create a fallback leader selection using the config for the last epoch, if the next epoch has not been set up or committed
  • Continue using our last epoch's random beacon key to sign block messages, if the next epoch has not been set up or committed
  • Ignore all service events, if we have entered a state of emergency chain continuation
  • Avoid emitted epoch transition and epoch phase transition protocol events, if we have entered a state of emergency chain continuation

@jordanschalm jordanschalm marked this pull request as ready for review September 3, 2021 20:15
@codecov-commenter
Copy link

codecov-commenter commented Sep 3, 2021

Codecov Report

Merging #1250 (4d7b4d5) into master (890803b) will decrease coverage by 0.05%.
The diff coverage is 64.70%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1250      +/-   ##
==========================================
- Coverage   56.28%   56.22%   -0.06%     
==========================================
  Files         497      497              
  Lines       30327    30384      +57     
==========================================
+ Hits        17069    17083      +14     
- Misses      10945    10976      +31     
- Partials     2313     2325      +12     
Flag Coverage Δ
unittests 56.22% <64.70%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
model/flow/epoch.go 42.70% <0.00%> (-0.95%) ⬇️
state/protocol/inmem/epoch.go 80.00% <ø> (ø)
...nsensus/hotstuff/committees/consensus_committee.go 61.76% <63.63%> (-5.38%) ⬇️
module/epochs/epoch_lookup.go 54.54% <64.70%> (+1.91%) ⬆️
state/protocol/badger/mutator.go 64.42% <73.33%> (-0.97%) ⬇️
module/signature/signer_store.go 50.00% <100.00%> (ø)
...sus/approvals/assignment_collector_statemachine.go 42.30% <0.00%> (-9.62%) ⬇️
engine/collection/synchronization/engine.go 62.90% <0.00%> (-1.08%) ⬇️
cmd/util/ledger/migrations/storage_v4.go 41.56% <0.00%> (-0.61%) ⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 890803b...4d7b4d5. Read the comment docs.

@jordanschalm jordanschalm changed the title Investigation: Continue Failed Epoch Continue building blocks as liveness fallback in case of epoch failure Sep 3, 2021
@AlexHentschel AlexHentschel mentioned this pull request Sep 4, 2021
Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. Tried to help out by adding goDoc and consolidating the code a bit: please see my PR #1255 (targeting this branch)

state/protocol/badger/mutator.go Outdated Show resolved Hide resolved
module/epochs/epoch_lookup.go Outdated Show resolved Hide resolved
module/epochs/epoch_lookup.go Show resolved Hide resolved
module/epochs/epoch_lookup.go Outdated Show resolved Hide resolved
consensus/hotstuff/committees/consensus_committee.go Outdated Show resolved Hide resolved
Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall approach looks good to me for a temporary solution. Have some ideas, what we could do as a bit more mature approach (follow up work).

state/protocol/badger/mutator.go Outdated Show resolved Hide resolved
state/protocol/badger/mutator.go Outdated Show resolved Hide resolved
return flow.ZeroID, fmt.Errorf("could not compute epoch fallback leader selection: %w", err)
}
c.mu.Lock()
c.leaders[counter] = selection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how often would this update happen?
I think we should generate the leader selection only once. Even for the EECC epoch. How to ensure that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will happen exactly once, the first time we attempt to retrieve the leader for a view which falls outside of a committed epoch. The reasoning for why this is true is in

if !errors.Is(err, errSelectionNotComputed) {
return flow.ZeroID, err
}
// we only reach the following code, if we got a errSelectionNotComputed
// STEP 2 - we haven't yet computed leader selection for an epoch containing
// the requested view. We compute leader selection for the current and previous
// epoch (w.r.t. the finalized head) at initialization then compute leader
// selection for the next epoch when we encounter any view for which we don't
// know the leader. The series of epochs we have computed leaders for is
// strictly consecutive, meaning we know the leader for all views V where:
//
// oldestEpoch.firstView <= V <= newestEpoch.finalView
//
// Thus, the requested view is either before oldestEpoch.firstView or after
// newestEpoch.finalView.
//
// CASE 1: V < oldestEpoch.firstView
// If the view is before the first view we've computed the leader for, this
// represents an invalid query because we only guarantee the protocol state
// will contain epoch information for the current, previous, and next epoch -
// such a query must be for a view within an epoch at least TWO epochs before
// the current epoch when we started up. This is considered an invalid query.
//
// CASE 2: V > newestEpoch.finalView
// If the view is after the last view we've computed the leader for, we
// assume the view is within the next epoch (w.r.t. the finalized head).
// This assumption is equivalent to assuming that we build at least one

selection, err := leader.ComputeLeaderSelectionFromSeed(
firstView,
seed,
int(firstView+leader.EstimatedSixMonthOfViews), // the fallback epoch lasts until the next spork
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm my understanding:

  1. Once entering EECC, then it won't enter another EECC?
    2.Once entering EECC, there will be no DKG in EECC
  2. Once entering EECC, the Epoch phase will stay at Staking Phase

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once entering EECC, then it won't enter another EECC?
Once entering EECC, there will be no DKG in EECC
Once entering EECC, the Epoch phase will stay at Staking Phase

These are all correct.

Comment on lines +38 to +40
// If the given view is within the bounds of the next epoch, and the epoch
// has not been set up or committed, we pretend that we are still in the
// current epoch and return that epoch's counter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are still in the current epoch during EECC, why we do still generate leaders for next epoch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation allows it, and it's a simpler change than trying to replace the existing leader selection for the current epoch which is being extended with EECC.

@jordanschalm
Copy link
Member Author

This is ready for another look @zhangchiqing

if view > currentFinalView {
_, err := next.DKG() // either of the following errors indicates that we have transitioned into EECC
if errors.Is(err, protocol.ErrEpochNotCommitted) || errors.Is(err, protocol.ErrNextEpochNotSetup) {
return current.Counter()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to return the error, and make another function like EpochForSignerView for the SignerStore to call.

Because different callers will expect different epoch counters for views that have no epoch committed:

  • the signer store expects EpochForView to return the current epoch
  • the leader selection expects EpochForView to returns the next epoch

Better we let signer store to call EpochForSignerView
and let leader selection to call EpochForView

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Added in 7e85f17

Copy link
Member

@zhangchiqing zhangchiqing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

@jordanschalm
Copy link
Member Author

bors merge

@bors
Copy link
Contributor

bors bot commented Sep 8, 2021

@bors bors bot merged commit e838255 into master Sep 8, 2021
@bors bors bot deleted the jordan/continue-failed-epoch branch September 8, 2021 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants