Continue building blocks as liveness fallback in case of epoch failure #1250

jordanschalm · 2021-09-03T16:10:33Z

This PR implements a fallback mechanism to continue block production after the specified end view of an epoch, if the next epoch has not been successfully set up or committed.

Changes

When building the first (and all subsequent) blocks of an epoch, if that epoch has not been set up or committed, store the block with the EpochStatus of its parent, so that it is treated as part of the last epoch
Create a fallback leader selection using the config for the last epoch, if the next epoch has not been set up or committed
Continue using our last epoch's random beacon key to sign block messages, if the next epoch has not been set up or committed
Ignore all service events, if we have entered a state of emergency chain continuation
Avoid emitted epoch transition and epoch phase transition protocol events, if we have entered a state of emergency chain continuation

…-go into jordan/continue-failed-epoch

codecov-commenter · 2021-09-03T23:11:31Z

Codecov Report

Merging #1250 (4d7b4d5) into master (890803b) will decrease coverage by 0.05%.
The diff coverage is 64.70%.

@@            Coverage Diff             @@
##           master    #1250      +/-   ##
==========================================
- Coverage   56.28%   56.22%   -0.06%     
==========================================
  Files         497      497              
  Lines       30327    30384      +57     
==========================================
+ Hits        17069    17083      +14     
- Misses      10945    10976      +31     
- Partials     2313     2325      +12

Flag	Coverage Δ
unittests	`56.22% <64.70%> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
model/flow/epoch.go	`42.70% <0.00%> (-0.95%)`	⬇️
state/protocol/inmem/epoch.go	`80.00% <ø> (ø)`
...nsensus/hotstuff/committees/consensus_committee.go	`61.76% <63.63%> (-5.38%)`	⬇️
module/epochs/epoch_lookup.go	`54.54% <64.70%> (+1.91%)`	⬆️
state/protocol/badger/mutator.go	`64.42% <73.33%> (-0.97%)`	⬇️
module/signature/signer_store.go	`50.00% <100.00%> (ø)`
...sus/approvals/assignment_collector_statemachine.go	`42.30% <0.00%> (-9.62%)`	⬇️
engine/collection/synchronization/engine.go	`62.90% <0.00%> (-1.08%)`	⬇️
cmd/util/ledger/migrations/storage_v4.go	`41.56% <0.00%> (-0.61%)`	⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 890803b...4d7b4d5. Read the comment docs.

…-epoch_-_suggestions

AlexHentschel

looks good. Tried to help out by adding goDoc and consolidating the code a bit: please see my PR #1255 (targeting this branch)

state/protocol/badger/mutator.go

module/epochs/epoch_lookup.go

consensus/hotstuff/committees/consensus_committee.go

AlexHentschel

Overall approach looks good to me for a temporary solution. Have some ideas, what we could do as a bit more mature approach (follow up work).

…:onflow/flow-go into alex/continue-failed-epoch_-_suggestions

…gestions EECC suggestions

state/protocol/badger/mutator.go

zhangchiqing · 2021-09-07T17:16:57Z

consensus/hotstuff/committees/consensus_committee.go

+			return flow.ZeroID, fmt.Errorf("could not compute epoch fallback leader selection: %w", err)
+		}
+		c.mu.Lock()
+		c.leaders[counter] = selection


how often would this update happen?
I think we should generate the leader selection only once. Even for the EECC epoch. How to ensure that?

It will happen exactly once, the first time we attempt to retrieve the leader for a view which falls outside of a committed epoch. The reasoning for why this is true is in

flow-go/consensus/hotstuff/committees/consensus_committee.go

Lines 103 to 130 in 24706fe

if !errors.Is(err, errSelectionNotComputed) {

return flow.ZeroID, err

}

// we only reach the following code, if we got a errSelectionNotComputed

// STEP 2 - we haven't yet computed leader selection for an epoch containing

// the requested view. We compute leader selection for the current and previous

// epoch (w.r.t. the finalized head) at initialization then compute leader

// selection for the next epoch when we encounter any view for which we don't

// know the leader. The series of epochs we have computed leaders for is

// strictly consecutive, meaning we know the leader for all views V where:

//

// oldestEpoch.firstView <= V <= newestEpoch.finalView

//

// Thus, the requested view is either before oldestEpoch.firstView or after

// newestEpoch.finalView.

//

// CASE 1: V < oldestEpoch.firstView

// If the view is before the first view we've computed the leader for, this

// represents an invalid query because we only guarantee the protocol state

// will contain epoch information for the current, previous, and next epoch -

// such a query must be for a view within an epoch at least TWO epochs before

// the current epoch when we started up. This is considered an invalid query.

//

// CASE 2: V > newestEpoch.finalView

// If the view is after the last view we've computed the leader for, we

// assume the view is within the next epoch (w.r.t. the finalized head).

// This assumption is equivalent to assuming that we build at least one

zhangchiqing · 2021-09-07T17:26:20Z

consensus/hotstuff/committees/consensus_committee.go

+		selection, err := leader.ComputeLeaderSelectionFromSeed(
+			firstView,
+			seed,
+			int(firstView+leader.EstimatedSixMonthOfViews), // the fallback epoch lasts until the next spork


Just to confirm my understanding:

Once entering EECC, then it won't enter another EECC?
2.Once entering EECC, there will be no DKG in EECC

Once entering EECC, the Epoch phase will stay at Staking Phase

Once entering EECC, then it won't enter another EECC?
Once entering EECC, there will be no DKG in EECC
Once entering EECC, the Epoch phase will stay at Staking Phase

These are all correct.

zhangchiqing · 2021-09-07T17:30:04Z

module/epochs/epoch_lookup.go

+	// If the given view is within the bounds of the next epoch, and the epoch
+	// has not been set up or committed, we pretend that we are still in the
+	// current epoch and return that epoch's counter.


If we are still in the current epoch during EECC, why we do still generate leaders for next epoch

The implementation allows it, and it's a simpler change than trying to replace the existing leader selection for the current epoch which is being extended with EECC.

Co-authored-by: Leo Zhang <zhangchiqing@gmail.com>

jordanschalm · 2021-09-07T20:01:49Z

This is ready for another look @zhangchiqing

zhangchiqing · 2021-09-07T23:11:18Z

module/epochs/epoch_lookup.go

+	if view > currentFinalView {
+		_, err := next.DKG() // either of the following errors indicates that we have transitioned into EECC
+		if errors.Is(err, protocol.ErrEpochNotCommitted) || errors.Is(err, protocol.ErrNextEpochNotSetup) {
+			return current.Counter()


I would suggest to return the error, and make another function like EpochForSignerView for the SignerStore to call.

Because different callers will expect different epoch counters for views that have no epoch committed:

the signer store expects EpochForView to return the current epoch

the leader selection expects EpochForView to returns the next epoch

Better we let signer store to call EpochForSignerView
and let leader selection to call EpochForView

👍 Added in 7e85f17

…-go into jordan/continue-failed-epoch

zhangchiqing

Looks good 👍

jordanschalm · 2021-09-08T01:39:33Z

bors merge

bors · 2021-09-08T01:57:17Z

Build succeeded:

jordanschalm added 11 commits September 2, 2021 14:46

do not transition epoch if it failed

ff813fa

on epoch failure, pretend the current epoch continues forever

1dda102

epoch lookup failure fallback

96ae5df

lint

6185310

add context

6291ab4

add context

beca1e3

handle case of not-setup epoch in consensus committee

93abbb1

add context

1a452b7

ignore service events after epoch failure

a92e0d8

prefix comment wording

aed5d97

spend less time holding the lock

488351b

jordanschalm requested review from AlexHentschel, Kay-Zee and kc1116 September 3, 2021 18:56

jordanschalm added 2 commits September 3, 2021 12:04

additional condition for EpochLookup fallback

1bc43b0

avoid emitting protocol events on emergency chain continuation

496adff

jordanschalm marked this pull request as ready for review September 3, 2021 20:15

jordanschalm added 6 commits September 3, 2021 13:51

Merge branch 'master' into jordan/continue-failed-epoch

725050a

typo

01f74d3

Merge branch 'jordan/continue-failed-epoch' of github.com:onflow/flow…

da293c8

…-go into jordan/continue-failed-epoch

update mutator tests

883c4cd

update epoch lockup tests

acfedf6

update committee leader selection tests

c3502d1

Alexander Hentschel added 2 commits September 3, 2021 16:26

update comments and consolidated code a bit

b9d7a22

Merge branch 'jordan/continue-failed-epoch' into alex/continue-failed…

dfb5391

…-epoch_-_suggestions

jordanschalm changed the title ~~Investigation: Continue Failed Epoch~~ Continue building blocks as liveness fallback in case of epoch failure Sep 3, 2021

Alexander Hentschel added 2 commits September 3, 2021 18:49

minor polishing

24706fe

consolidated EECC logic in Protocol State mutator

76127e1

AlexHentschel mentioned this pull request Sep 4, 2021

EECC suggestions #1255

Merged

AlexHentschel reviewed Sep 4, 2021

View reviewed changes

AlexHentschel approved these changes Sep 4, 2021

View reviewed changes

jordanschalm added 4 commits September 7, 2021 10:48

Update state/protocol/badger/mutator.go

d20dafd

copy epoch status when in EECC

3ffec1b

Merge branch 'alex/continue-failed-epoch_-_suggestions' of github.com…

17f9816

…:onflow/flow-go into alex/continue-failed-epoch_-_suggestions

Merge pull request #1255 from onflow/alex/continue-failed-epoch_-_sug…

d31bd89

…gestions EECC suggestions

zhangchiqing reviewed Sep 7, 2021

View reviewed changes

state/protocol/badger/mutator.go Outdated Show resolved Hide resolved

state/protocol/badger/mutator.go Outdated Show resolved Hide resolved

zhangchiqing reviewed Sep 7, 2021

View reviewed changes

jordanschalm and others added 4 commits September 7, 2021 15:08

Update state/protocol/badger/mutator.go

a2df87c

Co-authored-by: Leo Zhang <zhangchiqing@gmail.com>

Merge branch 'master' into jordan/continue-failed-epoch

74e5f94

fix formatting

9ec4c3c

Merge branch 'master' into jordan/continue-failed-epoch

79e83ba

Merge branch 'master' into jordan/continue-failed-epoch

af8eb7b

zhangchiqing reviewed Sep 7, 2021

View reviewed changes

jordanschalm added 3 commits September 7, 2021 19:26

Merge branch 'master' into jordan/continue-failed-epoch

8a1295f

add separate fallback-explicit EpochForView method

7e85f17

Merge branch 'jordan/continue-failed-epoch' of github.com:onflow/flow…

b536c8f

…-go into jordan/continue-failed-epoch

zhangchiqing approved these changes Sep 7, 2021

View reviewed changes

fix mocked function

4d7b4d5

bors bot merged commit e838255 into master Sep 8, 2021

bors bot deleted the jordan/continue-failed-epoch branch September 8, 2021 01:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continue building blocks as liveness fallback in case of epoch failure #1250

Continue building blocks as liveness fallback in case of epoch failure #1250

jordanschalm commented Sep 3, 2021 •

edited

codecov-commenter commented Sep 3, 2021 •

edited

AlexHentschel left a comment

AlexHentschel left a comment

zhangchiqing Sep 7, 2021

jordanschalm Sep 7, 2021 •

edited

zhangchiqing Sep 7, 2021

jordanschalm Sep 7, 2021

zhangchiqing Sep 7, 2021

jordanschalm Sep 7, 2021

jordanschalm commented Sep 7, 2021

zhangchiqing Sep 7, 2021 •

edited

jordanschalm Sep 7, 2021

zhangchiqing left a comment

jordanschalm commented Sep 8, 2021

bors bot commented Sep 8, 2021

	if !errors.Is(err, errSelectionNotComputed) {
	return flow.ZeroID, err
	}
	// we only reach the following code, if we got a errSelectionNotComputed

	// STEP 2 - we haven't yet computed leader selection for an epoch containing
	// the requested view. We compute leader selection for the current and previous
	// epoch (w.r.t. the finalized head) at initialization then compute leader
	// selection for the next epoch when we encounter any view for which we don't
	// know the leader. The series of epochs we have computed leaders for is
	// strictly consecutive, meaning we know the leader for all views V where:
	//
	// oldestEpoch.firstView <= V <= newestEpoch.finalView
	//
	// Thus, the requested view is either before oldestEpoch.firstView or after
	// newestEpoch.finalView.
	//
	// CASE 1: V < oldestEpoch.firstView
	// If the view is before the first view we've computed the leader for, this
	// represents an invalid query because we only guarantee the protocol state
	// will contain epoch information for the current, previous, and next epoch -
	// such a query must be for a view within an epoch at least TWO epochs before
	// the current epoch when we started up. This is considered an invalid query.
	//
	// CASE 2: V > newestEpoch.finalView
	// If the view is after the last view we've computed the leader for, we
	// assume the view is within the next epoch (w.r.t. the finalized head).
	// This assumption is equivalent to assuming that we build at least one

Continue building blocks as liveness fallback in case of epoch failure #1250

Continue building blocks as liveness fallback in case of epoch failure #1250

Conversation

jordanschalm commented Sep 3, 2021 • edited

Changes

codecov-commenter commented Sep 3, 2021 • edited

Codecov Report

AlexHentschel left a comment

Choose a reason for hiding this comment

AlexHentschel left a comment

Choose a reason for hiding this comment

zhangchiqing Sep 7, 2021

Choose a reason for hiding this comment

jordanschalm Sep 7, 2021 • edited

Choose a reason for hiding this comment

zhangchiqing Sep 7, 2021

Choose a reason for hiding this comment

jordanschalm Sep 7, 2021

Choose a reason for hiding this comment

zhangchiqing Sep 7, 2021

Choose a reason for hiding this comment

jordanschalm Sep 7, 2021

Choose a reason for hiding this comment

jordanschalm commented Sep 7, 2021

zhangchiqing Sep 7, 2021 • edited

Choose a reason for hiding this comment

jordanschalm Sep 7, 2021

Choose a reason for hiding this comment

zhangchiqing left a comment

Choose a reason for hiding this comment

jordanschalm commented Sep 8, 2021

bors bot commented Sep 8, 2021

jordanschalm commented Sep 3, 2021 •

edited

codecov-commenter commented Sep 3, 2021 •

edited

jordanschalm Sep 7, 2021 •

edited

zhangchiqing Sep 7, 2021 •

edited