[EFM Recovery] Dynamic Protocol State injects `EpochExtension`s #5773

durkmurder · 2024-04-24T19:13:45Z

Context

This PR adds data types and logic for extending current epoch using approach based around EpochExtensions.

It is implemented by extending EpochStateContainer in Dynamic Protocol State to support an extra field EpochExtensions.
The epochs.FallbackStateMachine implements logic to add epoch extensions when needed.

There are several rules when and how epoch is extended:

If we are entering EFM we are checking if the next epoch has been committed, if not then we drop whatever intermediate state that was previously setup for the next epoch.
If next epoch has been committed then no epoch extensions are added to the current epoch since we want to transition to the next epoch and only then add extensions.
If next epoch has not been committed and we have reached view v such as: v + commitSafetyThreshold >= currentEpoch.FinalView we add an epoch extension to the current epoch.

When following these rules one should never get in a state where there is an epoch extension and a next epoch, epochs extensions are only added to current epoch when there is no next epoch setup(or not anymore).

❗ Note that there is no notion of transitioning between epochs when we have entered EFM, extensions are simply added to the state, but no actual transitions are happening.

…ing extensions when reaching threshold

… extensions

state/protocol/protocol_state/epochs/fallback_statemachine.go

jordanschalm · 2024-04-25T15:30:32Z

state/protocol/protocol_state/epochs/fallback_statemachine.go

+		if !nextEpochCommitted {
+			state.NextEpoch = nil
+		}


🤔 I think we need to spend more time on how this case is handled for downstream components. Not necessarily in this PR.

The current model is that once you enter EpochPhaseSetup you can access all the EpochSetup data forever, and once you enter EpochPhaseCommitted you can access all the EpochCommit data forever. (The second part is still true, but the first part isn't true any more with EFM Recovery.) The phase state machine is strictly:

|-> STAKING -> SETUP -> COMMITTED -| |----------------------------------|

As it stands, I can call state.AtHeight(N).Epochs().Next()... and get a return value, then later call state.AtHeight(N+1).Epochs().Next()... and get ErrNextEpochNotSetup, which is a potentially breaking change.

Similarly I can observe counter=N,phase=EpochPhaseSetup in block K, then observe counter=N,phase=EpochPhaseStaking in block K+1 which is currently impossible.

Some thoughts:

Minimally, we should document this change in the relevant APIs (protocol.Epoch). Downstream components need to be aware of the newly possible "EpochSetup retraction" case and handle it properly.

We could also add a new epoch phase to capture the EFM state (see this Open Question). My gut feeling is this approach would result in the most comprehensible data model, but will be a substantial additional chunk of work.

My gut feeling is similar to Jordan's:

I think we need an explicit representation for "currently in EFM mode"

Our APIs were written with the mind-set that data once queryable is essentially never changed. My feeling is that we should try to change our APIs such that information about the upcoming epoch that has not been committed to by the protocol is not queryable.

While I think we have a few cases, where we need this data (like starting the DKG, collector root block voting), but they are very very specialized. Instead of covering these use-cases with the same API as all the other stuff caring only about committed epochs, I feel either a specialized API or just listening to the events would be sufficient for the protocol components that need uncommitted epoch information.

+1 on this being work for separate PRs.

One possible candidate is this issue: #5723

Added a note to Open Questions to come back to this later.

jordanschalm · 2024-04-25T15:53:41Z

state/protocol/protocol_state/epochs/fallback_statemachine.go

+		state.InvalidEpochTransitionAttempted = true
+	}
+
+	if !nextEpochCommitted && view+params.EpochCommitSafetyThreshold() >= parentState.CurrentEpochFinalView() {


To me, the fact that we've need to push all this state transition logic into the constructor indicates that the dual state machine design may not fit well with the EFM recovery changes.

This new logic is applying the state transition associated with incorporating the candidate block (essentially EvolveState) but at an earlier, unintuitive point in the codepath (constructor of a sub-state-machine). I think this logic would be better situated within the EvolveState call.

Our interface documentation also implies that state changes are exclusively handled in EvolveState:

flow-go/state/protocol/protocol_state/kvstore.go

Lines 122 to 130 in 3ed352f

// EvolveState applies the state change(s) on sub-state P for the candidate block (under construction).

// Information that potentially changes the Epoch state (compared to the parent block's state):

// - Service Events sealed in the candidate block

// - the candidate block's view (already provided at construction time)

//

// CAUTION: EvolveState MUST be called for all candidate blocks, even if `sealedServiceEvents` is empty!

// This is because also the absence of expected service events by a certain view can also result in the

// Epoch state changing. (For example, not having received the EpochCommit event for the next epoch, but

// approaching the end of the current epoch.)

Suggestion: I think it's worthwhile to maintain the division of responsibility where:

the constructor creates an otherwise unmodified copy of the parent state machine

EvolveState applies all state changes associated with the candidate block

One way to accomplish this is to consolidate the happy-path and fallback state machines. I'm leaning toward that personally, but I'm happy to implement more functionality and see how it shakes out with the separated sub-state-machines.

🤔 I have a slightly different view on this. While I agree with Jordan's argument for OrthogonalStoreStateMachine, the FallbackStateMachine is one level lower. It does not have an EvolveState method but ingests the events individually via dedicated methods. Therefore, I see no significant design inconsistency. Its more an aesthetic consideration in my mind. Overall, I think it acceptable for package internal sub-components to vary the API according to their needs. '

Furthermore, we separated the happy path from the fallback path specifically to keep the intellectual complexity manageable and explicitly express the concept of "throw all progress away from the happy path and re-do the evolution using the fallback logic". If we now merge the happy path and the fallback path, I expect this problem to come back. Honestly, I consider code that is hard to reason about way more problematic than a package-internal API convention.

Nevertheless, I don't want to entirely preclude the possibility of merging happy path and fallback again. But I think to make an educated decision, we need the entire logic fully implemented to better understand how similar / different the two paths are.

It does not have an EvolveState method but ingests the events individually via dedicated methods.

The FallbackStateMachine itself does not have an EvolveState method, but the higher-level EpochStateMachine does, and it uses the FallbackStateMachine. The problem is precisely that we do not have dedicated methods for EFM related state changes, so they go in the constructor.

My point is, the interface suggests that state changes should happen in EvolveState. But if I instantiate an instance of EpochStateMachine, it might evolve its internal state before I ever call EvolveState.

The problem is precisely that we do not have dedicated methods for EFM related state changes, so they go in the constructor.
My point is, the interface suggests that state changes should happen in EvolveState. But if I instantiate an instance of EpochStateMachine, it might evolve its internal state before I ever call EvolveState.

I don't quite agree with this mindset, I will try to explain why. The idea is that when we create a fallback state machine it means one of two things:

we are entering EFM

we are already in EFM

In either of those cases we can conclude that EFM is taking place. Constructor of FallbackStateMachine takes responsibility of creating a valid instance of state machine that ensures that updated state holds its invariant right after the construction. Without a need to call extra methods.

Right now this works quite well with how EFM is activated but I am open to making changes to the whole architecture and possibly merging everything together once we see how current design looks after finishing implementation of missing pieces.

We still disagree, but let's pick this discussion up again later, if needed, once the state machine changes are more fleshed out. Added a note to Open Questions.

state/protocol/protocol_state/epochs/fallback_statemachine.go

model/flow/protocol_state.go

state/protocol/protocol_state/epochs/fallback_statemachine.go

… which enforce that setup / commit events are nil in case no respective epoch is specified

AlexHentschel

Very clean code and amazing tests. I would tweak the tests a bit such that we test:

Epoch extensions are added exactly when reaching equality. We want to test that we don't accidentally have a "off-by-one" error in the implementation, which is a relatively common bug.
While my comments don't reflect this, I would still suggest to also test that exceeding the threshold (parent block's view is strictly smaller than threshold and candidate block's view strictly greater) yields the desired result.

In my comments, I have suggested code that exactly reaches the threshold. Just adding a second State Machine call for a view one larger is hopefully not making the test hugely more complex 😅 🤞.

model/flow/protocol_state.go

state/protocol/badger/mutator_test.go

state/protocol/protocol_state/epochs/fallback_statemachine_test.go

AlexHentschel · 2024-04-30T02:49:40Z

state/protocol/protocol_state/epochs/fallback_statemachine_test.go

+	// finalBlockView is the cumulative number of views that will be produced in the current epoch and its extensions
+	finalBlockView := DefaultEpochExtensionLength +
+		(parentProtocolState.CurrentEpochSetup.FinalView - parentProtocolState.CurrentEpochSetup.FirstView) + 1
+	candidateView := parentProtocolState.CurrentEpochSetup.FirstView + 1
+	for i := uint64(0); i < finalBlockView; i++ {


two concerns:

I think it is insufficient for the test to run through all views until the end of the first extension and only then verify that another extension was added. We want to verify that the extension is added at exactly the view we expect it to be added.

I find this logic complicated ... more complicated than it needs to be in my opinion. The aspect that is confusing me is that we first calculate the number of views (finalBlockView) we need to process, relative to the starting view of the epoch. We know the starting and end value for the candidate view that we want to iterate over. Using the absolute values would be much clearer at least for me.

Suggested change

// finalBlockView is the cumulative number of views that will be produced in the current epoch and its extensions

finalBlockView := DefaultEpochExtensionLength +

(parentProtocolState.CurrentEpochSetup.FinalView - parentProtocolState.CurrentEpochSetup.FirstView) + 1

candidateView := parentProtocolState.CurrentEpochSetup.FirstView + 1

for i := uint64(0); i < finalBlockView; i++ {

// In the previous test `TestNewEpochFallbackStateMachine`, we verified that the first extension is added correctly. Below we

// test proper addition of the subsequent extension. A new extension should be added when we reach `firstExtensionViewThreshold`.

// When reaching (equality) this threshold, the next extension should be added

firstExtensionViewThreshold := parentState.CurrentEpochSetup.FinalView + DefaultEpochExtensionLength - s.params.EpochCommitSafetyThreshold()

// We progress through views that are strictly smaller than threshold. Up to this point, only the initial extension should exist

for candidateView := parentState.CurrentEpochSetup.FirstView + 1; candidateView < firstExtensionViewThreshold; candidateView++ {

⋮

I have updated test setup with respect to our changes, it's not exactly as your suggestion but pretty close. Let me know if you want to make another iteration. 4e20394 (#5773)

state/protocol/protocol_state/epochs/fallback_statemachine_test.go

AlexHentschel · 2024-04-30T02:55:07Z

state/protocol/protocol_state/epochs/fallback_statemachine_test.go

+	finalBlockView := DefaultEpochExtensionLength +
+		(parentProtocolState.CurrentEpochSetup.FinalView - parentProtocolState.CurrentEpochSetup.FirstView) +
+		(parentProtocolState.NextEpochSetup.FinalView - parentProtocolState.NextEpochSetup.FirstView) +
+		1
+	candidateView := parentProtocolState.CurrentEpochSetup.FirstView + 1
+	for i := uint64(0); i < finalBlockView; i++ {


similarly to my previous comments, I would suggest to

work with absolute view numbers here to simplify the test

test that epoch extension is added exactly at the correct view

Suggested change

finalBlockView := DefaultEpochExtensionLength +

(parentProtocolState.CurrentEpochSetup.FinalView - parentProtocolState.CurrentEpochSetup.FirstView) +

(parentProtocolState.NextEpochSetup.FinalView - parentProtocolState.NextEpochSetup.FirstView) +

1

candidateView := parentProtocolState.CurrentEpochSetup.FirstView + 1

for i := uint64(0); i < finalBlockView; i++ {

// View threshold _before_ the end of the initial extension. When reaching (equality) this threshold, the next extension should be added

firstExtensionViewThreshold := parentProtocolState.NextEpochSetup.FinalView - s.params.EpochCommitSafetyThreshold()

// We progress through views that are strictly smaller than threshold. Up to this point, only the initial extension should exist

for candidateView := parentProtocolState.CurrentEpochSetup.FirstView + 1; candidateView < firstExtensionViewThreshold; candidateView++ {

4e20394 (#5773)

Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org> Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

codecov-commenter · 2024-05-08T12:27:54Z

Codecov Report

All modified and coverable lines are covered by tests ✅

❗ No coverage uploaded for pull request base (feature/efm-recovery@83ef8ad). Click here to learn what that means.

Additional details and impacted files

@@                   Coverage Diff                   @@
##             feature/efm-recovery    #5773   +/-   ##
=======================================================
  Coverage                        ?   63.06%           
=======================================================
  Files                           ?       77           
  Lines                           ?     6297           
  Branches                        ?        0           
=======================================================
  Hits                            ?     3971           
  Misses                          ?     2127           
  Partials                        ?      199

Flag	Coverage Δ
unittests	`63.06% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…e machine

jordanschalm

🎸

…ion_-_suggestions consistency checks for `RichProtocolStateEntry` constructor

durkmurder added 8 commits April 19, 2024 20:54

Added EpochExtension to ProtocolStateEntry. Added basic logic for add…

5eba414

…ing extensions when reaching threshold

Updated entering epoch fallback mode

6d86d0f

Updated logic for adding epoch extensions in EFM state machine

063107b

Added extra tests. Updated tests setup.

b940d5b

Added new test for committed epoch. Updated godoc

df60a83

Updated how epochs are extended to make sure that we are adding valid…

a98709e

… extensions

Linted

24ebf16

Updated broken test

d2accb8

durkmurder requested review from AlexHentschel and jordanschalm April 24, 2024 21:27

durkmurder assigned jordanschalm and AlexHentschel Apr 24, 2024

durkmurder requested a review from kc1116 April 24, 2024 21:27

durkmurder assigned kc1116 Apr 24, 2024

durkmurder marked this pull request as ready for review April 24, 2024 21:28

jordanschalm reviewed Apr 25, 2024

View reviewed changes

franklywatson added this to the EFM-Q2 EFM Core updates milestone Apr 26, 2024

AlexHentschel reviewed Apr 29, 2024

View reviewed changes

model/flow/protocol_state.go Outdated Show resolved Hide resolved

model/flow/protocol_state.go Outdated Show resolved Hide resolved

state/protocol/protocol_state/epochs/fallback_statemachine.go Outdated Show resolved Hide resolved

added consistency checks to constructor for RichProtocolStateEntry,…

72e960b

… which enforce that setup / commit events are nil in case no respective epoch is specified

AlexHentschel mentioned this pull request Apr 30, 2024

consistency checks for RichProtocolStateEntry constructor #5812

Merged

AlexHentschel approved these changes Apr 30, 2024

View reviewed changes

franklywatson removed this from the EFM-Q2 EFM Core updates milestone May 2, 2024

durkmurder and others added 4 commits May 8, 2024 14:30

Apply suggestions from code review

d4eb183

Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org> Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

Apply suggestions from code review

0a995de

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>

Linted

a051fc0

Apply suggestions from PR review

22bc8d7

durkmurder added 3 commits May 8, 2024 15:28

Apply suggestions from PR review

fc6e614

Updated godoc in last commit

ef3c72f

Updated how current epoch is extended. Moved logic into fallback stat…

51cf99d

…e machine

Updated tests to check for expected values

17cac0f

jordanschalm approved these changes May 8, 2024

View reviewed changes

durkmurder added 3 commits May 9, 2024 11:53

Updated tests to check expected protocol state

814f4c1

Merge pull request #5812 from onflow/alex/5724-epoch-extension-inject…

004a713

…ion_-_suggestions consistency checks for `RichProtocolStateEntry` constructor

Updated tests to check that transitions happen at exact views

4e20394

durkmurder merged commit 2ab36aa into feature/efm-recovery May 9, 2024
55 checks passed

durkmurder deleted the yurii/5724-epoch-extension-injection branch May 9, 2024 13:02

This was referenced May 10, 2024

[EFM] Dynamic Protocol State injects EpochExtension when entering EFM #5724

Closed

[EFM] Dynamic Protocol State maintains EFM by injecting EpochExtensions #5726

Closed

[EFM] Add EpochExtension to Epoch Data Model #5717

Closed

This was linked to issues May 10, 2024

[EFM] Add EpochExtension to Epoch Data Model #5717

Closed

[EFM] Dynamic Protocol State injects EpochExtension when entering EFM #5724

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EFM Recovery] Dynamic Protocol State injects `EpochExtension`s #5773

[EFM Recovery] Dynamic Protocol State injects `EpochExtension`s #5773

durkmurder commented Apr 24, 2024 •

edited

Loading

jordanschalm Apr 25, 2024

AlexHentschel Apr 29, 2024 •

edited

Loading

durkmurder May 8, 2024

jordanschalm May 8, 2024

jordanschalm Apr 25, 2024

AlexHentschel Apr 30, 2024 •

edited

Loading

jordanschalm May 1, 2024

durkmurder May 8, 2024

jordanschalm May 8, 2024

AlexHentschel left a comment

AlexHentschel Apr 30, 2024

durkmurder May 9, 2024

AlexHentschel Apr 30, 2024

durkmurder May 9, 2024

codecov-commenter commented May 8, 2024

jordanschalm left a comment

	// EvolveState applies the state change(s) on sub-state P for the candidate block (under construction).
	// Information that potentially changes the Epoch state (compared to the parent block's state):
	// - Service Events sealed in the candidate block
	// - the candidate block's view (already provided at construction time)
	//
	// CAUTION: EvolveState MUST be called for all candidate blocks, even if `sealedServiceEvents` is empty!
	// This is because also the absence of expected service events by a certain view can also result in the
	// Epoch state changing. (For example, not having received the EpochCommit event for the next epoch, but
	// approaching the end of the current epoch.)

-	// finalBlockView is the cumulative number of views that will be produced in the current epoch and its extensions
-	finalBlockView := DefaultEpochExtensionLength +
-		(parentProtocolState.CurrentEpochSetup.FinalView - parentProtocolState.CurrentEpochSetup.FirstView) + 1
-	candidateView := parentProtocolState.CurrentEpochSetup.FirstView + 1
-	for i := uint64(0); i < finalBlockView; i++ {
+	// In the previous test `TestNewEpochFallbackStateMachine`, we verified that the first extension is added correctly. Below we
+	// test proper addition of the subsequent extension. A new extension should be added when we reach `firstExtensionViewThreshold`.
+	// When reaching (equality) this threshold, the next extension should be added
+	firstExtensionViewThreshold := parentState.CurrentEpochSetup.FinalView + DefaultEpochExtensionLength - s.params.EpochCommitSafetyThreshold()
+	// We progress through views that are strictly smaller than threshold. Up to this point, only the initial extension should exist
+	for candidateView := parentState.CurrentEpochSetup.FirstView + 1; candidateView < firstExtensionViewThreshold; candidateView++ {
+	   ⋮

[EFM Recovery] Dynamic Protocol State injects EpochExtensions #5773

[EFM Recovery] Dynamic Protocol State injects EpochExtensions #5773

Conversation

durkmurder commented Apr 24, 2024 • edited Loading

Context

Choose a reason for hiding this comment

Some thoughts:

AlexHentschel Apr 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexHentschel Apr 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexHentschel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented May 8, 2024

Codecov Report

jordanschalm left a comment

Choose a reason for hiding this comment

[EFM Recovery] Dynamic Protocol State injects `EpochExtension`s #5773

[EFM Recovery] Dynamic Protocol State injects `EpochExtension`s #5773

durkmurder commented Apr 24, 2024 •

edited

Loading

AlexHentschel Apr 29, 2024 •

edited

Loading

AlexHentschel Apr 30, 2024 •

edited

Loading