Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EFM Recovery] Dynamic Protocol State injects EpochExtensions #5773

Merged
merged 20 commits into from
May 9, 2024

Conversation

durkmurder
Copy link
Member

@durkmurder durkmurder commented Apr 24, 2024

#5717
#5724
#5726

Context

This PR adds data types and logic for extending current epoch using approach based around EpochExtensions.

It is implemented by extending EpochStateContainer in Dynamic Protocol State to support an extra field EpochExtensions.
The epochs.FallbackStateMachine implements logic to add epoch extensions when needed.

There are several rules when and how epoch is extended:

  1. If we are entering EFM we are checking if the next epoch has been committed, if not then we drop whatever intermediate state that was previously setup for the next epoch.
  2. If next epoch has been committed then no epoch extensions are added to the current epoch since we want to transition to the next epoch and only then add extensions.
  3. If next epoch has not been committed and we have reached view v such as: v + commitSafetyThreshold >= currentEpoch.FinalView we add an epoch extension to the current epoch.

When following these rules one should never get in a state where there is an epoch extension and a next epoch, epochs extensions are only added to current epoch when there is no next epoch setup(or not anymore).

❗ Note that there is no notion of transitioning between epochs when we have entered EFM, extensions are simply added to the state, but no actual transitions are happening.

Comment on lines +35 to +37
if !nextEpochCommitted {
state.NextEpoch = nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I think we need to spend more time on how this case is handled for downstream components. Not necessarily in this PR.

The current model is that once you enter EpochPhaseSetup you can access all the EpochSetup data forever, and once you enter EpochPhaseCommitted you can access all the EpochCommit data forever. (The second part is still true, but the first part isn't true any more with EFM Recovery.) The phase state machine is strictly:

|-> STAKING -> SETUP -> COMMITTED -|
|----------------------------------|

As it stands, I can call state.AtHeight(N).Epochs().Next()... and get a return value, then later call state.AtHeight(N+1).Epochs().Next()... and get ErrNextEpochNotSetup, which is a potentially breaking change.

Similarly I can observe counter=N,phase=EpochPhaseSetup in block K, then observe counter=N,phase=EpochPhaseStaking in block K+1 which is currently impossible.

Some thoughts:

  • Minimally, we should document this change in the relevant APIs (protocol.Epoch). Downstream components need to be aware of the newly possible "EpochSetup retraction" case and handle it properly.
  • We could also add a new epoch phase to capture the EFM state (see this Open Question). My gut feeling is this approach would result in the most comprehensible data model, but will be a substantial additional chunk of work.

Copy link
Member

@AlexHentschel AlexHentschel Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut feeling is similar to Jordan's:

  • I think we need an explicit representation for "currently in EFM mode"

  • Our APIs were written with the mind-set that data once queryable is essentially never changed. My feeling is that we should try to change our APIs such that information about the upcoming epoch that has not been committed to by the protocol is not queryable.

    While I think we have a few cases, where we need this data (like starting the DKG, collector root block voting), but they are very very specialized. Instead of covering these use-cases with the same API as all the other stuff caring only about committed epochs, I feel either a specialized API or just listening to the events would be sufficient for the protocol components that need uncommitted epoch information.

+1 on this being work for separate PRs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possible candidate is this issue: #5723

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note to Open Questions to come back to this later.

state.InvalidEpochTransitionAttempted = true
}

if !nextEpochCommitted && view+params.EpochCommitSafetyThreshold() >= parentState.CurrentEpochFinalView() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, the fact that we've need to push all this state transition logic into the constructor indicates that the dual state machine design may not fit well with the EFM recovery changes.

This new logic is applying the state transition associated with incorporating the candidate block (essentially EvolveState) but at an earlier, unintuitive point in the codepath (constructor of a sub-state-machine). I think this logic would be better situated within the EvolveState call.

Our interface documentation also implies that state changes are exclusively handled in EvolveState:

// EvolveState applies the state change(s) on sub-state P for the candidate block (under construction).
// Information that potentially changes the Epoch state (compared to the parent block's state):
// - Service Events sealed in the candidate block
// - the candidate block's view (already provided at construction time)
//
// CAUTION: EvolveState MUST be called for all candidate blocks, even if `sealedServiceEvents` is empty!
// This is because also the absence of expected service events by a certain view can also result in the
// Epoch state changing. (For example, not having received the EpochCommit event for the next epoch, but
// approaching the end of the current epoch.)

Suggestion: I think it's worthwhile to maintain the division of responsibility where:

  • the constructor creates an otherwise unmodified copy of the parent state machine
  • EvolveState applies all state changes associated with the candidate block

One way to accomplish this is to consolidate the happy-path and fallback state machines. I'm leaning toward that personally, but I'm happy to implement more functionality and see how it shakes out with the separated sub-state-machines.

Copy link
Member

@AlexHentschel AlexHentschel Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I have a slightly different view on this. While I agree with Jordan's argument for OrthogonalStoreStateMachine, the FallbackStateMachine is one level lower. It does not have an EvolveState method but ingests the events individually via dedicated methods. Therefore, I see no significant design inconsistency. Its more an aesthetic consideration in my mind. Overall, I think it acceptable for package internal sub-components to vary the API according to their needs. '

Furthermore, we separated the happy path from the fallback path specifically to keep the intellectual complexity manageable and explicitly express the concept of "throw all progress away from the happy path and re-do the evolution using the fallback logic". If we now merge the happy path and the fallback path, I expect this problem to come back. Honestly, I consider code that is hard to reason about way more problematic than a package-internal API convention.

Nevertheless, I don't want to entirely preclude the possibility of merging happy path and fallback again. But I think to make an educated decision, we need the entire logic fully implemented to better understand how similar / different the two paths are.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not have an EvolveState method but ingests the events individually via dedicated methods.

The FallbackStateMachine itself does not have an EvolveState method, but the higher-level EpochStateMachine does, and it uses the FallbackStateMachine. The problem is precisely that we do not have dedicated methods for EFM related state changes, so they go in the constructor.

My point is, the interface suggests that state changes should happen in EvolveState. But if I instantiate an instance of EpochStateMachine, it might evolve its internal state before I ever call EvolveState.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is precisely that we do not have dedicated methods for EFM related state changes, so they go in the constructor.
My point is, the interface suggests that state changes should happen in EvolveState. But if I instantiate an instance of EpochStateMachine, it might evolve its internal state before I ever call EvolveState.

I don't quite agree with this mindset, I will try to explain why. The idea is that when we create a fallback state machine it means one of two things:

  1. we are entering EFM
  2. we are already in EFM

In either of those cases we can conclude that EFM is taking place. Constructor of FallbackStateMachine takes responsibility of creating a valid instance of state machine that ensures that updated state holds its invariant right after the construction. Without a need to call extra methods.

Right now this works quite well with how EFM is activated but I am open to making changes to the whole architecture and possibly merging everything together once we see how current design looks after finishing implementation of missing pieces.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still disagree, but let's pick this discussion up again later, if needed, once the state machine changes are more fleshed out. Added a note to Open Questions.

@franklywatson franklywatson added this to the EFM-Q2 EFM Core updates milestone Apr 26, 2024
model/flow/protocol_state.go Outdated Show resolved Hide resolved
model/flow/protocol_state.go Outdated Show resolved Hide resolved
… which enforce that setup / commit events are nil in case no respective epoch is specified
Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean code and amazing tests. I would tweak the tests a bit such that we test:

  • Epoch extensions are added exactly when reaching equality. We want to test that we don't accidentally have a "off-by-one" error in the implementation, which is a relatively common bug.

  • While my comments don't reflect this, I would still suggest to also test that exceeding the threshold (parent block's view is strictly smaller than threshold and candidate block's view strictly greater) yields the desired result.

    In my comments, I have suggested code that exactly reaches the threshold. Just adding a second State Machine call for a view one larger is hopefully not making the test hugely more complex 😅 🤞.

model/flow/protocol_state.go Outdated Show resolved Hide resolved
model/flow/protocol_state.go Outdated Show resolved Hide resolved
model/flow/protocol_state.go Outdated Show resolved Hide resolved
state/protocol/badger/mutator_test.go Outdated Show resolved Hide resolved
Comment on lines 164 to 168
// finalBlockView is the cumulative number of views that will be produced in the current epoch and its extensions
finalBlockView := DefaultEpochExtensionLength +
(parentProtocolState.CurrentEpochSetup.FinalView - parentProtocolState.CurrentEpochSetup.FirstView) + 1
candidateView := parentProtocolState.CurrentEpochSetup.FirstView + 1
for i := uint64(0); i < finalBlockView; i++ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two concerns:

  • I think it is insufficient for the test to run through all views until the end of the first extension and only then verify that another extension was added. We want to verify that the extension is added at exactly the view we expect it to be added.
  • I find this logic complicated ... more complicated than it needs to be in my opinion. The aspect that is confusing me is that we first calculate the number of views (finalBlockView) we need to process, relative to the starting view of the epoch. We know the starting and end value for the candidate view that we want to iterate over. Using the absolute values would be much clearer at least for me.
Suggested change
// finalBlockView is the cumulative number of views that will be produced in the current epoch and its extensions
finalBlockView := DefaultEpochExtensionLength +
(parentProtocolState.CurrentEpochSetup.FinalView - parentProtocolState.CurrentEpochSetup.FirstView) + 1
candidateView := parentProtocolState.CurrentEpochSetup.FirstView + 1
for i := uint64(0); i < finalBlockView; i++ {
// In the previous test `TestNewEpochFallbackStateMachine`, we verified that the first extension is added correctly. Below we
// test proper addition of the subsequent extension. A new extension should be added when we reach `firstExtensionViewThreshold`.
// When reaching (equality) this threshold, the next extension should be added
firstExtensionViewThreshold := parentState.CurrentEpochSetup.FinalView + DefaultEpochExtensionLength - s.params.EpochCommitSafetyThreshold()
// We progress through views that are strictly smaller than threshold. Up to this point, only the initial extension should exist
for candidateView := parentState.CurrentEpochSetup.FirstView + 1; candidateView < firstExtensionViewThreshold; candidateView++ {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated test setup with respect to our changes, it's not exactly as your suggestion but pretty close. Let me know if you want to make another iteration. 4e20394 (#5773)

Comment on lines 218 to 223
finalBlockView := DefaultEpochExtensionLength +
(parentProtocolState.CurrentEpochSetup.FinalView - parentProtocolState.CurrentEpochSetup.FirstView) +
(parentProtocolState.NextEpochSetup.FinalView - parentProtocolState.NextEpochSetup.FirstView) +
1
candidateView := parentProtocolState.CurrentEpochSetup.FirstView + 1
for i := uint64(0); i < finalBlockView; i++ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly to my previous comments, I would suggest to

  1. work with absolute view numbers here to simplify the test
  2. test that epoch extension is added exactly at the correct view
Suggested change
finalBlockView := DefaultEpochExtensionLength +
(parentProtocolState.CurrentEpochSetup.FinalView - parentProtocolState.CurrentEpochSetup.FirstView) +
(parentProtocolState.NextEpochSetup.FinalView - parentProtocolState.NextEpochSetup.FirstView) +
1
candidateView := parentProtocolState.CurrentEpochSetup.FirstView + 1
for i := uint64(0); i < finalBlockView; i++ {
// View threshold _before_ the end of the initial extension. When reaching (equality) this threshold, the next extension should be added
firstExtensionViewThreshold := parentProtocolState.NextEpochSetup.FinalView - s.params.EpochCommitSafetyThreshold()
// We progress through views that are strictly smaller than threshold. Up to this point, only the initial extension should exist
for candidateView := parentProtocolState.CurrentEpochSetup.FirstView + 1; candidateView < firstExtensionViewThreshold; candidateView++ {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@franklywatson franklywatson removed this from the EFM-Q2 EFM Core updates milestone May 2, 2024
durkmurder and others added 4 commits May 8, 2024 14:30
Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

❗ No coverage uploaded for pull request base (feature/efm-recovery@83ef8ad). Click here to learn what that means.

Additional details and impacted files
@@                   Coverage Diff                   @@
##             feature/efm-recovery    #5773   +/-   ##
=======================================================
  Coverage                        ?   63.06%           
=======================================================
  Files                           ?       77           
  Lines                           ?     6297           
  Branches                        ?        0           
=======================================================
  Hits                            ?     3971           
  Misses                          ?     2127           
  Partials                        ?      199           
Flag Coverage Δ
unittests 63.06% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@jordanschalm jordanschalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎸

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants