-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EFM Recovery] Dynamic Protocol State injects EpochExtension
s
#5773
[EFM Recovery] Dynamic Protocol State injects EpochExtension
s
#5773
Conversation
…ing extensions when reaching threshold
if !nextEpochCommitted { | ||
state.NextEpoch = nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I think we need to spend more time on how this case is handled for downstream components. Not necessarily in this PR.
The current model is that once you enter EpochPhaseSetup
you can access all the EpochSetup
data forever, and once you enter EpochPhaseCommitted
you can access all the EpochCommit
data forever. (The second part is still true, but the first part isn't true any more with EFM Recovery.) The phase state machine is strictly:
|-> STAKING -> SETUP -> COMMITTED -|
|----------------------------------|
As it stands, I can call state.AtHeight(N).Epochs().Next()...
and get a return value, then later call state.AtHeight(N+1).Epochs().Next()...
and get ErrNextEpochNotSetup
, which is a potentially breaking change.
Similarly I can observe counter=N,phase=EpochPhaseSetup
in block K, then observe counter=N,phase=EpochPhaseStaking
in block K+1 which is currently impossible.
Some thoughts:
- Minimally, we should document this change in the relevant APIs (
protocol.Epoch
). Downstream components need to be aware of the newly possible "EpochSetup
retraction" case and handle it properly. - We could also add a new epoch phase to capture the EFM state (see this Open Question). My gut feeling is this approach would result in the most comprehensible data model, but will be a substantial additional chunk of work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut feeling is similar to Jordan's:
-
I think we need an explicit representation for "currently in EFM mode"
-
Our APIs were written with the mind-set that data once queryable is essentially never changed. My feeling is that we should try to change our APIs such that information about the upcoming epoch that has not been committed to by the protocol is not queryable.
While I think we have a few cases, where we need this data (like starting the DKG, collector root block voting), but they are very very specialized. Instead of covering these use-cases with the same API as all the other stuff caring only about committed epochs, I feel either a specialized API or just listening to the events would be sufficient for the protocol components that need uncommitted epoch information.
+1 on this being work for separate PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possible candidate is this issue: #5723
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note to Open Questions to come back to this later.
state.InvalidEpochTransitionAttempted = true | ||
} | ||
|
||
if !nextEpochCommitted && view+params.EpochCommitSafetyThreshold() >= parentState.CurrentEpochFinalView() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, the fact that we've need to push all this state transition logic into the constructor indicates that the dual state machine design may not fit well with the EFM recovery changes.
This new logic is applying the state transition associated with incorporating the candidate block (essentially EvolveState
) but at an earlier, unintuitive point in the codepath (constructor of a sub-state-machine). I think this logic would be better situated within the EvolveState
call.
Our interface documentation also implies that state changes are exclusively handled in EvolveState
:
flow-go/state/protocol/protocol_state/kvstore.go
Lines 122 to 130 in 3ed352f
// EvolveState applies the state change(s) on sub-state P for the candidate block (under construction). | |
// Information that potentially changes the Epoch state (compared to the parent block's state): | |
// - Service Events sealed in the candidate block | |
// - the candidate block's view (already provided at construction time) | |
// | |
// CAUTION: EvolveState MUST be called for all candidate blocks, even if `sealedServiceEvents` is empty! | |
// This is because also the absence of expected service events by a certain view can also result in the | |
// Epoch state changing. (For example, not having received the EpochCommit event for the next epoch, but | |
// approaching the end of the current epoch.) |
Suggestion: I think it's worthwhile to maintain the division of responsibility where:
- the constructor creates an otherwise unmodified copy of the parent state machine
EvolveState
applies all state changes associated with the candidate block
One way to accomplish this is to consolidate the happy-path and fallback state machines. I'm leaning toward that personally, but I'm happy to implement more functionality and see how it shakes out with the separated sub-state-machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I have a slightly different view on this. While I agree with Jordan's argument for OrthogonalStoreStateMachine
, the FallbackStateMachine
is one level lower. It does not have an EvolveState
method but ingests the events individually via dedicated methods. Therefore, I see no significant design inconsistency. Its more an aesthetic consideration in my mind. Overall, I think it acceptable for package internal sub-components to vary the API according to their needs. '
Furthermore, we separated the happy path from the fallback path specifically to keep the intellectual complexity manageable and explicitly express the concept of "throw all progress away from the happy path and re-do the evolution using the fallback logic". If we now merge the happy path and the fallback path, I expect this problem to come back. Honestly, I consider code that is hard to reason about way more problematic than a package-internal API convention.
Nevertheless, I don't want to entirely preclude the possibility of merging happy path and fallback again. But I think to make an educated decision, we need the entire logic fully implemented to better understand how similar / different the two paths are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not have an EvolveState method but ingests the events individually via dedicated methods.
The FallbackStateMachine
itself does not have an EvolveState
method, but the higher-level EpochStateMachine
does, and it uses the FallbackStateMachine
. The problem is precisely that we do not have dedicated methods for EFM related state changes, so they go in the constructor.
My point is, the interface suggests that state changes should happen in EvolveState
. But if I instantiate an instance of EpochStateMachine
, it might evolve its internal state before I ever call EvolveState
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is precisely that we do not have dedicated methods for EFM related state changes, so they go in the constructor.
My point is, the interface suggests that state changes should happen in EvolveState. But if I instantiate an instance of EpochStateMachine, it might evolve its internal state before I ever call EvolveState.
I don't quite agree with this mindset, I will try to explain why. The idea is that when we create a fallback state machine it means one of two things:
- we are entering EFM
- we are already in EFM
In either of those cases we can conclude that EFM is taking place. Constructor of FallbackStateMachine
takes responsibility of creating a valid instance of state machine that ensures that updated state holds its invariant right after the construction. Without a need to call extra methods.
Right now this works quite well with how EFM is activated but I am open to making changes to the whole architecture and possibly merging everything together once we see how current design looks after finishing implementation of missing pieces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still disagree, but let's pick this discussion up again later, if needed, once the state machine changes are more fleshed out. Added a note to Open Questions.
… which enforce that setup / commit events are nil in case no respective epoch is specified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clean code and amazing tests. I would tweak the tests a bit such that we test:
-
Epoch extensions are added exactly when reaching equality. We want to test that we don't accidentally have a "off-by-one" error in the implementation, which is a relatively common bug.
-
While my comments don't reflect this, I would still suggest to also test that exceeding the threshold (parent block's view is strictly smaller than threshold and candidate block's view strictly greater) yields the desired result.
In my comments, I have suggested code that exactly reaches the threshold. Just adding a second State Machine call for a view one larger is hopefully not making the test hugely more complex 😅 🤞.
state/protocol/protocol_state/epochs/fallback_statemachine_test.go
Outdated
Show resolved
Hide resolved
state/protocol/protocol_state/epochs/fallback_statemachine_test.go
Outdated
Show resolved
Hide resolved
// finalBlockView is the cumulative number of views that will be produced in the current epoch and its extensions | ||
finalBlockView := DefaultEpochExtensionLength + | ||
(parentProtocolState.CurrentEpochSetup.FinalView - parentProtocolState.CurrentEpochSetup.FirstView) + 1 | ||
candidateView := parentProtocolState.CurrentEpochSetup.FirstView + 1 | ||
for i := uint64(0); i < finalBlockView; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two concerns:
- I think it is insufficient for the test to run through all views until the end of the first extension and only then verify that another extension was added. We want to verify that the extension is added at exactly the view we expect it to be added.
- I find this logic complicated ... more complicated than it needs to be in my opinion. The aspect that is confusing me is that we first calculate the number of views (
finalBlockView
) we need to process, relative to the starting view of the epoch. We know the starting and end value for the candidate view that we want to iterate over. Using the absolute values would be much clearer at least for me.
// finalBlockView is the cumulative number of views that will be produced in the current epoch and its extensions | |
finalBlockView := DefaultEpochExtensionLength + | |
(parentProtocolState.CurrentEpochSetup.FinalView - parentProtocolState.CurrentEpochSetup.FirstView) + 1 | |
candidateView := parentProtocolState.CurrentEpochSetup.FirstView + 1 | |
for i := uint64(0); i < finalBlockView; i++ { | |
// In the previous test `TestNewEpochFallbackStateMachine`, we verified that the first extension is added correctly. Below we | |
// test proper addition of the subsequent extension. A new extension should be added when we reach `firstExtensionViewThreshold`. | |
// When reaching (equality) this threshold, the next extension should be added | |
firstExtensionViewThreshold := parentState.CurrentEpochSetup.FinalView + DefaultEpochExtensionLength - s.params.EpochCommitSafetyThreshold() | |
// We progress through views that are strictly smaller than threshold. Up to this point, only the initial extension should exist | |
for candidateView := parentState.CurrentEpochSetup.FirstView + 1; candidateView < firstExtensionViewThreshold; candidateView++ { | |
⋮ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated test setup with respect to our changes, it's not exactly as your suggestion but pretty close. Let me know if you want to make another iteration. 4e20394
(#5773)
state/protocol/protocol_state/epochs/fallback_statemachine_test.go
Outdated
Show resolved
Hide resolved
finalBlockView := DefaultEpochExtensionLength + | ||
(parentProtocolState.CurrentEpochSetup.FinalView - parentProtocolState.CurrentEpochSetup.FirstView) + | ||
(parentProtocolState.NextEpochSetup.FinalView - parentProtocolState.NextEpochSetup.FirstView) + | ||
1 | ||
candidateView := parentProtocolState.CurrentEpochSetup.FirstView + 1 | ||
for i := uint64(0); i < finalBlockView; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly to my previous comments, I would suggest to
- work with absolute view numbers here to simplify the test
- test that epoch extension is added exactly at the correct view
finalBlockView := DefaultEpochExtensionLength + | |
(parentProtocolState.CurrentEpochSetup.FinalView - parentProtocolState.CurrentEpochSetup.FirstView) + | |
(parentProtocolState.NextEpochSetup.FinalView - parentProtocolState.NextEpochSetup.FirstView) + | |
1 | |
candidateView := parentProtocolState.CurrentEpochSetup.FirstView + 1 | |
for i := uint64(0); i < finalBlockView; i++ { | |
// View threshold _before_ the end of the initial extension. When reaching (equality) this threshold, the next extension should be added | |
firstExtensionViewThreshold := parentProtocolState.NextEpochSetup.FinalView - s.params.EpochCommitSafetyThreshold() | |
// We progress through views that are strictly smaller than threshold. Up to this point, only the initial extension should exist | |
for candidateView := parentProtocolState.CurrentEpochSetup.FirstView + 1; candidateView < firstExtensionViewThreshold; candidateView++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org> Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## feature/efm-recovery #5773 +/- ##
=======================================================
Coverage ? 63.06%
=======================================================
Files ? 77
Lines ? 6297
Branches ? 0
=======================================================
Hits ? 3971
Misses ? 2127
Partials ? 199
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎸
…ion_-_suggestions consistency checks for `RichProtocolStateEntry` constructor
#5717
#5724
#5726
Context
This PR adds data types and logic for extending current epoch using approach based around
EpochExtension
s.It is implemented by extending
EpochStateContainer
in Dynamic Protocol State to support an extra fieldEpochExtensions
.The
epochs.FallbackStateMachine
implements logic to add epoch extensions when needed.There are several rules when and how epoch is extended:
v
such as:v + commitSafetyThreshold >= currentEpoch.FinalView
we add an epoch extension to the current epoch.When following these rules one should never get in a state where there is an epoch extension and a next epoch, epochs extensions are only added to current epoch when there is no next epoch setup(or not anymore).
❗ Note that there is no notion of transitioning between epochs when we have entered EFM, extensions are simply added to the state, but no actual transitions are happening.