[EFM] Invalid Service Events shortly after Epoch Commit #5631

AlexHentschel · 2024-04-05T00:43:07Z

Problem description

Currently, the EpochStateMachine, which orchestrates the Epoch Happy Path and Fallback, has this behaviour:

As of the block that encounters an invalid Epoch ServiceEvent, we engage Epoch Fallback Mode [EFM] and do not process any Epoch transitions anymore. This creates subtle edge cases for future light clients and can potentially drive consensus into an irreconcilable state (not sure)
- Scenario:
  - Imagine that Epoch N ends at view 1000.
  - Block from view 1001 (first block of Epoch N+1) seals a result that has an invalid Epoch Service Event
- How the current implementation will behave:
  - Leader (lets call her Alice) for the first view (1001) in Epoch N+1 constructs its block, so it executes ProcessUpdate on the Epoch State Machine (including the broken Service Event).
  - first, EpochStateMachine realizes that this is the first block of the epoch, so it performs an epoch transition (👉 code).
  - However, while processing the service events, EpochStateMachine will encounter an InvalidServiceEventError here so it transitions to EFM.
  - transitioning to EFM means, we are discarding the interim Epoch state we have so far (including the epoch transition), re-initialize the state with a fresh copy of the parent block's Epoch state and re-apply all the service events.

In my opinion, the consensus protocol has formally reached an irreconcilable state at this point. I think our current implementation would probably just stop producing blocks. Reasoning:

Note that when initializing the FallbackStateMachine, we do not re-apply the epoch transition.
Going into view 1001, Alice thought that she was the leader, based on the leader assignment for Epoch N+1. However, after running the Epoch State Machine, the Epoch state is still in Epoch N in fallback mode. Most likely, Alice is not the leader for view 1001 in EMF of Epoch N.
I don't think our software will handle this edge case. Certainly it is a violation of HotStuffs formal safety requirement: once you commit to a leader selection for some range of views (here the views belonging to Epoch N+1), you cannot change it (slightly simplified). Conceptually, we commit to the leader selection once we commit Epoch N+1.

I think a similar aspect has previously come up for the EFM recovery. Specifically, the EFM recovery cannot change the modus operandi for view ranges that the FallbackStateMachine has already committed.

Suggestion of Problem Solution

Once an Epoch is committed (happy path) to some fork, that Epoch will become active on the specified view -- if this fork is extended beyond the epoch boundary. In other words, also the FallbackStateMachine will enact Epoch transitions that have previously been committed by the happy path protocol.
The aspect where HappyPathStateMachine and FallbackStateMachine differ is the way they add new view ranges beyond the already committed.

The text was updated successfully, but these errors were encountered:

AlexHentschel · 2024-04-11T00:58:54Z

Let us consider the following suggestion:

Once a leader selection for a view range is committed, it can never be overwritten/changed

A leader selection view range is committed upon finalizing the EpochCommit event (not EpochSetup) on the happy path

A leader selection view range is committed upon entering EFM on the fallback path

and

If an epoch extension is added, it's appended to the last committed leader selection view range.

Thoughts

There is a subtle detail in the proposed specification that we have to get correctly in order to not break consensus.To paraphrase, we are suggesting to use finality as a decision criterion on whether or not the Epoch State Machine accepts a service event leading to a changed leader selection for a future view range.

General Rule:

Finality cannot be used as an input for evolving the Protocol State. Generally, only information in the fork that is currently being extended can be used to determine the validity of a block. There are exceptions where using finality is safe, but those are generally very edge-casey.

Reasoning:

Finality is a determination that nodes make locally. Very explicitly, nodes that all know the fork A <- B <- C and receiving the candidate D such that A <- B <- C <- D; yet they might still have different finality statuses for the blocks. Specifically, this is because nodes may observe alternative forks that are subsequently orphaned and are not observed by other nodes. Nevertheless, observing subsequently- orphaned forks can (rarely) progress finality on the main fork.

For example, me knowing the fork A <- B <- C <- D, I may conclude that B is finalized. On the one hand, based on my world view C is still unfinalized. On the other hand, some other node may know additional children of C that finalize C. So if we allow finality to influence what Protocol State transitions in block D are legal/illegal, me and that other node may disagree whether D is a valid extension of the chain.
Consensus rules guarantee that finality is eventually consistent. In other words, if some node finalizes block B and if the network continues to produce valid blocks, all honest nodes will eventually conclude that B is finalized.

Suggested change.

Above, I argued that this part of the suggested rule would break consensus:

❌ A leader selection view range is committed upon finalizing the EpochCommit event (not EpochSetup) in the happy path

Lets discuss how we can modify this rule to work out:

I think as a first step, we need to have a definition of "committing a leader selection view range" for a specific fork. I would suggest:
- On the happy path, a leader selection view range is committed for one specific fork, when an EpochCommit event is included in that fork.
- On the unhappy path, a leader selection view range is committed when the EFM logic reaches the threshold view without a valid EpochCommit or EpochRecovery event
With this definition, we can have different committed leader selection view ranges in different forks. This is no problem, as long as such view ranges are sufficiently far into the future.
- Though, by the time the consensus committee for the view range takes over, it must have already been finalized which committee is taking over. In other words, finality is not important when the Epoch State Machine writes the leader selection view range into the protocol state. Finality is important when this view range activates. Conceptually, it is the same mechanics as with protocol version upgrades: we need this safety buffer between writing data into the Protocol State and this taking effect in the network.

AlexHentschel mentioned this issue Apr 5, 2024

[Dynamic Protocol State] Refactoring to support orthogonal state machine operating on sub-states #5616

Merged

jordanschalm self-assigned this Apr 5, 2024

durkmurder mentioned this issue Apr 17, 2024

[Dynamic Protocol State] Extend tests for EpochStateMachine #5681

Merged

franklywatson added this to the EFM-Q2 milestone Apr 18, 2024

This was referenced Apr 23, 2024

[EFM] Update Consensus Committee EFM processing #5730

Open

[EFM] Modify EFM logic do not enter EFM while in EpochCommitted phase #5731

Open

[EFM] Dynamic Protocol State maintains EFM by injecting EpochExtensions #5726

Closed

kc1116 modified the milestones: EFM-Q2, EFM-Q2 Downstream updates Apr 23, 2024

franklywatson modified the milestones: EFM-Q2 Downstream updates, EFM-Q2 Epoch Extensions updates Apr 23, 2024

franklywatson mentioned this issue Apr 23, 2024

[EPIC | M2] Implement epoch state machine in Dynamic Protocol State #5762

Open

franklywatson removed this from the EFM-Q2 EFM Core updates milestone Apr 23, 2024

franklywatson mentioned this issue Apr 26, 2024

[EPIC] [EFM Recovery - M3] Overview #5103

Open

franklywatson added this to the EFM-Q2 EFM Core updates milestone Apr 26, 2024

durkmurder self-assigned this May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EFM] Invalid Service Events shortly after Epoch Commit #5631

[EFM] Invalid Service Events shortly after Epoch Commit #5631

AlexHentschel commented Apr 5, 2024

AlexHentschel commented Apr 11, 2024 •

edited

[EFM] Invalid Service Events shortly after Epoch Commit #5631

[EFM] Invalid Service Events shortly after Epoch Commit #5631

Comments

AlexHentschel commented Apr 5, 2024

Problem description

Suggestion of Problem Solution

AlexHentschel commented Apr 11, 2024 • edited

Thoughts

General Rule:

Suggested change.

AlexHentschel commented Apr 11, 2024 •

edited