Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epoch Sync] EpochInfoAggregator treatment is awkward at the epoch sync boundary #11931

Open
Tracked by #73
robin-near opened this issue Aug 13, 2024 · 0 comments
Open
Tracked by #73

Comments

@robin-near
Copy link
Contributor

robin-near commented Aug 13, 2024

The EpochInfoAggregator is a cache of the aggregated info of an epoch. At the beginning of an epoch, the EpochInfoAggregator should be empty, and then each block of the epoch adds some information to the aggregator. That way, we're able to accumulate very important info such as all the proposals submitted during the epoch.

The way that the EpochManager maintains the EpochInfoAggregator is that it keeps both an in-memory aggregator and an on-disk aggregator; in both cases, the aggregator refers to a block hash H, and its data represents fold_left(aggregator_append, empty aggregator state, all blocks from H's epoch's first block to H inclusive). In order to compute the EpochInfoAggregator for any given block hash B, we walk back from B until we either hit H (in which case we combine with the current aggregator), or we hit the beginning of the epoch (in which case we effectively recomputed the whole result).

This sounds like not a big issue for EpochSync, because as we sync to the first block of an epoch, our EpochInfoAggregator would begin empty. However, it is not that simple. The way we maintain the current EpochInfoAggregator is that we would always update it to the last final block. That makes sense, as the last final block is as far as we can advance without ever having to rewind the aggregator computation (rewind as in, when updating the aggregator we would end up having to traverse all the way to epoch start). This presents a very awkward situation for EpochSync, because right after epoch sync, our header_head is at the first block F of the current epoch T:

L3 --> L2 --> L1 --> F
       Epoch T-1  | Epoch T

but our last final block may be either L2 or L3 (depending on whether F is an endorsement or skip). Either way, when we do header sync after, we are going to apply a block on top of F:

L3 --> L2 --> L1 --> F --> S
       Epoch T-1  | Epoch T

Now, the final block is either L1, L2, or L3. If the last final block does advance, let's say from L2 to L1, EpochManager is going to have to update the EpochInfoAggregator to L1, but this is where the problem happens: we do not have a way to compute this. We're missing the entire epoch T - 1 so even if we wanted to we couldn't aggregate all the way to the beginning of epoch T - 1. In other words there is no way for us to correctly compute the EpochInfoAggregator at L1.

So to solve this, we have a few options:

  • Forcefully initialize EpochInfoAggregator at L3 to an empty state. It is incorrect, and as a result, the computation of the aggregator at L2 and L1 will also be incorrect, but that is okay because we don't need to use the aggregator for epoch T - 1 to compute anything, since the EpochInfo for T+1 had already been initialized by epoch sync. It doesn't affect the aggregator computation for epoch T, since we would restart the aggregation at F anyway. The problem with this approach is it's just hacky, and we still need to put in some additional conditionals in the code (it's not too bad) because we don't have the BlockInfo for L3, which the current aggregator code needs.
  • Make the EpochManager aware that L1 is our epoch sync boundary, so that when calculating the last final block of F or S, rather than returning L3 or L2, always return L1. This is fine for correctness because we will not be rewinding or having forks during epoch sync - we already know as part of epoch sync that L1 is final. This approach will avoid having to compute the aggregator for anything in T-1. The disadvantage is that now we have a weird dependency just for this very subtle situation. No other code other than epoch sync itself currently needs to know about an "epoch sync boundary".
  • Some other solution perhaps?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant