-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Consensus] Caching of non processable approvals #776
[Consensus] Caching of non processable approvals #776
Conversation
…b.com/onflow/flow-go into yurii/5555-caching-non-processable-approvals
…b.com/onflow/flow-go into yurii/5555-caching-non-processable-approvals
…/5555-caching-non-processable-approvals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks. I like the design.
Overall, I have two recommendations:
- We could completely hide the implementation details of
AssignmentCollector
including the implementation of the state transitions fromAssignmentCollectorTree
:- Implementing and testing the state transitions and concurrency is probably non-trivial. It would be great to decouple this logic as much as possible from
AssignmentCollectorTree
. AssignmentCollectorTree
could propagate the state update along its internal tree. But each vertex would internally know what to do when a state transition occurs.
- Implementing and testing the state transitions and concurrency is probably non-trivial. It would be great to decouple this logic as much as possible from
- When you implement the logic for transitioning from
CachingAssignmentCollector
toVerifyingAssignmentCollector
, you might be wondering if we have to preserve the order in which we received the approvals. At least I was contemplating this. Here is my thinking:- Generally, verifiers can only send approvals after the result was incorporated. Hence, with very large probability, the node already has a VerifyingAssignmentCollector` when the first approvals arrive. In this case, we process the approvals roughly in the order they arrive.
- Vice versa, a node storing approvals in a
CachingAssignmentCollector
is very rare and should only happen if the node is behind. In this case, it is fine to process the approvals in random order, because all approvers have the same probability to make it into the seal.
// NewCollector is a factory method to generate an AssignmentCollector for an execution result | ||
type NewCollectorFactoryMethod = func(result *flow.ExecutionResult) (*AssignmentCollector, error) | ||
type NewCollectorFactoryMethod = func(result *flow.ExecutionResult) (*VerifyingAssignmentCollector, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering to which extend the AssignmentCollectorTree
needs to know about the implementation or if it could work with abstract AssignmentCollector
s. To add a bit more details:
- we could inject a factory for
AssignmentCollector
s - transitioning between the states (
CachingAssignmentCollector
->VerifyingAssignmentCollector
->OrphanAssignmentCollector
) is internal logic of theAssignmentCollector
AssignmentCollectorTree
triggers theAssignmentCollector
state transition (but does not have detailed internal knowledge how these happen).- note: multiple threads could update the state of an
AssignmentCollector
. I would suggest to implement the state transition as an atomic operation similar to Compare-and-swap - We need a shared worker pool (which I suggested to remove earlier 😑). The Factory could provide this pool.
- note: multiple threads could update the state of an
if processable != v.processable { | ||
// if became unprocessable means it's orphan now | ||
if v.processable { | ||
v.nonProcessableCollector = NewOrphanAssignmentCollector(v.collector.ResultID(), v.collector.BlockID()) | ||
} | ||
|
||
v.processable = processable | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how concurrency-safe this is:
- The
AssignmentCollectorTree
is locked while we update. - But I am wondering if other threads could hold a reference to the struct before its state transition?
|
||
ProcessIncorporatedResult(incorporatedResult *flow.IncorporatedResult) error | ||
ProcessApproval(approval *flow.ResultApproval) error | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a method
ProcessingStatus() ProcessingStatus | |
// ChangeProcessingStatus changes the AssignmentCollector's internal processing | |
// status. The operation is implemented as an atomic compare-and-swap, i.e. the | |
// state transition is only executed if AssignmentCollector's internal state is | |
// equal to `expectedValue`. The boolean return indicates whether the state | |
// was updated. | |
ChangeProcessingStatus(expectedValue, newValue ProcessingStatus) (bool, error) | |
with
type ProcessingStatus int
const (
Caching ProcessingStatus = iota
Verifying
Orphaned
)
On a side-note: Leo has suggested an option I haven't considered before: https://github.com/dapperlabs/flow-go/issues/5555#issuecomment-853483770 As far as I understand, his approach also has some engineering complexity:
Nevertheless, I acknowledge that my proposal has quite an engineering complexity. Please let me know, if you have any idea to simplify this implementation. |
…/5555-caching-non-processable-approvals
@AlexHentschel I believe that this case falls very strictly under type AssignmentCollector interface {
BlockID() flow.Identifier
ResultID() flow.Identifier
ProcessIncorporatedResult(incorporatedResult *flow.IncorporatedResult) error
ProcessApproval(approval *flow.ResultApproval) error
CheckEmergencySealing(finalizedBlockHeight uint64) error
RequestMissingApprovals(sealingTracker *tracker.SealingTracker, maxHeightForRequesting uint64) (int, error)
} The question is with the state transitions. Extending your proposal I would implement type AssignmentCollectorContext struct {
ProcessingStatus() ProcessingStatus
// ChangeProcessingStatus changes the AssignmentCollector's internal processing
// status. The operation is implemented as an atomic compare-and-swap, i.e. the
// state transition is only executed if AssignmentCollector's internal state is
// equal to `expectedValue`. The boolean return indicates whether the state
// was updated.
ChangeProcessingStatus(expectedValue, newValue ProcessingStatus) (bool, error)
}
The only concern I have is for situations: collector := collectorTree.GetCollector() // returns caching collector
// at this point thread gets suspended
// collector internally changes state to orphan collector
// any operation on it will result in error
collector.ProcessApproval() // error When thinking about this case I doubt that changing internal state of object is a good idea. With current implementation whatever you get from collector tree you can be sure that it won't change. Maybe that's a good argument for keeping state changing logic as it is in current implementation. Open to discussing it. |
…-approvals Rough draft for
…/5555-caching-non-processable-approvals
…/5555-caching-non-processable-approvals
Codecov Report
@@ Coverage Diff @@
## master #776 +/- ##
==========================================
- Coverage 54.86% 54.85% -0.01%
==========================================
Files 279 284 +5
Lines 18649 18872 +223
==========================================
+ Hits 10231 10353 +122
- Misses 7042 7122 +80
- Partials 1376 1397 +21
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
…/5555-caching-non-processable-approvals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Added some comments.
engine/consensus/approvals/assignment_collector_statemachine.go
Outdated
Show resolved
Hide resolved
engine/consensus/approvals/assignment_collector_statemachine.go
Outdated
Show resolved
Hide resolved
engine/consensus/approvals/assignment_collector_statemachine.go
Outdated
Show resolved
Hide resolved
engine/consensus/approvals/assignment_collector_statemachine_test.go
Outdated
Show resolved
Hide resolved
wg.Done() | ||
}() | ||
|
||
err := s.collector.ChangeProcessingStatus(CachingApprovals, VerifyingApprovals) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice tests.
I think this is called concurrently with the ProcessApproval and ProcessIncorporatedResult.
I wonder if you have tried this tests is able to capture the concurrency issue by re-processing the data
Did you try removing that logic, and see if the tests fails, and added back would make the tests pass?
This could prove we did reproduce the concurrency issue, and did address it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I've tried removing logic to prove that it works.
There was some issues in the test itself so I have fixed those.
Co-authored-by: Leo Zhang <zhangchiqing@gmail.com>
…/5555-caching-non-processable-approvals
…//github.com/onflow/flow-go into yurii/5555-caching-non-processable-approvals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the AssignmentCollectorStateMachine
it would be great to have tests for invalid state transitions. I have the suspicion that the current code would panic 😨 . Please test for the specific error type that is expected in this case (i.e. ErrDifferentCollectorState
).
I have provided some additional documentation in my PR #995 (targeting this branch). The PR also implements some of my comments .
func (asm *AssignmentCollectorStateMachine) caching2Verifying() (*CachingAssignmentCollector, error) { | ||
asm.Lock() | ||
defer asm.Unlock() | ||
cachingCollector, ok := asm.atomicLoadCollector().(*CachingAssignmentCollector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😨
if this cast fails, won't cachingCollector
be nil
? If that is the case, we will have a panic here:
verifyingCollector.ProcessingStatus().String(), ErrDifferentCollectorState) |
Do we have a test that covers this failure case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, it may happen. I will add a test for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it can't in our scenario, it may happen theoretically but caller of ChangeProcessingStatus
mutually excludes at same lock so only one goroutine will change processing status at time. In other words in current implementation race cannot happen.
@@ -118,13 +120,20 @@ func (t *AssignmentCollectorTree) FinalizeForkAtLevel(finalized *flow.Header, se | |||
} | |||
|
|||
if len(finalizedFork) > 0 { | |||
if !finalizedFork[0].processable { | |||
if finalizedFork[0].collector.ProcessingStatus() != VerifyingApprovals { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution, I think we made a simplifying assumption here, which doesn't hold:
- We assume a single fork. While we only have a single fork of finalized blocks, this logic is for the
AssignmentCollectorTree
, which mirrors the Execution Tree. There can be multiple execution forks that all descend from the latest sealed block ❗
I think if we replace
flow-go/engine/consensus/approvals/assignment_collector_tree.go
Lines 122 to 138 in 6702dae
if len(finalizedFork) > 0 { | |
if finalizedFork[0].collector.ProcessingStatus() != VerifyingApprovals { | |
log.Error().Msgf("AssignmentCollectorTree has found not processable finalized fork %v,"+ | |
" this is unexpected and shouldn't happen, recovering", finalizedFork[0].collector.BlockID()) | |
for _, vertex := range finalizedFork { | |
expectedStatus := vertex.collector.ProcessingStatus() | |
err = vertex.collector.ChangeProcessingStatus(expectedStatus, VerifyingApprovals) | |
if err != nil { | |
return err | |
} | |
} | |
err = t.updateForkState(finalizedFork[len(finalizedFork)-1], VerifyingApprovals) | |
if err != nil { | |
return err | |
} | |
} | |
} |
by the following, that should fix the problem:
for _, collectorVertex := range finalizedFork {
clr := collectorVertex.collector
if clr.ProcessingStatus() != VerifyingApprovals {
log.Error().Msgf("AssignmentCollectorTree has found not processable finalized fork %v,"+
" this is unexpected and shouldn't happen, recovering", clr.BlockID())
}
currentStatus := clr.ProcessingStatus()
if clr.Block().Height < finalized.Height {
err = clr.ChangeProcessingStatus(currentStatus, VerifyingApprovals)
} else {
err = t.updateForkState(collectorVertex, VerifyingApprovals)
}
if err != nil {
return err
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused, can we have multiple finalized forks? This code is basically for a case where finalized fork has to be verifiable otherwise we have a bug somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to distinguish between the tree of blocks and the tree of results (aka execution tree)
- tree of blocks: we can only have a single finalized fork
- execution tree: for the finalized blocks, there can still be conflicting branches of results in the execution tree.
- Sealing orphans forks in the execution tree.
- But since the finalized blocks are still unsealed, there can be multiple competing results
* added high-level readme and goDoc * removed `base` type as this was only used together with `AssignmentCollectorBase` * revised method `AssignmentCollectorTree.updateForkState(..)` to return early if the state of the collector is already equal to the target state Co-authored-by: Yura <yuraolex@gmail.com>
…/5555-caching-non-processable-approvals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎯 Great work. Thanks for the thorough work and the plentiful tests 👏
Made a couple optional comments. Feel free to merge.
P.S. pushed commit 2311e39 with a few lines of extra goDoc. No code changes.
// TODO: currently VerifyingAssignmentCollector doesn't cleanup collectorTree when blocks that incorporate results get orphaned | ||
// For BFT milestone we need to ensure that this cleanup is properly implemented and all orphan collectorTree are pruned by height | ||
// when fork gets orphaned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this ToDo still up to date? Here my thoughts:
- we prune
AssignmentCollectorTree
by height - orphaned forks cannot grow indefinitely, because consensus nodes don't extend them anymore
- I think it is beneficial to keep orphaned forks in the tree, so we can orphan an new collector that extends an already orphaned fork
Hence, pruning by height should be sufficient (?)
Co-authored-by: Alexander Hentschel <alex.hentschel@axiomzen.co>
…/5555-caching-non-processable-approvals
This PR implements second approach proposed here: dapperlabs/flow-go/issues/5555.
It's not ready yet, but I am submitting it to discuss this approach before making a decision on implementation.
I have introduced extra assignment collectors, in proposed solution we have:
CachingAssignmentCollector
which caches all approvals into separate cache without verificationVerifyingAssignmentCollector
which does full verification and approval collectingOrphanAssignmentCollector
which errors if being used since we clearly know that it's outdated.There is logic for "changing" active collector but no logic for actually applying cached approvals(to be implemented).
Hopefully this is enough to gave a general overview of proposed solution.