Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Consensus] Caching of non processable approvals #776

Merged
merged 39 commits into from
Jul 21, 2021

Conversation

durkmurder
Copy link
Member

This PR implements second approach proposed here: dapperlabs/flow-go/issues/5555.

It's not ready yet, but I am submitting it to discuss this approach before making a decision on implementation.

I have introduced extra assignment collectors, in proposed solution we have:

  • CachingAssignmentCollector which caches all approvals into separate cache without verification
  • VerifyingAssignmentCollector which does full verification and approval collecting
  • OrphanAssignmentCollector which errors if being used since we clearly know that it's outdated.

There is logic for "changing" active collector but no logic for actually applying cached approvals(to be implemented).

Hopefully this is enough to gave a general overview of proposed solution.

Base automatically changed from yurii/5417-approval-processing-engine to master June 4, 2021 12:01
Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks. I like the design.

Overall, I have two recommendations:

  • We could completely hide the implementation details of AssignmentCollector including the implementation of the state transitions from AssignmentCollectorTree:
    • Implementing and testing the state transitions and concurrency is probably non-trivial. It would be great to decouple this logic as much as possible from AssignmentCollectorTree.
    • AssignmentCollectorTree could propagate the state update along its internal tree. But each vertex would internally know what to do when a state transition occurs.
  • When you implement the logic for transitioning from CachingAssignmentCollector to VerifyingAssignmentCollector, you might be wondering if we have to preserve the order in which we received the approvals. At least I was contemplating this. Here is my thinking:
    • Generally, verifiers can only send approvals after the result was incorporated. Hence, with very large probability, the node already has a VerifyingAssignmentCollector` when the first approvals arrive. In this case, we process the approvals roughly in the order they arrive.
    • Vice versa, a node storing approvals in a CachingAssignmentCollector is very rare and should only happen if the node is behind. In this case, it is fine to process the approvals in random order, because all approvers have the same probability to make it into the seal.

// NewCollector is a factory method to generate an AssignmentCollector for an execution result
type NewCollectorFactoryMethod = func(result *flow.ExecutionResult) (*AssignmentCollector, error)
type NewCollectorFactoryMethod = func(result *flow.ExecutionResult) (*VerifyingAssignmentCollector, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering to which extend the AssignmentCollectorTree needs to know about the implementation or if it could work with abstract AssignmentCollectors. To add a bit more details:

  • we could inject a factory for AssignmentCollectors
  • transitioning between the states (CachingAssignmentCollector -> VerifyingAssignmentCollector -> OrphanAssignmentCollector) is internal logic of the AssignmentCollector
  • AssignmentCollectorTree triggers the AssignmentCollector state transition (but does not have detailed internal knowledge how these happen).
    • note: multiple threads could update the state of an AssignmentCollector. I would suggest to implement the state transition as an atomic operation similar to Compare-and-swap
    • We need a shared worker pool (which I suggested to remove earlier 😑). The Factory could provide this pool.

Comment on lines 37 to 44
if processable != v.processable {
// if became unprocessable means it's orphan now
if v.processable {
v.nonProcessableCollector = NewOrphanAssignmentCollector(v.collector.ResultID(), v.collector.BlockID())
}

v.processable = processable
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how concurrency-safe this is:

  • The AssignmentCollectorTree is locked while we update.
  • But I am wondering if other threads could hold a reference to the struct before its state transition?


ProcessIncorporatedResult(incorporatedResult *flow.IncorporatedResult) error
ProcessApproval(approval *flow.ResultApproval) error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a method

Suggested change
ProcessingStatus() ProcessingStatus
// ChangeProcessingStatus changes the AssignmentCollector's internal processing
// status. The operation is implemented as an atomic compare-and-swap, i.e. the
// state transition is only executed if AssignmentCollector's internal state is
// equal to `expectedValue`. The boolean return indicates whether the state
// was updated.
ChangeProcessingStatus(expectedValue, newValue ProcessingStatus) (bool, error)

with

type ProcessingStatus int
const (
	Caching ProcessingStatus = iota
	Verifying
	Orphaned
)

@AlexHentschel
Copy link
Member

AlexHentschel commented Jun 22, 2021

On a side-note: Leo has suggested an option I haven't considered before: https://github.com/dapperlabs/flow-go/issues/5555#issuecomment-853483770 As far as I understand, his approach also has some engineering complexity:

  • we already have some still-inactive vertices in the AssignmentCollectorTree. But we don't want to store approvals in these vertices and rather in Core.approvalsCache. Then core needs to understand the different states.
  • My personal preference is generally a strong separation of concerns:
    • AssignmentCollector implements the state transition for one result
    • AssignmentCollectorTree maintains the different AssignmentCollector, triggers their state transitions and propagates the state transitions

Nevertheless, I acknowledge that my proposal has quite an engineering complexity. Please let me know, if you have any idea to simplify this implementation.

@durkmurder
Copy link
Member Author

@AlexHentschel I believe that this case falls very strictly under State OOP pattern. I would implement it in a way that we have 3 separate states, Caching, Verifying, Orphan, they all implement same AssignmentCollector interface.

type AssignmentCollector interface {
  BlockID() flow.Identifier
  ResultID() flow.Identifier
  
  ProcessIncorporatedResult(incorporatedResult *flow.IncorporatedResult) error
  ProcessApproval(approval *flow.ResultApproval) error
  
  CheckEmergencySealing(finalizedBlockHeight uint64) error
  RequestMissingApprovals(sealingTracker *tracker.SealingTracker, maxHeightForRequesting uint64) (int, error)
}

The question is with the state transitions. Extending your proposal I would implement AssignmentCollectorContext a specific class that will implement AssignmentCollector interface and perform state transitions, it will know everything about all 3 states and can perform the transition + will hold reference to current state.

type AssignmentCollectorContext struct {

	ProcessingStatus() ProcessingStatus
	// ChangeProcessingStatus changes the AssignmentCollector's internal processing
	// status. The operation is implemented as an atomic compare-and-swap, i.e. the
	// state transition is only executed if AssignmentCollector's internal state is
	// equal to `expectedValue`. The boolean return indicates whether the state
	// was updated.
	ChangeProcessingStatus(expectedValue, newValue ProcessingStatus) (bool, error)
}

AssignmentCollectorTree will be responsible for triggering state transition since it's the only class that has knowledge for that.

The only concern I have is for situations:

collector := collectorTree.GetCollector() // returns caching collector
// at this point thread gets suspended
// collector internally changes state to orphan collector
// any operation on it will result in error
collector.ProcessApproval() // error

When thinking about this case I doubt that changing internal state of object is a good idea. With current implementation whatever you get from collector tree you can be sure that it won't change. Maybe that's a good argument for keeping state changing logic as it is in current implementation.

Open to discussing it.

@AlexHentschel AlexHentschel mentioned this pull request Jul 6, 2021
@codecov-commenter
Copy link

codecov-commenter commented Jul 9, 2021

Codecov Report

Merging #776 (0b65640) into master (89ea460) will decrease coverage by 0.00%.
The diff coverage is 54.52%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #776      +/-   ##
==========================================
- Coverage   54.86%   54.85%   -0.01%     
==========================================
  Files         279      284       +5     
  Lines       18649    18872     +223     
==========================================
+ Hits        10231    10353     +122     
- Misses       7042     7122      +80     
- Partials     1376     1397      +21     
Flag Coverage Δ
unittests 54.85% <54.52%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...e/consensus/approvals/assignment_collector_tree.go 0.00% <0.00%> (ø)
engine/consensus/approvals/request_tracker.go 84.21% <ø> (ø)
engine/consensus/sealing/engine.go 50.77% <0.00%> (-0.81%) ⬇️
...consensus/approvals/orphan_assignment_collector.go 42.85% <42.85%> (ø)
engine/consensus/sealing/core.go 60.71% <42.85%> (-0.48%) ⬇️
engine/consensus/approvals/assignment_collector.go 50.00% <50.00%> (-12.12%) ⬇️
...sus/approvals/assignment_collector_statemachine.go 50.00% <50.00%> (ø)
...nsensus/approvals/veryfing_assignment_collector.go 59.06% <59.06%> (ø)
...e/consensus/approvals/assignment_collector_base.go 65.51% <65.51%> (ø)
...onsensus/approvals/caching_assignment_collector.go 84.61% <84.61%> (ø)
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 89ea460...0b65640. Read the comment docs.

@durkmurder durkmurder marked this pull request as ready for review July 13, 2021 14:13
Copy link
Member

@zhangchiqing zhangchiqing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Added some comments.

engine/consensus/approvals/caches.go Outdated Show resolved Hide resolved
engine/consensus/approvals/caches.go Outdated Show resolved Hide resolved
engine/consensus/approvals/caches.go Outdated Show resolved Hide resolved
wg.Done()
}()

err := s.collector.ChangeProcessingStatus(CachingApprovals, VerifyingApprovals)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice tests.

I think this is called concurrently with the ProcessApproval and ProcessIncorporatedResult.

I wonder if you have tried this tests is able to capture the concurrency issue by re-processing the data

Did you try removing that logic, and see if the tests fails, and added back would make the tests pass?

This could prove we did reproduce the concurrency issue, and did address it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I've tried removing logic to prove that it works.
There was some issues in the test itself so I have fixed those.

Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the AssignmentCollectorStateMachine it would be great to have tests for invalid state transitions. I have the suspicion that the current code would panic 😨 . Please test for the specific error type that is expected in this case (i.e. ErrDifferentCollectorState).

I have provided some additional documentation in my PR #995 (targeting this branch). The PR also implements some of my comments .

func (asm *AssignmentCollectorStateMachine) caching2Verifying() (*CachingAssignmentCollector, error) {
asm.Lock()
defer asm.Unlock()
cachingCollector, ok := asm.atomicLoadCollector().(*CachingAssignmentCollector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😨
if this cast fails, won't cachingCollector be nil? If that is the case, we will have a panic here:

verifyingCollector.ProcessingStatus().String(), ErrDifferentCollectorState)

Do we have a test that covers this failure case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, it may happen. I will add a test for this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it can't in our scenario, it may happen theoretically but caller of ChangeProcessingStatus mutually excludes at same lock so only one goroutine will change processing status at time. In other words in current implementation race cannot happen.

@@ -118,13 +120,20 @@ func (t *AssignmentCollectorTree) FinalizeForkAtLevel(finalized *flow.Header, se
}

if len(finalizedFork) > 0 {
if !finalizedFork[0].processable {
if finalizedFork[0].collector.ProcessingStatus() != VerifyingApprovals {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️
Caution, I think we made a simplifying assumption here, which doesn't hold:

  • We assume a single fork. While we only have a single fork of finalized blocks, this logic is for the AssignmentCollectorTree, which mirrors the Execution Tree. There can be multiple execution forks that all descend from the latest sealed block ❗

I think if we replace

if len(finalizedFork) > 0 {
if finalizedFork[0].collector.ProcessingStatus() != VerifyingApprovals {
log.Error().Msgf("AssignmentCollectorTree has found not processable finalized fork %v,"+
" this is unexpected and shouldn't happen, recovering", finalizedFork[0].collector.BlockID())
for _, vertex := range finalizedFork {
expectedStatus := vertex.collector.ProcessingStatus()
err = vertex.collector.ChangeProcessingStatus(expectedStatus, VerifyingApprovals)
if err != nil {
return err
}
}
err = t.updateForkState(finalizedFork[len(finalizedFork)-1], VerifyingApprovals)
if err != nil {
return err
}
}
}

by the following, that should fix the problem:

for _, collectorVertex := range finalizedFork {
	clr := collectorVertex.collector
	if clr.ProcessingStatus() != VerifyingApprovals {
		log.Error().Msgf("AssignmentCollectorTree has found not processable finalized fork %v,"+
			" this is unexpected and shouldn't happen, recovering", clr.BlockID())
	}
	currentStatus := clr.ProcessingStatus()
	if clr.Block().Height < finalized.Height {
		err = clr.ChangeProcessingStatus(currentStatus, VerifyingApprovals)
	} else {
		err = t.updateForkState(collectorVertex, VerifyingApprovals)
	}
	if err != nil {
		return err
	}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused, can we have multiple finalized forks? This code is basically for a case where finalized fork has to be verifiable otherwise we have a bug somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to distinguish between the tree of blocks and the tree of results (aka execution tree)

  • tree of blocks: we can only have a single finalized fork
  • execution tree: for the finalized blocks, there can still be conflicting branches of results in the execution tree.
    • Sealing orphans forks in the execution tree.
    • But since the finalized blocks are still unsealed, there can be multiple competing results

engine/consensus/approvals/assignment_collector_tree.go Outdated Show resolved Hide resolved
durkmurder and others added 4 commits July 20, 2021 18:28
* added high-level readme and goDoc

* removed `base` type as this was only used together with `AssignmentCollectorBase`

* revised method `AssignmentCollectorTree.updateForkState(..)` to return early if the state of the collector is already equal to the target state

Co-authored-by: Yura <yuraolex@gmail.com>
Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Great work. Thanks for the thorough work and the plentiful tests 👏
Made a couple optional comments. Feel free to merge.

P.S. pushed commit 2311e39 with a few lines of extra goDoc. No code changes.

Comment on lines 33 to 35
// TODO: currently VerifyingAssignmentCollector doesn't cleanup collectorTree when blocks that incorporate results get orphaned
// For BFT milestone we need to ensure that this cleanup is properly implemented and all orphan collectorTree are pruned by height
// when fork gets orphaned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ToDo still up to date? Here my thoughts:

  • we prune AssignmentCollectorTree by height
  • orphaned forks cannot grow indefinitely, because consensus nodes don't extend them anymore
  • I think it is beneficial to keep orphaned forks in the tree, so we can orphan an new collector that extends an already orphaned fork

Hence, pruning by height should be sufficient (?)

@durkmurder durkmurder merged commit 20969da into master Jul 21, 2021
@durkmurder durkmurder deleted the yurii/5555-caching-non-processable-approvals branch July 21, 2021 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants