New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Synchronization] Message queue for collection sync engine #1248

Merged

bors merged 8 commits into master from yurii/5807-collection-sync-engine-refactoring

Sep 7, 2021

Member

durkmurder commented Sep 3, 2021

https://github.com/dapperlabs/flow-go/issues/5807

Context

This PR implements message queues for collection sync engine. Design and implementation is inspired by consensus syng engine, basically same engine is used with needed modifications for collection clusters.

durkmurder added 2 commits

September 3, 2021 14:34


          Implemented new engine for collection nodes sync. Small refactoring o…

55f8fc1

…f consensus engine. Updated tests


          Merge branch 'master' of https://github.com/onflow/flow-go into yurii…

7aef25b

…/5807-collection-sync-engine-refactoring

durkmurder requested review from jordanschalm, synzhu and yhassanzadeh13

September 3, 2021 11:38

durkmurder assigned jordanschalm, synzhu and yhassanzadeh13

codecov-commenter commented Sep 3, 2021 •

edited

Codecov Report

Merging #1248 (fd0fc3d) into master (7b0ae36) will increase coverage by 0.09%.
The diff coverage is 60.81%.

@@            Coverage Diff             @@
##           master    #1248      +/-   ##
==========================================
+ Coverage   56.18%   56.27%   +0.09%     
==========================================
  Files         496      497       +1     
  Lines       30179    30320     +141     
==========================================
+ Hits        16956    17064     +108     
- Misses      10921    10944      +23     
- Partials     2302     2312      +10

Flag	Coverage Δ
unittests	`56.27% <60.81%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
engine/common/synchronization/config.go	`42.85% <50.00%> (ø)`
...gine/collection/synchronization/request_handler.go	`59.49% <59.49%> (ø)`
engine/collection/synchronization/engine.go	`62.90% <62.12%> (+9.70%)`	⬆️
engine/common/synchronization/engine.go	`68.78% <100.00%> (ø)`
...sus/approvals/assignment_collector_statemachine.go	`47.11% <0.00%> (+4.80%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b0ae36...fd0fc3d. Read the comment docs.

jordanschalm approved these changes

View reviewed changes

Member

jordanschalm left a comment

Thanks for this Yurii.

If you haven't already, I think we should smoke test forcing a collection node to catch up with the sync engine on localnet as well with these changes.

engine/collection/synchronization/engine.go Outdated Show resolved Hide resolved

engine/collection/synchronization/engine.go Outdated

Comment on lines 342 to 343

		e.log.Error().Err(err).Msg("could not get last finalized header")
		return

Member

jordanschalm Sep 3, 2021

should this be fatal if we're going to exit here? Stopping the polling process can cause the node to get permanently out of sync

Member Author

durkmurder Sep 6, 2021

I have replaced it with continue but my feeling is that we should crash the node at this point. Not retrieving state seems like a fatal error to me.

Member

jordanschalm Sep 7, 2021

my feeling is that we should crash the node at this point

I agree actually, I think it's better to change the log level to Fatal

engine/collection/synchronization/engine.go

Comment on lines +318 to 323

+              	if e.pollInterval > 0 {
+              		poll := time.NewTicker(e.pollInterval)
+              		pollChan = poll.C
+              		defer poll.Stop()
+              	}
               	scan := time.NewTicker(e.scanInterval)

Member

jordanschalm Sep 3, 2021

Any reason not to treat pollInterval and scanInterval the same here? IE handle negative values, and defer <ticker>.Stop() for both?

Member Author

durkmurder Sep 6, 2021

For execution nodes consensus sync engine passed 0, meaning it doesn't want to poll height.
I don't know if it's needed for collection nodes but no harm in keeping it I guess

Contributor

synzhu Sep 7, 2021

should we add the defer scanStop() though?

engine/collection/synchronization/engine.go

		if err != nil {
		errs = multierror.Append(errs, fmt.Errorf("could not submit range request (from=%d, to=%d): %w", ran.From, ran.To, err))

Member

jordanschalm Sep 3, 2021

Maybe we could add the more informational error message back (and add to consensus engine as well)

engine/collection/synchronization/engine.go

		if err != nil {
		errs = multierror.Append(errs, fmt.Errorf("could not submit batch request (size=%d): %w", len(batch.BlockIDs), err))

Member

jordanschalm Sep 3, 2021

Maybe we could add the more informational error message back (and add to consensus engine as well)

durkmurder and others added 4 commits

September 6, 2021 11:00


          Merge branch 'master' of https://github.com/onflow/flow-go into yurii…

0575e34

…/5807-collection-sync-engine-refactoring


          Update engine/collection/synchronization/engine.go

f0fcbd0

Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>


          Replaced return with continue

bdedb12


          Replaced log for fatal

4be4754

durkmurder requested a review from zhangchiqing

September 7, 2021 16:35

durkmurder assigned zhangchiqing

synzhu approved these changes

View reviewed changes

synzhu reviewed

View reviewed changes

engine/collection/synchronization/engine.go

Comment on lines +318 to 323

+              	if e.pollInterval > 0 {
+              		poll := time.NewTicker(e.pollInterval)
+              		pollChan = poll.C
+              		defer poll.Stop()
+              	}
               	scan := time.NewTicker(e.scanInterval)

Contributor

synzhu Sep 7, 2021

should we add the defer scanStop() though?

engine/collection/synchronization/engine.go

-              	// get the last finalized header
-              	final, err := e.state.Final().Head()
+              func (e *Engine) pollHeight() {
+              	head, err := e.state.Final().Head()

Contributor

synzhu Sep 7, 2021

Could we instead use FinalizedHeaderCache, like what is done in the common sync engine?

Member Author

durkmurder Sep 7, 2021

Not really, collection cluster don't support same notification infrastructure as consensus nodes. Don't think it's worth implementing it only for this.


          Merge branch 'master' into yurii/5807-collection-sync-engine-refactoring

de21838

yhassanzadeh13 approved these changes

View reviewed changes

engine/collection/synchronization/engine.go Outdated

+              	}
+              	if comp == nil {
+              		panic("must initialize synchronization engine with comp engine")

Contributor

yhassanzadeh13 Sep 7, 2021

Returning an error would be more proper for handling the failures.

engine/collection/synchronization/engine.go

               func (e *Engine) Ready() <-chan struct{} {
-              	e.unit.Launch(e.checkLoop)
-              	return e.unit.Ready()
+              	e.lm.OnStart(func() {

Contributor

yhassanzadeh13 Sep 7, 2021

Since we are using unit it is safer to first invoke unit.Ready. This follows the lifecycle assumptions of the unit.

Member

jordanschalm commented Sep 7, 2021

Cluster syncing is looking good on localnet on this PR.


          Update engine.go

fd0fc3d

Member Author

durkmurder commented Sep 7, 2021

bors merge

Contributor

bors bot commented Sep 7, 2021

Build succeeded:

bors bot merged commit a63b2ba into master

bors bot deleted the yurii/5807-collection-sync-engine-refactoring branch

September 7, 2021 19:48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment