[Network] Sychronize ready and done #1026

synzhu · 2021-07-24T00:30:45Z

This PR fixes a bug with the existing network implementation.

The middleware.Start method is not idempotent, and calling it twice results in attempting to bind to the same address twice, which results in the error:

{"error":"could not create libp2p node: could not bootstrap libp2p host: could not create libp2p host: failed to listen on any addresses: [listen tcp4 172.25.0.8:2137: bind: address already in use]","time":"2021-07-23T23:39:27Z","message":"failed to start middleware"}

This is a problem because the splitter network relies on an underlying network implementation, and it will wait for the underlying network to be ready inside its own Ready method. However, the underlying network already has its Ready method called as part of the node startup.

Regardless of how many times Ready and Done are called, we should only call middleware.Start and middleware.Stop once.

codecov-commenter · 2021-07-24T00:44:07Z

Codecov Report

Merging #1026 (766b38a) into master (ca72e30) will decrease coverage by 0.10%.
The diff coverage is 0.00%.

❗ Current head 766b38a differs from pull request most recent head 2505433. Consider uploading reports for the commit 2505433 to get more accurate results

@@            Coverage Diff             @@
##           master    #1026      +/-   ##
==========================================
- Coverage   54.85%   54.75%   -0.11%     
==========================================
  Files         286      286              
  Lines       18913    18916       +3     
==========================================
- Hits        10375    10357      -18     
- Misses       7140     7160      +20     
- Partials     1398     1399       +1

Flag	Coverage Δ
unittests	`54.75% <0.00%> (-0.11%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
network/p2p/network.go	`0.00% <0.00%> (ø)`
...sus/approvals/assignment_collector_statemachine.go	`40.38% <0.00%> (-9.62%)`	⬇️
engine/execution/ingestion/engine.go	`57.81% <0.00%> (-1.45%)`	⬇️
cmd/util/ledger/migrations/storage_v4.go	`41.56% <0.00%> (-0.61%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca72e30...2505433. Read the comment docs.

huitseeker

This looks like a great catch, but just for my understanding:

in what conditions is the middleware start called twice?
in so far as it is, why do we use two independent uint flags (rather than one) to check whether it's running? Would we fail a necessary restart, or a shutdown-after-restart in a situation where started == stopped == 1?

synzhu · 2021-07-26T18:13:41Z

This looks like a great catch, but just for my understanding:

in what conditions is the middleware start called twice?

in so far as it is, why do we use two independent uint flags (rather than one) to check whether it's running? Would we fail a necessary restart, or a shutdown-after-restart in a situation where started == stopped == 1?

@huitseeker The middleware start would have been called twice whenever network.Ready was called twice. network.Ready should be an idempotent method (it always returns a channel which closes (or is already closed) when the network has started up).

The convention here is that a ReadyDoneAware module only needs to support starting and stopping once. In other words, the state transition looks like this:

idle -> starting up -> running -> shutting down -> stopped

There is no way to "restart" (ie get from "stopped" to "starting up" state). Instead, one should just create a new instance of the module.

However, thanks for the comment because I've now changed the code to account for some edge cases that I didn't think were necessary before (08c887a, c6ef6bf)

In particular, the logic will now behave as follows:

If either Ready or Done are called twice, the second call will be a noop.
If Done is called before Ready is ever called, the network will be stopped, and the channel returned by any subsequent calls to Ready will never be closed. In other words, calling Done while in the "idle" state will take us directly to the "stopped" state. I find that this justification is convincing enough for me.
If Done is called after Ready is called, but before the ready channel has actually closed (ie while in the "starting up" state), then we will wait until the ready channel has closed before closing the done channel.
As mentioned above, if Done has already been called before Ready is called for the first time, then the startup will not actually be performed, and the returned channel will never close.

yhassanzadeh13 · 2021-07-26T19:23:13Z

network/p2p/network.go

+	ready             chan struct{}
+	done              chan struct{}
+	startupCommenced  bool
+	shutdownCommenced bool


Please add godoc for this part reflecting why do we need them and what are they used for.

yhassanzadeh13

Thanks for the fix. Considering a hotfix to avoid race conditions, the PR looks good. However, I have concerns about the maintainability of the code, as there are literally several member fields added to the Network solely to manage its start-stop lifecycle. I think these components (i.e., ready, done, startupCommand, and shutdownCommand), can be efficiently encapsulated into a separate (sub)module that manages the lifecycle of the Network. For example, the current implementation utilizes stateTransition lock solely to isolate the ready and done channels and startupCommand, and shutdownCommand. It is clear that we must never use that lock for any other purpose. Over the time, however, the concept of state transition can be misunderstood, or misused by other developers, hence posing correctness issues for the code, e.g., one uses stateTransition as the lock to update the state of topology, which can cause liveness issues for our code. Hence, I strongly recommend considering encapsulation of this synchronization logic for lifecycle management into a separate sub(module). You may do that as either part of this PR or another issue (please feel free to issue it up and assign it to me).

huitseeker

✨

…w/flow-go into smnzhu/synchronize-network

synzhu · 2021-07-26T21:56:49Z

Thanks for the fix. Considering a hotfix to avoid race conditions, the PR looks good. However, I have concerns about the maintainability of the code, as there are literally several member fields added to the Network solely to manage its start-stop lifecycle. I think these components (i.e., ready, done, startupCommand, and shutdownCommand), can be efficiently encapsulated into a separate (sub)module that manages the lifecycle of the Network. For example, the current implementation utilizes stateTransition lock solely to isolate the ready and done channels and startupCommand, and shutdownCommand. It is clear that we must never use that lock for any other purpose. Over the time, however, the concept of state transition can be misunderstood, or misused by other developers, hence posing correctness issues for the code, e.g., one uses stateTransition as the lock to update the state of topology, which can cause liveness issues for our code. Hence, I strongly recommend considering encapsulation of this synchronization logic for lifecycle management into a separate sub(module). You may do that as either part of this PR or another issue (please feel free to issue it up and assign it to me).

@yhassanzadeh13 #1031

Sychronize ready and done

766b38a

synzhu requested review from vishalchangrani and yhassanzadeh13 July 24, 2021 00:30

Merge branch 'master' into smnzhu/synchronize-network

2505433

vishalchangrani approved these changes Jul 24, 2021

View reviewed changes

huitseeker reviewed Jul 26, 2021

View reviewed changes

synzhu added 2 commits July 26, 2021 10:58

support edge cases

08c887a

rename flags

c6ef6bf

synzhu requested a review from huitseeker July 26, 2021 18:17

Merge branch 'master' into smnzhu/synchronize-network

7c871d4

yhassanzadeh13 reviewed Jul 26, 2021

View reviewed changes

yhassanzadeh13 approved these changes Jul 26, 2021

View reviewed changes

huitseeker approved these changes Jul 26, 2021

View reviewed changes

synzhu added 5 commits July 26, 2021 13:38

Merge branch 'master' into smnzhu/synchronize-network

b6e1d1b

add comments

ea0fe77

Merge branch 'smnzhu/synchronize-network' of https://github.com/onflo…

e3de5b6

…w/flow-go into smnzhu/synchronize-network

Merge branch 'master' into smnzhu/synchronize-network

b9fbae8

Merge branch 'master' into smnzhu/synchronize-network

08024f0

synzhu merged commit 208bb5e into master Jul 26, 2021

synzhu deleted the smnzhu/synchronize-network branch July 26, 2021 21:46

synzhu mentioned this pull request Jul 26, 2021

[Network] refactor lifecycle manager #1031

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Network] Sychronize ready and done #1026

[Network] Sychronize ready and done #1026

synzhu commented Jul 24, 2021 •

edited

codecov-commenter commented Jul 24, 2021

huitseeker left a comment

synzhu commented Jul 26, 2021 •

edited

yhassanzadeh13 Jul 26, 2021

yhassanzadeh13 left a comment

huitseeker left a comment

synzhu commented Jul 26, 2021 •

edited

[Network] Sychronize ready and done #1026

[Network] Sychronize ready and done #1026

Conversation

synzhu commented Jul 24, 2021 • edited

codecov-commenter commented Jul 24, 2021

Codecov Report

huitseeker left a comment

Choose a reason for hiding this comment

synzhu commented Jul 26, 2021 • edited

yhassanzadeh13 Jul 26, 2021

Choose a reason for hiding this comment

yhassanzadeh13 left a comment

Choose a reason for hiding this comment

huitseeker left a comment

Choose a reason for hiding this comment

synzhu commented Jul 26, 2021 • edited

synzhu commented Jul 24, 2021 •

edited

synzhu commented Jul 26, 2021 •

edited

synzhu commented Jul 26, 2021 •

edited