Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Network] Sychronize ready and done #1026

Merged
merged 10 commits into from
Jul 26, 2021
Merged

[Network] Sychronize ready and done #1026

merged 10 commits into from
Jul 26, 2021

Conversation

synzhu
Copy link
Contributor

@synzhu synzhu commented Jul 24, 2021

This PR fixes a bug with the existing network implementation.

The middleware.Start method is not idempotent, and calling it twice results in attempting to bind to the same address twice, which results in the error:

{"error":"could not create libp2p node: could not bootstrap libp2p host: could not create libp2p host: failed to listen on any addresses: [listen tcp4 172.25.0.8:2137: bind: address already in use]","time":"2021-07-23T23:39:27Z","message":"failed to start middleware"}

This is a problem because the splitter network relies on an underlying network implementation, and it will wait for the underlying network to be ready inside its own Ready method. However, the underlying network already has its Ready method called as part of the node startup.

Regardless of how many times Ready and Done are called, we should only call middleware.Start and middleware.Stop once.

@codecov-commenter
Copy link

Codecov Report

Merging #1026 (766b38a) into master (ca72e30) will decrease coverage by 0.10%.
The diff coverage is 0.00%.

❗ Current head 766b38a differs from pull request most recent head 2505433. Consider uploading reports for the commit 2505433 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1026      +/-   ##
==========================================
- Coverage   54.85%   54.75%   -0.11%     
==========================================
  Files         286      286              
  Lines       18913    18916       +3     
==========================================
- Hits        10375    10357      -18     
- Misses       7140     7160      +20     
- Partials     1398     1399       +1     
Flag Coverage Δ
unittests 54.75% <0.00%> (-0.11%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
network/p2p/network.go 0.00% <0.00%> (ø)
...sus/approvals/assignment_collector_statemachine.go 40.38% <0.00%> (-9.62%) ⬇️
engine/execution/ingestion/engine.go 57.81% <0.00%> (-1.45%) ⬇️
cmd/util/ledger/migrations/storage_v4.go 41.56% <0.00%> (-0.61%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca72e30...2505433. Read the comment docs.

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great catch, but just for my understanding:

  • in what conditions is the middleware start called twice?
  • in so far as it is, why do we use two independent uint flags (rather than one) to check whether it's running? Would we fail a necessary restart, or a shutdown-after-restart in a situation where started == stopped == 1?

@synzhu
Copy link
Contributor Author

synzhu commented Jul 26, 2021

This looks like a great catch, but just for my understanding:

  • in what conditions is the middleware start called twice?
  • in so far as it is, why do we use two independent uint flags (rather than one) to check whether it's running? Would we fail a necessary restart, or a shutdown-after-restart in a situation where started == stopped == 1?

@huitseeker The middleware start would have been called twice whenever network.Ready was called twice. network.Ready should be an idempotent method (it always returns a channel which closes (or is already closed) when the network has started up).

The convention here is that a ReadyDoneAware module only needs to support starting and stopping once. In other words, the state transition looks like this:

idle -> starting up -> running -> shutting down -> stopped

There is no way to "restart" (ie get from "stopped" to "starting up" state). Instead, one should just create a new instance of the module.

However, thanks for the comment because I've now changed the code to account for some edge cases that I didn't think were necessary before (08c887a, c6ef6bf)

In particular, the logic will now behave as follows:

  • If either Ready or Done are called twice, the second call will be a noop.
  • If Done is called before Ready is ever called, the network will be stopped, and the channel returned by any subsequent calls to Ready will never be closed. In other words, calling Done while in the "idle" state will take us directly to the "stopped" state. I find that this justification is convincing enough for me.
  • If Done is called after Ready is called, but before the ready channel has actually closed (ie while in the "starting up" state), then we will wait until the ready channel has closed before closing the done channel.
  • As mentioned above, if Done has already been called before Ready is called for the first time, then the startup will not actually be performed, and the returned channel will never close.

@synzhu synzhu requested a review from huitseeker July 26, 2021 18:17
Comment on lines 47 to 50
ready chan struct{}
done chan struct{}
startupCommenced bool
shutdownCommenced bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add godoc for this part reflecting why do we need them and what are they used for.

Copy link
Contributor

@yhassanzadeh13 yhassanzadeh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. Considering a hotfix to avoid race conditions, the PR looks good. However, I have concerns about the maintainability of the code, as there are literally several member fields added to the Network solely to manage its start-stop lifecycle. I think these components (i.e., ready, done, startupCommand, and shutdownCommand), can be efficiently encapsulated into a separate (sub)module that manages the lifecycle of the Network. For example, the current implementation utilizes stateTransition lock solely to isolate the ready and done channels and startupCommand, and shutdownCommand. It is clear that we must never use that lock for any other purpose. Over the time, however, the concept of state transition can be misunderstood, or misused by other developers, hence posing correctness issues for the code, e.g., one uses stateTransition as the lock to update the state of topology, which can cause liveness issues for our code. Hence, I strongly recommend considering encapsulation of this synchronization logic for lifecycle management into a separate sub(module). You may do that as either part of this PR or another issue (please feel free to issue it up and assign it to me).

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@synzhu synzhu merged commit 208bb5e into master Jul 26, 2021
@synzhu synzhu deleted the smnzhu/synchronize-network branch July 26, 2021 21:46
@synzhu
Copy link
Contributor Author

synzhu commented Jul 26, 2021

Thanks for the fix. Considering a hotfix to avoid race conditions, the PR looks good. However, I have concerns about the maintainability of the code, as there are literally several member fields added to the Network solely to manage its start-stop lifecycle. I think these components (i.e., ready, done, startupCommand, and shutdownCommand), can be efficiently encapsulated into a separate (sub)module that manages the lifecycle of the Network. For example, the current implementation utilizes stateTransition lock solely to isolate the ready and done channels and startupCommand, and shutdownCommand. It is clear that we must never use that lock for any other purpose. Over the time, however, the concept of state transition can be misunderstood, or misused by other developers, hence posing correctness issues for the code, e.g., one uses stateTransition as the lock to update the state of topology, which can cause liveness issues for our code. Hence, I strongly recommend considering encapsulation of this synchronization logic for lifecycle management into a separate sub(module). You may do that as either part of this PR or another issue (please feel free to issue it up and assign it to me).

@yhassanzadeh13 #1031

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants