Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[network] Makes mesh-dependent network tests asynchronous #1095

Merged
merged 6 commits into from
Aug 10, 2021

Conversation

huitseeker
Copy link
Contributor

@huitseeker huitseeker commented Aug 6, 2021

The network tests in TestMeshNetTestSuite and TestMiddlewareTestSuit/TestUnsubscribe depend on waiting for mesh formation between libp2p nodes, and use time.Sleep for that, and are therefore flaky (see issue).

The present PR:

  • defines a customized ConnManager in testUtils that captures the tagged Protect events generated by gossipsub at the establishment of a PubSub mesh, using the observable.Observable interface,
  • injects it in test suite setup,
  • registers an observable.Observer instance that waits for the correct number of such events to notice the establishment of the mesh,
  • thereby replacing the time.Sleep instances,
  • does some cleanup (removes one minor race in a mock, quiet down the logs in a test).

Obsoletes #1020
Fixes https://github.com/dapperlabs/flow-go/issues/5710

@huitseeker huitseeker force-pushed the network-races-chunk2 branch 3 times, most recently from d47d81f to 4ffa8f0 Compare August 6, 2021 14:55
@codecov-commenter
Copy link

codecov-commenter commented Aug 6, 2021

Codecov Report

Merging #1095 (cb93248) into master (dd75d07) will increase coverage by 0.03%.
The diff coverage is 91.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1095      +/-   ##
==========================================
+ Coverage   53.14%   53.18%   +0.03%     
==========================================
  Files         321      321              
  Lines       21730    21755      +25     
==========================================
+ Hits        11549    11570      +21     
- Misses       8614     8618       +4     
  Partials     1567     1567              
Flag Coverage Δ
unittests 53.18% <91.66%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
network/p2p/connManager.go 100.00% <ø> (ø)
network/p2p/libp2pNode.go 66.66% <ø> (ø)
network/test/testUtil.go 94.28% <91.17%> (-1.45%) ⬇️
network/test/meshengine.go 69.56% <100.00%> (+2.89%) ⬆️
module/mempool/epochs/transactions.go 94.73% <0.00%> (-5.27%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dd75d07...cb93248. Read the comment docs.

network/test/testUtil.go Outdated Show resolved Hide resolved
type TagLessConnManager interface {
connmgr.ConnManager
// ProtectPeer increments the stream setup count for the peer.ID
ProtectPeer(id peer.ID)
Copy link
Contributor

@vishalchangrani vishalchangrani Aug 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ProtectPeer and UnprotectPeer are only called when a one-to-one stream is established between two nodes (only caller are libP2PNode.CreateStream -> tryCreateNewStream()) It is not called when a node connects to the other node as part of the PeerManager.
Hence, when testing nodes in a purely pub-sub connections, the TagLessConnManager observer will not be notified.

Libp2p nodes heartbeat every one second hence 2 seconds guarantee that peers have discovered each other and formed a mesh (their own tests do something similar)

If we want an event driven test and avoid these sleeps we could explore the libp2p EventBus - https://github.com/libp2p/go-libp2p-core/blob/master/event/network.go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our tagless ProtectPeer and UnprotectPeer are indeed not called in pure pub-sub connections, but even if they were, they're not the calls that the TagWatchingConnManager intercepts and builds an Observable on.

What we are capturing here is the calls made by libp2p's tagtracer in the pubsub package. They call the tag-full Protect (and Unprotect), not the tag-less ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok ! got it..thanks

network/test/meshengine_test.go Outdated Show resolved Hide resolved
network/test/meshengine_test.go Outdated Show resolved Hide resolved
network/test/meshengine_test.go Outdated Show resolved Hide resolved
network/test/meshengine_test.go Outdated Show resolved Hide resolved
network/test/meshengine_test.go Outdated Show resolved Hide resolved
network/test/middleware_test.go Outdated Show resolved Hide resolved
network/test/middleware_test.go Outdated Show resolved Hide resolved
network/test/middleware_test.go Outdated Show resolved Hide resolved
@huitseeker huitseeker force-pushed the network-races-chunk2 branch 3 times, most recently from 7c3987e to 563b65d Compare August 10, 2021 14:39
- updates our tag-less variant of ConnManager to make the interface we're using clear,
- make the functions that use it (bootstraplibP2PNode and associated) take the interface
- add a specialized ConnManager in testUtils which Protect events are observable,
- this allows observing the pubsub network, which emits tagged protect events on mesh connection
- return those on libp2p Node createion (testutils' `GenerateIDs` and dependents)
This makes TestMiddlewareTestSuit/TestUnSubscribe asynchronous, eliminating the race condition
@huitseeker huitseeker merged commit 60b5960 into onflow:master Aug 10, 2021
huitseeker added a commit to huitseeker/flow-go that referenced this pull request Aug 18, 2021
This reverts commit 845a00b.

Reasons for reversal:
- provided impetus for clearing out races (onflow#1073, onflow#1095)
- network tests now pass with race detector on!
- but those  timings are preposterous
huitseeker added a commit to huitseeker/flow-go that referenced this pull request Aug 18, 2021
This reverts commit 845a00b.

Reasons for reversal:
- provided impetus for clearing out races (onflow#1073, onflow#1095)
- network tests now pass with race detector on!
- but those  timings are preposterous
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants