Yahya/5005-fix-flakey-libp2p-test #99

yhassanzadeh13 · 2020-10-28T00:05:56Z

This PR addresses some fixes to the tests on the network layer that would come flakey over the long run. Three tests were identified flakey and fixed. Root causes are indicated as comments to this PR. Some small code hammering and encapsulations have also been done to support DRY coding and for better readability of tests.

…ey-libp2p-test

yhassanzadeh13 · 2020-10-28T16:44:58Z

network/gossip/libp2p/libp2pNode_test.go

-	golog.SetAllLoggers(golog.LevelDebug)
+func (suite *LibP2PNodeTestSuite) SetupTest() {
+	suite.logger = zerolog.New(os.Stderr).Level(zerolog.DebugLevel)
+	suite.ctx, suite.cancel = context.WithCancel(context.Background())


This is a refactoring to several test suites in the network package, we set all their log level to error to avoid having unnecessary debug logs of the internals.

yhassanzadeh13 · 2020-10-28T20:40:25Z

network/gossip/libp2p/libp2pNode_test.go

+		func() {
+			_, _ = goodPeers[0].CreateStream(suite.ctx, silentNodeAddress) // this call will block
+		},
+		1*time.Second,


These changes basically simplify some test logics using encapsulation, however, they do not change the logic. The encapsulation allowed me to find the root cause easily: the anonymous function and RequireReturnBefore are running on two separate goroutinges, which are not necessarily synchronized by go runtime. So, based on watching the goroutines scheduling, there were cases that RequireReturnBefore would execute and terminate before the anonymous function of its interest is executed, and it would fail the test. Solution was simply to increase the timeout. However, I wanted to make sure of the root cause before just adhocly adding some timeout, as it would not solve the falkiness, and would also lasten build longer.

does that mean there were times when the CI didn't really kick off the the CreateStream go routine within 5ms?
Good catch and unittest.RequireNeverReturnBefore is definitely a better way of doing the wait that what I had.

Yes, actually this test would even get flakey on my laptop, and as I logged the calls, there were traces where the causality would be violated depending on the number of pending goroutines.

yhassanzadeh13 · 2020-10-28T20:43:31Z

network/gossip/libp2p/libp2pNode_test.go

-		assert.Fail(l.T(), "CreateStream attempt to the unresponsive peer did not block")
-	default:
-	}
+	unittest.RequireNeverClosedWithin(suite.T(), streamCreated, 1*time.Millisecond,


We set timeout to 1ms since we evaluate that the channel should be still open at this time. Here the casualty is important, i.e., the channel should remain open even after creating a stream to a good node.

yup makes sense! - I would just rename streamCreated to blockedCallCh or something that reflects that it is a channel.

yhassanzadeh13 · 2020-10-28T20:44:42Z

network/gossip/libp2p/libp2pNode_test.go

+				require.Equal(suite.T(), msg, rcv)
+			},
+			10*time.Second,
+			fmt.Sprintf("message %s not received", msg))


Replaced select statement with our test helpers for sake of more readability, also to be DRY.

yhassanzadeh13 · 2020-10-28T20:51:32Z

network/gossip/libp2p/peerManager.go

+	go pm.updateLoop(wg)
+	go pm.periodicUpdate(wg)
+
+	wg.Wait()
 	return nil
 }


A flakiness root cause: Previously, Start was not a blocking call. However, tests running it were assuming otherwise. So, there was flakiness due to assuming updateLoop and periodicUpdate are running right after Start returns, which was not always the case.

Also, per conversation with @vishalchangrani we plan to make PeerManager a ReadyDoneAware module: https://github.com/dapperlabs/flow-go/issues/5011

I'm a bit uneasy with this as the resolution of the flakiness. Both these goroutines (with an exception in the next paragraph) block immediately after they begin running by entering a select statement. As far as I understand the runtime's scheduling behaviour, this goroutine state (blocked on a select statement after having been previously scheduled to run on a thread -- let's call it waiting_after_initial_execution) is more or less equivalent to the state where the goroutine has been created but not yet scheduled to run on a thread (let's call it waiting_before_initial_execution). Using the waitgroup here essentially forces these two goroutines into waiting_after_initial_execution rather than waiting_before_initial_execution.

The exception I mentioned earlier is essentially that by forcing these goroutines to be executed, for the periodicUpdate method in particular, we also force it to invoke pm.RequestPeerUpdate(). I suspect that this might be what is really addressing the flakiness. If this is true I would suggest we simply directly invoke pm.RequestPeerUpdate() in the main Start method and remove the waitgroup altogether.

Fixed in c294613

vishalchangrani · 2020-10-28T21:00:38Z

network/gossip/libp2p/libp2pNode_test.go

-func (l *LibP2PNodeTestSuite) SetupTest() {
-	l.logger = log.Output(zerolog.ConsoleWriter{Out: os.Stderr}).With().Caller().Logger()
-	l.ctx, l.cancel = context.WithCancel(context.Background())
-	golog.SetAllLoggers(golog.LevelDebug)


Yahya - I leave this in to enable libp2p logging. Othewise I keep forgetting how to turn on libp2p logging. We can keep this line but just set its level to Error as well. This line enables libp2p logging.

fixed in 4ba7992.

network/gossip/libp2p/libp2pNode_test.go

vishalchangrani · 2020-10-28T21:05:45Z

network/gossip/libp2p/libp2pNode_test.go

-		_, _ = goodPeers[0].CreateStream(l.ctx, silentNodeAddress) // this call will block
-		close(ch)
+	listener, silentNodeAddress := newSilentNode(suite.T())
+	defer func() {


~~wouldn't this work instead? defer require.NoError(suite.T(), listener.Close())~~
bad suggestion :( sry abt that

thanks, fixed in cc27eb3.

yhassanzadeh13 · 2020-10-28T21:07:40Z

network/gossip/libp2p/peerManager_test.go

-	assert.NoError(ts.T(), err)
-	assert.Eventually(ts.T(), func() bool {
-		return connector.AssertNumberOfCalls(ts.T(), "ConnectPeers", 2)
-	}, 2*PeerUpdateInterval+4*time.Millisecond, 2*PeerUpdateInterval)


A root cause of flakiness: in this test we are checking whether we receive at least two calls on ConnectPeers or not? However, the syntax checks for the exact two calls on it. Due to the asynchrony among assert.Eventually and peer update goroutine, it was sometimes the case that the method was getting called 3 times, and hence would fail the test.

yhassanzadeh13 · 2020-10-28T21:08:21Z

network/gossip/libp2p/peerManager_test.go

+	connector.On("ConnectPeers", suite.ctx, testifymock.Anything).Run(func(args testifymock.Arguments) {
+		if count < times {
+			count++
+			wg.Done()


To avoid negative value on WaitGroup.

From your comment below:

Due to the asynchrony among assert.Eventually and peer update goroutine...

Will there be multiple goroutines concurrently calling ConnectPeers? If so I think we need to avoid concurrent writes to count in the Run function.

Fixed in cc4fad9.

yhassanzadeh13 · 2020-10-28T21:09:41Z

network/gossip/libp2p/peerManager_test.go

-	// wait for the first periodic update initiated after start to complete
-	assert.Eventually(ts.T(), func() bool {
-		return connector.AssertNumberOfCalls(ts.T(), "ConnectPeers", 1)
-	}, 10*time.Millisecond, 1*time.Millisecond)


Not exactly the same cause but similar to the above.

network/gossip/libp2p/libp2pNode_test.go

vishalchangrani · 2020-10-28T21:24:55Z

network/gossip/libp2p/libp2pNode_test.go

@@ -422,11 +407,11 @@ func (l *LibP2PNodeTestSuite) TestCreateStreamIsConcurrencySafe() {
 	close(gate)

 	// no call should block
-	unittest.AssertReturnsBefore(l.T(), wg.Wait, 10*time.Millisecond)
+	unittest.AssertReturnsBefore(suite.T(), wg.Wait, 10*time.Second)


network/gossip/libp2p/peerManager.go

vishalchangrani · 2020-10-28T21:29:56Z

network/gossip/libp2p/peerManager.go

 func (pm *PeerManager) Start() error {
-	go pm.updateLoop()
-	go pm.periodicUpdate()
+	wg := &sync.WaitGroup{}


vishalchangrani · 2020-10-28T21:31:58Z

network/gossip/libp2p/peerManager_test.go

 	})
 }

-// TestPeriodicPeerUpdate tests that the peermanager runs periodically
-func (ts *PeerManagerTestSuite) TestPeriodicPeerUpdate() {
+// TestPeriodicPeerUpdate tests that the peer manager runs periodically


vishalchangrani · 2020-10-28T21:46:40Z

network/gossip/libp2p/peerManager_test.go

+	wg := &sync.WaitGroup{} // keeps track of number of calls on `ConnectPeers`
+	// we expect it to be called twice, i.e.,
+	// one for periodic update and one for the on-demand request
+	count, times := 0, 2


I think in this test we should not follow this pattern since in this we want to make sure that there were never more than 2 calls made. In this test I was trying to test that even if there are multiple concurrent RequestPeerUpdate calls made, only one is executed and the others are noop. If the count goes beyond 2 then it means that RequestPeerUpdate does not meet the expectation.
Here we can be assured that regardless of how the CI runs these routines, the blocked ConnectPeers return will make sure that only one call returns.

thanks, fixed in af9e5d6.

network/gossip/libp2p/peerManager_test.go

Co-authored-by: Vishal <1117327+vishalchangrani@users.noreply.github.com>

network/gossip/libp2p/test/testUtil.go

vishalchangrani

lgtm

Co-authored-by: Vishal <1117327+vishalchangrani@users.noreply.github.com>

jordanschalm

Thanks for tackling this, and nice work 👍

jordanschalm · 2020-10-29T21:05:47Z

network/gossip/libp2p/peerManager.go

+	go pm.updateLoop(wg)
+	go pm.periodicUpdate(wg)
+
+	wg.Wait()
 	return nil
 }


I'm a bit uneasy with this as the resolution of the flakiness. Both these goroutines (with an exception in the next paragraph) block immediately after they begin running by entering a select statement. As far as I understand the runtime's scheduling behaviour, this goroutine state (blocked on a select statement after having been previously scheduled to run on a thread -- let's call it waiting_after_initial_execution) is more or less equivalent to the state where the goroutine has been created but not yet scheduled to run on a thread (let's call it waiting_before_initial_execution). Using the waitgroup here essentially forces these two goroutines into waiting_after_initial_execution rather than waiting_before_initial_execution.

The exception I mentioned earlier is essentially that by forcing these goroutines to be executed, for the periodicUpdate method in particular, we also force it to invoke pm.RequestPeerUpdate(). I suspect that this might be what is really addressing the flakiness. If this is true I would suggest we simply directly invoke pm.RequestPeerUpdate() in the main Start method and remove the waitgroup altogether.

jordanschalm · 2020-10-29T21:11:26Z

network/gossip/libp2p/peerManager_test.go

+	connector.On("ConnectPeers", suite.ctx, testifymock.Anything).Run(func(args testifymock.Arguments) {
+		if count < times {
+			count++
+			wg.Done()


From your comment below:

Due to the asynchrony among assert.Eventually and peer update goroutine...

Will there be multiple goroutines concurrently calling ConnectPeers? If so I think we need to avoid concurrent writes to count in the Run function.

network/gossip/libp2p/peerManager_test.go

network/gossip/libp2p/test/testUtil.go

…t' into yahya/5005-fix-flakey-libp2p-test

…ey-libp2p-test

jordanschalm

A couple final suggestions. Looks good!

jordanschalm · 2020-10-30T20:20:11Z

network/gossip/libp2p/peerManager.go

-		case <-ticker.C:
-			pm.RequestPeerUpdate()
-		case <-pm.ctx.Done():
+		case <-pm.unit.Ctx().Done():


Better to use unit.Quit() here. The fact that it uses a context under the covers is more of an implementation detail IMO.

Fixed in 14e5a0b

jordanschalm · 2020-10-30T20:24:25Z

network/gossip/libp2p/middleware.go

-	if err != nil {
-		return fmt.Errorf("failed to start peer manager: %w", err)
-	}
+	<-m.peerManager.Ready()


We should create a timeout for these if there isn't already a timeout further up the stack while initializing this component. IE:

select { <-m.peerManager.Ready(): <-time.After(time.Second*30): // error timeout }

Fixed in 14e5a0b

…ey-libp2p-test

yhassanzadeh13 added 9 commits October 27, 2020 15:01

fixes peer manager non-blocking start

c7f603a

fixes peer manager assertion issue

ddc375f

adds RequireConcurrentCallsReturnBefore

d1f92e9

refactors logs at debug level

236baf2

fixes wait group negative count issue

e807c5b

adds error level logs

d8800f7

Merge remote-tracking branch 'origin/master' into yahya/5005-fix-flak…

c54dc5e

…ey-libp2p-test

adds free port function

3e74856

fixes libp2p tests

cf8f6db

yhassanzadeh13 commented Oct 28, 2020

View reviewed changes

yhassanzadeh13 added 3 commits October 28, 2020 09:50

refactors unnecessary suite nested calls

e0d6e9b

adds more unittest utils

b3b9441

simplifies some tests

3b4cb6d

yhassanzadeh13 commented Oct 28, 2020

View reviewed changes

vishalchangrani reviewed Oct 28, 2020

View reviewed changes

network/gossip/libp2p/libp2pNode_test.go Outdated Show resolved Hide resolved

vishalchangrani reviewed Oct 28, 2020

View reviewed changes

yhassanzadeh13 commented Oct 28, 2020

View reviewed changes

vishalchangrani reviewed Oct 28, 2020

View reviewed changes

network/gossip/libp2p/libp2pNode_test.go Outdated Show resolved Hide resolved

vishalchangrani reviewed Oct 28, 2020

View reviewed changes

network/gossip/libp2p/libp2pNode_test.go Outdated Show resolved Hide resolved

vishalchangrani reviewed Oct 28, 2020

View reviewed changes

network/gossip/libp2p/peerManager.go Outdated Show resolved Hide resolved

vishalchangrani reviewed Oct 28, 2020

View reviewed changes

yhassanzadeh13 added 2 commits October 28, 2020 16:24

adds error assertion

7f73b52

fixes failure issue

2beb7bf

yhassanzadeh13 requested a review from vishalchangrani October 28, 2020 23:43

fixes lint issue

976fd27

yhassanzadeh13 changed the title ~~[WIP] Yahya/5005-fix-flakey-libp2p-test~~ Yahya/5005-fix-flakey-libp2p-test Oct 28, 2020

yhassanzadeh13 requested a review from jordanschalm October 28, 2020 23:46

yhassanzadeh13 marked this pull request as ready for review October 28, 2020 23:50

fixes unittest error

50dc9ab

vishalchangrani reviewed Oct 29, 2020

View reviewed changes

network/gossip/libp2p/peerManager_test.go Outdated Show resolved Hide resolved

Update network/gossip/libp2p/peerManager_test.go

431ede2

Co-authored-by: Vishal <1117327+vishalchangrani@users.noreply.github.com>

vishalchangrani reviewed Oct 29, 2020

View reviewed changes

network/gossip/libp2p/test/testUtil.go Outdated Show resolved Hide resolved

vishalchangrani approved these changes Oct 29, 2020

View reviewed changes

Update network/gossip/libp2p/test/testUtil.go

93b0f1b

Co-authored-by: Vishal <1117327+vishalchangrani@users.noreply.github.com>

jordanschalm reviewed Oct 29, 2020

View reviewed changes

yhassanzadeh13 added 7 commits October 29, 2020 15:19

encapsulates port allocator test helper

24a8840

Merge remote-tracking branch 'origin/yahya/5005-fix-flakey-libp2p-tes…

9e2fc91

…t' into yahya/5005-fix-flakey-libp2p-test

fixes concurrent access on mock run method

cc4fad9

Merge remote-tracking branch 'origin/master' into yahya/5005-fix-flak…

0305d6a

…ey-libp2p-test

updates peer manager with ready done aware

c294613

Merge remote-tracking branch 'origin/master' into yahya/5005-fix-flak…

0eac3ad

…ey-libp2p-test

updates go mod

e32feea

yhassanzadeh13 requested a review from tarakby as a code owner October 30, 2020 18:40

yhassanzadeh13 added 2 commits October 30, 2020 12:39

fixes small commented code

dc96082

fixes comments

2dea2f8

yhassanzadeh13 requested a review from jordanschalm October 30, 2020 20:17

jordanschalm approved these changes Oct 30, 2020

View reviewed changes

yhassanzadeh13 added 2 commits October 30, 2020 13:45

adds require return before

14e5a0b

Merge remote-tracking branch 'origin/master' into yahya/5005-fix-flak…

43390c6

…ey-libp2p-test

yhassanzadeh13 merged commit e1d0ec8 into master Oct 30, 2020

yhassanzadeh13 deleted the yahya/5005-fix-flakey-libp2p-test branch October 30, 2020 21:01

Yahya/5005-fix-flakey-libp2p-test #99

Yahya/5005-fix-flakey-libp2p-test #99

Conversation

yhassanzadeh13 commented Oct 28, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhassanzadeh13 Oct 28, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhassanzadeh13 Oct 28, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vishalchangrani Oct 28, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vishalchangrani left a comment

Choose a reason for hiding this comment

jordanschalm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jordanschalm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhassanzadeh13 commented Oct 28, 2020 •

edited

yhassanzadeh13 Oct 28, 2020 •

edited

yhassanzadeh13 Oct 28, 2020 •

edited

vishalchangrani Oct 28, 2020 •

edited