Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yahya/5005-fix-flakey-libp2p-test #99

Merged
merged 41 commits into from Oct 30, 2020

Conversation

yhassanzadeh13
Copy link
Contributor

@yhassanzadeh13 yhassanzadeh13 commented Oct 28, 2020

This PR addresses some fixes to the tests on the network layer that would come flakey over the long run. Three tests were identified flakey and fixed. Root causes are indicated as comments to this PR. Some small code hammering and encapsulations have also been done to support DRY coding and for better readability of tests.

golog.SetAllLoggers(golog.LevelDebug)
func (suite *LibP2PNodeTestSuite) SetupTest() {
suite.logger = zerolog.New(os.Stderr).Level(zerolog.DebugLevel)
suite.ctx, suite.cancel = context.WithCancel(context.Background())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a refactoring to several test suites in the network package, we set all their log level to error to avoid having unnecessary debug logs of the internals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree 100%

func() {
_, _ = goodPeers[0].CreateStream(suite.ctx, silentNodeAddress) // this call will block
},
1*time.Second,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes basically simplify some test logics using encapsulation, however, they do not change the logic. The encapsulation allowed me to find the root cause easily: the anonymous function and RequireReturnBefore are running on two separate goroutinges, which are not necessarily synchronized by go runtime. So, based on watching the goroutines scheduling, there were cases that RequireReturnBefore would execute and terminate before the anonymous function of its interest is executed, and it would fail the test. Solution was simply to increase the timeout. However, I wanted to make sure of the root cause before just adhocly adding some timeout, as it would not solve the falkiness, and would also lasten build longer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that mean there were times when the CI didn't really kick off the the CreateStream go routine within 5ms?
Good catch and unittest.RequireNeverReturnBefore is definitely a better way of doing the wait that what I had.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, actually this test would even get flakey on my laptop, and as I logged the calls, there were traces where the causality would be violated depending on the number of pending goroutines.

assert.Fail(l.T(), "CreateStream attempt to the unresponsive peer did not block")
default:
}
unittest.RequireNeverClosedWithin(suite.T(), streamCreated, 1*time.Millisecond,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We set timeout to 1ms since we evaluate that the channel should be still open at this time. Here the casualty is important, i.e., the channel should remain open even after creating a stream to a good node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup makes sense! - I would just rename streamCreated to blockedCallCh or something that reflects that it is a channel.

require.Equal(suite.T(), msg, rcv)
},
10*time.Second,
fmt.Sprintf("message %s not received", msg))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced select statement with our test helpers for sake of more readability, also to be DRY.

go pm.updateLoop(wg)
go pm.periodicUpdate(wg)

wg.Wait()
return nil
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A flakiness root cause: Previously, Start was not a blocking call. However, tests running it were assuming otherwise. So, there was flakiness due to assuming updateLoop and periodicUpdate are running right after Start returns, which was not always the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, per conversation with @vishalchangrani we plan to make PeerManager a ReadyDoneAware module: https://github.com/dapperlabs/flow-go/issues/5011

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit uneasy with this as the resolution of the flakiness. Both these goroutines (with an exception in the next paragraph) block immediately after they begin running by entering a select statement. As far as I understand the runtime's scheduling behaviour, this goroutine state (blocked on a select statement after having been previously scheduled to run on a thread -- let's call it waiting_after_initial_execution) is more or less equivalent to the state where the goroutine has been created but not yet scheduled to run on a thread (let's call it waiting_before_initial_execution). Using the waitgroup here essentially forces these two goroutines into waiting_after_initial_execution rather than waiting_before_initial_execution.

The exception I mentioned earlier is essentially that by forcing these goroutines to be executed, for the periodicUpdate method in particular, we also force it to invoke pm.RequestPeerUpdate(). I suspect that this might be what is really addressing the flakiness. If this is true I would suggest we simply directly invoke pm.RequestPeerUpdate() in the main Start method and remove the waitgroup altogether.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c294613

func (l *LibP2PNodeTestSuite) SetupTest() {
l.logger = log.Output(zerolog.ConsoleWriter{Out: os.Stderr}).With().Caller().Logger()
l.ctx, l.cancel = context.WithCancel(context.Background())
golog.SetAllLoggers(golog.LevelDebug)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yahya - I leave this in to enable libp2p logging. Othewise I keep forgetting how to turn on libp2p logging. We can keep this line but just set its level to Error as well. This line enables libp2p logging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 4ba7992.

_, _ = goodPeers[0].CreateStream(l.ctx, silentNodeAddress) // this call will block
close(ch)
listener, silentNodeAddress := newSilentNode(suite.T())
defer func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this work instead? defer require.NoError(suite.T(), listener.Close())
bad suggestion :( sry abt that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed in cc27eb3.

assert.NoError(ts.T(), err)
assert.Eventually(ts.T(), func() bool {
return connector.AssertNumberOfCalls(ts.T(), "ConnectPeers", 2)
}, 2*PeerUpdateInterval+4*time.Millisecond, 2*PeerUpdateInterval)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A root cause of flakiness: in this test we are checking whether we receive at least two calls on ConnectPeers or not? However, the syntax checks for the exact two calls on it. Due to the asynchrony among assert.Eventually and peer update goroutine, it was sometimes the case that the method was getting called 3 times, and hence would fail the test.

connector.On("ConnectPeers", suite.ctx, testifymock.Anything).Run(func(args testifymock.Arguments) {
if count < times {
count++
wg.Done()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid negative value on WaitGroup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From your comment below:

Due to the asynchrony among assert.Eventually and peer update goroutine...

Will there be multiple goroutines concurrently calling ConnectPeers? If so I think we need to avoid concurrent writes to count in the Run function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in cc4fad9.

// wait for the first periodic update initiated after start to complete
assert.Eventually(ts.T(), func() bool {
return connector.AssertNumberOfCalls(ts.T(), "ConnectPeers", 1)
}, 10*time.Millisecond, 1*time.Millisecond)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly the same cause but similar to the above.

@@ -422,11 +407,11 @@ func (l *LibP2PNodeTestSuite) TestCreateStreamIsConcurrencySafe() {
close(gate)

// no call should block
unittest.AssertReturnsBefore(l.T(), wg.Wait, 10*time.Millisecond)
unittest.AssertReturnsBefore(suite.T(), wg.Wait, 10*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

func (pm *PeerManager) Start() error {
go pm.updateLoop()
go pm.periodicUpdate()
wg := &sync.WaitGroup{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

})
}

// TestPeriodicPeerUpdate tests that the peermanager runs periodically
func (ts *PeerManagerTestSuite) TestPeriodicPeerUpdate() {
// TestPeriodicPeerUpdate tests that the peer manager runs periodically
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

wg := &sync.WaitGroup{} // keeps track of number of calls on `ConnectPeers`
// we expect it to be called twice, i.e.,
// one for periodic update and one for the on-demand request
count, times := 0, 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this test we should not follow this pattern since in this we want to make sure that there were never more than 2 calls made. In this test I was trying to test that even if there are multiple concurrent RequestPeerUpdate calls made, only one is executed and the others are noop. If the count goes beyond 2 then it means that RequestPeerUpdate does not meet the expectation.
Here we can be assured that regardless of how the CI runs these routines, the blocked ConnectPeers return will make sure that only one call returns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed in af9e5d6.

@yhassanzadeh13 yhassanzadeh13 changed the title [WIP] Yahya/5005-fix-flakey-libp2p-test Yahya/5005-fix-flakey-libp2p-test Oct 28, 2020
@yhassanzadeh13 yhassanzadeh13 marked this pull request as ready for review October 28, 2020 23:50
Co-authored-by: Vishal <1117327+vishalchangrani@users.noreply.github.com>
Copy link
Contributor

@vishalchangrani vishalchangrani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Co-authored-by: Vishal <1117327+vishalchangrani@users.noreply.github.com>
Copy link
Member

@jordanschalm jordanschalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this, and nice work 👍

go pm.updateLoop(wg)
go pm.periodicUpdate(wg)

wg.Wait()
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit uneasy with this as the resolution of the flakiness. Both these goroutines (with an exception in the next paragraph) block immediately after they begin running by entering a select statement. As far as I understand the runtime's scheduling behaviour, this goroutine state (blocked on a select statement after having been previously scheduled to run on a thread -- let's call it waiting_after_initial_execution) is more or less equivalent to the state where the goroutine has been created but not yet scheduled to run on a thread (let's call it waiting_before_initial_execution). Using the waitgroup here essentially forces these two goroutines into waiting_after_initial_execution rather than waiting_before_initial_execution.

The exception I mentioned earlier is essentially that by forcing these goroutines to be executed, for the periodicUpdate method in particular, we also force it to invoke pm.RequestPeerUpdate(). I suspect that this might be what is really addressing the flakiness. If this is true I would suggest we simply directly invoke pm.RequestPeerUpdate() in the main Start method and remove the waitgroup altogether.

connector.On("ConnectPeers", suite.ctx, testifymock.Anything).Run(func(args testifymock.Arguments) {
if count < times {
count++
wg.Done()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From your comment below:

Due to the asynchrony among assert.Eventually and peer update goroutine...

Will there be multiple goroutines concurrently calling ConnectPeers? If so I think we need to avoid concurrent writes to count in the Run function.

network/gossip/libp2p/peerManager_test.go Outdated Show resolved Hide resolved
network/gossip/libp2p/test/testUtil.go Outdated Show resolved Hide resolved
Copy link
Member

@jordanschalm jordanschalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple final suggestions. Looks good!

case <-ticker.C:
pm.RequestPeerUpdate()
case <-pm.ctx.Done():
case <-pm.unit.Ctx().Done():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use unit.Quit() here. The fact that it uses a context under the covers is more of an implementation detail IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 14e5a0b

if err != nil {
return fmt.Errorf("failed to start peer manager: %w", err)
}
<-m.peerManager.Ready()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should create a timeout for these if there isn't already a timeout further up the stack while initializing this component. IE:

select {
   <-m.peerManager.Ready():
   <-time.After(time.Second*30):
         // error timeout
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 14e5a0b

@yhassanzadeh13 yhassanzadeh13 merged commit e1d0ec8 into master Oct 30, 2020
@yhassanzadeh13 yhassanzadeh13 deleted the yahya/5005-fix-flakey-libp2p-test branch October 30, 2020 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants