New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Yahya/5005-fix-flakey-libp2p-test #99
Conversation
golog.SetAllLoggers(golog.LevelDebug) | ||
func (suite *LibP2PNodeTestSuite) SetupTest() { | ||
suite.logger = zerolog.New(os.Stderr).Level(zerolog.DebugLevel) | ||
suite.ctx, suite.cancel = context.WithCancel(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a refactoring to several test suites in the network package, we set all their log level to error to avoid having unnecessary debug logs of the internals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree 100%
func() { | ||
_, _ = goodPeers[0].CreateStream(suite.ctx, silentNodeAddress) // this call will block | ||
}, | ||
1*time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes basically simplify some test logics using encapsulation, however, they do not change the logic. The encapsulation allowed me to find the root cause easily: the anonymous function and RequireReturnBefore
are running on two separate goroutinges, which are not necessarily synchronized by go runtime. So, based on watching the goroutines scheduling, there were cases that RequireReturnBefore
would execute and terminate before the anonymous function of its interest is executed, and it would fail the test. Solution was simply to increase the timeout. However, I wanted to make sure of the root cause before just adhocly adding some timeout, as it would not solve the falkiness, and would also lasten build longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does that mean there were times when the CI didn't really kick off the the CreateStream
go routine within 5ms
?
Good catch and unittest.RequireNeverReturnBefore
is definitely a better way of doing the wait that what I had.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, actually this test would even get flakey on my laptop, and as I logged the calls, there were traces where the causality would be violated depending on the number of pending goroutines.
assert.Fail(l.T(), "CreateStream attempt to the unresponsive peer did not block") | ||
default: | ||
} | ||
unittest.RequireNeverClosedWithin(suite.T(), streamCreated, 1*time.Millisecond, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We set timeout to 1ms
since we evaluate that the channel should be still open at this time. Here the casualty is important, i.e., the channel should remain open even after creating a stream to a good node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup makes sense! - I would just rename streamCreated to blockedCallCh
or something that reflects that it is a channel.
require.Equal(suite.T(), msg, rcv) | ||
}, | ||
10*time.Second, | ||
fmt.Sprintf("message %s not received", msg)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced select statement with our test helpers for sake of more readability, also to be DRY.
go pm.updateLoop(wg) | ||
go pm.periodicUpdate(wg) | ||
|
||
wg.Wait() | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A flakiness root cause: Previously, Start
was not a blocking call. However, tests running it were assuming otherwise. So, there was flakiness due to assuming updateLoop
and periodicUpdate
are running right after Start
returns, which was not always the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, per conversation with @vishalchangrani we plan to make PeerManager
a ReadyDoneAware
module: https://github.com/dapperlabs/flow-go/issues/5011
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit uneasy with this as the resolution of the flakiness. Both these goroutines (with an exception in the next paragraph) block immediately after they begin running by entering a select
statement. As far as I understand the runtime's scheduling behaviour, this goroutine state (blocked on a select
statement after having been previously scheduled to run on a thread -- let's call it waiting_after_initial_execution
) is more or less equivalent to the state where the goroutine has been created but not yet scheduled to run on a thread (let's call it waiting_before_initial_execution
). Using the waitgroup here essentially forces these two goroutines into waiting_after_initial_execution
rather than waiting_before_initial_execution
.
The exception I mentioned earlier is essentially that by forcing these goroutines to be executed, for the periodicUpdate
method in particular, we also force it to invoke pm.RequestPeerUpdate()
. I suspect that this might be what is really addressing the flakiness. If this is true I would suggest we simply directly invoke pm.RequestPeerUpdate()
in the main Start
method and remove the waitgroup altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in c294613
func (l *LibP2PNodeTestSuite) SetupTest() { | ||
l.logger = log.Output(zerolog.ConsoleWriter{Out: os.Stderr}).With().Caller().Logger() | ||
l.ctx, l.cancel = context.WithCancel(context.Background()) | ||
golog.SetAllLoggers(golog.LevelDebug) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yahya - I leave this in to enable libp2p logging. Othewise I keep forgetting how to turn on libp2p logging. We can keep this line but just set its level to Error as well. This line enables libp2p logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 4ba7992.
_, _ = goodPeers[0].CreateStream(l.ctx, silentNodeAddress) // this call will block | ||
close(ch) | ||
listener, silentNodeAddress := newSilentNode(suite.T()) | ||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't this work instead? defer require.NoError(suite.T(), listener.Close())
bad suggestion :( sry abt that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed in cc27eb3.
assert.NoError(ts.T(), err) | ||
assert.Eventually(ts.T(), func() bool { | ||
return connector.AssertNumberOfCalls(ts.T(), "ConnectPeers", 2) | ||
}, 2*PeerUpdateInterval+4*time.Millisecond, 2*PeerUpdateInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A root cause of flakiness: in this test we are checking whether we receive at least two calls on ConnectPeers
or not? However, the syntax checks for the exact two calls on it. Due to the asynchrony among assert.Eventually
and peer update goroutine, it was sometimes the case that the method was getting called 3 times, and hence would fail the test.
connector.On("ConnectPeers", suite.ctx, testifymock.Anything).Run(func(args testifymock.Arguments) { | ||
if count < times { | ||
count++ | ||
wg.Done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid negative value on WaitGroup
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From your comment below:
Due to the asynchrony among assert.Eventually and peer update goroutine...
Will there be multiple goroutines concurrently calling ConnectPeers
? If so I think we need to avoid concurrent writes to count
in the Run
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in cc4fad9.
// wait for the first periodic update initiated after start to complete | ||
assert.Eventually(ts.T(), func() bool { | ||
return connector.AssertNumberOfCalls(ts.T(), "ConnectPeers", 1) | ||
}, 10*time.Millisecond, 1*time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly the same cause but similar to the above.
@@ -422,11 +407,11 @@ func (l *LibP2PNodeTestSuite) TestCreateStreamIsConcurrencySafe() { | |||
close(gate) | |||
|
|||
// no call should block | |||
unittest.AssertReturnsBefore(l.T(), wg.Wait, 10*time.Millisecond) | |||
unittest.AssertReturnsBefore(suite.T(), wg.Wait, 10*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
network/gossip/libp2p/peerManager.go
Outdated
func (pm *PeerManager) Start() error { | ||
go pm.updateLoop() | ||
go pm.periodicUpdate() | ||
wg := &sync.WaitGroup{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
}) | ||
} | ||
|
||
// TestPeriodicPeerUpdate tests that the peermanager runs periodically | ||
func (ts *PeerManagerTestSuite) TestPeriodicPeerUpdate() { | ||
// TestPeriodicPeerUpdate tests that the peer manager runs periodically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
wg := &sync.WaitGroup{} // keeps track of number of calls on `ConnectPeers` | ||
// we expect it to be called twice, i.e., | ||
// one for periodic update and one for the on-demand request | ||
count, times := 0, 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this test we should not follow this pattern since in this we want to make sure that there were never more than 2 calls made. In this test I was trying to test that even if there are multiple concurrent RequestPeerUpdate
calls made, only one is executed and the others are noop. If the count goes beyond 2 then it means that RequestPeerUpdate
does not meet the expectation.
Here we can be assured that regardless of how the CI runs these routines, the blocked ConnectPeers
return will make sure that only one call returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed in af9e5d6.
Co-authored-by: Vishal <1117327+vishalchangrani@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Co-authored-by: Vishal <1117327+vishalchangrani@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this, and nice work 👍
go pm.updateLoop(wg) | ||
go pm.periodicUpdate(wg) | ||
|
||
wg.Wait() | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit uneasy with this as the resolution of the flakiness. Both these goroutines (with an exception in the next paragraph) block immediately after they begin running by entering a select
statement. As far as I understand the runtime's scheduling behaviour, this goroutine state (blocked on a select
statement after having been previously scheduled to run on a thread -- let's call it waiting_after_initial_execution
) is more or less equivalent to the state where the goroutine has been created but not yet scheduled to run on a thread (let's call it waiting_before_initial_execution
). Using the waitgroup here essentially forces these two goroutines into waiting_after_initial_execution
rather than waiting_before_initial_execution
.
The exception I mentioned earlier is essentially that by forcing these goroutines to be executed, for the periodicUpdate
method in particular, we also force it to invoke pm.RequestPeerUpdate()
. I suspect that this might be what is really addressing the flakiness. If this is true I would suggest we simply directly invoke pm.RequestPeerUpdate()
in the main Start
method and remove the waitgroup altogether.
connector.On("ConnectPeers", suite.ctx, testifymock.Anything).Run(func(args testifymock.Arguments) { | ||
if count < times { | ||
count++ | ||
wg.Done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From your comment below:
Due to the asynchrony among assert.Eventually and peer update goroutine...
Will there be multiple goroutines concurrently calling ConnectPeers
? If so I think we need to avoid concurrent writes to count
in the Run
function.
…t' into yahya/5005-fix-flakey-libp2p-test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple final suggestions. Looks good!
network/gossip/libp2p/peerManager.go
Outdated
case <-ticker.C: | ||
pm.RequestPeerUpdate() | ||
case <-pm.ctx.Done(): | ||
case <-pm.unit.Ctx().Done(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use unit.Quit()
here. The fact that it uses a context under the covers is more of an implementation detail IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 14e5a0b
network/gossip/libp2p/middleware.go
Outdated
if err != nil { | ||
return fmt.Errorf("failed to start peer manager: %w", err) | ||
} | ||
<-m.peerManager.Ready() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should create a timeout for these if there isn't already a timeout further up the stack while initializing this component. IE:
select {
<-m.peerManager.Ready():
<-time.After(time.Second*30):
// error timeout
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 14e5a0b
This PR addresses some fixes to the tests on the network layer that would come flakey over the long run. Three tests were identified flakey and fixed. Root causes are indicated as comments to this PR. Some small code hammering and encapsulations have also been done to support DRY coding and for better readability of tests.