Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Yahya/5005-fix-flakey-libp2p-test #99
Yahya/5005-fix-flakey-libp2p-test #99
Changes from 9 commits
c7f603a
ddc375f
d1f92e9
236baf2
e807c5b
d8800f7
c54dc5e
3e74856
cf8f6db
e0d6e9b
b3b9441
3b4cb6d
c4a6de8
af9e5d6
b4fe909
fe48531
553b5be
57aa665
dbc419f
7a87076
cc27eb3
0572633
ae48d42
4ba7992
7f73b52
2beb7bf
976fd27
50dc9ab
431ede2
93b0f1b
24a8840
9e2fc91
cc4fad9
0305d6a
c294613
0eac3ad
e32feea
dc96082
2dea2f8
14e5a0b
43390c6
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
Large diffs are not rendered by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A flakiness root cause: Previously,
Start
was not a blocking call. However, tests running it were assuming otherwise. So, there was flakiness due to assumingupdateLoop
andperiodicUpdate
are running right afterStart
returns, which was not always the case.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, per conversation with @vishalchangrani we plan to make
PeerManager
aReadyDoneAware
module: https://github.com/dapperlabs/flow-go/issues/5011There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit uneasy with this as the resolution of the flakiness. Both these goroutines (with an exception in the next paragraph) block immediately after they begin running by entering a
select
statement. As far as I understand the runtime's scheduling behaviour, this goroutine state (blocked on aselect
statement after having been previously scheduled to run on a thread -- let's call itwaiting_after_initial_execution
) is more or less equivalent to the state where the goroutine has been created but not yet scheduled to run on a thread (let's call itwaiting_before_initial_execution
). Using the waitgroup here essentially forces these two goroutines intowaiting_after_initial_execution
rather thanwaiting_before_initial_execution
.The exception I mentioned earlier is essentially that by forcing these goroutines to be executed, for the
periodicUpdate
method in particular, we also force it to invokepm.RequestPeerUpdate()
. I suspect that this might be what is really addressing the flakiness. If this is true I would suggest we simply directly invokepm.RequestPeerUpdate()
in the mainStart
method and remove the waitgroup altogether.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in c294613
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid negative value on
WaitGroup
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From your comment below:
Will there be multiple goroutines concurrently calling
ConnectPeers
? If so I think we need to avoid concurrent writes tocount
in theRun
function.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in cc4fad9.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A root cause of flakiness: in this test we are checking whether we receive at least two calls on
ConnectPeers
or not? However, the syntax checks for the exact two calls on it. Due to the asynchrony amongassert.Eventually
and peer update goroutine, it was sometimes the case that the method was getting called 3 times, and hence would fail the test.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly the same cause but similar to the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this test we should not follow this pattern since in this we want to make sure that there were never more than 2 calls made. In this test I was trying to test that even if there are multiple concurrent
RequestPeerUpdate
calls made, only one is executed and the others are noop. If the count goes beyond 2 then it means thatRequestPeerUpdate
does not meet the expectation.Here we can be assured that regardless of how the CI runs these routines, the blocked
ConnectPeers
return will make sure that only one call returns.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed in af9e5d6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An extra return could also mean that we have either:
periodic run
,first on demand run
ORperiodic run
,first on demand
,second on demand
.the latter should never happen. Hence, I would just let everyone block and then just open the gate once and count the calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, fixed in ae48d42.