-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat #339: faster tests #350
Conversation
dac6563
to
7c4e348
Compare
4fd234b
to
8ce1c62
Compare
Discovered that random test failures in our tests mostly come from libp2p "dial backoff" when bootstrapping test peers in parallel. Once a dial backoff happens, libp2p mechanisms prevent any retry for at least 5 seconds (hardcoded in libp2p-swarm). Before, we got a second chance because we slept for longer, but now it was making at least one or two tests fail. Starting peers sequentially with a small sleep between them seems to help. |
8ce1c62
to
f9680f2
Compare
Seems travis consistently passes now, and Jenkins mostly fails on windows only. I'd like to merge #349 before I keep trying. Creating the hosts and connecting them before creating the peers might do the trick resolve the problems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great to see improvements in test times and more thought being put into precisely why we need timeouts. As it stands right now the delays in the testing code are not significantly more transparent after this PR.
It would be great to see some comments that help new contributors understand why delays are chosen to be in a certain range and maybe even a principled guide at the top of a test file or in a README in the test subpackage explaining why delays are necessary and providing a little motivation for the values we are currently using. Even if some values we are using are rough guesses or estimates it would be helpful to communicate the state of our understanding of the values we use. In this document we could talk about libp2p backoffs and how they lead to issues with bootstrapping, and also why waiting for leader is important before executing various cluster operations.
ci/Jenkinsfile
Outdated
@@ -1,2 +1,2 @@ | |||
golang([test: "go test -v -timeout 20m ./..."]) | |||
golang([test: "go test -v -loglevel ERROR ./..."]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why have a different loglevel between jenkins and travis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temporary
cluster_config.go
Outdated
@@ -105,10 +106,16 @@ type Config struct { | |||
// possible. | |||
ReplicationFactorMin int | |||
|
|||
// MonitorPingInterval is frequency by which a cluster peer pings the | |||
// monitoring component. The ping metric has a TTL set to the double | |||
// MonitorPingInterval is the frequency by which a cluster peer pings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very minor typo but "frequency with which a cluster..." is more idiomatic
cluster_config.go
Outdated
// of this value. | ||
MonitorPingInterval time.Duration | ||
|
||
// PeerWatchInterval is the frequency that we watch for changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again minor typo but "frequency we use to watch for changes" or "frequency with which we watch for changes" is more correct.
@@ -7,9 +7,9 @@ for dir in $dirs; | |||
do | |||
if ls "$dir"/*.go &> /dev/null; | |||
then | |||
cmdflags="-timeout 20m -v -coverprofile=profile.out -covermode=count $dir" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using this script anymore? From the makefile it looks like we are just calling go test
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, travis.yml
ipfscluster_test.go
Outdated
|
||
// Start the rest | ||
var wg sync.WaitGroup | ||
// Doing this in parallel causes libp2p dial backoffs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like this changed from parallel to sequential which also makes sense in terms of avoiding collisions and backoffs. Is the comment off or am I misinterpreting the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, the comment just explains why it's not in parallel. ill change it
ipfscluster_test.go
Outdated
// Sleep a monitoring interval | ||
time.Sleep(6 * time.Second) | ||
delay() | ||
waitForLeader(t, clusters) // incase we killed the leader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case
Jenkins failures are only on windows My comment from above still stands though. It would be great to
Let me know if you disagree with this. Perhaps this implicit knowledge is already encoded well enough in the existing comments to enable new developers and I'm just not looking in the right places. |
f28fe3f
to
9fa4aa0
Compare
e605bec
to
028384b
Compare
@ZenGround0 if it goes all green then I'm done here. I have added a note about delays. Mostly they are a result of testing and jenkins/travis not passing. It would be good to have extensive developer documentation about tests in general, but I cannot write a full guide right now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
@@ -276,6 +275,21 @@ func runF(t *testing.T, clusters []*Cluster, f func(*testing.T, *Cluster)) { | |||
wg.Wait() | |||
} | |||
|
|||
////////////////////////////////////// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome this is just what I was looking for.
It should provide a way to speed up peer list updates when peers join/part. It was hardcoded. License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
4ed8a97
to
47afd2c
Compare
License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
47afd2c
to
0069c00
Compare
travis passes very consistently and jenkins more or less. So same as before, except everything takes 1/3 of the time at worst. Merging! |
This reduces main package test times from +15 minutes to ~4 in my machine. Let's see how it fares in travis and jenkins.
#339