Feat #339: faster tests #350

hsanjuan · 2018-03-16T16:58:50Z

This reduces main package test times from +15 minutes to ~4 in my machine. Let's see how it fares in travis and jenkins.

#339

coveralls · 2018-03-16T17:54:21Z

Coverage decreased (-0.005%) to 67.4% when pulling 0069c00 on feat/339-faster-tests into 95ae174 on master.

hsanjuan · 2018-03-20T13:17:49Z

Discovered that random test failures in our tests mostly come from libp2p "dial backoff" when bootstrapping test peers in parallel. Once a dial backoff happens, libp2p mechanisms prevent any retry for at least 5 seconds (hardcoded in libp2p-swarm). Before, we got a second chance because we slept for longer, but now it was making at least one or two tests fail. Starting peers sequentially with a small sleep between them seems to help.

libp2p/go-libp2p#1549

hsanjuan · 2018-03-20T14:01:20Z

Seems travis consistently passes now, and Jenkins mostly fails on windows only.

I'd like to merge #349 before I keep trying. Creating the hosts and connecting them before creating the peers might do the trick resolve the problems.

ZenGround0

It's great to see improvements in test times and more thought being put into precisely why we need timeouts. As it stands right now the delays in the testing code are not significantly more transparent after this PR.

It would be great to see some comments that help new contributors understand why delays are chosen to be in a certain range and maybe even a principled guide at the top of a test file or in a README in the test subpackage explaining why delays are necessary and providing a little motivation for the values we are currently using. Even if some values we are using are rough guesses or estimates it would be helpful to communicate the state of our understanding of the values we use. In this document we could talk about libp2p backoffs and how they lead to issues with bootstrapping, and also why waiting for leader is important before executing various cluster operations.

ZenGround0 · 2018-03-21T14:34:44Z

ci/Jenkinsfile

@@ -1,2 +1,2 @@
-golang([test: "go test -v -timeout 20m ./..."])
+golang([test: "go test -v -loglevel ERROR ./..."])


why have a different loglevel between jenkins and travis?

ZenGround0 · 2018-03-21T14:36:15Z

cluster_config.go

@@ -105,10 +106,16 @@ type Config struct {
 	// possible.
 	ReplicationFactorMin int

-	// MonitorPingInterval is frequency by which a cluster peer pings the
-	// monitoring component. The ping metric has a TTL set to the double
+	// MonitorPingInterval is the frequency by which a cluster peer pings


very minor typo but "frequency with which a cluster..." is more idiomatic

ZenGround0 · 2018-03-21T14:37:35Z

cluster_config.go

 	// of this value.
 	MonitorPingInterval time.Duration
+
+	// PeerWatchInterval is the frequency that we watch for changes


again minor typo but "frequency we use to watch for changes" or "frequency with which we watch for changes" is more correct.

ZenGround0 · 2018-03-21T14:42:42Z

coverage.sh

@@ -7,9 +7,9 @@ for dir in $dirs;
 do
        if ls "$dir"/*.go &> /dev/null;
        then
-            cmdflags="-timeout 20m -v -coverprofile=profile.out -covermode=count $dir"


Are we using this script anymore? From the makefile it looks like we are just calling go test directly?

yeah, travis.yml

ZenGround0 · 2018-03-21T14:45:51Z

ipfscluster_test.go


 	// Start the rest
-	var wg sync.WaitGroup
+	// Doing this in parallel causes libp2p dial backoffs


it looks like this changed from parallel to sequential which also makes sense in terms of avoiding collisions and backoffs. Is the comment off or am I misinterpreting the code?

oh, the comment just explains why it's not in parallel. ill change it

ZenGround0 · 2018-03-21T14:52:34Z

ipfscluster_test.go

-	// Sleep a monitoring interval
-	time.Sleep(6 * time.Second)
+	delay()
+	waitForLeader(t, clusters) // incase we killed the leader


ZenGround0 · 2018-03-22T15:16:44Z

Jenkins failures are only on windows

My comment from above still stands though. It would be great to

write up the implicit knowledge encoded in testing wait intervals and the reason for doing particular waits before/after certain operations. I'm thinking of this as the beginning of "A guide to testing in ipfs-cluster".
find a good place for this documentation. comments in testfile? readme in test folder? document in docs folder?
I'm ok with merging this as is without documenting these things now, but I think they will be very helpful for contributors starting out in the repo, particularly because contributors so often start with testing. Maybe we could make an issue?

Let me know if you disagree with this. Perhaps this implicit knowledge is already encoded well enough in the existing comments to enable new developers and I'm just not looking in the right places.

hsanjuan · 2018-03-29T20:33:37Z

@ZenGround0 if it goes all green then I'm done here.

I have added a note about delays. Mostly they are a result of testing and jenkins/travis not passing. It would be good to have extensive developer documentation about tests in general, but I cannot write a full guide right now.

ZenGround0

LGTM. Thanks!

ZenGround0 · 2018-03-29T20:49:08Z

ipfscluster_test.go

@@ -276,6 +275,21 @@ func runF(t *testing.T, clusters []*Cluster, f func(*testing.T, *Cluster)) {
 	wg.Wait()
 }

+//////////////////////////////////////


Awesome this is just what I was looking for.

It should provide a way to speed up peer list updates when peers join/part. It was hardcoded. License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

hsanjuan · 2018-04-05T16:02:34Z

travis passes very consistently and jenkins more or less. So same as before, except everything takes 1/3 of the time at worst. Merging!

hsanjuan self-assigned this Mar 16, 2018

ghost added the status/in-progress In progress label Mar 16, 2018

hsanjuan force-pushed the feat/339-faster-tests branch from dac6563 to 7c4e348 Compare March 16, 2018 18:41

hsanjuan changed the title ~~Feat/339 faster tests~~ Feat #339: faster tests Mar 16, 2018

hsanjuan force-pushed the feat/339-faster-tests branch 7 times, most recently from 4fd234b to 8ce1c62 Compare March 20, 2018 10:57

hsanjuan force-pushed the feat/339-faster-tests branch from 8ce1c62 to f9680f2 Compare March 20, 2018 13:34

hsanjuan added status/blocked Unable to be worked further until needs are met and removed status/in-progress In progress labels Mar 20, 2018

ZenGround0 requested changes Mar 21, 2018

View reviewed changes

ghost added the status/in-progress In progress label Mar 22, 2018

ZenGround0 approved these changes Mar 22, 2018

View reviewed changes

hsanjuan force-pushed the feat/339-faster-tests branch from f28fe3f to 9fa4aa0 Compare March 29, 2018 18:18

hsanjuan removed the status/blocked Unable to be worked further until needs are met label Mar 29, 2018

hsanjuan force-pushed the feat/339-faster-tests branch from e605bec to 028384b Compare March 29, 2018 20:18

ZenGround0 approved these changes Mar 29, 2018

View reviewed changes

hsanjuan added 4 commits April 5, 2018 16:49

cluster: introduce PeerWatchInterval config option.

58acf16

It should provide a way to speed up peer list updates when peers join/part. It was hardcoded. License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

Fix #339: Reduce Sleeps in tests

dd4128a

License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

Pre-create and pre-connect hosts in tests

c73e540

License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

Add some clarifications about delays

f5f56f2

License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

hsanjuan force-pushed the feat/339-faster-tests branch 2 times, most recently from 4ed8a97 to 47afd2c Compare April 5, 2018 15:22

Fix metric expire type. Do not discard metrics in Allocate().

0069c00

License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>

hsanjuan force-pushed the feat/339-faster-tests branch from 47afd2c to 0069c00 Compare April 5, 2018 15:57

hsanjuan mentioned this pull request Apr 5, 2018

Release 0.4.0 #372

Closed

hsanjuan merged commit da0915a into master Apr 5, 2018

ghost removed the status/in-progress In progress label Apr 5, 2018

hsanjuan deleted the feat/339-faster-tests branch April 5, 2018 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat #339: faster tests #350

Feat #339: faster tests #350

hsanjuan commented Mar 16, 2018 •

edited

Loading

coveralls commented Mar 16, 2018 •

edited

Loading

hsanjuan commented Mar 20, 2018

hsanjuan commented Mar 20, 2018

ZenGround0 left a comment

ZenGround0 Mar 21, 2018

hsanjuan Mar 22, 2018

ZenGround0 Mar 21, 2018

ZenGround0 Mar 21, 2018

ZenGround0 Mar 21, 2018

hsanjuan Mar 22, 2018

ZenGround0 Mar 21, 2018

hsanjuan Mar 22, 2018

ZenGround0 Mar 21, 2018

ZenGround0 commented Mar 22, 2018

hsanjuan commented Mar 29, 2018

ZenGround0 left a comment

ZenGround0 Mar 29, 2018

hsanjuan commented Apr 5, 2018

		@@ -1,2 +1,2 @@
		golang([test: "go test -v -timeout 20m ./..."])
		golang([test: "go test -v -loglevel ERROR ./..."])

Feat #339: faster tests #350

Feat #339: faster tests #350

Conversation

hsanjuan commented Mar 16, 2018 • edited Loading

coveralls commented Mar 16, 2018 • edited Loading

hsanjuan commented Mar 20, 2018

hsanjuan commented Mar 20, 2018

ZenGround0 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZenGround0 commented Mar 22, 2018

hsanjuan commented Mar 29, 2018

ZenGround0 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsanjuan commented Apr 5, 2018

hsanjuan commented Mar 16, 2018 •

edited

Loading

coveralls commented Mar 16, 2018 •

edited

Loading