Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat #339: faster tests #350

Merged
merged 5 commits into from
Apr 5, 2018
Merged

Feat #339: faster tests #350

merged 5 commits into from
Apr 5, 2018

Conversation

hsanjuan
Copy link
Collaborator

@hsanjuan hsanjuan commented Mar 16, 2018

This reduces main package test times from +15 minutes to ~4 in my machine. Let's see how it fares in travis and jenkins.

#339

@hsanjuan hsanjuan self-assigned this Mar 16, 2018
@ghost ghost added the status/in-progress In progress label Mar 16, 2018
@coveralls
Copy link

coveralls commented Mar 16, 2018

Coverage Status

Coverage decreased (-0.005%) to 67.4% when pulling 0069c00 on feat/339-faster-tests into 95ae174 on master.

@hsanjuan hsanjuan changed the title Feat/339 faster tests Feat #339: faster tests Mar 16, 2018
@hsanjuan hsanjuan force-pushed the feat/339-faster-tests branch 7 times, most recently from 4fd234b to 8ce1c62 Compare March 20, 2018 10:57
@hsanjuan
Copy link
Collaborator Author

Discovered that random test failures in our tests mostly come from libp2p "dial backoff" when bootstrapping test peers in parallel. Once a dial backoff happens, libp2p mechanisms prevent any retry for at least 5 seconds (hardcoded in libp2p-swarm). Before, we got a second chance because we slept for longer, but now it was making at least one or two tests fail. Starting peers sequentially with a small sleep between them seems to help.

libp2p/go-libp2p#1549

@hsanjuan
Copy link
Collaborator Author

Seems travis consistently passes now, and Jenkins mostly fails on windows only.

I'd like to merge #349 before I keep trying. Creating the hosts and connecting them before creating the peers might do the trick resolve the problems.

@hsanjuan hsanjuan added status/blocked Unable to be worked further until needs are met and removed status/in-progress In progress labels Mar 20, 2018
Copy link
Collaborator

@ZenGround0 ZenGround0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great to see improvements in test times and more thought being put into precisely why we need timeouts. As it stands right now the delays in the testing code are not significantly more transparent after this PR.

It would be great to see some comments that help new contributors understand why delays are chosen to be in a certain range and maybe even a principled guide at the top of a test file or in a README in the test subpackage explaining why delays are necessary and providing a little motivation for the values we are currently using. Even if some values we are using are rough guesses or estimates it would be helpful to communicate the state of our understanding of the values we use. In this document we could talk about libp2p backoffs and how they lead to issues with bootstrapping, and also why waiting for leader is important before executing various cluster operations.

ci/Jenkinsfile Outdated
@@ -1,2 +1,2 @@
golang([test: "go test -v -timeout 20m ./..."])
golang([test: "go test -v -loglevel ERROR ./..."])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have a different loglevel between jenkins and travis?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temporary

@@ -105,10 +106,16 @@ type Config struct {
// possible.
ReplicationFactorMin int

// MonitorPingInterval is frequency by which a cluster peer pings the
// monitoring component. The ping metric has a TTL set to the double
// MonitorPingInterval is the frequency by which a cluster peer pings
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very minor typo but "frequency with which a cluster..." is more idiomatic

// of this value.
MonitorPingInterval time.Duration

// PeerWatchInterval is the frequency that we watch for changes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again minor typo but "frequency we use to watch for changes" or "frequency with which we watch for changes" is more correct.

@@ -7,9 +7,9 @@ for dir in $dirs;
do
if ls "$dir"/*.go &> /dev/null;
then
cmdflags="-timeout 20m -v -coverprofile=profile.out -covermode=count $dir"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using this script anymore? From the makefile it looks like we are just calling go test directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, travis.yml


// Start the rest
var wg sync.WaitGroup
// Doing this in parallel causes libp2p dial backoffs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this changed from parallel to sequential which also makes sense in terms of avoiding collisions and backoffs. Is the comment off or am I misinterpreting the code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, the comment just explains why it's not in parallel. ill change it

// Sleep a monitoring interval
time.Sleep(6 * time.Second)
delay()
waitForLeader(t, clusters) // incase we killed the leader
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case

@ghost ghost added the status/in-progress In progress label Mar 22, 2018
@ZenGround0
Copy link
Collaborator

Jenkins failures are only on windows

My comment from above still stands though. It would be great to

  1. write up the implicit knowledge encoded in testing wait intervals and the reason for doing particular waits before/after certain operations. I'm thinking of this as the beginning of "A guide to testing in ipfs-cluster".
  2. find a good place for this documentation. comments in testfile? readme in test folder? document in docs folder?
    I'm ok with merging this as is without documenting these things now, but I think they will be very helpful for contributors starting out in the repo, particularly because contributors so often start with testing. Maybe we could make an issue?

Let me know if you disagree with this. Perhaps this implicit knowledge is already encoded well enough in the existing comments to enable new developers and I'm just not looking in the right places.

@hsanjuan hsanjuan removed the status/blocked Unable to be worked further until needs are met label Mar 29, 2018
@hsanjuan
Copy link
Collaborator Author

@ZenGround0 if it goes all green then I'm done here.

I have added a note about delays. Mostly they are a result of testing and jenkins/travis not passing. It would be good to have extensive developer documentation about tests in general, but I cannot write a full guide right now.

Copy link
Collaborator

@ZenGround0 ZenGround0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@@ -276,6 +275,21 @@ func runF(t *testing.T, clusters []*Cluster, f func(*testing.T, *Cluster)) {
wg.Wait()
}

//////////////////////////////////////
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome this is just what I was looking for.

It should provide a way to speed up peer list updates when
peers join/part. It was hardcoded.

License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
@hsanjuan hsanjuan force-pushed the feat/339-faster-tests branch 2 times, most recently from 4ed8a97 to 47afd2c Compare April 5, 2018 15:22
License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
@hsanjuan
Copy link
Collaborator Author

hsanjuan commented Apr 5, 2018

travis passes very consistently and jenkins more or less. So same as before, except everything takes 1/3 of the time at worst. Merging!

@hsanjuan hsanjuan mentioned this pull request Apr 5, 2018
@hsanjuan hsanjuan merged commit da0915a into master Apr 5, 2018
@ghost ghost removed the status/in-progress In progress label Apr 5, 2018
@hsanjuan hsanjuan deleted the feat/339-faster-tests branch April 5, 2018 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants