Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky tests #685

Closed
ruseinov opened this issue May 28, 2019 · 7 comments
Closed

Fix flaky tests #685

ruseinov opened this issue May 28, 2019 · 7 comments
Assignees

Comments

@ruseinov
Copy link
Contributor

ruseinov commented May 28, 2019

ok      github.com/iov-one/weave/cmd/bnsd/commands    10.101s
2019/05/28 10:41:29 proto: duplicate proto type registered: common.KVPair
2019/05/28 10:41:29 proto: duplicate proto type registered: common.KI64Pair
E[2019-05-28|10:41:29.977] Couldn't connect to any seeds                module=p2p 
E[2019-05-28|10:41:30.896] Not stopping BlockPool -- have not been started yet module=blockchain impl=BlockPool
E[2019-05-28|10:41:30.902] Stopped accept routine, as transport is closed module=p2p numPeers=0
E[2019-05-28|10:41:30.942] Couldn't connect to any seeds                module=p2p 
E[2019-05-28|10:41:31.105] Not stopping BlockPool -- have not been started yet module=blockchain impl=BlockPool
E[2019-05-28|10:41:31.122] Stopped accept routine, as transport is closed module=p2p numPeers=0
E[2019-05-28|10:41:31.156] Couldn't connect to any seeds                module=p2p 
E[2019-05-28|10:41:31.417] Not stopping BlockPool -- have not been started yet module=blockchain impl=BlockPool
E[2019-05-28|10:41:31.429] Stopped accept routine, as transport is closed module=p2p numPeers=0
E[2019-05-28|10:41:31.473] Couldn't connect to any seeds                module=p2p 
E[2019-05-28|10:41:37.091] Not stopping BlockPool -- have not been started yet module=blockchain impl=BlockPool
E[2019-05-28|10:41:37.102] Stopped accept routine, as transport is closed module=p2p numPeers=0
E[2019-05-28|10:41:37.146] Couldn't connect to any seeds                module=p2p 
E[2019-05-28|10:41:37.946] Not stopping BlockPool -- have not been started yet module=blockchain impl=BlockPool
E[2019-05-28|10:41:37.992] Stopped accept routine, as transport is closed module=p2p numPeers=0
--- FAIL: TestIssueNfts (0.07s)
    <autogenerated>:1: cannot start tendermint node: listen tcp 0.0.0.0:33358: bind: address already in use
E[2019-05-28|10:41:38.143] Couldn't connect to any seeds                module=p2p 
E[2019-05-28|10:41:38.314] Not stopping BlockPool -- have not been started yet module=blockchain impl=BlockPool
E[2019-05-28|10:41:38.340] Stopped accept routine, as transport is closed module=p2p numPeers=0
E[2019-05-28|10:41:38.420] Couldn't connect to any seeds                module=p2p 
E[2019-05-28|10:41:39.308] Not stopping BlockPool -- have not been started yet module=blockchain impl=BlockPool
E[2019-05-28|10:41:39.337] Stopped accept routine, as transport is closed module=p2p numPeers=0
E[2019-05-28|10:41:39.383] Couldn't connect to any seeds                module=p2p 
E[2019-05-28|10:41:40.142] Not stopping BlockPool -- have not been started yet module=blockchain impl=BlockPool
E[2019-05-28|10:41:40.143] Stopped accept routine, as transport is closed module=p2p numPeers=0
E[2019-05-28|10:41:40.203] Couldn't connect to any seeds                module=p2p 
E[2019-05-28|10:41:40.376] Not stopping BlockPool -- have not been started yet module=blockchain impl=BlockPool
E[2019-05-28|10:41:40.408] Stopped accept routine, as transport is closed module=p2p numPeers=0
FAIL
FAIL    github.com/iov-one/weave/cmd/bnsd/scenarios    11.179s

https://travis-ci.com/iov-one/weave/jobs/203494342

@ruseinov ruseinov self-assigned this May 28, 2019
@ruseinov
Copy link
Contributor Author

ruseinov commented May 28, 2019

@ethanfrey So I have investigated why this happens and it's this:
https://github.com/tendermint/tendermint/blob/master/libs/common/net.go#L28

// GetFreePort gets a free port from the operating system.
// Ripped from https://github.com/phayes/freeport.
// BSD-licensed.
func GetFreePort() (int, error) {
	addr, err := net.ResolveTCPAddr("tcp", "localhost:0")
	if err != nil {
		return 0, err
	}

	l, err := net.ListenTCP("tcp", addr)
	if err != nil {
		return 0, err
	}
	defer l.Close()
	return l.Addr().(*net.TCPAddr).Port, nil
}

A good explanation here: https://eklitzke.org/binding-on-port-zero.
In our case this should not be common, I'm actually surprised it happens at all.

@ethanfrey
Copy link
Contributor

I think we explicitly set the port 0.0.0.0:33358 or do we not???

Either we make sure to explicitly set the port differently for each run of tendermint to avoid timing issues, or figure out why port 0 returning odd locations.

Let's see if this happens again in the CI though, as no need to waste time on a fluke occurance

@ruseinov
Copy link
Contributor Author

@ethanfrey nope, this is being done by

// GetConfig returns a config for the test cases as a singleton
func GetConfig() *cfg.Config {
	if globalConfig == nil {
		pathname := makePathname()
		globalConfig = cfg.ResetTestRoot(pathname)

		// and we use random ports to run in parallel
		tm, rpc, grpc := makeAddrs()
		globalConfig.P2P.ListenAddress = tm
		globalConfig.RPC.ListenAddress = rpc
		globalConfig.RPC.CORSAllowedOrigins = []string{"https://tendermint.com/"}
		globalConfig.RPC.GRPCListenAddress = grpc
		globalConfig.TxIndex.IndexTags = "app.creator,tx.height" // see kvstore application
	}
	return globalConfig
}

in rpc/test/helpers.go in tendermint

@ruseinov ruseinov added the blocked Work on the issue cannot be continue before another problem is solved. label May 28, 2019
@ethanfrey
Copy link
Contributor

  // and we use random ports to run in parallel

okay, gotcha here.

I will close this issue unless it happens a few more times (then please reopen)

@ruseinov
Copy link
Contributor Author

@ethanfrey I mark this now as blocked.
There is a possibility for collision using this approach, but it is very low.

However, introducing one retry (or configurable retries) to bind to port 0 in tendermint would result in this being consistently .

@ruseinov
Copy link
Contributor Author

I have been looking at the wrong test. So it turns out we are calling GetFreePort in our tests, which is something we should avoid doing.

func randomAddr(t testing.TB) string {

This is likely the cause of collisions, because GetFreePort just checks if the port is free and then releases it, so it's likely we get the same port. We should let tendermint do it instead or get these ports ourselves by stealing that code and enhancing it to return several ports.

I will get this fixed.

@ruseinov ruseinov removed the blocked Work on the issue cannot be continue before another problem is solved. label May 29, 2019
@ruseinov
Copy link
Contributor Author

I have taken a look at how tendermint does it - and they have the same approach.
Would would help is a method that binds to ports while generating n ports and then returns and releases them. That would reduce collision chance.

I have tried to port our code to use such approach, but that means duplicating a lot of tendermint's test setup and does not look like a good way forward. I'm going to comment on my original tendermint issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants