Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: TestDialTimeout flake on linux-amd64-sid #22896

Closed
bradfitz opened this issue Nov 28, 2017 · 15 comments

Comments

Projects
None yet
6 participants
@bradfitz
Copy link
Member

commented Nov 28, 2017

Is it supposed to be dialing 127.0.0.1:0 (the 0 port is weird):

https://build.golang.org/log/450b8cb212512d306aed03190d275f546c332715

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net	5.178s

/cc @mikioh

@bradfitz bradfitz added this to the Go1.10 milestone Nov 28, 2017

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Nov 29, 2017

I've looked through all the code and I've convinced myself that this can not happen. I also have not been able to reproduce it.

That said, here is another instance:

https://build.golang.org/log/95a745a6f8fed78a9b854b3cbad5986275ffc0cd

@nilsmagnus

This comment has been minimized.

Copy link
Contributor

commented Jan 2, 2018

if nerr, ok := err.(Error); !ok || !nerr.Timeout() {
				t.Fatalf("#%d: %v", i, err)
}

It looks to mee that there is an error in the log-message, it should have logged nerr, not err .

@nilsmagnus

This comment has been minimized.

Copy link
Contributor

commented Jan 2, 2018

The :0 port is for testing purposes and comes from line 71, at least according to (https://golang.org/src/net/timeout_test.go)

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 3, 2018

@nilsmagnus Logging err rather than nerr is correct; if !ok then nerr will be nil so we want to log err, and if ok then logging err gives us the same value as logging nerr anyhow.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.10, Go1.11 Jan 3, 2018

@agnivade

This comment has been minimized.

Copy link
Member

commented Jan 9, 2018

Binding on port 0 is a request to the kernel to look for a free port and bind to it. It actually does not bind to 0. Therefore, it should not be possible to connect to port 0. But that should not be an issue here since the socket filter will prevent the connection from happening in the first place.

However, in both the buildlet failures, the failure happens on item #5 which has the following values -

{
timeout - 50 * time.Millisecond, 
delta- 0, 
guard- 100 * time.Millisecond, 
max- time.Second
}

@bradfitz - Anything suspicious about these values that can cause a flake ? I tried very hard to reproduce but could not.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 9, 2018

ping @mikioh for above question.

@mikioh

This comment has been minimized.

Copy link
Contributor

commented Jan 12, 2018

I guess that one possibility is that the buildbot, cmd/go or testing package did run the net.test process incorrectly, without the harness of net/internal/socktest package.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 12, 2018

@mikioh I don't see any way that that could happen.

@mikioh

This comment has been minimized.

Copy link
Contributor

commented Jan 12, 2018

@ianlancetaylor,

I'm also not sure the mechanism, but the evidence, the test failed within 0.00s and attempted to connect to the literal port number zero (as a boobytrap set by me), leads me to the possibility.

@mikioh

This comment has been minimized.

Copy link
Contributor

commented Jan 12, 2018

not sure the mechanism

Hm, the harness is naively designed to be unstrapped before the execution of benchmarking, so it depends on the execution order of testing and benchmarking. It's likely that unstrapping happens if the testing goroutine is overtaken by one of benchmarking goroutines, like #21090.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 25, 2018

Here are the cases I can find today. It's perhaps interesting that it always seems to be case 5.

2017-04-17T18:33:08-73f283f/linux-ppc64-buildlet:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-04-21T23:43:59-49f54e8/linux-ppc64-buildlet:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-04-22T00:53:51-c202d4d/linux-amd64-clang:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-04-28T15:32:18-6511931/linux-ppc64-buildlet:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-08-28T05:26:59-eb07028/linux-amd64-alpine:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-09-23T09:24:02-6171d05/linux-amd64-alpine:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-10-05T08:28:41-be272ec/linux-ppc64le-buildlet:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-10-08T20:14:33-e285f39/linux-amd64-alpine:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-11-02T03:33:53-b66a294/linux-amd64:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-11-03T17:45:11-14f2bfd/linux-amd64-clang:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-11-10T04:40:13-17f35c6/linux-amd64-nocgo:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-11-22T22:37:15-ecc8650/linux-amd64-sid:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net

2017-11-28T01:18:54-b5c7183/linux-amd64-sid:

--- FAIL: TestDialTimeout (0.00s)
	timeout_test.go:87: #5: dial tcp 127.0.0.1:0: getsockopt: connection refused
FAIL
FAIL	net
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 25, 2018

These failures are occurring on the builders, which as far as I know don't run the benchmarks, so I don't think the problem could be the benchmarks uninstalling the test hooks.

@mikioh

This comment has been minimized.

Copy link
Contributor

commented Jan 26, 2018

Then, there might be a race on the callpath switch between test cases; enableSocketConnect? sounds plausible. Adding per-test case filter feature into socktest.Switch might be a solution.

@gopherbot

This comment has been minimized.

Copy link

commented Apr 6, 2018

Change https://golang.org/cl/105095 mentions this issue: net, net/internal/socktest: make concurrent socket system calls in test safe

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2018

This has not happened since 2017-11-28. I don't know what fixed it, but I think it must be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.