Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky shutdown test #4587

Open
gouthamve opened this Issue Sep 7, 2018 · 8 comments

Comments

Projects
None yet
5 participants
@gouthamve
Copy link
Member

gouthamve commented Sep 7, 2018

This failed on travis:

No unused packages
>> running all tests
go test -race ./...
--- FAIL: TestStartupInterrupt (10.51s)
	main_test.go:105: prometheus didn't shutdown gracefully after sending the Interrupt signal
FAIL
FAIL	github.com/prometheus/prometheus/cmd/prometheus	85.136s
ok  	github.com/prometheus/prometheus/cmd/promtool	1.075s
ok  	github.com/prometheus/prometheus/config	1.279s
ok  	github.com/prometheus/prometheus/discovery	7.474s
@sipian

This comment has been minimized.

Copy link
Contributor

sipian commented Sep 7, 2018

I also got this error when I was building the open census zpages.

@zhsj

This comment has been minimized.

Copy link

zhsj commented Oct 20, 2018

I also meet this failure when I rebuild the prometheus debian package.
https://bugs.debian.org/911434

The simple workaround is to insert a time.Sleep.

--- cmd/prometheus/main_test.go.orig	2018-10-20 15:47:14.599840120 +0000
+++ cmd/prometheus/main_test.go	2018-10-20 15:41:37.667943206 +0000
@@ -93,6 +93,7 @@
 		t.Errorf("prometheus didn't start in the specified timeout")
 		return
 	}
+	time.Sleep(10 * time.Second)
 	if err := prom.Process.Kill(); err == nil {
 		t.Errorf("prometheus didn't shutdown gracefully after sending the Interrupt signal")
 	} else if stoppedErr != nil && stoppedErr.Error() != "signal: interrupt" { // TODO - find a better way to detect when the process didn't exit as expected!

But it looks unreliable and ugly...

@TheTincho

This comment has been minimized.

Copy link
Contributor

TheTincho commented Oct 20, 2018

There is already a 10 second wait in the for loop, so this seems excessive.. Also, this was not happening a few days ago when I uploaded 2.4.3 to debian (https://buildd.debian.org/status/package.php?p=prometheus&suite=sid), so there is something fishy..

@TheTincho

This comment has been minimized.

Copy link
Contributor

TheTincho commented Oct 20, 2018

By stracing this test, I see prometheus taking exactly 20 seconds to shutdown, with no messages explaining why:

[pid 15133] write(2, "level=info ts=2018-10-20T19:45:35.643055957Z caller=main.go:601 msg=\"Notifier manager stopped\"\n", 95) = 95
[pid 15139] write(2, "level=info ts=2018-10-20T19:45:55.161094263Z caller=main.go:613 msg=\"See you next time!\"\n", 89) = 89
@TheTincho

This comment has been minimized.

Copy link
Contributor

TheTincho commented Oct 20, 2018

Definitely, this is something in gRPC. I have just rebuilt prometheus with gRPC 1.6.0, and that test finished in less than a second:

=== RUN   TestStartupInterrupt
--- PASS: TestStartupInterrupt (0.51s)
@TheTincho

This comment has been minimized.

Copy link
Contributor

TheTincho commented Oct 20, 2018

After some debugging, I've found the cause for this hang: the http.DefaultTransport is caching open connections, including the one used to check health, and prometheus does not exit until that connection is closed. Adding this line makes the test finish immediately:

--- a/cmd/prometheus/main_test.go
+++ b/cmd/prometheus/main_test.go
@@ -77,6 +77,7 @@
 	for x := 0; x < 10; x++ {
 		// error=nil means prometheus has started so can send the interrupt signal and wait for the grace shutdown.
 		if _, err := http.Get("http://localhost:9090/graph"); err == nil {
+			http.DefaultTransport.(*http.Transport).CloseIdleConnections()
 			startedOk = true
 			prom.Process.Signal(os.Interrupt)
 			select {
@TheTincho

This comment has been minimized.

Copy link
Contributor

TheTincho commented Oct 20, 2018

Scratch that. I had misread the straces (and the test was passing who knows why). The hang is due to a HTTP/2 connection between prometheus threads kept open and idle for 20s, but I haven't yet found where or why this happens, although gRPC would be the prime suspect. Any ideas/pointers welcomed.

23:41:01.500462 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 6
23:41:01.500590 setsockopt(6, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
23:41:01.500656 connect(6, {sa_family=AF_INET, sin_port=htons(9090), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EINPROGRESS (Operation now in progress)
23:41:01.500904 getsockopt(6, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
23:41:01.500952 getpeername(6, {sa_family=AF_INET, sin_port=htons(9090), sin_addr=inet_addr("127.0.0.1")}, [112->16]) = 0
23:41:01.500998 getsockname(6, {sa_family=AF_INET, sin_port=htons(39658), sin_addr=inet_addr("127.0.0.1")}, [112->16]) = 0
23:41:01.501043 setsockopt(6, SOL_TCP, TCP_NODELAY, [1], 4) = 0
23:41:01.501125 write(6, "PRI * HTTP/2.0\r\n\r\nSM\r\n\r\n", 24) = 24
23:41:01.501821 write(6, "\0\0\0\4\0\0\0\0\0", 9) = 9
23:41:01.502263 read(6, 0xc420652000, 32768) = -1 EAGAIN (Resource temporarily unavailable)

23:41:21.501651 close(6)                = 0
@TheTincho

This comment has been minimized.

Copy link
Contributor

TheTincho commented Oct 21, 2018

I have found the hang happens during the call to grpcSrv.GracefulStop() in the web handler, replacing it with grpcSrv.Stop() does not change the behaviour, contrary to what the documentation implies. Running with GRPC_GO_LOG_SEVERITY_LEVEL=info I see these messages:

level=info ts=2018-10-21T09:56:23.33676144Z caller=web.go:392 component=web msg="Start listening for connections" address=0.0.0.0:9090
INFO: 2018/10/21 09:56:23 parsed scheme: ""
INFO: 2018/10/21 09:56:23 scheme "" not registered, fallback to default scheme
INFO: 2018/10/21 09:56:23 ccResolverWrapper: sending new addresses to cc: [{0.0.0.0:9090 0  <nil>}]
INFO: 2018/10/21 09:56:23 ClientConn switching balancer to "pick_first"
INFO: 2018/10/21 09:56:23 pickfirstBalancer: HandleSubConnStateChange: 0xc420630060, CONNECTING
INFO: 2018/10/21 09:56:23 pickfirstBalancer: HandleSubConnStateChange: 0xc420630060, READY

[..]
level=info ts=2018-10-21T09:56:23.343295261Z caller=main.go:605 msg="Notifier manager stopped"
WARNING: 2018/10/21 09:56:43 grpc: addrConn.transportMonitor didn't get server preface after waiting. Closing the new transport now.
INFO: 2018/10/21 09:56:43 pickfirstBalancer: HandleSubConnStateChange: 0xc420630060, TRANSIENT_FAILURE
INFO: 2018/10/21 09:56:43 pickfirstBalancer: HandleSubConnStateChange: 0xc420630060, CONNECTING
INFO: 2018/10/21 09:56:43 pickfirstBalancer: HandleSubConnStateChange: 0xc420630060, TRANSIENT_FAILURE
INFO: 2018/10/21 09:56:43 pickfirstBalancer: HandleSubConnStateChange: 0xc420630060, CONNECTING
WARNING: 2018/10/21 09:56:43 grpc: addrConn.createTransport failed to connect to {0.0.0.0:9090 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:9090: connect: connection refused". Reconnecting...
INFO: 2018/10/21 09:56:43 pickfirstBalancer: HandleSubConnStateChange: 0xc420630060, TRANSIENT_FAILURE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.