Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upFlaky shutdown test #4587
Comments
This comment has been minimized.
This comment has been minimized.
|
I also got this error when I was building the open census zpages. |
simonpasquier
added
the
component/tests
label
Sep 10, 2018
This comment has been minimized.
This comment has been minimized.
zhsj
commented
Oct 20, 2018
|
I also meet this failure when I rebuild the prometheus debian package. The simple workaround is to insert a --- cmd/prometheus/main_test.go.orig 2018-10-20 15:47:14.599840120 +0000
+++ cmd/prometheus/main_test.go 2018-10-20 15:41:37.667943206 +0000
@@ -93,6 +93,7 @@
t.Errorf("prometheus didn't start in the specified timeout")
return
}
+ time.Sleep(10 * time.Second)
if err := prom.Process.Kill(); err == nil {
t.Errorf("prometheus didn't shutdown gracefully after sending the Interrupt signal")
} else if stoppedErr != nil && stoppedErr.Error() != "signal: interrupt" { // TODO - find a better way to detect when the process didn't exit as expected!
But it looks unreliable and ugly... |
This comment has been minimized.
This comment has been minimized.
|
There is already a 10 second wait in the for loop, so this seems excessive.. Also, this was not happening a few days ago when I uploaded 2.4.3 to debian (https://buildd.debian.org/status/package.php?p=prometheus&suite=sid), so there is something fishy.. |
This comment has been minimized.
This comment has been minimized.
|
By stracing this test, I see prometheus taking exactly 20 seconds to shutdown, with no messages explaining why:
|
This comment has been minimized.
This comment has been minimized.
|
Definitely, this is something in gRPC. I have just rebuilt prometheus with gRPC 1.6.0, and that test finished in less than a second:
|
This comment has been minimized.
This comment has been minimized.
|
After some debugging, I've found the cause for this hang: the http.DefaultTransport is caching open connections, including the one used to check health, and prometheus does not exit until that connection is closed. Adding this line makes the test finish immediately: --- a/cmd/prometheus/main_test.go
+++ b/cmd/prometheus/main_test.go
@@ -77,6 +77,7 @@
for x := 0; x < 10; x++ {
// error=nil means prometheus has started so can send the interrupt signal and wait for the grace shutdown.
if _, err := http.Get("http://localhost:9090/graph"); err == nil {
+ http.DefaultTransport.(*http.Transport).CloseIdleConnections()
startedOk = true
prom.Process.Signal(os.Interrupt)
select { |
This comment has been minimized.
This comment has been minimized.
|
Scratch that. I had misread the straces (and the test was passing who knows why). The hang is due to a HTTP/2 connection between prometheus threads kept open and idle for 20s, but I haven't yet found where or why this happens, although gRPC would be the prime suspect. Any ideas/pointers welcomed.
|
This comment has been minimized.
This comment has been minimized.
|
I have found the hang happens during the call to
|
gouthamve commentedSep 7, 2018
This failed on travis: