-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve flaky connect/proxy Listener tests #4498
Conversation
- Add sleep to allow for cw.written to increment before calling Stats - Improve debug output when dumping metrics - Account for flakiness around interval for Gauge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super minor changes but I think we can easily make the gauges thing even more robust.
Longer term, I'd still like to use the Mock interface ratheer than jump through these ugly hoops but we can leave that for a future PR when we want to do more metrics testing.
connect/proxy/listener_test.go
Outdated
// If the latest interval is empty, the prior one should contain the stored metrics | ||
if len(currentInterval.Gauges) == 0 { | ||
currentInterval = data[len(data)-2] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is fine given 10 second intervals, but it's still somewhat timing dependent right?
Also this relies on the current system under test only having one gauge. If we used the same code in future or added more gauges it would be possible that this interval does have a gauge value but not the right one.
Finally, if we just failed to actually record the gauge because of an actual bug, len(data) - 2
could be out of bounds and turn a useful failure into a panic.
I suggest for this PR, lets just copy the loop from below to iterate (backwards) over all the intervals and pick the most recent value (if there is one).
connect/proxy/listener_test.go
Outdated
// Close listener to ensure all conns are closed and have reported their | ||
// metrics | ||
// Short sleep to allow for cw.written to increment for tx and rx before calling reportStats | ||
time.Sleep(time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool, but I'm still puzzled how atomic increments appear to update asynchronously... Those atomic increments occur at same time Write
was called on the packet. I assume Write
must have returned by the time the test passes since we assert the packet was received.
But we call reportStats
in a defer
in the handler loop which will only be executed when one of these cases triggers:
consul/connect/proxy/listener.go
Lines 206 to 216 in 3e6313b
// Wait for conn to close | |
for { | |
select { | |
case <-connStop: | |
return | |
case <-l.stopChan: | |
return | |
case <-statsT.C: | |
reportStats() | |
} | |
} |
<-connStop
implies that the Conn
is closed and both actual connections with it so Write
must have completed
<-stopChan
is more subtle, it will return when Listener.Stop
closes the chan. Which means the Conn
is still running.
We then execute defers in reverse order so reportStats
gets called before conn.Close()
.
So the only thing that I can see that could possibly explain flakiness is that somehow, the test code gets the bytes delivered over the network and returns from the testEchoConn
function, before the Write
call on the countWriter
finishes executing.
That is possible but seems incredible that it would happen as much as it does but my mental model for the internals of go network stack may well be off.
Can we just super quickly validate the theory though. All you'd need to do would be to remove the sleep here and then modify:
Lines 91 to 96 in 3e6313b
// Write implements io.Writer | |
func (cw *countWriter) Write(p []byte) (n int, err error) { | |
n, err = cw.w.Write(p) | |
atomic.AddUint64(&cw.written, uint64(n)) | |
return | |
} |
To increment the counter, before the Write
. That assumes the processor can't re-order those instructions though too which I'm not sure if it's valid.
If that change appears to fix it then I'll be a bit happier knowing where the race is. It's not a real "fix" though since we can't count sent bytes until they are really sent since the Write might fail etc.
I think either way, solving it with this sleep is the right approach for now, it's not a race that is ever of importance except in this test. But understanding exactly where the ordering assumptions are violated would be nice. We could update the comment to be more specific too.
If that doesn't fix it then I'm baffled but don't want to spend more time on this now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the sleep and flipped the order inside of Write and now the test passes consistently.
So yes it does seem to be that testEchoConn returns before Write in these conditions with limited CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
Thanks for the sanity check on that flake issue.
The source of flakiness for TestUpstreamListener and TestPublicListener was due to a data race. The counter assertion was occasionally happening before the atomic counter in memory was incremented. This was remedied with a short 1ms Sleep before the listener is closed and the data gets dumped to go-metrics. (The new test passed hundreds of times locally in a constrained environment that would typically cause it to fail after < 20 iterations)
Lastly, while going through these tests I also found that on occasion the latest go-metrics interval would not contain the logged gauge data due to how the intervals are generated. That assertion was improved by checking the 2nd to last index in the metrics array.