Improve flaky connect/proxy Listener tests #4498

freddygv · 2018-08-07T17:43:03Z

The source of flakiness for TestUpstreamListener and TestPublicListener was due to a data race. The counter assertion was occasionally happening before the atomic counter in memory was incremented. This was remedied with a short 1ms Sleep before the listener is closed and the data gets dumped to go-metrics. (The new test passed hundreds of times locally in a constrained environment that would typically cause it to fail after < 20 iterations)

Lastly, while going through these tests I also found that on occasion the latest go-metrics interval would not contain the logged gauge data due to how the intervals are generated. That assertion was improved by checking the 2nd to last index in the metrics array.

- Add sleep to allow for cw.written to increment before calling Stats - Improve debug output when dumping metrics - Account for flakiness around interval for Gauge

banks

Super minor changes but I think we can easily make the gauges thing even more robust.

Longer term, I'd still like to use the Mock interface ratheer than jump through these ugly hoops but we can leave that for a future PR when we want to do more metrics testing.

banks · 2018-08-08T12:45:15Z

connect/proxy/listener_test.go

+	// If the latest interval is empty, the prior one should contain the stored metrics
+	if len(currentInterval.Gauges) == 0 {
+		currentInterval = data[len(data)-2]
+	}


I guess this is fine given 10 second intervals, but it's still somewhat timing dependent right?

Also this relies on the current system under test only having one gauge. If we used the same code in future or added more gauges it would be possible that this interval does have a gauge value but not the right one.

Finally, if we just failed to actually record the gauge because of an actual bug, len(data) - 2 could be out of bounds and turn a useful failure into a panic.

I suggest for this PR, lets just copy the loop from below to iterate (backwards) over all the intervals and pick the most recent value (if there is one).

banks · 2018-08-08T13:15:43Z

connect/proxy/listener_test.go

-	// Close listener to ensure all conns are closed and have reported their
-	// metrics
+	// Short sleep to allow for cw.written to increment for tx and rx before calling reportStats
+	time.Sleep(time.Millisecond)


This is cool, but I'm still puzzled how atomic increments appear to update asynchronously... Those atomic increments occur at same time Write was called on the packet. I assume Write must have returned by the time the test passes since we assert the packet was received.

But we call reportStats in a defer in the handler loop which will only be executed when one of these cases triggers:

consul/connect/proxy/listener.go

Lines 206 to 216 in 3e6313b

// Wait for conn to close

for {

select {

case <-connStop:

return

case <-l.stopChan:

return

case <-statsT.C:

reportStats()

}

}

<-connStop implies that the Conn is closed and both actual connections with it so Write must have completed

<-stopChan is more subtle, it will return when Listener.Stop closes the chan. Which means the Conn is still running.

We then execute defers in reverse order so reportStats gets called before conn.Close().

So the only thing that I can see that could possibly explain flakiness is that somehow, the test code gets the bytes delivered over the network and returns from the testEchoConn function, before the Write call on the countWriter finishes executing.

That is possible but seems incredible that it would happen as much as it does but my mental model for the internals of go network stack may well be off.

Can we just super quickly validate the theory though. All you'd need to do would be to remove the sleep here and then modify:

consul/connect/proxy/conn.go

Lines 91 to 96 in 3e6313b

// Write implements io.Writer

func (cw *countWriter) Write(p []byte) (n int, err error) {

n, err = cw.w.Write(p)

atomic.AddUint64(&cw.written, uint64(n))

return

}

To increment the counter, before the Write. That assumes the processor can't re-order those instructions though too which I'm not sure if it's valid.

If that change appears to fix it then I'll be a bit happier knowing where the race is. It's not a real "fix" though since we can't count sent bytes until they are really sent since the Write might fail etc.

I think either way, solving it with this sleep is the right approach for now, it's not a race that is ever of importance except in this test. But understanding exactly where the ordering assumptions are violated would be nice. We could update the comment to be more specific too.

If that doesn't fix it then I'm baffled but don't want to spend more time on this now!

I removed the sleep and flipped the order inside of Write and now the test passes consistently.

So yes it does seem to be that testEchoConn returns before Write in these conditions with limited CPU.

banks

Awesome!

Thanks for the sanity check on that flake issue.

Improve flaky connect/proxy Listener tests

492291d

- Add sleep to allow for cw.written to increment before calling Stats - Improve debug output when dumping metrics - Account for flakiness around interval for Gauge

freddygv requested a review from banks August 7, 2018 17:43

banks requested changes Aug 8, 2018

View reviewed changes

freddygv added 2 commits August 8, 2018 11:38

Make testing for gauges across intervals more robust

16cf7d3

Move sleep closer to source of flakiness (TestEchoConn)

74d0d22

banks approved these changes Aug 8, 2018

View reviewed changes

freddygv merged commit e21f554 into hashicorp:master Aug 8, 2018

freddygv deleted the debug-listener-tests branch August 8, 2018 18:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve flaky connect/proxy Listener tests #4498

Improve flaky connect/proxy Listener tests #4498

freddygv commented Aug 7, 2018 •

edited

Loading

banks left a comment

banks Aug 8, 2018

banks Aug 8, 2018

freddygv Aug 8, 2018

banks left a comment

	// Wait for conn to close
	for {
	select {
	case <-connStop:
	return
	case <-l.stopChan:
	return
	case <-statsT.C:
	reportStats()
	}
	}

	// Write implements io.Writer
	func (cw *countWriter) Write(p []byte) (n int, err error) {
	n, err = cw.w.Write(p)
	atomic.AddUint64(&cw.written, uint64(n))
	return
	}

Improve flaky connect/proxy Listener tests #4498

Improve flaky connect/proxy Listener tests #4498

Conversation

freddygv commented Aug 7, 2018 • edited Loading

banks left a comment

Choose a reason for hiding this comment

banks Aug 8, 2018

Choose a reason for hiding this comment

banks Aug 8, 2018

Choose a reason for hiding this comment

freddygv Aug 8, 2018

Choose a reason for hiding this comment

banks left a comment

Choose a reason for hiding this comment

freddygv commented Aug 7, 2018 •

edited

Loading