Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve flaky connect/proxy Listener tests #4498

Merged
merged 3 commits into from
Aug 8, 2018

Conversation

freddygv
Copy link
Contributor

@freddygv freddygv commented Aug 7, 2018

The source of flakiness for TestUpstreamListener and TestPublicListener was due to a data race. The counter assertion was occasionally happening before the atomic counter in memory was incremented. This was remedied with a short 1ms Sleep before the listener is closed and the data gets dumped to go-metrics. (The new test passed hundreds of times locally in a constrained environment that would typically cause it to fail after < 20 iterations)

Lastly, while going through these tests I also found that on occasion the latest go-metrics interval would not contain the logged gauge data due to how the intervals are generated. That assertion was improved by checking the 2nd to last index in the metrics array.

- Add sleep to allow for cw.written to increment before calling Stats

- Improve debug output when dumping metrics

- Account for flakiness around interval for Gauge
@freddygv freddygv requested a review from banks August 7, 2018 17:43
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor changes but I think we can easily make the gauges thing even more robust.

Longer term, I'd still like to use the Mock interface ratheer than jump through these ugly hoops but we can leave that for a future PR when we want to do more metrics testing.

// If the latest interval is empty, the prior one should contain the stored metrics
if len(currentInterval.Gauges) == 0 {
currentInterval = data[len(data)-2]
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is fine given 10 second intervals, but it's still somewhat timing dependent right?

Also this relies on the current system under test only having one gauge. If we used the same code in future or added more gauges it would be possible that this interval does have a gauge value but not the right one.

Finally, if we just failed to actually record the gauge because of an actual bug, len(data) - 2 could be out of bounds and turn a useful failure into a panic.

I suggest for this PR, lets just copy the loop from below to iterate (backwards) over all the intervals and pick the most recent value (if there is one).

// Close listener to ensure all conns are closed and have reported their
// metrics
// Short sleep to allow for cw.written to increment for tx and rx before calling reportStats
time.Sleep(time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool, but I'm still puzzled how atomic increments appear to update asynchronously... Those atomic increments occur at same time Write was called on the packet. I assume Write must have returned by the time the test passes since we assert the packet was received.

But we call reportStats in a defer in the handler loop which will only be executed when one of these cases triggers:

// Wait for conn to close
for {
select {
case <-connStop:
return
case <-l.stopChan:
return
case <-statsT.C:
reportStats()
}
}

<-connStop implies that the Conn is closed and both actual connections with it so Write must have completed

<-stopChan is more subtle, it will return when Listener.Stop closes the chan. Which means the Conn is still running.

We then execute defers in reverse order so reportStats gets called before conn.Close().

So the only thing that I can see that could possibly explain flakiness is that somehow, the test code gets the bytes delivered over the network and returns from the testEchoConn function, before the Write call on the countWriter finishes executing.

That is possible but seems incredible that it would happen as much as it does but my mental model for the internals of go network stack may well be off.

Can we just super quickly validate the theory though. All you'd need to do would be to remove the sleep here and then modify:

// Write implements io.Writer
func (cw *countWriter) Write(p []byte) (n int, err error) {
n, err = cw.w.Write(p)
atomic.AddUint64(&cw.written, uint64(n))
return
}

To increment the counter, before the Write. That assumes the processor can't re-order those instructions though too which I'm not sure if it's valid.

If that change appears to fix it then I'll be a bit happier knowing where the race is. It's not a real "fix" though since we can't count sent bytes until they are really sent since the Write might fail etc.

I think either way, solving it with this sleep is the right approach for now, it's not a race that is ever of importance except in this test. But understanding exactly where the ordering assumptions are violated would be nice. We could update the comment to be more specific too.

If that doesn't fix it then I'm baffled but don't want to spend more time on this now!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the sleep and flipped the order inside of Write and now the test passes consistently.

So yes it does seem to be that testEchoConn returns before Write in these conditions with limited CPU.

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

Thanks for the sanity check on that flake issue.

@freddygv freddygv merged commit e21f554 into hashicorp:master Aug 8, 2018
@freddygv freddygv deleted the debug-listener-tests branch August 8, 2018 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants