Make DebugHook test more reliable on busy machines #7038

Merged
merged 1 commit into from Feb 27, 2017

Conversation

Projects
None yet
4 participants
Member

babbageclunk commented Feb 27, 2017

Description of change

TestRunHook was regularly failing on test hosts with an error indicating
that the flock process was being killed by a timer before we managed to
get the debug directory. It passes reliably on a (relatively unloaded)
dev machine, but I could simulate a heavier load by adding a delay in
the goroutine the test starts and get similarly flaky behaviour.

Increase the timeout of the flock command to match the timeout of the
select loop below it - this should make the failures much less likely.

Bug reference

Hopefullly fixes https://bugs.launchpad.net/juju/+bug/1612747

Make DebugHook test more reliable on busy machines
TestRunHook was regularly failing on test hosts with an error indicating
that the flock process was being killed by a timer before we managed to
get the debug directory. It passes reliably on a (relatively unloaded)
dev machine, but I could simulate a heavier load by adding a delay in
the goroutine the test starts and get similarly flaky behaviour.

Increase the timeout of the flock command to match the timeout of the
select loop below it - this should make the failures much less likely.

In terms of getting test to pass more frequently, increasing timeout seems fine. As such it's an improvement for getting more frequent Blesses :D Hence, LGTM \o/
However, I wonder if we are just kicking the can and are actually obscuring an underlying issue?

Member

anastasiamac commented Feb 27, 2017

$$merge$$

Contributor

jujubot commented Feb 27, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

@jujubot jujubot merged commit d277b90 into juju:2.1 Feb 27, 2017

Member

babbageclunk commented Feb 27, 2017

@anastasiamac yeah, I thought about that too - bumping up a timeout isn't a very satisfying "fix". I couldn't see a way of making it more inherently reliable in this case - since we're starting an external flock process running sleep there's not much we can do to connect the timeout to something happening in this process.

I guess the flock command could run something that listened to a socket and died when it received a message, and we could write to the socket when we know the goroutine has run? It feels like that's complicated enough to end up with a less reliable test though.

@babbageclunk babbageclunk deleted the babbageclunk:debug-hook-test-fix branch Feb 27, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment