Skip to content
This repository has been archived by the owner on Sep 9, 2020. It is now read-only.

TestMonitoredCmd is flaky #763

Closed
carolynvs opened this issue Jun 17, 2017 · 6 comments
Closed

TestMonitoredCmd is flaky #763

carolynvs opened this issue Jun 17, 2017 · 6 comments

Comments

@carolynvs
Copy link
Collaborator

This test fails intermittently with the following error. This time I saw it fail on go tip, but haven't noticed if it fails on other versions as well.

--- FAIL: TestMonitoredCmd (2.82s)
	cmd_test.go:59: Unexpected output:
			(GOT): foo
		foo
		foo
		foo
		foo
		
			(WNT): foo
		foo
		foo
		foo
@sdboyer
Copy link
Member

sdboyer commented Jun 19, 2017

This flaky test been a thorn in my side for some months now. I managed to make it a bit less flaky by playing with the timeouts, but the issues still occur.

I'd love to find a way to achieve the same or similar guarantees from the test, while avoiding the issues related to timing.

@ibrasho
Copy link
Collaborator

ibrasho commented Jun 20, 2017

Found a race condition that could happen on activityBuffer.buf (occurred after ~300 runs of the test). I've opened #779 with a possible fix.

This is a different failure reason than the one @carolynvs reported. I'd love if we could document the failing builds here to collect the possible other reasons that could cause this to fail.

And I'm still thinking of ways to figure out an alternative to the current test. 🤔

@sdboyer
Copy link
Member

sdboyer commented Jun 21, 2017

The current approach relies on unsynchronized clocks to agree, so it's pretty much always going to be flaky. We need to figure out how to get rid of the clock in the invoked process if there's going to be any hope of making it not flaky.

@erizocosmico
Copy link
Contributor

erizocosmico commented Jul 15, 2017

Maybe we should not check the output for specific values, which is what leads to this. I'm not sure if just checking whether the command timed out or not is enough, though, because then we lose the check of when the command timed out. Other than that, I can't think of a way to test this that does not rely on the clock 😞

@sdboyer
Copy link
Member

sdboyer commented Jul 18, 2017

Well, it sounds like we need coordination between processes. Coordination means communication. Probably best to do with additional fd. If we replace echosleep with an implementation that expects two additional fds, then we can use those to coordinate IPC and exercise fine-grained control over the child process from the parent in order to simulate the desired behavior.

@sdboyer
Copy link
Member

sdboyer commented Aug 9, 2017

lot less flaky now - i'm gonna close this as good enough.

@sdboyer sdboyer closed this as completed Aug 9, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants