Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(canary): Fix Flaky Tests #889

Merged
merged 3 commits into from
Aug 13, 2019
Merged

fix(canary): Fix Flaky Tests #889

merged 3 commits into from
Aug 13, 2019

Conversation

slim-bean
Copy link
Collaborator

the canary comparator tests are flaky because they assert some exact ordering on messages written to the output. I believe because certain validations are done in a goroutine this leads to inconsistent timing on their responses and flakes out the tests. I added a parameter to the comparator to make these lookups synchronous and hopefully fix this issue.

…void races around the entry order and test verifications
@slim-bean slim-bean requested a review from rfratto August 13, 2019 02:19
Copy link
Contributor

@sandeepsukhani sandeepsukhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rfratto
Copy link
Member

rfratto commented Aug 13, 2019

I can still reproduce some flakes if I simulate a slow machine, unfortunately.

  1. On a Mac, with XCode installed, open /Applications/Xcode.app/Contents/Applications/Instruments.app
  2. Open Instruments Preferences
  3. Go to the CPUs tab, disable Hardware Multi-Threading and set "Active Processor cores" to 1.
  4. Run the canary tests a bunch of times in a terminal: for i in `seq 100`; do go test -count=1 ./pkg/canary/...; done
  5. Restore the settings on the CPUs tab in Instrument Preferences to their original settings unless you love a very slow machine 🙂

I got this:

--- FAIL: TestEntryNeverReceived (0.10s)
    comparator_test.go:186: 
        	Error Trace:	comparator_test.go:186
        	Error:      	Not equal: 
        	            	expected: "out of order entry 2019-08-13 05:39:27.303961 -0400 EDT m=+0.027548861 was received before entries: [2019-08-13 05:39:27.302961 -0400 EDT m=+0.026548861]\nout of order entry 2019-08-13 05:39:27.305961 -0400 EDT m=+0.029548861 was received before entries: [2019-08-13 05:39:27.302961 -0400 EDT m=+0.026548861 2019-08-13 05:39:27.304961 -0400 EDT m=+0.028548861]\nwebsocket failed to receive entry 1565689167302961000 within 0.050000 seconds\nfailed to receive entry 1565689167302961000 within 0.050000 seconds\nwebsocket failed to receive entry 1565689167304961000 within 0.050000 seconds\n"
        	            	actual  : "out of order entry 2019-08-13 05:39:27.303961 -0400 EDT m=+0.027548861 was received before entries: [2019-08-13 05:39:27.302961 -0400 EDT m=+0.026548861]\nout of order entry 2019-08-13 05:39:27.305961 -0400 EDT m=+0.029548861 was received before entries: [2019-08-13 05:39:27.302961 -0400 EDT m=+0.026548861 2019-08-13 05:39:27.304961 -0400 EDT m=+0.028548861]\nwebsocket failed to receive entry 1565689167302961000 within 0.050000 seconds\nwebsocket failed to receive entry 1565689167304961000 within 0.050000 seconds\nfailed to receive entry 1565689167302961000 within 0.050000 seconds\n"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -3,4 +3,4 @@
        	            	 websocket failed to receive entry 1565689167302961000 within 0.050000 seconds
        	            	+websocket failed to receive entry 1565689167304961000 within 0.050000 seconds
        	            	 failed to receive entry 1565689167302961000 within 0.050000 seconds
        	            	-websocket failed to receive entry 1565689167304961000 within 0.050000 seconds
        	            	 
        	Test:       	TestEntryNeverReceived
FAIL
FAIL	github.com/grafana/loki/pkg/canary/comparator	0.221s

@sandeepsukhani
Copy link
Contributor

@rfratto did you try it with these changes? I haven't merged it yet.

@rfratto
Copy link
Member

rfratto commented Aug 13, 2019

@sandlis yeah, I checked out the PR locally

@slim-bean
Copy link
Collaborator Author

Depending on how much you slowed things down @rfratto you may have just exceeded some of the timings hard coded into the test.

Was it always the same test that failed?

//Wait a few maxWait intervals just to make sure all entries are expired and the async confirmMissing has completed
<-time.After(2 * maxWait)

Can you increase that wait and see if it fixes the failures?

@rfratto
Copy link
Member

rfratto commented Aug 13, 2019

It's always the same test that fails; it always passed when I increased maxWait to 500 * time.Millisecond

I tried lower settings (200ms, 300ms) but 500 seems to be the acceptable minimum for my specific machine.

Copy link
Member

@rfratto rfratto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@rfratto rfratto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, forgot to hit approve 🙂

@slim-bean slim-bean changed the title hopefully fix flaky tests fix(canary): Fix Flaky Tests Aug 13, 2019
@slim-bean slim-bean merged commit 173e8ea into master Aug 13, 2019
@slim-bean slim-bean deleted the fix-flaky-tests branch August 13, 2019 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants