Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some tests fail due to test timing issues when using -count=100 / 1000 #25

Closed
mochi-co opened this issue Jan 30, 2022 · 9 comments
Closed
Labels
bug Something isn't working

Comments

@mochi-co
Copy link
Collaborator

As per @rkennedy's comments in #24, various tests fail intermittently and typically exhibit when running go test ./... -count=100. Backtesting indicates these failures occur at least as early as v1.0.5. The failures can occur as few as 1 in every 500 runs. This is likely to be a problem with the tests rather than a problem with the broker code.

Determine the precise cause for these intermittent failures and correct it.

After running go test ./... -count=100 several times, the known failures are as follows:

writing err io: read/write on closed pipe
--- FAIL: TestServerResendClientInflightBackoff (0.01s)
    server_test.go:2096: 
                Error Trace:    server_test.go:2096
                Error:          Not equal: 
                                expected: []byte{0x3a, 0xe, 0x0, 0x5, 0x61, 0x2f, 0x62, 0x2f, 0x63, 0x0, 0xb, 0x68, 0x65, 0x6c, 0x6c, 0x6f}
                                actual  : []byte{}
                            
                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -1,3 +1,2 @@
                                -([]uint8) (len=16) {
                                - 00000000  3a 0e 00 05 61 2f 62 2f  63 00 0b 68 65 6c 6c 6f  |:...a/b/c..hello|
                                +([]uint8) {
                                 }
                Test:           TestServerResendClientInflightBackoff
writing err io: read/write on closed pipe
--- FAIL: TestServerResendClientInflightBackoff (0.00s)
    server_test.go:2105: 
                Error Trace:    server_test.go:2105
                Error:          Not equal: 
                                expected: 1
                                actual  : 2
                Test:           TestServerResendClientInflightBackoff
2022/01/30 10:16:33 error writing to buffer io.Writer; io: read/write on closed pipe
--- FAIL: TestServerWriteClient (0.01s)
    server_test.go:671: 
                Error Trace:    server_test.go:671
                Error:          Not equal: 
                                expected: []byte{0x70, 0x2, 0x0, 0xe}
                                actual  : []byte{}
                            
                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -1,3 +1,2 @@
                                -([]uint8) (len=4) {
                                - 00000000  70 02 00 0e                                       |p...|
                                +([]uint8) {
                                 }
                Test:           TestServerWriteClient
2022/01/30 10:19:33 error writing to buffer io.Writer; io: read/write on closed pipe
--- FAIL: TestServerCloseClientLWT (0.01s)
    server_test.go:1756: 
                Error Trace:    server_test.go:1756
                Error:          Not equal: 
                                expected: []byte{0x30, 0xc, 0x0, 0x5, 0x61, 0x2f, 0x62, 0x2f, 0x63, 0x68, 0x65, 0x6c, 0x6c, 0x6f}
                                actual  : []byte{}
                            
                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -1,3 +1,2 @@
                                -([]uint8) (len=14) {
                                - 00000000  30 0c 00 05 61 2f 62 2f  63 68 65 6c 6c 6f        |0...a/b/chello|
                                +([]uint8) {
                                 }
                Test:           TestServerCloseClientLWT

Please add any others to this issue if you see them.

@mochi-co mochi-co added the bug Something isn't working label Jan 30, 2022
@rkennedy
Copy link
Contributor

I've seen this a few times:

--- FAIL: TestServerServe (0.00s)
    server_test.go:148:
                Error Trace:    server_test.go:148
                Error:          Not equal:
                                expected: true
                                actual  : false
                Test:           TestServerServe

On my RHEL and macOS systems (both x86_64), I added the -race option to go test (not available on ARM32), and I see a few data races, mainly related to unsyncronized access to various fields of the system.Info struct. Perhaps that's a separate class of errors from the one tracked in this issue, though.

@jmacd
Copy link
Contributor

jmacd commented Mar 9, 2022

I believe all but one of these are addressed in #36.

@jmacd
Copy link
Contributor

jmacd commented Mar 14, 2022

On further investigation, there are two categories of test failure that are still happening here after #36. First, there are tests that use time.Sleep() that may break under heavy load, some of these will be difficult to fix. For example, in #38 a fix for
TestServerWriteClient() replaced a ReadAll() that might race with another Close() with a Read() specifically tailored to the expected 4 bytes (which makes Close uneccessary).

Second, there are tests that read atomic variables without using an atomic method., as noted above. Easy to fix.

@mochi-co
Copy link
Collaborator Author

I believe all but one of these are addressed in #36.

Which one do you feel is not addressed? If we can isolate any remaining issues it would be good to split them into a new ticket and close this one.

@jmacd
Copy link
Contributor

jmacd commented Mar 17, 2022

Sure! I ran overnight testing and was able to reproduce a few test failures. Every failure I looked into, however, just appeared to be a minor issue in the test, not a problem in the code.

I will be glad to file new, specific issues and close this.

@jmacd
Copy link
Contributor

jmacd commented Mar 17, 2022

Here's the most frequent one:

--- FAIL: TestServerResendClientInflightBackoff (0.00s)
    server_test.go:2208: 
        	Error Trace:	server_test.go:2208
        	Error:      	Not equal: 
        	            	expected: []byte{0x3a, 0xe, 0x0, 0x5, 0x61, 0x2f, 0x62, 0x2f, 0x63, 0x0, 0xb, 0x68, 0x65, 0x6c, 0x6c, 0x6f}
        	            	actual  : []byte{0x3a, 0xe, 0x0, 0x5, 0x61, 0x2f, 0x62, 0x2f, 0x63, 0x0, 0xb, 0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x3a, 0xe, 0x0, 0x5, 0x61, 0x2f, 0x62, 0x2f, 0x63, 0x0, 0xb, 0x68, 0x65, 0x6c, 0x6c, 0x6f}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,3 +1,4 @@
        	            	-([]uint8) (len=16) {
        	            	+([]uint8) (len=32) {
        	            	  00000000  3a 0e 00 05 61 2f 62 2f  63 00 0b 68 65 6c 6c 6f  |:...a/b/c..hello|
        	            	+ 00000010  3a 0e 00 05 61 2f 62 2f  63 00 0b 68 65 6c 6c 6f  |:...a/b/c..hello|
        	            	 }
        	Test:       	TestServerResendClientInflightBackoff
FAIL
FAIL	github.com/mochi-co/mqtt/server	334.052s
FAIL

If you repeat this test enough, the Resend backoff test will fail, resulting in two packets being sent. The test uses second-precision and a sleep of 1ms, so there's a small probability of sleeping so that time.Unix() shows a difference of 1, so that the first resend happens.

We can rename this issue to capture correcting all "flaky" tests related to rare timing events and keep working on them. In any case, I run go test -race -count=1000 -timeout=100h . and they're popping up pretty frequently.

@mochi-co mochi-co changed the title Various tests fail when using test -count=100 Some tests fail due to test timing issues when using -count=100 / 1000 Mar 22, 2022
@mochi-co
Copy link
Collaborator Author

This appears fixed in v1.2.0 but I'll leave it open just incase.

@mochi-co
Copy link
Collaborator Author

I am still seeing the TestServerResendClientInflightBackoff issue

@mochi-co
Copy link
Collaborator Author

Fixed in v1.3.0 due to a timing issue between writing and reading the err message from the close err atomic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants