Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky TestJoinMembersWithRetryBackoff #412

Conversation

dimitarvdimitrov
Copy link
Contributor

    memberlist_client_test.go:898:
        	Error Trace:	/Users/dimitar/grafana/dskit/kv/memberlist/memberlist_client_test.go:898
        	            				/Users/dimitar/grafana/dskit/kv/memberlist/asm_arm64.s:1197
        	Error:      	Received unexpected error:
        	            	Member-2 failed to join the cluster: &{[%!f(*errors.errorString=&{Failed to join 127.0.0.1:57493: dial tcp 127.0.0.1:57493: connect: connection refused})] %!f(multierror.ErrorFormatFunc=<nil>)}
        	Test:       	TestJoinMembersWithRetryBackoff/Test_late_start_of_DNS_service

The failure was caused by the test clients starting and expecting to connect to a KV which still wasn't started. This is unrelated with the backoff that the test asserts on (the backoff happens effectively in the call err = services.StartAndAwaitRunning(context.Background(), mkv)).

```
    memberlist_client_test.go:898:
        	Error Trace:	/Users/dimitar/grafana/dskit/kv/memberlist/memberlist_client_test.go:898
        	            				/Users/dimitar/grafana/dskit/kv/memberlist/asm_arm64.s:1197
        	Error:      	Received unexpected error:
        	            	Member-2 failed to join the cluster: &{[%!f(*errors.errorString=&{Failed to join 127.0.0.1:57493: dial tcp 127.0.0.1:57493: connect: connection refused})] %!f(multierror.ErrorFormatFunc=<nil>)}
        	Test:       	TestJoinMembersWithRetryBackoff/Test_late_start_of_DNS_service
```

The failure was caused by the test clients starting and expecting to connect to a KV which still wasn't started.
This is unrelated with the backoff that the test asserts on (the backoff happens effectively in the call `err = services.StartAndAwaitRunning(context.Background(), mkv)`).
Copy link
Contributor

@56quarters 56quarters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find!

@@ -894,10 +895,12 @@ func TestJoinMembersWithRetryBackoff(t *testing.T) {
if err != nil {
t.Errorf("failed to start KV: %v", err)
}
kvsStarted.Done()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably use defer kvsStarted.Done() in this function, to unblock the main test goroutine even if starting of KVs fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can to run kvsStarted.Done() before running runClient. And t.Errorf continues the execution of the test

Don't we want to abort the test if a kv cannot start? I changed it so that we abort if the kv store cannot start and call kvsStarted.Done() regardless if the kv store started successfully or failed. See da2d513

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think that should do it.

@@ -894,10 +895,12 @@ func TestJoinMembersWithRetryBackoff(t *testing.T) {
if err != nil {
t.Errorf("failed to start KV: %v", err)
}
kvsStarted.Done()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think that should do it.

@dimitarvdimitrov dimitarvdimitrov merged commit 151f33b into main Oct 26, 2023
3 checks passed
@dimitarvdimitrov dimitarvdimitrov deleted the dimitar/memberlist/fix-flaky-TestJoinMembersWithRetryBackoff branch October 26, 2023 17:09
ying-jeanne pushed a commit that referenced this pull request Nov 2, 2023
* Fix flaky `TestJoinMembersWithRetryBackoff`

```
    memberlist_client_test.go:898:
        	Error Trace:	/Users/dimitar/grafana/dskit/kv/memberlist/memberlist_client_test.go:898
        	            				/Users/dimitar/grafana/dskit/kv/memberlist/asm_arm64.s:1197
        	Error:      	Received unexpected error:
        	            	Member-2 failed to join the cluster: &{[%!f(*errors.errorString=&{Failed to join 127.0.0.1:57493: dial tcp 127.0.0.1:57493: connect: connection refused})] %!f(multierror.ErrorFormatFunc=<nil>)}
        	Test:       	TestJoinMembersWithRetryBackoff/Test_late_start_of_DNS_service
```

The failure was caused by the test clients starting and expecting to connect to a KV which still wasn't started.
This is unrelated with the backoff that the test asserts on (the backoff happens effectively in the call `err = services.StartAndAwaitRunning(context.Background(), mkv)`).

* Clean up logger

* abort running client if kv starting fails
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants