Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestPods_AddEphemeralContainer became flaky #280

Closed
roobre opened this issue Aug 5, 2023 · 4 comments · Fixed by #326
Closed

TestPods_AddEphemeralContainer became flaky #280

roobre opened this issue Aug 5, 2023 · 4 comments · Fixed by #326
Assignees
Labels
bug Something isn't working

Comments

@roobre
Copy link
Collaborator

roobre commented Aug 5, 2023

This test seem to have a bit of a flaky behavior, e.g:

https://github.com/grafana/xk6-disruptor/actions/runs/5772040184/job/15646605631?pr=271

and

https://github.com/grafana/xk6-disruptor/actions/runs/5772040184/job/15646650382?pr=271

I have reproduced this locally on main as well, although it may take four or five runs to make it fail:

Long log
roobre@Archiroo  ±main ● ☸ kind-e2e-pod-disruptor
20:15:30 ~/Devel/xk6-disruptor $> go clean -testcache && make test
go test -race  ./...
?   	github.com/grafana/xk6-disruptor/cmd/agent	[no test files]
?   	github.com/grafana/xk6-disruptor/cmd/agent/commands	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/agent/protocol	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/internal/version	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/kubernetes	[no test files]
ok  	github.com/grafana/xk6-disruptor	0.076s
?   	github.com/grafana/xk6-disruptor/pkg/runtime/profiler	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/testutils/assertions	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/testutils/e2e/checks	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/testutils/e2e/cluster	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/testutils/e2e/deploy	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/testutils/e2e/fetch	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/testutils/e2e/fixtures	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/testutils/e2e/kubectl	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/testutils/e2e/kubernetes	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/testutils/grpc	[no test files]
ok  	github.com/grafana/xk6-disruptor/pkg/agent	5.031s
ok  	github.com/grafana/xk6-disruptor/pkg/agent/protocol/grpc	1.027s
ok  	github.com/grafana/xk6-disruptor/pkg/agent/protocol/http	0.031s
ok  	github.com/grafana/xk6-disruptor/pkg/api	0.087s
ok  	github.com/grafana/xk6-disruptor/pkg/disruptors	0.075s
ok  	github.com/grafana/xk6-disruptor/pkg/iptables	0.031s
ok  	github.com/grafana/xk6-disruptor/pkg/kubernetes/helpers	5.084s
ok  	github.com/grafana/xk6-disruptor/pkg/runtime	0.027s
ok  	github.com/grafana/xk6-disruptor/pkg/testutils/cluster	0.023s
ok  	github.com/grafana/xk6-disruptor/pkg/testutils/command	0.024s
ok  	github.com/grafana/xk6-disruptor/pkg/testutils/e2e/kubernetes/namespace	0.039s
ok  	github.com/grafana/xk6-disruptor/pkg/testutils/grpc/dynamic	0.029s
ok  	github.com/grafana/xk6-disruptor/pkg/testutils/grpc/ping	0.029s
ok  	github.com/grafana/xk6-disruptor/pkg/testutils/kubernetes/builders	3.047s
ok  	github.com/grafana/xk6-disruptor/pkg/utils	5.051s

roobre@Archiroo  ±main ● ☸ kind-e2e-pod-disruptor
20:15:40 ~/Devel/xk6-disruptor $> go clean -testcache && make test
go test -race  ./...
?   	github.com/grafana/xk6-disruptor/cmd/agent	[no test files]
?   	github.com/grafana/xk6-disruptor/cmd/agent/commands	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/agent/protocol	[no test files]
ok  	github.com/grafana/xk6-disruptor	0.073s
?   	github.com/grafana/xk6-disruptor/pkg/internal/version	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/kubernetes	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/runtime/profiler	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/testutils/assertions	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/testutils/e2e/checks	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/testutils/e2e/cluster	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/testutils/e2e/deploy	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/testutils/e2e/fetch	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/testutils/e2e/fixtures	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/testutils/e2e/kubectl	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/testutils/e2e/kubernetes	[no test files]
?   	github.com/grafana/xk6-disruptor/pkg/testutils/grpc	[no test files]
ok  	github.com/grafana/xk6-disruptor/pkg/agent	5.029s
ok  	github.com/grafana/xk6-disruptor/pkg/agent/protocol/grpc	1.028s
ok  	github.com/grafana/xk6-disruptor/pkg/agent/protocol/http	0.029s
ok  	github.com/grafana/xk6-disruptor/pkg/api	0.092s
ok  	github.com/grafana/xk6-disruptor/pkg/disruptors	0.079s
ok  	github.com/grafana/xk6-disruptor/pkg/iptables	0.038s
--- FAIL: TestPods_AddEphemeralContainer (0.00s)
    --- FAIL: TestPods_AddEphemeralContainer/Create_ephemeral_container_waiting (5.02s)
        pods_test.go:223: unexpected error: ephemeral container for pod "test-pod" has not started after 5.000000s
FAIL
FAIL	github.com/grafana/xk6-disruptor/pkg/kubernetes/helpers	5.065s
ok  	github.com/grafana/xk6-disruptor/pkg/runtime	0.033s
ok  	github.com/grafana/xk6-disruptor/pkg/testutils/cluster	0.026s
ok  	github.com/grafana/xk6-disruptor/pkg/testutils/command	0.026s
ok  	github.com/grafana/xk6-disruptor/pkg/testutils/e2e/kubernetes/namespace	0.039s
ok  	github.com/grafana/xk6-disruptor/pkg/testutils/grpc/dynamic	0.029s
ok  	github.com/grafana/xk6-disruptor/pkg/testutils/grpc/ping	0.029s
ok  	github.com/grafana/xk6-disruptor/pkg/testutils/kubernetes/builders	3.050s
ok  	github.com/grafana/xk6-disruptor/pkg/utils	5.052s
FAIL
make: *** [Makefile:50: test] Error 1

I initally thought increasing the timeout would help, but still fails now and then even with 10 seconds:

--- FAIL: TestPods_AddEphemeralContainer (0.00s)
    --- FAIL: TestPods_AddEphemeralContainer/Create_ephemeral_container_waiting (10.02s)
        pods_test.go:223: unexpected error: ephemeral container for pod "test-pod" has not started after 10.000000s
FAIL
FAIL	github.com/grafana/xk6-disruptor/pkg/kubernetes/helpers	10.076s

I suspect of some minor race condition either on the test or on the observer logic.

@roobre roobre added the bug Something isn't working label Aug 5, 2023
@pablochacin
Copy link
Collaborator

pablochacin commented Aug 7, 2023

I suspect of some minor race condition either on the test or on the observer logic.

Based on the discussion of this issue (even if not the same situation because we are not using informers) it is more likely a problem with the implementation of Watch in the face client that sometimes misses a change.

Considering the results shown below, this is unlikely the reason.

@pablochacin
Copy link
Collaborator

pablochacin commented Aug 7, 2023

Running only the failing test seems not to fail, as seen below. Therefore there seems to be interference between tests. As tests do not share any resources, the more likely option is some congestion regarding goroutines, as the client sets a goroutine per observer.

for i in {1..100}; do go clean -testcache && go test ./pkg/kubernetes/helpers/ -run TestPods_AddEphemeralContainer/Create_ephemeral_container_waiting; done
ok  	github.com/grafana/xk6-disruptor/pkg/kubernetes/helpers	0.022s
ok  	github.com/grafana/xk6-disruptor/pkg/kubernetes/helpers	0.022s
...
ok  	github.com/grafana/xk6-disruptor/pkg/kubernetes/helpers	0.019s

@roobre
Copy link
Collaborator Author

roobre commented Aug 7, 2023

🤔 Interesting, I'll try to reproduce that. I would have expected that doubling the timeout to 10s would fix the issue if it was a matter of goroutines starving for resources.

@pablochacin pablochacin changed the title TestPods_AddEphemeralContainer became flaky TestPods_AddEphemeralContainer became flaky Aug 7, 2023
@pablochacin
Copy link
Collaborator

Running all the subtests of TestPods_AddEphemeralContainer removing parallel execution (by commenting out t.Parallel()) seems also to work reliably

for i in {1..100}; do go clean -testcache && go test ./pkg/kubernetes/helpers/ -run TestPods_AddEphemeralContainer; done 
ok  	github.com/grafana/xk6-disruptor/pkg/kubernetes/helpers	3.034s
ok  	github.com/grafana/xk6-disruptor/pkg/kubernetes/helpers	3.024s
ok  	github.com/grafana/xk6-disruptor/pkg/kubernetes/helpers	3.034s
...
ok  	github.com/grafana/xk6-disruptor/pkg/kubernetes/helpers	3.026s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants