Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failing Test] periodic-kubernetes-unit-test-ppc64le, k8s.io/kubernetes/test/e2e/framework/internal/unittests/cleanup #112412

Closed
Rajalakshmi-Girish opened this issue Sep 13, 2022 · 16 comments · Fixed by #112416
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@Rajalakshmi-Girish
Copy link
Contributor

Which jobs are failing?

periodic-kubernetes-unit-test-ppc64le

Which tests are failing?

k8s.io/kubernetes/test/e2e/framework/internal/unittests/cleanup: TestCleanup

Since when has it been failing?

Probably since it was merged in #111998

Testgrid link

https://k8s-testgrid.appspot.com/ibm-unit-tests-ppc64le#Periodic%20unit%20test%20suite%20on%20ppc64le

Reason for failure (if possible)

No response

Anything else we need to know?

The error from build log trace is:

 klog.go:874: ERROR Unable to remove endpoints from kubernetes service: StorageError: key not found, Code: 1, Key: /05ffbd08-d55c-4234-a474-890da0df1095/registry/masterleases/127.0.0.1, ResourceVersion: 0, AdditionalErrorMsg: 

Please go through the above job for complete build log.

Relevant SIG(s)

/sig testing

@Rajalakshmi-Girish Rajalakshmi-Girish added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Sep 13, 2022
@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 13, 2022
@k8s-ci-robot
Copy link
Contributor

@Rajalakshmi-Girish: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Rajalakshmi-Girish
Copy link
Contributor Author

@pohly This test has been constantly failing in our environment and not just flaking.
Also, We see it still failing even after your fix for #112388
https://prow.k8s.io/view/gs/ppc64le-kubernetes/logs/periodic-kubernetes-unit-test-ppc64le/1569494093191450624 is the job that ran after merge of #112389

@Rajalakshmi-Girish
Copy link
Contributor Author

@aojea @mkumatag ^^

@mkumatag
Copy link
Member

cc @pohly

@pohly
Copy link
Contributor

pohly commented Sep 13, 2022

This is a data race in the control plane setup.

cc @aojea

@pohly
Copy link
Contributor

pohly commented Sep 13, 2022

So yes, this is different from the failure that I ran into. It's still a flake (races don''t always go wrong), but apparently the conditions in your jobs are so that it always triggers.

@aojea
Copy link
Member

aojea commented Sep 13, 2022

I'll take care of this
/assign

@pohly
Copy link
Contributor

pohly commented Sep 13, 2022

At first glance it looks like access to c.runner in pkg/controlplane/controller.go lacks protection by a mutex. But why are Start and Stop racing with each other in different goroutines, and why is Stop called first?

@aojea
Copy link
Member

aojea commented Sep 13, 2022

in some executions it seems that it just timeouts, no races

https://prow.k8s.io/view/gs/ppc64le-kubernetes/logs/periodic-kubernetes-unit-test-ppc64le/1569539392458985472

✖ test/e2e/framework/internal/unittests/cleanup (1m47.534s)

I think that the problem is that this environment is under resourced and this test is not able to boot the apsierver , that may explain the race and why Stop is racing with Start

@pohly
Copy link
Contributor

pohly commented Sep 13, 2022

Should we perhaps make this unit test run only on amd64 architectures, with the rationale a) that it isn't architecture-dependent and thus testing it in one unit test job is sufficient and b) that it is an expensive test that stresses slower architectures too much?

@aojea
Copy link
Member

aojea commented Sep 13, 2022

klog.go:874: ERROR timed out waiting for caches to sync

the apiserver times out starting up, I think we can make it lighter with

diff --git a/test/utils/apiserver/testapiserver.go b/test/utils/apiserver/testapiserver.go
index a40baf7bd7e..fdb6e751a63 100644
--- a/test/utils/apiserver/testapiserver.go
+++ b/test/utils/apiserver/testapiserver.go
@@ -51,7 +51,10 @@ func StartAPITestServer(t *testing.T) TestAPIServer {
        storageConfig := storagebackend.NewDefaultConfig(path.Join(uuid.New().String(), "registry"), nil)
        storageConfig.Transport.ServerList = etcdClient.Endpoints()
 
-       server := kubeapiservertesting.StartTestServerOrDie(t, nil, []string{}, storageConfig)
+       options := kubeapiservertesting.TestServerInstanceOptions{
+               EnableCertAuth: false,
+       }
+       server := kubeapiservertesting.StartTestServerOrDie(t, &options, []string{}, storageConfig)
        t.Cleanup(server.TearDownFn)
 
        clientSet := clientset.NewForConfigOrDie(server.ClientConfig)

Should we perhaps make this unit test run only on amd64 architectures

it sound fair to me, since we use etcd and it is only guaranteed to build

https://etcd.io/docs/v3.5/op-guide/supported-platform/#support-tiers

@pohly do you want to make the honors? or should I?

@pohly
Copy link
Contributor

pohly commented Sep 13, 2022

Can you do it because we probably want both and you know best how to explain the patch above in the commit message?

@aojea
Copy link
Member

aojea commented Sep 16, 2022

the apiserver times out starting up, I think we can make it lighter with

I thought about this twice, and I don't think we should tune the tests for running on constrained environments, this is a slippery slope, this environment is clearly constrained , there are also other issues reporting timeouts in this environment ...
we should not tune test just to pass

The skip based on etcd not supported may be reasonable though

@kerthcet
Copy link
Member

@aojea
Copy link
Member

aojea commented Sep 19, 2022

this is for the ppc64le specific problem, we should open a new one for this test

https://storage.googleapis.com/k8s-triage/index.html?ci=0&pr=1&test=TestCleanup

it seems there were 2 occurrences more

@kerthcet
Copy link
Member

For tracking, opened a new issue here #112569.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants