Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky Multikueue E2E tests #1658

Closed
alculquicondor opened this issue Jan 26, 2024 · 16 comments · Fixed by #1676
Closed

Flaky Multikueue E2E tests #1658

alculquicondor opened this issue Jan 26, 2024 · 16 comments · Fixed by #1676
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test.

Comments

@alculquicondor
Copy link
Contributor

What happened:

It looks like the Kueue managers start properly, but somehow they crash later.
As a result, we observe:

Internal error occurred: failed calling webhook "mclusterqueue.kb.io": failed to call webhook: Post "https://kueue-webhook-service.kueue-system.svc:443/mutate-kueue-x-k8s-io-v1beta1-clusterqueue?timeout=10s": dial tcp 10.96.160.164:443: connect: connection refused

In End To End MultiKueue Suite: kindest/node:v1.28.0: [It] MultiKueue when Creating a multikueue admission check Should run a job on worker if admitted

What you expected to happen:

Kueue managers to continue to run properly

How to reproduce it (as minimally and precisely as possible):

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/1607/pull-kueue-test-e2e-main-1-28/1750942764473782272

Note that this PR only changes documentation, so the flakiness is definitely on the multikueue code.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Kueue version (use git describe --tags --dirty --always):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@alculquicondor alculquicondor added the kind/bug Categorizes issue or PR as related to a bug. label Jan 26, 2024
@alculquicondor
Copy link
Contributor Author

/assign @trasc
cc @mimowo

@tenzen-y
Copy link
Member

I think this is duplicated with #1649.
We can close one of these.

@tenzen-y
Copy link
Member

/kind flaky

@k8s-ci-robot
Copy link
Contributor

@tenzen-y: The label(s) kind/flaky cannot be applied, because the repository doesn't have them.

In response to this:

/kind flaky

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tenzen-y
Copy link
Member

/kind flake

@k8s-ci-robot k8s-ci-robot added the kind/flake Categorizes issue or PR as related to a flaky test. label Jan 26, 2024
@trasc
Copy link
Contributor

trasc commented Jan 29, 2024

For this behavior, I wold expect the kueue controller manager to be crushed, since before starting the suite it is checked that all the clusters are able to create a resource flavor.

So , very likely this is happening due to the heavy load in multikueue case.

@mimowo
Copy link
Contributor

mimowo commented Jan 30, 2024

As discussed under #1659 I run experiments looping the e2e multikueue tests.

First, I was able to repro the issue locally with a failure rate around 1 out of 4, which is close to the one on GH CI.

Second, when running on existing clusters I don't get any failures (37 passes in a row, interrupted manually), which suggests the issue is only during startup.

Third, with the following code change I eliminated the failures locally (30 passes in a row, still running):

func KueueReadyForTesting(ctx context.Context, client client.Client) {
	resourceKueue := utiltesting.MakeResourceFlavor("default").Obj()
	gomega.Eventually(func() error {
		return client.Create(ctx, resourceKueue)
	}, StartUpTimeout, Interval).Should(gomega.Succeed())

	cqKueueTest := utiltesting.MakeClusterQueue("q1").
		ResourceGroup(
			*utiltesting.MakeFlavorQuotas("default").
				Resource(corev1.ResourceCPU, "1").
				Obj(),
		).
		Obj()

	gomega.Eventually(func() error {
		return client.Create(ctx, cqKueueTest)
	}, StartUpTimeout, Interval).Should(gomega.Succeed())

	ExpectClusterQueueToBeDeleted(ctx, client, cqKueueTest, true)
	ExpectResourceFlavorToBeDeleted(ctx, client, resourceKueue, true)
}

This suggests also the issue is only on startup. Further, it suggests that for multikueue, where the system is loaded there might be a signifficant difference when the ResourceFlavor webhooks and the ClusterQueue webhooks are functional. This also appears to explain why the PR #1659 is stable.

IIUC there is another ongoing effort by @trasc to see if we can have a more generic solution: #1674.

@mimowo
Copy link
Contributor

mimowo commented Jan 30, 2024

Ok, I got failure on 32st loop, but on creating LocalQueue, because the localqueue_webhook wasn't ready. This reinforces the statement that the webhooks become ready at different points in time. However, this also means that we would need to add creating LocalQueues to KueueReadyForTesting (and virtually any object).

@trasc
Copy link
Contributor

trasc commented Jan 30, 2024

Ok, I got failure on 32st loop, but on creating LocalQueue, because the localqueue_webhook wasn't ready. This reinforces the statement that the webhooks become ready at different points in time. However, this also means that we would need to add creating LocalQueues to KueueReadyForTesting (and virtually any object).

What was the error?

@mimowo
Copy link
Contributor

mimowo commented Jan 30, 2024

What was the error?

2024-01-30 15:06:10.751971   [FAILED] Expected success, but got an error:
2024-01-30 15:06:10.751998       <*errors.StatusError | 0xc000593360>: 
2024-01-30 15:06:10.752025       Internal error occurred: failed calling webhook "vlocalqueue.kb.io": failed to call webhook: Post "https://kueue-webhook-service.kueue-system.svc:443/validate-kueue-x-k8s-io-v1beta1-localqueue?timeout=10s": dial tcp 10.244.1.4:9443: connect: connection refused
2024-01-30 15:06:10.752054       {
2024-01-30 15:06:10.752083           ErrStatus: {
2024-01-30 15:06:10.752110               TypeMeta: {Kind: "", APIVersion: ""},
2024-01-30 15:06:10.752138               ListMeta: {
2024-01-30 15:06:10.752166                   SelfLink: "",
2024-01-30 15:06:10.752193                   ResourceVersion: "",
2024-01-30 15:06:10.752220                   Continue: "",
2024-01-30 15:06:10.752247                   RemainingItemCount: nil,
2024-01-30 15:06:10.752276               },
2024-01-30 15:06:10.752304               Status: "Failure",
2024-01-30 15:06:10.752331               Message: "Internal error occurred: failed calling webhook \"vlocalqueue.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.kueue-system.svc:443/validate-kueue-x-k8s-io-v1beta1-localqueue?timeout=10s\": dial tcp 10.244.1.4:9443: connect: connection refused",
2024-01-30 15:06:10.752360               Reason: "InternalError",
2024-01-30 15:06:10.752387               Details: {
2024-01-30 15:06:10.752414                   Name: "",
2024-01-30 15:06:10.752440                   Group: "",
2024-01-30 15:06:10.752467                   Kind: "",
2024-01-30 15:06:10.752494                   UID: "",
2024-01-30 15:06:10.752521                   Causes: [
2024-01-30 15:06:10.752548                       {
2024-01-30 15:06:10.752575                           Type: "",
2024-01-30 15:06:10.752602                           Message: "failed calling webhook \"vlocalqueue.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.kueue-system.svc:443/validate-kueue-x-k8s-io-v1beta1-localqueue?timeout=10s\": dial tcp 10.244.1.4:9443: connect: connection refused",
2024-01-30 15:06:10.752630                           Field: "",
2024-01-30 15:06:10.752657                       },
2024-01-30 15:06:10.752687                   ],
2024-01-30 15:06:10.752714                   RetryAfterSeconds: 0,
2024-01-30 15:06:10.752765               },
2024-01-30 15:06:10.752793               Code: 500,
2024-01-30 15:06:10.752820           },
2024-01-30 15:06:10.752846       }
2024-01-30 15:06:10.752873   In [BeforeEach] at: /.../src/sigs.k8s.io/kueue/test/e2e/multikueue/e2e_test.go:162 @ 01/30/24 15:06:08.444

@trasc
Copy link
Contributor

trasc commented Jan 30, 2024

I expect connect: connection refused to be a L4 error and no be impacted by the handlers the webhook server has registered at some point.

@mimowo
Copy link
Contributor

mimowo commented Jan 30, 2024

I expect connect: connection refused to be a L4 error and no be impacted by the handlers the webhook server has registered at some point.

Not sure I understand to be able to follow up. Do you suggest that this is not caused by webhooks, or that there is a bug in API server that 500 is returned in this case?

@alculquicondor
Copy link
Contributor Author

I wonder if the connection refused is caused because the caBundle is still not set in the ValidatingWebhookConfiguration or MutatingWebhookConfiguration objects.

That might be consistent with @mimowo's observations.

However, when we added the MK tests, we didn't increase resource requests in the E2E jobs, did we? Perhaps we can start there?

@alculquicondor
Copy link
Contributor Author

I think we should do both #1674 and increasing the requests.

@mimowo
Copy link
Contributor

mimowo commented Jan 31, 2024

I think I understand now what was happening. Described here: #1659 (comment). Essentially, with 2 replicas running the registered webhooks are distributed randomly between the two replicas. With the KueueReadyForTesting we would only make sure a subset of webhooks is working, but if unlucky some webooks which are supported by the other replica would fail.

@mimowo
Copy link
Contributor

mimowo commented Jan 31, 2024

I have also opened an alternative proposal using probes to wait for the ready replicaes: #1676. Seems to pass consitently, but going to yet test more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test.
Projects
None yet
5 participants