Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2E tests startup is flaky #1372

Closed
PBundyra opened this issue Nov 28, 2023 · 20 comments · Fixed by #1491
Closed

E2E tests startup is flaky #1372

PBundyra opened this issue Nov 28, 2023 · 20 comments · Fixed by #1491
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test.

Comments

@PBundyra
Copy link
Contributor

During e2e tests webhooks are not ready which results in failing e2e tests

What happened:
E2E test failed on testing kueue readiness

What you expected to happen:
E2E test to succeed

How to reproduce it (as minimally and precisely as possible):
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/1365/pull-kueue-test-e2e-main-1-28/1729486888244350976

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Kueue version (use git describe --tags --dirty --always):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@PBundyra PBundyra added the kind/bug Categorizes issue or PR as related to a bug. label Nov 28, 2023
@PBundyra
Copy link
Contributor Author

/kind flake

@k8s-ci-robot k8s-ci-robot added the kind/flake Categorizes issue or PR as related to a flaky test. label Nov 28, 2023
@mimowo
Copy link
Contributor

mimowo commented Nov 30, 2023

Looks like flaked again: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/1378/pull-kueue-test-e2e-main-1-26/1730154566130864128

�[0m�[1m[BeforeSuite] �[0m
�[38;5;243m/home/prow/go/src/sigs.k8s.io/kueue/test/e2e/suite_test.go:94�[0m
  �[38;5;9m[FAILED]�[0m in [BeforeSuite] - /home/prow/go/src/sigs.k8s.io/kueue/test/e2e/suite_test.go:90 �[38;5;243m@ 11/28/23 13:12:39.891�[0m
�[38;5;9m[BeforeSuite] [FAILED] [60.003 seconds]�[0m
�[38;5;9m�[1m[BeforeSuite] �[0m
�[38;5;243m/home/prow/go/src/sigs.k8s.io/kueue/test/e2e/suite_test.go:94�[0m

  �[38;5;9m[FAILED] Timed out after 60.000s.
  Expected success, but got an error:
      <*errors.StatusError | 0xc00022e0a0>: 
      Internal error occurred: failed calling webhook "mresourceflavor.kb.io": failed to call webhook: Post "https://kueue-webhook-service.kueue-system.svc:443/mutate-kueue-x-k8s-io-v1beta1-resourceflavor?timeout=10s": dial tcp 10.244.2.2:9443: connect: connection refused
      {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "Internal error occurred: failed calling webhook \"mresourceflavor.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.kueue-system.svc:443/mutate-kueue-x-k8s-io-v1beta1-resourceflavor?timeout=10s\": dial tcp 10.244.2.2:9443: connect: connection refused",
              Reason: "InternalError",
              Details: {
                  Name: "",
                  Group: "",
                  Kind: "",
                  UID: "",
                  Causes: [
                      {
                          Type: "",
                          Message: "failed calling webhook \"mresourceflavor.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.kueue-system.svc:443/mutate-kueue-x-k8s-io-v1beta1-resourceflavor?timeout=10s\": dial tcp 10.244.2.2:9443: connect: connection refused",
                          Field: "",
                      },
                  ],
                  RetryAfterSeconds: 0,
              },
              Code: 500,
          },
      }�[0m

@mimowo
Copy link
Contributor

mimowo commented Dec 1, 2023

@alculquicondor
Copy link
Contributor

Uhm.... I wonder if we need to increase the timeout or cpu/memory of the job.

@tenzen-y
Copy link
Member

tenzen-y commented Dec 4, 2023

According to the Grafana dashboard, resource requests seem to be sufficient. The max usage for CPU is 8.27 Core, and memory is 1.69 GiB.

https://monitoring-eks.prow.k8s.io/d/96Q8oOOZk/builds?orgId=1&from=now-7d&to=now&var-org=kubernetes-sigs&var-repo=kueue&var-job=pull-kueue-test-e2e-main-1-26&refresh=30s&var-build=All

@tenzen-y
Copy link
Member

tenzen-y commented Dec 4, 2023

I'm wondering if there are noisy neighbors 🤔

@alculquicondor
Copy link
Contributor

Should we just increase the timeout for waiting for the webhook?

@tenzen-y
Copy link
Member

tenzen-y commented Dec 8, 2023

Should we just increase the timeout for waiting for the webhook?

I think so since it seems that jobs have enough resource requests.

@alculquicondor
Copy link
Contributor

It looks like jobset is still having issues.

We should start by increasing the visibility into what is pending. Perhaps obtaining the status of the kueue-controller Pod is enough.

@alculquicondor
Copy link
Contributor

I'll take this for now
/assign

@alculquicondor
Copy link
Contributor

Comparing the kueue logs from a successful run with an unsuccessful run, something that catches my eye is the absence of the line CA certs are injected to webhooks in the failing one.

@alculquicondor
Copy link
Contributor

The first relevant log line to be absent is
https://github.com/open-policy-agent/cert-controller/blob/01a9f146b6f1c2f5ed6d496c4ca78a84bde4325b/pkg/rotator/rotator.go#L857C32-L857C32

@astefanutti does this give any clues? What should be generating this file? What does it depend on?

@alculquicondor
Copy link
Contributor

When the certs become ready, it doesn't take more than 5 seconds since the first kueue log line. So a greater timeout is unlikely to fix the issue.

I'm currently testing cert-controller 0.9 in an attempt to narrow down a culprit #1489

@alculquicondor
Copy link
Contributor

The E2E test is also failing with 0.9. I wanted to try 0.7, but there are some dependency issues that prevented me from trying.

I'm curious about whether the issue is related to open-policy-agent/cert-controller#81

@alculquicondor
Copy link
Contributor

Uhm... I don't think so. I do see the message:

server certs refreshed

Meaning that an initial write was sent to the apiserver. It looks like what should happen next is that the kubelet watches the update of the Secret and updates the file.

@astefanutti
Copy link
Member

astefanutti commented Dec 19, 2023

The first relevant log line to be absent is https://github.com/open-policy-agent/cert-controller/blob/01a9f146b6f1c2f5ed6d496c4ca78a84bde4325b/pkg/rotator/rotator.go#L857C32-L857C32

@astefanutti does this give any clues? What should be generating this file? What does it depend on?

@alculquicondor That file has the following lifecycle (simplified with the relevant bits):

  1. The webhook-server-cert Secret is mounted into the Kueue Deployment at tmp/k8s-webhook-server/serving-certs)
  2. Concurrently:
    1. cert-controller generates the certificates and updates the webhook-server-cert Secret
      1. the kubelet propagates the update into the Secret volume mount from 1.
      2. cert-controller checks the server certificate is mounted and closes the certsMounted channel
    2. cert-controller injects the CA certificate into the admission webhooks and closes the IsReady channel (if certsMounted is also closes)
  3. The Kueue webhooks are started along with the other controllers (gated by the IsReady channel)

In the failing runs, it's step 2.i.b. that fails to proceed. I initially though it could be the kubelet that takes too much time to propagate the Secret update to the volume mount (depending on the configMapAndSecretChangeDetectionStrategy kubelet configuration), but then the ensureCertsMounted method should print max retries for checking certs existence.

I still fail to understand what could cause to break what seems to be the two invariants of the ensureCertsMounted method, that are either log:

  • certs are ready in /tmp/k8s-webhook-server/serving-certs
  • Or max retries for checking certs existence

I'll keep digging, but it may be more productive to create a "DO NOT MERGE" PR with some added data points that could help nailing down why that ensureCertsMounted invariants are broken. How does that sound to you?

@astefanutti
Copy link
Member

I realise with the exponential backoff configuration that's used in the ensureCertsMounted method:

wait.Backoff{ Duration: 1 * time.Second, Factor: 2, Jitter: 1, Steps: 10}

The tests time out before the backoff in the ensureCertsMounted method does.

So it's still a possibility the update made to the Secret by cert-controller isn't propagated as expected into the volume mount by the kubelet.

@astefanutti
Copy link
Member

I've just found https://ahmet.im/blog/kubernetes-secret-volumes-delay/ that seems to corroborate the long Secret volume mount update propagation hypothesis.

@alculquicondor Based on that article, to validate the test timeout increase approach as in #1491, a value as long as two minutes might be needed.

@alculquicondor
Copy link
Contributor

That article is very useful, thanks!

Yes, it sounds like 2 minutes should be the target (how annoying).

Otherwise, I wonder if I can cause a Pod status update (to trigger kubelet sync) using the readiness probes more smartly. But I'll leave that as a follow up. It sounds like something like this might be happening most of the time, as we get the secrets updated within 5s when the test passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants