Fix Kueue startup by waiting for webhooks server using probes #1676

mimowo · 2024-01-31T08:48:28Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

To fix the flaky test.
To also let users wait for Kueue webhooks ready without the need for creating / deleting objects as in KueueReadyForTesting

Which issue(s) this PR fixes:

Fixes #1658

Special notes for your reviewer:

Tested locally, trying to see what will happen on GH CI.

Does this PR introduce a user-facing change?

Kueue replicas are advertised as Ready only once the webhooks are functional.

This allows users to wait with the first requests until the Kueue deployment is available, so that the 
early requests don't fail.

netlify · 2024-01-31T08:48:34Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`d2e420b`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/65bbadbc7ccacf0009d468e2

mimowo · 2024-01-31T09:04:55Z

/cc @trasc @alculquicondor

mimowo · 2024-01-31T09:26:13Z

All green, retrying to get more signal:
/test pull-kueue-test-e2e-main-1-26
/test pull-kueue-test-e2e-main-1-27
/test pull-kueue-test-e2e-main-1-28
/test pull-kueue-test-e2e-main-1-29

trasc · 2024-01-31T10:04:55Z

/test pull-kueue-test-e2e-main-1-26
/test pull-kueue-test-e2e-main-1-27
/test pull-kueue-test-e2e-main-1-28
/test pull-kueue-test-e2e-main-1-29

cmd/kueue/main.go

trasc · 2024-01-31T10:07:21Z

cmd/kueue/main.go

+		return mgr.GetWebhookServer().StartedChecker()(req)
+	}
+
+	if err := mgr.AddHealthzCheck("healthz", readyz); err != nil {


why change the healthz?

I'm not sure the change to healthz is required. I will test if this can be simplified. Do you know what healthz is used for?

it should indicate if the container is running, https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#types-of-probe

Good call, reverted setting it, going to test.

trasc · 2024-01-31T10:12:20Z

hack/e2e-common.sh

@@ -55,5 +55,6 @@ function cluster_kueue_deploy {
    else
        kubectl apply --server-side -k test/e2e/config
    fi
+    kubectl wait --for=condition=available --timeout=3m deployment/kueue-controller-manager -n kueue-system


Can we do this somewhere else ?

It could help if in case of multikue , we deploy in manager, and instead of waiting to be ready now, we also deploy in the workers , and check the availability later on for all of them.

I see, this is a simple way of waiting for Kueue ready. I imagine users would like to use in their setup scripts. Typically it takes a couple of seconds.

One alternative is to revert the removal of KueueReadyForTesting, but I don't like it, because it recreates objects, this takes time and could cover some issues (as it is actually currently in case of single cluster). Actually, in this PR I wanted to remove it to test that it is not necessary.

Other alternatives I see modify KueueReadyForTesting to check the deployment status as this line, or run the entire setup of different clusters for MK in parallel at the script level.

+1 on changing KueueReadyForTesting with an availability check for the deployment.

Let's leave this for a follow up.

Sure, but I'm not sure what is the preferred option as a follow up. Is it returning to the KueueReadyForTesting, but waiting for deployment available, as in the command line?

trasc · 2024-01-31T10:24:11Z

/test pull-kueue-test-e2e-main-1-26
/test pull-kueue-test-e2e-main-1-27
/test pull-kueue-test-e2e-main-1-28
/test pull-kueue-test-e2e-main-1-29

trasc · 2024-01-31T12:27:11Z

cmd/kueue/main.go

+	if err := mgr.AddReadyzCheck("readyz", func(req *http.Request) error {
+		return mgr.GetWebhookServer().StartedChecker()(req)
+	}); err != nil {


Suggested change

if err := mgr.AddReadyzCheck("readyz", func(req *http.Request) error {

return mgr.GetWebhookServer().StartedChecker()(req)

}); err != nil {

if err := mgr.AddReadyzCheck("readyz",mgr.GetWebhookServer().StartedChecker()); err != nil {

or could the GetWebhookServer() return different servers on different calls?

I cannot do this, then the replicas never get ready. The function needs to get the webhook server once ready. I think the webhook server is set asynchronously in controller-runtime.

.... very strange, maybe the cert manager is is doing it ...
maybe add a comment about this if confirmed.

all other references in sigs are using it directly.
https://github.com/search?q=org%3Akubernetes-sigs%20StartedChecker&type=code

I think it is indeed about the certs not being ready. One of the projects actually solves this problem too: https://github.com/kubernetes-sigs/hierarchical-namespaces/blob/d367fc08a261135b63d22aeb01c688317d9b7e02/cmd/manager/main.go#L296-L307.

Here they wait explicitly for the certs to be ready, I guess we could do the same so that we don't get the transient errors logged. WDYT?

GetWebhookServer().StartedChecker is what we need to determine the readiness.

I'm only not sure why we need a closure for calling it , since from what I can see the field holding the webhook is private and only stable whit an option when crating the manager, however it's not a huge problem.

TBH, I'm not sure why either. What I verified experimentally is that if I just call GetWebhookServer before WaitForCertsReady then the webhook is not initialized, and it stays that way.

I get the following logs then even if keeping the readyz probe like currently, but just invoking GetWebookServer() before registering the probe:

2024-01-31T19:49:50.466048439Z INFO setup kueue/main.go:291 readyz {"ws": {"Options":{"Host":"","Port":9443,"CertDir":"/tmp/k8s-webhook-server/serving-certs","CertName":"tls.crt","KeyName":"tls.key","ClientCAName":"","TLSOpts":null,"WebhookMux":{}}}, "err": "webhook server has not been started yet"}

I suspect this has something to do with the one-time memorization of the not initialized webhook as runnable here: https://github.com/kubernetes-sigs/controller-runtime/blob/73519a939e27c8465c6c49694356cbd66daf1677/pkg/manager/internal.go#L258.

However, I'm not sure of the details. I would need to further investigate probably with debugger. Let me know if you have some ideas how to get to the bottom of it.

Let me put in the comment what I know so far.

and

, I guess we could do the same so that we don't get the transient errors logged.

What kind of errors?

I meant transient errors before the webhook server is listening, but I didn't how the look like in practice before writing. I was thinking maybe we have many errors about certs not ready. For what I can see we get these errors:

2024-01-31T13:09:29.832263236Z stderr F 2024-01-31T13:09:29.832081013Z DEBUG controller-runtime.healthz healthz/healthz.go:60 healthz check failed {"checker": "readyz", "error": "webhook server has not been started yet"}

before the checker checks the started field before attempting the connection. So, there is only a small window where we can get transient errors, between setting started: true, and actually listening: https://github.com/kubernetes-sigs/controller-runtime/blob/8475c55f3e00e5611de0880eccd785efa85e8e38/pkg/webhook/server.go#L261-L263. However, I have never seen this errors so far.

I have never seen this errors so far.

I'ts normal behavior when the pod is not yet ready, (previously we ware "ready" from the start).

Maybe something like:

Suggested change

if err := mgr.AddReadyzCheck("readyz", func(req *http.Request) error {

return mgr.GetWebhookServer().StartedChecker()(req)

}); err != nil {

if err := mgr.AddReadyzCheck("readyz", func(req *http.Request) error {

select{

case <-certsReady:

return mgr.GetWebhookServer().StartedChecker()(req)

case <-req.Context().Done():

return errors.New("request canceled")

}

}); err != nil {

Could decrease the umber of messages.

But i find it a bit odd to block the http handler.

I was thinking about waiting explicitly for certsReady (it is done in the hierarchical-namespaces project as pointed out above).

However, I think it doesn't improve messages., because we already wait for certs ready as mgr.GetWebhookServer().StartedChecker() first checks if started=true (started=true implicitly means that certs are ready). If certs aren't ready we get the nice message: {"checker": "readyz", "error": "webhook server has not been started yet"}. I think this is enough, and no reason to complicate the code.

Ok, I think I have some new insights into :

I'm only not sure why we need a closure for calling it , since from what I can see the field holding the webhook is private and only stable whit an option when crating the manager, however it's not a huge problem.

if we call GetWebhookServer early, then the not fully initialized webhook server runnable is registered here: https://github.com/kubernetes-sigs/controller-runtime/blob/73519a939e27c8465c6c49694356cbd66daf1677/pkg/manager/internal.go#L257.

When manager starts it starts all runnables, including the webhook server, but it fails, in this case I see in logs early on:

2024-02-01T12:44:13.308211223Z INFO controller-runtime.webhook webhook/server.go:191 Starting webhook server 2024-02-01T12:44:13.40926745Z INFO controller/controller.go:178 Starting EventSource {"controller": "cert-rotator", "source": "kind source: *v1.Secret"} 2024-02-01T12:44:13.40932954Z INFO controller/controller.go:178 Starting EventSource {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"} 2024-02-01T12:44:13.40933832Z INFO manager/internal.go:516 Stopping and waiting for non leader election runnables

Note the presence of "Stopping and waiting for non leader election runnables" just after the start.

OTOH (as in this PR) If we delay calling GetWebhookServer then the runnables are added after the certs are ready, and only then the webhook server is started, then it starts successfully. We the log Starting webhook server is much later (after certs ready).

slightly polished the comment

mimowo · 2024-01-31T12:47:21Z

All green, retrying to get more signal:
/test pull-kueue-test-e2e-main-1-26
/test pull-kueue-test-e2e-main-1-27
/test pull-kueue-test-e2e-main-1-28
/test pull-kueue-test-e2e-main-1-29

mimowo · 2024-01-31T13:00:16Z

All green, retrying to get more signal:
/test pull-kueue-test-e2e-main-1-26
/test pull-kueue-test-e2e-main-1-27
/test pull-kueue-test-e2e-main-1-28
/test pull-kueue-test-e2e-main-1-29

alculquicondor

/lgtm
/approve
/hold
for a couple more re-runs just in case.

alculquicondor · 2024-02-01T19:45:10Z

hack/e2e-common.sh

@@ -55,5 +55,6 @@ function cluster_kueue_deploy {
    else
        kubectl apply --server-side -k test/e2e/config
    fi
+    kubectl wait --for=condition=available --timeout=3m deployment/kueue-controller-manager -n kueue-system


Let's leave this for a follow up.

k8s-ci-robot · 2024-02-01T19:45:32Z

LGTM label has been added.

Git tree hash: 672838e0922a62f82b328cd35c211a88005ba5c6

k8s-ci-robot · 2024-02-01T19:45:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alculquicondor · 2024-02-01T19:45:42Z

/test pull-kueue-test-e2e-main-1-26
/test pull-kueue-test-e2e-main-1-27
/test pull-kueue-test-e2e-main-1-28
/test pull-kueue-test-e2e-main-1-29

alculquicondor · 2024-02-01T19:45:52Z

/hold cancel

mimowo · 2024-02-02T08:27:44Z

@alculquicondor @tenzen-y
I have added a release note, because this is actually a user-facing change IMO. Previously it was really hard to tell when Kueue can accept early requests, especially in the HA mode. Should we consider cherry-picking?

tenzen-y · 2024-02-02T11:06:32Z

@alculquicondor @tenzen-y I have added a release note, because this is actually a user-facing change IMO. Previously it was really hard to tell when Kueue can accept early requests, especially in the HA mode. Should we consider cherry-picking?

I'm fine with cherry-picking.
This patch would be helpful for both HA and non-HA users. Without this patch, we need to manually check if the manager has started since the readinessprove server starts in spite of not being ready.

alculquicondor · 2024-02-02T17:19:12Z

+1 on cherry pick

…

On Fri, Feb 2, 2024, 3:06 a.m. Yuki Iwai ***@***.***> wrote: @alculquicondor <https://github.com/alculquicondor> @tenzen-y <https://github.com/tenzen-y> I have added a release note, because this is actually a user-facing change IMO. Previously it was really hard to tell when Kueue can accept early requests, especially in the HA mode. Should we consider cherry-picking? I'm fine with cherry-picking. This patch would be helpful for both HA and non-HA users. Without this patch, we need to manually check if the manager has started since the readinessprove server starts in spite of not being ready. — Reply to this email directly, view it on GitHub <#1676 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ5E6GLLY7GQNNQ4DFVGCDYRTCEJAVCNFSM6AAAAABCSUVZYGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRTGU4DENRTHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. labels Jan 31, 2024

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 31, 2024

k8s-ci-robot requested review from denkensk and trasc January 31, 2024 08:48

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 31, 2024

Wait for webhooks server using probes

2c7918f

mimowo force-pushed the wait-for-webhooks-using-probes branch from ccd05b6 to 2c7918f Compare January 31, 2024 08:49

mimowo mentioned this pull request Jan 31, 2024

Flaky Multikueue E2E tests #1658

Closed

k8s-ci-robot requested a review from alculquicondor January 31, 2024 09:04

Delete KueueReadyForTesting

b43f8ab

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 31, 2024

mimowo changed the title ~~WIP: Wait for webhooks server using probes [TESTING]~~ WIP: Fix Kueue startup by waiting for webhooks server using probes [TESTING] Jan 31, 2024

trasc reviewed Jan 31, 2024

View reviewed changes

revert the setting of healthz

35d4f0a

trasc reviewed Jan 31, 2024

View reviewed changes

This was referenced Jan 31, 2024

[kueue] Wait for webhooks setup before ready. #1674

Closed

[e2e/multukueue] Retry creating test setup objects #1659

Closed

mimowo changed the title ~~WIP: Fix Kueue startup by waiting for webhooks server using probes [TESTING]~~ Fix Kueue startup by waiting for webhooks server using probes Jan 31, 2024

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 31, 2024

Add a comment about the readyz probe

d2e420b

mimowo force-pushed the wait-for-webhooks-using-probes branch from 2b70e0c to d2e420b Compare February 1, 2024 14:42

alculquicondor reviewed Feb 1, 2024

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 1, 2024

k8s-ci-robot assigned alculquicondor Feb 1, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 1, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 1, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 1, 2024

k8s-ci-robot merged commit a8c11f3 into kubernetes-sigs:main Feb 1, 2024
14 checks passed

k8s-ci-robot added this to the v0.6 milestone Feb 1, 2024

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Feb 2, 2024

mimowo mentioned this pull request Feb 2, 2024

Remove deleted pending workloads from the cache so the visibility API results are correct #1679

Merged

mimowo deleted the wait-for-webhooks-using-probes branch February 10, 2024 11:47

Fix Kueue startup by waiting for webhooks server using probes #1676

Fix Kueue startup by waiting for webhooks server using probes #1676

Conversation

mimowo commented Jan 31, 2024 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

netlify bot commented Jan 31, 2024 • edited

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

mimowo commented Jan 31, 2024

mimowo commented Jan 31, 2024

trasc commented Jan 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trasc commented Jan 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Feb 1, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Feb 1, 2024 • edited

Choose a reason for hiding this comment

mimowo Feb 1, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo commented Jan 31, 2024

mimowo commented Jan 31, 2024

alculquicondor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Feb 1, 2024

k8s-ci-robot commented Feb 1, 2024

alculquicondor commented Feb 1, 2024

alculquicondor commented Feb 1, 2024

mimowo commented Feb 2, 2024 • edited

tenzen-y commented Feb 2, 2024

alculquicondor commented Feb 2, 2024 via email

mimowo commented Jan 31, 2024 •

edited

netlify bot commented Jan 31, 2024 •

edited

mimowo Feb 1, 2024 •

edited

mimowo Feb 1, 2024 •

edited

mimowo Feb 1, 2024 •

edited

mimowo commented Feb 2, 2024 •

edited