Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Kueue startup by waiting for webhooks server using probes #1676

Merged

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Jan 31, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

To fix the flaky test.
To also let users wait for Kueue webhooks ready without the need for creating / deleting objects as in KueueReadyForTesting

Which issue(s) this PR fixes:

Fixes #1658

Special notes for your reviewer:

Tested locally, trying to see what will happen on GH CI.

Does this PR introduce a user-facing change?

Kueue replicas are advertised as Ready only once the webhooks are functional.

This allows users to wait with the first requests until the Kueue deployment is available, so that the 
early requests don't fail.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. labels Jan 31, 2024
Copy link

netlify bot commented Jan 31, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit d2e420b
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/65bbadbc7ccacf0009d468e2

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 31, 2024
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 31, 2024
@mimowo mimowo force-pushed the wait-for-webhooks-using-probes branch from ccd05b6 to 2c7918f Compare January 31, 2024 08:49
@mimowo
Copy link
Contributor Author

mimowo commented Jan 31, 2024

/cc @trasc @alculquicondor

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 31, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Jan 31, 2024

All green, retrying to get more signal:
/test pull-kueue-test-e2e-main-1-26
/test pull-kueue-test-e2e-main-1-27
/test pull-kueue-test-e2e-main-1-28
/test pull-kueue-test-e2e-main-1-29

@mimowo mimowo changed the title WIP: Wait for webhooks server using probes [TESTING] WIP: Fix Kueue startup by waiting for webhooks server using probes [TESTING] Jan 31, 2024
@trasc
Copy link
Contributor

trasc commented Jan 31, 2024

/test pull-kueue-test-e2e-main-1-26
/test pull-kueue-test-e2e-main-1-27
/test pull-kueue-test-e2e-main-1-28
/test pull-kueue-test-e2e-main-1-29

cmd/kueue/main.go Outdated Show resolved Hide resolved
cmd/kueue/main.go Outdated Show resolved Hide resolved
return mgr.GetWebhookServer().StartedChecker()(req)
}

if err := mgr.AddHealthzCheck("healthz", readyz); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change the healthz?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the change to healthz is required. I will test if this can be simplified. Do you know what healthz is used for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, reverted setting it, going to test.

@@ -55,5 +55,6 @@ function cluster_kueue_deploy {
else
kubectl apply --server-side -k test/e2e/config
fi
kubectl wait --for=condition=available --timeout=3m deployment/kueue-controller-manager -n kueue-system
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this somewhere else ?

It could help if in case of multikue , we deploy in manager, and instead of waiting to be ready now, we also deploy in the workers , and check the availability later on for all of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, this is a simple way of waiting for Kueue ready. I imagine users would like to use in their setup scripts. Typically it takes a couple of seconds.

One alternative is to revert the removal of KueueReadyForTesting, but I don't like it, because it recreates objects, this takes time and could cover some issues (as it is actually currently in case of single cluster). Actually, in this PR I wanted to remove it to test that it is not necessary.

Other alternatives I see modify KueueReadyForTesting to check the deployment status as this line, or run the entire setup of different clusters for MK in parallel at the script level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on changing KueueReadyForTesting with an availability check for the deployment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave this for a follow up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but I'm not sure what is the preferred option as a follow up. Is it returning to the KueueReadyForTesting, but waiting for deployment available, as in the command line?

@trasc
Copy link
Contributor

trasc commented Jan 31, 2024

/test pull-kueue-test-e2e-main-1-26
/test pull-kueue-test-e2e-main-1-27
/test pull-kueue-test-e2e-main-1-28
/test pull-kueue-test-e2e-main-1-29

Comment on lines +282 to +284
if err := mgr.AddReadyzCheck("readyz", func(req *http.Request) error {
return mgr.GetWebhookServer().StartedChecker()(req)
}); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err := mgr.AddReadyzCheck("readyz", func(req *http.Request) error {
return mgr.GetWebhookServer().StartedChecker()(req)
}); err != nil {
if err := mgr.AddReadyzCheck("readyz",mgr.GetWebhookServer().StartedChecker()); err != nil {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or could the GetWebhookServer() return different servers on different calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot do this, then the replicas never get ready. The function needs to get the webhook server once ready. I think the webhook server is set asynchronously in controller-runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.... very strange, maybe the cert manager is is doing it ...
maybe add a comment about this if confirmed.

all other references in sigs are using it directly.
https://github.com/search?q=org%3Akubernetes-sigs%20StartedChecker&type=code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is indeed about the certs not being ready. One of the projects actually solves this problem too: https://github.com/kubernetes-sigs/hierarchical-namespaces/blob/d367fc08a261135b63d22aeb01c688317d9b7e02/cmd/manager/main.go#L296-L307.

Here they wait explicitly for the certs to be ready, I guess we could do the same so that we don't get the transient errors logged. WDYT?

Copy link
Contributor Author

@mimowo mimowo Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetWebhookServer().StartedChecker is what we need to determine the readiness.

I'm only not sure why we need a closure for calling it , since from what I can see the field holding the webhook is private and only stable whit an option when crating the manager, however it's not a huge problem.

TBH, I'm not sure why either. What I verified experimentally is that if I just call GetWebhookServer before WaitForCertsReady then the webhook is not initialized, and it stays that way.

I get the following logs then even if keeping the readyz probe like currently, but just invoking GetWebookServer() before registering the probe:

2024-01-31T19:49:50.466048439Z	INFO	setup	kueue/main.go:291	readyz	{"ws": {"Options":{"Host":"","Port":9443,"CertDir":"/tmp/k8s-webhook-server/serving-certs","CertName":"tls.crt","KeyName":"tls.key","ClientCAName":"","TLSOpts":null,"WebhookMux":{}}}, "err": "webhook server has not been started yet"}

I suspect this has something to do with the one-time memorization of the not initialized webhook as runnable here: https://github.com/kubernetes-sigs/controller-runtime/blob/73519a939e27c8465c6c49694356cbd66daf1677/pkg/manager/internal.go#L258.

However, I'm not sure of the details. I would need to further investigate probably with debugger. Let me know if you have some ideas how to get to the bottom of it.

Let me put in the comment what I know so far.

and

, I guess we could do the same so that we don't get the transient errors logged.

What kind of errors?

I meant transient errors before the webhook server is listening, but I didn't how the look like in practice before writing. I was thinking maybe we have many errors about certs not ready. For what I can see we get these errors:

2024-01-31T13:09:29.832263236Z stderr F 2024-01-31T13:09:29.832081013Z	DEBUG	controller-runtime.healthz	healthz/healthz.go:60	healthz check failed	{"checker": "readyz", "error": "webhook server has not been started yet"}

before the checker checks the started field before attempting the connection. So, there is only a small window where we can get transient errors, between setting started: true, and actually listening: https://github.com/kubernetes-sigs/controller-runtime/blob/8475c55f3e00e5611de0880eccd785efa85e8e38/pkg/webhook/server.go#L261-L263. However, I have never seen this errors so far.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never seen this errors so far.

I'ts normal behavior when the pod is not yet ready, (previously we ware "ready" from the start).

Maybe something like:

Suggested change
if err := mgr.AddReadyzCheck("readyz", func(req *http.Request) error {
return mgr.GetWebhookServer().StartedChecker()(req)
}); err != nil {
if err := mgr.AddReadyzCheck("readyz", func(req *http.Request) error {
select{
case <-certsReady:
return mgr.GetWebhookServer().StartedChecker()(req)
case <-req.Context().Done():
return errors.New("request canceled")
}
}); err != nil {

Could decrease the umber of messages.

But i find it a bit odd to block the http handler.

Copy link
Contributor Author

@mimowo mimowo Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about waiting explicitly for certsReady (it is done in the hierarchical-namespaces project as pointed out above).

However, I think it doesn't improve messages., because we already wait for certs ready as mgr.GetWebhookServer().StartedChecker() first checks if started=true (started=true implicitly means that certs are ready). If certs aren't ready we get the nice message: {"checker": "readyz", "error": "webhook server has not been started yet"}. I think this is enough, and no reason to complicate the code.

Copy link
Contributor Author

@mimowo mimowo Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I have some new insights into :

I'm only not sure why we need a closure for calling it , since from what I can see the field holding the webhook is private and only stable whit an option when crating the manager, however it's not a huge problem.

if we call GetWebhookServer early, then the not fully initialized webhook server runnable is registered here: https://github.com/kubernetes-sigs/controller-runtime/blob/73519a939e27c8465c6c49694356cbd66daf1677/pkg/manager/internal.go#L257.

When manager starts it starts all runnables, including the webhook server, but it fails, in this case I see in logs early on:

2024-02-01T12:44:13.308211223Z	INFO	controller-runtime.webhook	webhook/server.go:191	Starting webhook server
2024-02-01T12:44:13.40926745Z	INFO	controller/controller.go:178	Starting EventSource	{"controller": "cert-rotator", "source": "kind source: *v1.Secret"}
2024-02-01T12:44:13.40932954Z	INFO	controller/controller.go:178	Starting EventSource	{"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2024-02-01T12:44:13.40933832Z	INFO	manager/internal.go:516	Stopping and waiting for non leader election runnables

Note the presence of "Stopping and waiting for non leader election runnables" just after the start.

OTOH (as in this PR) If we delay calling GetWebhookServer then the runnables are added after the certs are ready, and only then the webhook server is started, then it starts successfully. We the log Starting webhook server is much later (after certs ready).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly polished the comment

@mimowo
Copy link
Contributor Author

mimowo commented Jan 31, 2024

All green, retrying to get more signal:
/test pull-kueue-test-e2e-main-1-26
/test pull-kueue-test-e2e-main-1-27
/test pull-kueue-test-e2e-main-1-28
/test pull-kueue-test-e2e-main-1-29

@mimowo
Copy link
Contributor Author

mimowo commented Jan 31, 2024

All green, retrying to get more signal:
/test pull-kueue-test-e2e-main-1-26
/test pull-kueue-test-e2e-main-1-27
/test pull-kueue-test-e2e-main-1-28
/test pull-kueue-test-e2e-main-1-29

@mimowo mimowo changed the title WIP: Fix Kueue startup by waiting for webhooks server using probes [TESTING] Fix Kueue startup by waiting for webhooks server using probes Jan 31, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 31, 2024
@mimowo mimowo force-pushed the wait-for-webhooks-using-probes branch from 2b70e0c to d2e420b Compare February 1, 2024 14:42
Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
/hold
for a couple more re-runs just in case.

@@ -55,5 +55,6 @@ function cluster_kueue_deploy {
else
kubectl apply --server-side -k test/e2e/config
fi
kubectl wait --for=condition=available --timeout=3m deployment/kueue-controller-manager -n kueue-system
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave this for a follow up.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 1, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 1, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 672838e0922a62f82b328cd35c211a88005ba5c6

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 1, 2024
@alculquicondor
Copy link
Contributor

/test pull-kueue-test-e2e-main-1-26
/test pull-kueue-test-e2e-main-1-27
/test pull-kueue-test-e2e-main-1-28
/test pull-kueue-test-e2e-main-1-29

@alculquicondor
Copy link
Contributor

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 1, 2024
@k8s-ci-robot k8s-ci-robot merged commit a8c11f3 into kubernetes-sigs:main Feb 1, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.6 milestone Feb 1, 2024
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Feb 2, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Feb 2, 2024

@alculquicondor @tenzen-y
I have added a release note, because this is actually a user-facing change IMO. Previously it was really hard to tell when Kueue can accept early requests, especially in the HA mode. Should we consider cherry-picking?

@tenzen-y
Copy link
Member

tenzen-y commented Feb 2, 2024

@alculquicondor @tenzen-y I have added a release note, because this is actually a user-facing change IMO. Previously it was really hard to tell when Kueue can accept early requests, especially in the HA mode. Should we consider cherry-picking?

I'm fine with cherry-picking.
This patch would be helpful for both HA and non-HA users. Without this patch, we need to manually check if the manager has started since the readinessprove server starts in spite of not being ready.

@alculquicondor
Copy link
Contributor

alculquicondor commented Feb 2, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky Multikueue E2E tests
5 participants