[multikueue] Manage worker cluster unavailability #1681

trasc · 2024-02-02T14:31:41Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Keep the AdmissionCheck Active even if not all it's clusters are usable.
Add a mechanism which keeps a workload's MultiKueue AdmissionCheckState as Ready for a configured time even if the manager cannot connect to its reserving worker.

Which issue(s) this PR fixes:

Relates to #693

Special notes for your reviewer:

Does this PR introduce a user-facing change?

MuliKueue  - Manage worker cluster unavailability

k8s-ci-robot · 2024-02-02T14:31:43Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

netlify · 2024-02-02T14:31:47Z

✅ Deploy Preview for kubernetes-sigs-kueue ready!

Name	Link
🔨 Latest commit	`c9a4984`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/65e8a07790b35c00082bd89f
😎 Deploy Preview	https://deploy-preview-1681--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

apis/config/v1beta1/defaults.go

mimowo

Initial review, API field name suggestion, and questions to understand the flow.

apis/config/v1beta1/configuration_types.go

pkg/controller/admissionchecks/multikueue/workload.go

test/integration/multikueue/multikueue_test.go

pkg/controller/admissionchecks/multikueue/workload.go

mimowo · 2024-02-21T14:22:14Z

pkg/controller/admissionchecks/multikueue/workload.go

+		remainingWaitTime := a.workerLostTimeout - time.Since(acs.LastTransitionTime.Time)
+		if remainingWaitTime > 0 {
+			log.V(3).Info("Reserving remote lost, retry", "retryAfter", remainingWaitTime)
+			return reconcile.Result{RequeueAfter: remainingWaitTime}, nil


Requeue after 15min to check sounds bad. I think by analogy to --retry in curl we could have exponential backoff, or fixed delay (probably at 30s by default). Then wait for min(remainingTime, delay)

The 15 min will apply only if no other event triggers the reconcile, then and only then we are interested in requeuing if the reserving worker is not back.

Maybe I still miss some piece of information. Some question: if we have a workload running on worker1, and we lose connectivity for 5min, to worker1 (and worker1 is marked inactive). What will trigger the check to include the cluster back as active?

Also, if we don't get connectivity for 15min, do we mark such cluster as inactive?

Maybe I still miss some piece of information. Some question: if we have a workload running on worker1, and we lose connectivity for 5min, to worker1 (and worker1 is marked inactive). What will trigger the check to include the cluster back as active?

Nothing for now we have a follow-up for connection monitoring and reconnects.

Also, if we don't get connectivity for 15min, do we mark such cluster as inactive?

Here we are talking about workloads.

Ok, I see, but there is an issue with delaying the check, that if we get unlucky once and 15min later, then re requeue. However, the cluster might have been available in-between for 14min. I don't think this is the intention. I think we should monitor if the cluster is connected, and switch its status based on that.

If the problem is solved under " connection monitoring and reconnects.", then maybe it makes sense to combine to see the solution to temporary loss of connectivity holistically. At least update the "special notes to reviewers" with what will be covered in the coming follow up.

When the watch starts, events are generated for the already exiting objects, hence at that point in time when the worker is alive for some time we wold get a reconcile also.

But we still would not observe the cluster being available for 10min in between, This seems fragile to me. I think it is better to periodically check the status, then if the ping fails, we count 15min since the last successful status check.

This is the best I can do.

You mean in this PR / approach on in general we should not track the availability of a worker?

In this PR, we should talk about this PR.

pkg/controller/admissionchecks/multikueue/workload_test.go

pkg/controller/admissionchecks/multikueue/workload.go

pkg/controller/admissionchecks/multikueue/workload_test.go

test/integration/multikueue/multikueue_test.go

test/integration/multikueue/suite_test.go

pkg/controller/admissionchecks/multikueue/workload.go

alculquicondor

API LGTM
I'll approve the implementation when @mimowo gives LGTM

mimowo · 2024-02-28T10:29:45Z

I'll approve the implementation when @mimowo gives LGTM

Before lgtm for the implementation I would yet like to confirm the implementation approach. IIUC in the current approach the 15min timeout is evaluated between checks, but if there is nothing to trigger the checks in the period of 15min, then it is possible that the cluster is available, but the check after 15min fails (might be even rejected by the P&F). Then, the workload is requeued.

I would like to have some mechanism that (1) gives more confidence the cluster is really unavailable, by retrying the check periodically, (2) gives users something more tangible to verify the timeout really passed. I think currently the users don't have a good visibility of when the checks happen, so to verify for them why the requeue happened based on logs will be very hard.

I would like to have some approach that monitors periodically the state of the cluster, say every 30s. The first check unavailable timestamp is recorded in the cluster status. Then, the admission check reconciler verifies if 15min passed since the unavailable timestamp. The timestamp in the status would make it easier to verify / track down the decision for without logs. Also, it would mean that in the 15min period we do more checks, so we don't risk saying "cluster unavailable" if after 15min the call fails for a reason which might be just overloaded API server.

Having said that, I'm ok if we conclude this is overkill, or if we decide this is subject to follow up. However, I would like to have a decision first. WDYT @alculquicondor @mwielgus ?

alculquicondor · 2024-02-28T15:33:55Z

I'm ok with the current behavior for this PR. We can follow up in another one with periodic checks.

mimowo · 2024-03-06T16:09:17Z

Ok, let's go as is, we may iterate on the solution if needed in the future.

/lgtm

k8s-ci-robot · 2024-03-06T16:09:22Z

LGTM label has been added.

Git tree hash: bcfc2d4afb40670dce1616d770436f38010b7772

alculquicondor · 2024-03-06T16:18:10Z

/approve

k8s-ci-robot · 2024-03-06T16:18:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, trasc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mimowo · 2024-03-06T17:02:28Z

/lgtm

k8s-ci-robot · 2024-03-06T17:02:33Z

LGTM label has been added.

Git tree hash: 84e29c2ae7f85cc205d9c5b83ca2e2d00be71647

trasc · 2024-03-06T17:14:18Z

/retest

mimowo · 2024-03-06T17:45:21Z

/test pull-kueue-verify-main

mimowo · 2024-03-06T17:45:51Z

/test pull-kueue-test-integration-main

* [multikueue] Partially active admission check. * [multikueue] Keep ready timeout * Review Remarks * Refactor reconcileGroup. * Review Remarks. * Fix int test after rebase.

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels Feb 2, 2024

k8s-ci-robot requested review from denkensk and mimowo February 2, 2024 14:31

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 2, 2024

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 2, 2024

trasc commented Feb 5, 2024

View reviewed changes

apis/config/v1beta1/defaults.go Outdated Show resolved Hide resolved

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 5, 2024

trasc force-pushed the multikueue-keep-ready-timeout branch from 48a8f99 to 0c3a8f8 Compare February 6, 2024 07:15

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 6, 2024

mimowo reviewed Feb 6, 2024

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 7, 2024

trasc mentioned this pull request Feb 7, 2024

[multikueue] Partially active admission check. #1673

Closed

trasc force-pushed the multikueue-keep-ready-timeout branch from 0c3a8f8 to 7efa2cc Compare February 8, 2024 07:55

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 8, 2024

trasc changed the title ~~[multikueue] Keep ready timeout~~ [multikueue] Manage worker cluster unavailability Feb 8, 2024

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Feb 8, 2024

trasc marked this pull request as ready for review February 8, 2024 08:03

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 8, 2024

k8s-ci-robot requested a review from mimowo February 8, 2024 08:04

mimowo reviewed Feb 8, 2024

View reviewed changes

test/integration/multikueue/multikueue_test.go Outdated Show resolved Hide resolved

test/integration/multikueue/multikueue_test.go Show resolved Hide resolved

trasc force-pushed the multikueue-keep-ready-timeout branch from 7efa2cc to 49a2f64 Compare February 19, 2024 14:33

mimowo reviewed Feb 21, 2024

View reviewed changes

pkg/controller/admissionchecks/multikueue/workload.go Show resolved Hide resolved

mimowo mentioned this pull request Feb 23, 2024

[AC] Update description for "Retry" and "Rejected" states. #1763

Merged

alculquicondor reviewed Feb 23, 2024

View reviewed changes

k8s-ci-robot assigned mimowo Mar 6, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2024

trasc added 6 commits March 6, 2024 18:52

[multikueue] Partially active admission check.

c0b82c3

[multikueue] Keep ready timeout

06f9bf1

Review Remarks

536aef8

Refactor reconcileGroup.

ee84747

Review Remarks.

ce1bc97

Fix int test after rebase.

c9a4984

trasc force-pushed the multikueue-keep-ready-timeout branch from 6872aef to c9a4984 Compare March 6, 2024 16:57

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2024

k8s-ci-robot requested a review from mimowo March 6, 2024 16:57

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2024

k8s-ci-robot merged commit a527664 into kubernetes-sigs:main Mar 6, 2024
14 checks passed

k8s-ci-robot added this to the v0.7 milestone Mar 6, 2024

trasc deleted the multikueue-keep-ready-timeout branch March 6, 2024 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[multikueue] Manage worker cluster unavailability #1681

[multikueue] Manage worker cluster unavailability #1681

trasc commented Feb 2, 2024 •

edited

k8s-ci-robot commented Feb 2, 2024

netlify bot commented Feb 2, 2024 •

edited

mimowo left a comment

mimowo Feb 21, 2024

trasc Feb 21, 2024

mimowo Feb 21, 2024 •

edited

trasc Feb 21, 2024 •

edited

mimowo Feb 22, 2024

trasc Feb 22, 2024

mimowo Feb 22, 2024

trasc Feb 22, 2024

mimowo Feb 22, 2024

trasc Feb 22, 2024

alculquicondor left a comment

mimowo commented Feb 28, 2024 •

edited

alculquicondor commented Feb 28, 2024

mimowo commented Mar 6, 2024

k8s-ci-robot commented Mar 6, 2024

alculquicondor commented Mar 6, 2024

k8s-ci-robot commented Mar 6, 2024

mimowo commented Mar 6, 2024

k8s-ci-robot commented Mar 6, 2024

trasc commented Mar 6, 2024

mimowo commented Mar 6, 2024

mimowo commented Mar 6, 2024

[multikueue] Manage worker cluster unavailability #1681

[multikueue] Manage worker cluster unavailability #1681

Conversation

trasc commented Feb 2, 2024 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Feb 2, 2024

netlify bot commented Feb 2, 2024 • edited

✅ Deploy Preview for kubernetes-sigs-kueue ready!

mimowo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Feb 21, 2024 • edited

Choose a reason for hiding this comment

trasc Feb 21, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor left a comment

Choose a reason for hiding this comment

mimowo commented Feb 28, 2024 • edited

alculquicondor commented Feb 28, 2024

mimowo commented Mar 6, 2024

k8s-ci-robot commented Mar 6, 2024

alculquicondor commented Mar 6, 2024

k8s-ci-robot commented Mar 6, 2024

mimowo commented Mar 6, 2024

k8s-ci-robot commented Mar 6, 2024

trasc commented Mar 6, 2024

mimowo commented Mar 6, 2024

mimowo commented Mar 6, 2024

trasc commented Feb 2, 2024 •

edited

netlify bot commented Feb 2, 2024 •

edited

mimowo Feb 21, 2024 •

edited

trasc Feb 21, 2024 •

edited

mimowo commented Feb 28, 2024 •

edited