Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[multikueue] Manage worker cluster unavailability #1681

Merged

Conversation

trasc
Copy link
Contributor

@trasc trasc commented Feb 2, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

  • Keep the AdmissionCheck Active even if not all it's clusters are usable.
  • Add a mechanism which keeps a workload's MultiKueue AdmissionCheckState as Ready for a configured time even if the manager cannot connect to its reserving worker.

Which issue(s) this PR fixes:

Relates to #693

Special notes for your reviewer:

Does this PR introduce a user-facing change?

MuliKueue  - Manage worker cluster unavailability

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels Feb 2, 2024
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 2, 2024
Copy link

netlify bot commented Feb 2, 2024

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit c9a4984
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/65e8a07790b35c00082bd89f
😎 Deploy Preview https://deploy-preview-1681--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 2, 2024
apis/config/v1beta1/defaults.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 5, 2024
@trasc trasc force-pushed the multikueue-keep-ready-timeout branch from 48a8f99 to 0c3a8f8 Compare February 6, 2024 07:15
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 6, 2024
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review, API field name suggestion, and questions to understand the flow.

apis/config/v1beta1/configuration_types.go Outdated Show resolved Hide resolved
pkg/controller/admissionchecks/multikueue/workload.go Outdated Show resolved Hide resolved
pkg/controller/admissionchecks/multikueue/workload.go Outdated Show resolved Hide resolved
pkg/controller/admissionchecks/multikueue/workload.go Outdated Show resolved Hide resolved
pkg/controller/admissionchecks/multikueue/workload.go Outdated Show resolved Hide resolved
pkg/controller/admissionchecks/multikueue/workload.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 7, 2024
@trasc trasc force-pushed the multikueue-keep-ready-timeout branch from 0c3a8f8 to 7efa2cc Compare February 8, 2024 07:55
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 8, 2024
@trasc trasc changed the title [multikueue] Keep ready timeout [multikueue] Manage worker cluster unavailability Feb 8, 2024
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Feb 8, 2024
@trasc trasc marked this pull request as ready for review February 8, 2024 08:03
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 8, 2024
@trasc trasc force-pushed the multikueue-keep-ready-timeout branch from 7efa2cc to 49a2f64 Compare February 19, 2024 14:33
pkg/controller/admissionchecks/multikueue/workload.go Outdated Show resolved Hide resolved
pkg/controller/admissionchecks/multikueue/workload.go Outdated Show resolved Hide resolved
pkg/controller/admissionchecks/multikueue/workload.go Outdated Show resolved Hide resolved
remainingWaitTime := a.workerLostTimeout - time.Since(acs.LastTransitionTime.Time)
if remainingWaitTime > 0 {
log.V(3).Info("Reserving remote lost, retry", "retryAfter", remainingWaitTime)
return reconcile.Result{RequeueAfter: remainingWaitTime}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requeue after 15min to check sounds bad. I think by analogy to --retry in curl we could have exponential backoff, or fixed delay (probably at 30s by default). Then wait for min(remainingTime, delay)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 15 min will apply only if no other event triggers the reconcile, then and only then we are interested in requeuing if the reserving worker is not back.

Copy link
Contributor

@mimowo mimowo Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I still miss some piece of information. Some question: if we have a workload running on worker1, and we lose connectivity for 5min, to worker1 (and worker1 is marked inactive). What will trigger the check to include the cluster back as active?

Also, if we don't get connectivity for 15min, do we mark such cluster as inactive?

Copy link
Contributor Author

@trasc trasc Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I still miss some piece of information. Some question: if we have a workload running on worker1, and we lose connectivity for 5min, to worker1 (and worker1 is marked inactive). What will trigger the check to include the cluster back as active?

Nothing for now we have a follow-up for connection monitoring and reconnects.

Also, if we don't get connectivity for 15min, do we mark such cluster as inactive?

Here we are talking about workloads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see, but there is an issue with delaying the check, that if we get unlucky once and 15min later, then re requeue. However, the cluster might have been available in-between for 14min. I don't think this is the intention. I think we should monitor if the cluster is connected, and switch its status based on that.

If the problem is solved under " connection monitoring and reconnects.", then maybe it makes sense to combine to see the solution to temporary loss of connectivity holistically. At least update the "special notes to reviewers" with what will be covered in the coming follow up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the watch starts, events are generated for the already exiting objects, hence at that point in time when the worker is alive for some time we wold get a reconcile also.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we still would not observe the cluster being available for 10min in between, This seems fragile to me. I think it is better to periodically check the status, then if the ping fails, we count 15min since the last successful status check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the best I can do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean in this PR / approach on in general we should not track the availability of a worker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR, we should talk about this PR.

pkg/controller/admissionchecks/multikueue/workload_test.go Outdated Show resolved Hide resolved
pkg/controller/admissionchecks/multikueue/workload_test.go Outdated Show resolved Hide resolved
pkg/controller/admissionchecks/multikueue/workload_test.go Outdated Show resolved Hide resolved
test/integration/multikueue/multikueue_test.go Outdated Show resolved Hide resolved
test/integration/multikueue/suite_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API LGTM
I'll approve the implementation when @mimowo gives LGTM

@mimowo
Copy link
Contributor

mimowo commented Feb 28, 2024

I'll approve the implementation when @mimowo gives LGTM

Before lgtm for the implementation I would yet like to confirm the implementation approach. IIUC in the current approach the 15min timeout is evaluated between checks, but if there is nothing to trigger the checks in the period of 15min, then it is possible that the cluster is available, but the check after 15min fails (might be even rejected by the P&F). Then, the workload is requeued.

I would like to have some mechanism that (1) gives more confidence the cluster is really unavailable, by retrying the check periodically, (2) gives users something more tangible to verify the timeout really passed. I think currently the users don't have a good visibility of when the checks happen, so to verify for them why the requeue happened based on logs will be very hard.

I would like to have some approach that monitors periodically the state of the cluster, say every 30s. The first check unavailable timestamp is recorded in the cluster status. Then, the admission check reconciler verifies if 15min passed since the unavailable timestamp. The timestamp in the status would make it easier to verify / track down the decision for without logs. Also, it would mean that in the 15min period we do more checks, so we don't risk saying "cluster unavailable" if after 15min the call fails for a reason which might be just overloaded API server.

Having said that, I'm ok if we conclude this is overkill, or if we decide this is subject to follow up. However, I would like to have a decision first. WDYT @alculquicondor @mwielgus ?

@alculquicondor
Copy link
Contributor

I'm ok with the current behavior for this PR. We can follow up in another one with periodic checks.

@mimowo
Copy link
Contributor

mimowo commented Mar 6, 2024

Ok, let's go as is, we may iterate on the solution if needed in the future.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: bcfc2d4afb40670dce1616d770436f38010b7772

@alculquicondor
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, trasc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2024
@trasc trasc force-pushed the multikueue-keep-ready-timeout branch from 6872aef to c9a4984 Compare March 6, 2024 16:57
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2024
@k8s-ci-robot k8s-ci-robot requested a review from mimowo March 6, 2024 16:57
@mimowo
Copy link
Contributor

mimowo commented Mar 6, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 84e29c2ae7f85cc205d9c5b83ca2e2d00be71647

@trasc
Copy link
Contributor Author

trasc commented Mar 6, 2024

/retest

@mimowo
Copy link
Contributor

mimowo commented Mar 6, 2024

/test pull-kueue-verify-main

@mimowo
Copy link
Contributor

mimowo commented Mar 6, 2024

/test pull-kueue-test-integration-main

@k8s-ci-robot k8s-ci-robot merged commit a527664 into kubernetes-sigs:main Mar 6, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.7 milestone Mar 6, 2024
@trasc trasc deleted the multikueue-keep-ready-timeout branch March 6, 2024 18:52
vsoch pushed a commit to researchapps/kueue that referenced this pull request Apr 18, 2024
* [multikueue] Partially active admission check.

* [multikueue] Keep ready timeout

* Review Remarks

* Refactor reconcileGroup.

* Review Remarks.

* Fix int test after rebase.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants