New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[multikueue] Manage worker cluster unavailability #1681
[multikueue] Manage worker cluster unavailability #1681
Conversation
Skipping CI for Draft Pull Request. |
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
48a8f99
to
0c3a8f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial review, API field name suggestion, and questions to understand the flow.
0c3a8f8
to
7efa2cc
Compare
7efa2cc
to
49a2f64
Compare
remainingWaitTime := a.workerLostTimeout - time.Since(acs.LastTransitionTime.Time) | ||
if remainingWaitTime > 0 { | ||
log.V(3).Info("Reserving remote lost, retry", "retryAfter", remainingWaitTime) | ||
return reconcile.Result{RequeueAfter: remainingWaitTime}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requeue after 15min to check sounds bad. I think by analogy to --retry
in curl we could have exponential backoff, or fixed delay (probably at 30s by default). Then wait for min(remainingTime, delay)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 15 min will apply only if no other event triggers the reconcile, then and only then we are interested in requeuing if the reserving worker is not back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I still miss some piece of information. Some question: if we have a workload running on worker1, and we lose connectivity for 5min, to worker1 (and worker1 is marked inactive). What will trigger the check to include the cluster back as active?
Also, if we don't get connectivity for 15min, do we mark such cluster as inactive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I still miss some piece of information. Some question: if we have a workload running on worker1, and we lose connectivity for 5min, to worker1 (and worker1 is marked inactive). What will trigger the check to include the cluster back as active?
Nothing for now we have a follow-up for connection monitoring and reconnects.
Also, if we don't get connectivity for 15min, do we mark such cluster as inactive?
Here we are talking about workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see, but there is an issue with delaying the check, that if we get unlucky once and 15min later, then re requeue. However, the cluster might have been available in-between for 14min. I don't think this is the intention. I think we should monitor if the cluster is connected, and switch its status based on that.
If the problem is solved under " connection monitoring and reconnects.", then maybe it makes sense to combine to see the solution to temporary loss of connectivity holistically. At least update the "special notes to reviewers" with what will be covered in the coming follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the watch starts, events are generated for the already exiting objects, hence at that point in time when the worker is alive for some time we wold get a reconcile also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we still would not observe the cluster being available for 10min in between, This seems fragile to me. I think it is better to periodically check the status, then if the ping fails, we count 15min since the last successful status check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the best I can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean in this PR / approach on in general we should not track the availability of a worker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR, we should talk about this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API LGTM
I'll approve the implementation when @mimowo gives LGTM
Before lgtm for the implementation I would yet like to confirm the implementation approach. IIUC in the current approach the 15min timeout is evaluated between checks, but if there is nothing to trigger the checks in the period of 15min, then it is possible that the cluster is available, but the check after 15min fails (might be even rejected by the P&F). Then, the workload is requeued. I would like to have some mechanism that (1) gives more confidence the cluster is really unavailable, by retrying the check periodically, (2) gives users something more tangible to verify the timeout really passed. I think currently the users don't have a good visibility of when the checks happen, so to verify for them why the requeue happened based on logs will be very hard. I would like to have some approach that monitors periodically the state of the cluster, say every 30s. The first check unavailable timestamp is recorded in the cluster status. Then, the admission check reconciler verifies if 15min passed since the unavailable timestamp. The timestamp in the status would make it easier to verify / track down the decision for without logs. Also, it would mean that in the 15min period we do more checks, so we don't risk saying "cluster unavailable" if after 15min the call fails for a reason which might be just overloaded API server. Having said that, I'm ok if we conclude this is overkill, or if we decide this is subject to follow up. However, I would like to have a decision first. WDYT @alculquicondor @mwielgus ? |
I'm ok with the current behavior for this PR. We can follow up in another one with periodic checks. |
Ok, let's go as is, we may iterate on the solution if needed in the future. /lgtm |
LGTM label has been added. Git tree hash: bcfc2d4afb40670dce1616d770436f38010b7772
|
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, trasc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
6872aef
to
c9a4984
Compare
/lgtm |
LGTM label has been added. Git tree hash: 84e29c2ae7f85cc205d9c5b83ca2e2d00be71647
|
/retest |
/test pull-kueue-verify-main |
/test pull-kueue-test-integration-main |
* [multikueue] Partially active admission check. * [multikueue] Keep ready timeout * Review Remarks * Refactor reconcileGroup. * Review Remarks. * Fix int test after rebase.
What type of PR is this?
/kind feature
What this PR does / why we need it:
Ready
for a configured time even if the manager cannot connect to its reserving worker.Which issue(s) this PR fixes:
Relates to #693
Special notes for your reviewer:
Does this PR introduce a user-facing change?