-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design for all-or-nothing semantics for workload resource assignment #433
Design for all-or-nothing semantics for workload resource assignment #433
Conversation
Skipping CI for Draft Pull Request. |
dfff328
to
22ad82b
Compare
3c08f58
to
2b91e1d
Compare
ad3eade
to
ad77660
Compare
92b940c
to
f9abac0
Compare
7b8f79d
to
8ca45f3
Compare
// If false, the workload starts as soon as it is admitted and it does not | ||
// block admission of other workloads. The `null` value results in the same | ||
// behavior as `false`. | ||
waitForPodsScheduled *bool `json:"waitForPodsScheduled,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit odd. The semantics described here (specifically, "blocking other workloads and waiting until the workload pods are scheduled") are not a property that a user of the workload API cares about, the relevant workload property is "all-or-nothing". Moreover, a batch user shouldn't have knobs that explicitly control the queue scheduling behavior.
I do think that waitForPodsScheduled
can be a property of the queue though and something that a batch admin may want to toggle.
So I see two options:
-
Keep the knob in the Workload API but use different naming, something like
AllOrNothing = Strict | BestEffort
. WithBestEffort
being the current default behavior andStrict
is the one where we wait for the pods to schedule. TheStrict
semantics may have a different implementation in the future that is based on a more optimistic approach (start all workloads just like we do now, but preempt the ones that get stuck and never fully schedule) or via an integration with the planned CA's provisioning API. -
Move the knob to the ClusterQueue API or even to ComponentConfig and keep the proposed naming and description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's review the options for the best place for the API and their pros&cons:
- workload (current proposal)
- local queue
- cluster queue
- component
Workload/Job level pros:
- throughput - the same cluster queue may run workloads which don't need any guarantees and thus get admitted faster.
- flexibility - we can easily implement queue-level (or component-level) by translating the queue-level property into workload property during workload creation. For example, add the workload annotation to every workload created for a given local queue.
- conceptually, IMO, requiring all pods to run at the same time is a property of a workload itself.
Workload/Job level cons:
- introducing a new annotation on a Job level will increase the complexity of the job reconciler which is a reference Job-based implementation of CRD <> Workload integration. However, translating the annotation into a field still seems to be a marginal increase of complexity.
Component level pros:
- lower complexity of implementation comparing to workload level - we could just block admission until the workload is in the
PodsScheduled
state. We wouldn't need to check if there are any already admitted workloads - we know they all have already awaited to be in thePodsScheduled
condition.
Component level cons:
- throughput - would degrade substantially by sequencing of all workloads, even those which don't require the semantics
Queue level cons:
- similar complexity of implementation comparing to workload level - in order to ensure exclusive start of pods we still need to both block admission of new workloads until in
PodsScheduled
and await for all already workloads to be in thePodsScheduled
condition, as some workloads might be executed on the cluster bypassing the queues. - throughput - though it could be compensated by creating dedicated queues for workloads requiring all-or-nothing or not requiring it.
Still, I think the current naming - waitForPodsScheduled
- is leaking implementation details and might be confusing at the workload level. I like the Idea of a declarative API at the workload levels. Here are the names that come to my mind:
AllOrNothing
(Strict
,BestEffort
)ResourceAllocation
(AllOrNothing
/Strict
/All
,BestEffort
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also worth considering:
- PodScheduling (
AllOrNothing
,BestEffort
) - PodResources (
AllOrNothing
,BestEffort
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a discussion with @alculquicondor and @ahg-g renamed the field to the following:
StartupRequirements
: QuotaOnly
, QuotaAndPodsReady
|
||
First, we introduce a new workload condition, called `PodsScheduled`, to indicate | ||
that the first batch of workload's pods is scheduled, more precisely: | ||
`job.status.ready + job.status.succeeded >= job.spec.parallelism`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain that job controllers are responsible for populating this condition and then discuss how we are doing it for v1.Job in a separate paragraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition is populated by the Kueue's Job reconciler. Rephrased and used the Job reconciler name vs job controller to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to clearly separate Job & Workload interfaces in Kueue designs. The idea behind Workload was to generalize CRD <> Workload <> Kueue integration, with Job <> Workload <> Kueue being the reference implementation.
We shouldn't over-engineer it before we actually have another CRD integrations (YAGNI), but I suggest we separate:
-
User stories and user guide assuming Job. It should refer to objects that users interact with, i.e. either ClusterQueue/LocalQueue or Job (no Workload, and maybe also not 'workload' to avoid ambiguity)
-
Guide for CRD <> Workload integration, using Job reconciler as reference implementation. We should probably try to keep it lean. Whenever we add something here, like Job annotation, it will be an extra step for CRD <> Workload integration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the high-level perspective.
- Re 1: I renamed the uses of workload to job to avoid the confusion.
- Re 2: Indeed, it could be an argument against setting the annotation on Job. Not sure if it outweights the other pros. I have updated the review: Design for all-or-nothing semantics for workload resource assignment #433 (comment). Going to also put these into alternatives. Another possibility would be to keep the workload API field, but populate it based on the local queue configuration. WDYT? It would mean that the admin needs to set up a local queue with the configuration, so that the users sent their workloads there. On the other side, it means there is no need for the new Job annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the discussion into the notes section in the doc.
After the note I'm leaning to using LocalQueue API fields (for opting in and timeout declaration). What do you think @qelo @alculquicondor @ahg-g ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
let's hold until we have more opinions on this, in particular: Jobs annotation vs. LocalQueue API field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a long term plan to use LocalQueue as a place for workload defaulting, I wouldn't add it now though, lets give the feature time to soak a bit, so perhaps starting with an annotation on the Job API is a good first step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of advantages which LocalQueue seems to have:
- it does not leak implementation to JobReconciler, allowing to keep it lean as a reference implementation of the CRD <> Workload integration (Design for all-or-nothing semantics for workload resource assignment #433 (comment)).
- it would also be a natural place for the timeout, currently the opt-in and related timeout are on two opposite ends of the concepts. Then, it allows you to have two LocalQueues, so that the timeouts could be different for large and small jobs.
In terms of implementation complexity, I think it is similar, but we have already a webhook for validating LocalQueue, and if we want to validate the job annotation we would need to introduce a new one.
I also think that in terms of the "first step" it also seems more natural to stay within Kueue concepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of thoughts/history:
- The integration controllers should be lean, there is no disagreement, it is a principle we have been trying to maintain since day 1 :).
- Any workload level API that we expose can be populated by the integration controller, we will not intentionally block that.
- LocalQueue can be thought of as a place for defaulting workload parameters, this was one of the design goals of LocalQueue.
it does not leak implementation to JobReconciler, allowing to keep it lean as a reference implementation of the CRD <> Workload integration (#433 (comment)).
If we want to continue to have this as a workload level API (with an extension in the future to be defaulted via LocalQueue), then this is not leaking. As for complexity, as I mentioned on the other comment, we will have that no matter what option we choose because we need to translate pod ready status to a condition on the workload.
I also think that in terms of the "first step" it also seems more natural to stay within Kueue concepts.
My thinking is that we want to be very careful when adding new fields to ClusterQueue and LocalQueue since those are user-facing apis and we need to be very careful when adding new fields to them. While the Workload API is intended as a gluing layer that we may have a bit more wiggle room to experiment with, while at the same time adding an annotation to the job api is simple to deprecate.
|
||
## Proposal | ||
|
||
We introduce a mechanism to avoid concurrent scheduling of Job pods. More |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to @ahg-g mentioned above, I think we're not aim to avoid concurrent scheduling, but to avoid resource deadlock through different admitted workloads. Or we do want to introduce such a strict serial scheduling mechanism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still unclear about this goal of this KEP.🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the goal is to ensure that a workload has all its resources allocated. Sequencing of admissions, for workloads which require the all-or-nothing semantics, is just a method that is currently available to Kueue.
have a guarantee that other workloads scheduled within this time interval would | ||
not delay execution of my workload. | ||
|
||
My use can be supported by adding `kueue.x-k8s.io/wait-for-pods-scheduled` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe replace this with your expected behaviors, for the approach is till unclear to readers and we'll explain this in details below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased the stories, hope it is clearer now.
As an administrator of a batch processing in my organization I want to be able | ||
to configure kueue quotas for on-demand resources a little higher than what is | ||
available in the quiet period. However, I don't want to risk two workloads | ||
deadlock in case of issues with provisioning nodes to satisfy the configured |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, can't fully understand this, do you mean workloads with different priorities are scheduled the same time but are starving, then triggered the scaling up, but in theory, this can be avoided for higher workloads should run in priority?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about the same priorities , or unspecified priorities when the native k8s preemption does not help. Added a sentence to clarify
If the workload fails to schedule its pods it could block admission of other | ||
workloads indefinitely. | ||
|
||
The risk can be mitigated by using the Job's `ActiveDeadlineSeconds` timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really a serious problem, when other types of jobs do not support this feature. Even for native batch Job, pods can't run successfully for various of reasons besides resource problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the sentiment, was just considering decoupling it from this proposal as it is already an issue to some extent. Still, moved it to Design Details as the need was expressed by reviewers.
bbd6326
to
24db03b
Compare
c809863
to
ab3c41b
Compare
ab3c41b
to
a66247f
Compare
keps/349-all-or-nothing/README.md
Outdated
// waitForPodsReady, when true, indicates that each admitted workload | ||
// blocks admission of other workloads in the cluster, until it is in the | ||
// `PodsReady` condition. If false, all workloads start as soon as they are | ||
// admitted and do not block admission of other workloads. The PodsReady | ||
// condition is only added if this setting is enabled. If unspecified, | ||
// it defaults to false. | ||
waitForPodsReady *bool `json:"waitForPodsReady,omitempty"` | ||
|
||
// podsReadyTimeout describes the timeout for an admitted workload to reach | ||
// the PodsReady condition since unsuspending of the corresponding Job. | ||
// After exceeding this timeout the corresponding job gets suspended again | ||
// and moved to the ClusterQueue's inadmissibleWorkloads list. The timeout is | ||
// enforced only if waitForPodsReady=true. If unspecified, it defaults to 5min. | ||
podsReadyTimeout *metav1.Time `json:"podsReadyTimeout,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waitForPodsReady:
enabled: true
timeout: 5m
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks cleaner, let me update it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I the proposed API I use enable
for consistency with the analogous field in InternalCertManagement
. Also, I suggest to use timeoutSeconds
to express the timeout by seconds, similarly as Job's ActiveDeadlineSeconds
is defined.
52dd451
to
8cbfc39
Compare
8cbfc39
to
9a7e948
Compare
// job.status.StartTime, it can take an admitted workload to reach | ||
// the PodsReady condition. | ||
// After exceeding the timeout the corresponding job gets suspended again | ||
// and moved to the ClusterQueue's inadmissibleWorkloads list. The timeout is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the inadmissibleWorkloads
list is an implementation detail.
You can simply say that it's requeued.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
I'll leave the hold to @ahg-g
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
What type of PR is this?
/kind documentation
/kind feature
What this PR does / why we need it:
It proposes an opt-in mechanism to avoid concurrent pod scheduling for workloads. It is useful for running workloads which require all pods to be running at the same time.
Which issue(s) this PR fixes:
Part of: #349
Special notes for your reviewer: