Design for all-or-nothing semantics for workload resource assignment #433

mimowo · 2022-11-23T15:52:44Z

What type of PR is this?

/kind documentation
/kind feature

What this PR does / why we need it:

It proposes an opt-in mechanism to avoid concurrent pod scheduling for workloads. It is useful for running workloads which require all pods to be running at the same time.

Which issue(s) this PR fixes:

Part of: #349

Special notes for your reviewer:

k8s-ci-robot · 2022-11-23T15:52:47Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

keps/349-avoid-coscheduling/README.md

ahg-g · 2022-11-26T21:50:53Z

keps/349-avoid-coscheduling/README.md

+  // If false, the workload starts as soon as it is admitted and it does not
+  // block admission of other workloads. The `null` value results in the same
+  // behavior as `false`.
+	waitForPodsScheduled *bool `json:"waitForPodsScheduled,omitempty"`


This is a bit odd. The semantics described here (specifically, "blocking other workloads and waiting until the workload pods are scheduled") are not a property that a user of the workload API cares about, the relevant workload property is "all-or-nothing". Moreover, a batch user shouldn't have knobs that explicitly control the queue scheduling behavior.

I do think that waitForPodsScheduled can be a property of the queue though and something that a batch admin may want to toggle.

So I see two options:

Keep the knob in the Workload API but use different naming, something like AllOrNothing = Strict | BestEffort. With BestEffort being the current default behavior and Strict is the one where we wait for the pods to schedule. The Strict semantics may have a different implementation in the future that is based on a more optimistic approach (start all workloads just like we do now, but preempt the ones that get stuck and never fully schedule) or via an integration with the planned CA's provisioning API.

Move the knob to the ClusterQueue API or even to ComponentConfig and keep the proposed naming and description.

Let's review the options for the best place for the API and their pros&cons:

workload (current proposal)

local queue

cluster queue

component

Workload/Job level pros:

throughput - the same cluster queue may run workloads which don't need any guarantees and thus get admitted faster.

flexibility - we can easily implement queue-level (or component-level) by translating the queue-level property into workload property during workload creation. For example, add the workload annotation to every workload created for a given local queue.

conceptually, IMO, requiring all pods to run at the same time is a property of a workload itself.

Workload/Job level cons:

introducing a new annotation on a Job level will increase the complexity of the job reconciler which is a reference Job-based implementation of CRD <> Workload integration. However, translating the annotation into a field still seems to be a marginal increase of complexity.

Component level pros:

lower complexity of implementation comparing to workload level - we could just block admission until the workload is in the PodsScheduled state. We wouldn't need to check if there are any already admitted workloads - we know they all have already awaited to be in the PodsScheduled condition.

Component level cons:

throughput - would degrade substantially by sequencing of all workloads, even those which don't require the semantics

Queue level cons:

similar complexity of implementation comparing to workload level - in order to ensure exclusive start of pods we still need to both block admission of new workloads until in PodsScheduled and await for all already workloads to be in the PodsScheduled condition, as some workloads might be executed on the cluster bypassing the queues.

throughput - though it could be compensated by creating dedicated queues for workloads requiring all-or-nothing or not requiring it.

Still, I think the current naming - waitForPodsScheduled - is leaking implementation details and might be confusing at the workload level. I like the Idea of a declarative API at the workload levels. Here are the names that come to my mind:

AllOrNothing (Strict, BestEffort)

ResourceAllocation (AllOrNothing / Strict / All, BestEffort)

Maybe also worth considering:

PodScheduling (AllOrNothing, BestEffort)

PodResources (AllOrNothing, BestEffort)

After a discussion with @alculquicondor and @ahg-g renamed the field to the following:
StartupRequirements: QuotaOnly, QuotaAndPodsReady

keps/349-avoid-coscheduling/README.md

ahg-g · 2022-11-26T22:10:37Z

keps/349-avoid-coscheduling/README.md

+
+First, we introduce a new workload condition, called `PodsScheduled`, to indicate
+that the first batch of workload's pods is scheduled, more precisely:
+`job.status.ready + job.status.succeeded >= job.spec.parallelism`.


Explain that job controllers are responsible for populating this condition and then discuss how we are doing it for v1.Job in a separate paragraph.

The condition is populated by the Kueue's Job reconciler. Rephrased and used the Job reconciler name vs job controller to avoid confusion.

+1 to clearly separate Job & Workload interfaces in Kueue designs. The idea behind Workload was to generalize CRD <> Workload <> Kueue integration, with Job <> Workload <> Kueue being the reference implementation.

We shouldn't over-engineer it before we actually have another CRD integrations (YAGNI), but I suggest we separate:

User stories and user guide assuming Job. It should refer to objects that users interact with, i.e. either ClusterQueue/LocalQueue or Job (no Workload, and maybe also not 'workload' to avoid ambiguity)

Guide for CRD <> Workload integration, using Job reconciler as reference implementation. We should probably try to keep it lean. Whenever we add something here, like Job annotation, it will be an extra step for CRD <> Workload integration.

Thanks for the high-level perspective.

Re 1: I renamed the uses of workload to job to avoid the confusion.

Re 2: Indeed, it could be an argument against setting the annotation on Job. Not sure if it outweights the other pros. I have updated the review: Design for all-or-nothing semantics for workload resource assignment #433 (comment). Going to also put these into alternatives. Another possibility would be to keep the workload API field, but populate it based on the local queue configuration. WDYT? It would mean that the admin needs to set up a local queue with the configuration, so that the users sent their workloads there. On the other side, it means there is no need for the new Job annotation.

Added the discussion into the notes section in the doc.
After the note I'm leaning to using LocalQueue API fields (for opting in and timeout declaration). What do you think @qelo @alculquicondor @ahg-g ?

/hold
let's hold until we have more opinions on this, in particular: Jobs annotation vs. LocalQueue API field.

We have a long term plan to use LocalQueue as a place for workload defaulting, I wouldn't add it now though, lets give the feature time to soak a bit, so perhaps starting with an annotation on the Job API is a good first step.

There are a couple of advantages which LocalQueue seems to have:

it does not leak implementation to JobReconciler, allowing to keep it lean as a reference implementation of the CRD <> Workload integration (Design for all-or-nothing semantics for workload resource assignment #433 (comment)).

it would also be a natural place for the timeout, currently the opt-in and related timeout are on two opposite ends of the concepts. Then, it allows you to have two LocalQueues, so that the timeouts could be different for large and small jobs.

In terms of implementation complexity, I think it is similar, but we have already a webhook for validating LocalQueue, and if we want to validate the job annotation we would need to introduce a new one.

I also think that in terms of the "first step" it also seems more natural to stay within Kueue concepts.

A couple of thoughts/history:

The integration controllers should be lean, there is no disagreement, it is a principle we have been trying to maintain since day 1 :).

Any workload level API that we expose can be populated by the integration controller, we will not intentionally block that.

LocalQueue can be thought of as a place for defaulting workload parameters, this was one of the design goals of LocalQueue.

it does not leak implementation to JobReconciler, allowing to keep it lean as a reference implementation of the CRD <> Workload integration (#433 (comment)).

If we want to continue to have this as a workload level API (with an extension in the future to be defaulted via LocalQueue), then this is not leaking. As for complexity, as I mentioned on the other comment, we will have that no matter what option we choose because we need to translate pod ready status to a condition on the workload.

I also think that in terms of the "first step" it also seems more natural to stay within Kueue concepts.

My thinking is that we want to be very careful when adding new fields to ClusterQueue and LocalQueue since those are user-facing apis and we need to be very careful when adding new fields to them. While the Workload API is intended as a gluing layer that we may have a bit more wiggle room to experiment with, while at the same time adding an annotation to the job api is simple to deprecate.

keps/349-avoid-coscheduling/README.md

kerthcet · 2022-11-28T04:09:55Z

keps/349-avoid-coscheduling/README.md

+
+## Proposal
+
+We introduce a mechanism to avoid concurrent scheduling of Job pods. More


Similar to @ahg-g mentioned above, I think we're not aim to avoid concurrent scheduling, but to avoid resource deadlock through different admitted workloads. Or we do want to introduce such a strict serial scheduling mechanism?

I'm still unclear about this goal of this KEP.🤔

Yeah, the goal is to ensure that a workload has all its resources allocated. Sequencing of admissions, for workloads which require the all-or-nothing semantics, is just a method that is currently available to Kueue.

keps/349-avoid-coscheduling/README.md

kerthcet · 2022-11-28T04:24:10Z

keps/349-avoid-coscheduling/README.md

+have a guarantee that other workloads scheduled within this time interval would
+not delay execution of my workload.
+
+My use can be supported by adding `kueue.x-k8s.io/wait-for-pods-scheduled`


Maybe replace this with your expected behaviors, for the approach is till unclear to readers and we'll explain this in details below.

Rephrased the stories, hope it is clearer now.

kerthcet · 2022-11-28T04:36:37Z

keps/349-avoid-coscheduling/README.md

+As an administrator of a batch processing in my organization I want to be able
+to configure kueue quotas for on-demand resources a little higher than what is
+available in the quiet period. However, I don't want to risk two workloads
+deadlock in case of issues with provisioning nodes to satisfy the configured


Sorry, can't fully understand this, do you mean workloads with different priorities are scheduled the same time but are starving, then triggered the scaling up, but in theory, this can be avoided for higher workloads should run in priority?

I was thinking about the same priorities , or unspecified priorities when the native k8s preemption does not help. Added a sentence to clarify

kerthcet · 2022-11-28T06:05:45Z

keps/349-avoid-coscheduling/README.md

+If the workload fails to schedule its pods it could block admission of other
+workloads indefinitely.
+
+The risk can be mitigated by using the Job's `ActiveDeadlineSeconds` timeout


It's really a serious problem, when other types of jobs do not support this feature. Even for native batch Job, pods can't run successfully for various of reasons besides resource problems.

Agree with the sentiment, was just considering decoupling it from this proposal as it is already an issue to some extent. Still, moved it to Design Details as the need was expressed by reviewers.

alculquicondor · 2022-12-01T14:43:57Z

keps/349-all-or-nothing/README.md

+  // waitForPodsReady, when true, indicates that each admitted workload
+  // blocks admission of other workloads in the cluster, until it is in the
+  // `PodsReady` condition. If false, all workloads start as soon as they are
+  // admitted and do not block admission of other workloads. The PodsReady
+  // condition is only added if this setting is enabled. If unspecified,
+  // it defaults to false.
+  waitForPodsReady *bool `json:"waitForPodsReady,omitempty"`
+
+  // podsReadyTimeout describes the timeout for an admitted workload to reach
+  // the PodsReady condition since unsuspending of the corresponding Job.
+  // After exceeding this timeout the corresponding job gets suspended again
+  // and moved to the ClusterQueue's inadmissibleWorkloads list. The timeout is
+  // enforced only if waitForPodsReady=true. If unspecified, it defaults to 5min.
+  podsReadyTimeout *metav1.Time `json:"podsReadyTimeout,omitempty"`


waitForPodsReady: enabled: true timeout: 5m

looks cleaner, let me update it

I the proposed API I use enable for consistency with the analogous field in InternalCertManagement. Also, I suggest to use timeoutSeconds to express the timeout by seconds, similarly as Job's ActiveDeadlineSeconds is defined.

apis/config/v1alpha2/configuration_types.go

alculquicondor · 2022-12-01T16:19:30Z

keps/349-all-or-nothing/README.md

+	// job.status.StartTime, it can take an admitted workload to reach
+	// the PodsReady condition.
+	// After exceeding the timeout the corresponding job gets suspended again
+	// and moved to the ClusterQueue's inadmissibleWorkloads list. The timeout is


the inadmissibleWorkloads list is an implementation detail.

You can simply say that it's requeued.

alculquicondor

/approve

I'll leave the hold to @ahg-g

k8s-ci-robot · 2022-12-01T16:19:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ahg-g · 2022-12-01T17:12:58Z

/hold cancel
/lgtm

k8s-ci-robot requested review from ArangoGutierrez and denkensk November 23, 2022 15:52

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 23, 2022

mimowo force-pushed the avoid-coscheduling-design branch from dfff328 to 22ad82b Compare November 24, 2022 14:09

alculquicondor reviewed Nov 24, 2022

View reviewed changes

mimowo force-pushed the avoid-coscheduling-design branch 2 times, most recently from 3c08f58 to 2b91e1d Compare November 25, 2022 09:07

mimowo changed the title ~~Design for avoiding workload coscheduling - draft~~ Design for avoiding concurrent pod scheduling for workloads Nov 25, 2022

mimowo force-pushed the avoid-coscheduling-design branch 4 times, most recently from ad3eade to ad77660 Compare November 25, 2022 12:17

mimowo marked this pull request as ready for review November 25, 2022 12:18

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 25, 2022

k8s-ci-robot requested a review from ahg-g November 25, 2022 12:18

mimowo force-pushed the avoid-coscheduling-design branch 3 times, most recently from 92b940c to f9abac0 Compare November 25, 2022 13:01

mimowo mentioned this pull request Nov 25, 2022

An option to unsuspend jobs only after previously unsuspended ones are assigned nodes #349

Closed

3 tasks

mimowo force-pushed the avoid-coscheduling-design branch from 7b8f79d to 8ca45f3 Compare November 25, 2022 13:30

alculquicondor reviewed Nov 25, 2022

View reviewed changes

keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved

keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved

ahg-g reviewed Nov 26, 2022

View reviewed changes

kerthcet reviewed Nov 28, 2022

View reviewed changes

mimowo force-pushed the avoid-coscheduling-design branch from bbd6326 to 24db03b Compare November 28, 2022 08:14

mimowo force-pushed the avoid-coscheduling-design branch 13 times, most recently from c809863 to ab3c41b Compare December 1, 2022 12:51

Proposal for component level opt-in

a66247f

mimowo force-pushed the avoid-coscheduling-design branch from ab3c41b to a66247f Compare December 1, 2022 12:52

alculquicondor reviewed Dec 1, 2022

View reviewed changes

apis/config/v1alpha2/configuration_types.go Outdated Show resolved Hide resolved

mimowo force-pushed the avoid-coscheduling-design branch from 52dd451 to 8cbfc39 Compare December 1, 2022 15:50

Improve the API struct

9a7e948

mimowo force-pushed the avoid-coscheduling-design branch from 8cbfc39 to 9a7e948 Compare December 1, 2022 15:54

alculquicondor reviewed Dec 1, 2022

View reviewed changes

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 1, 2022

k8s-ci-robot assigned ahg-g Dec 1, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 1, 2022

k8s-ci-robot merged commit 2b48316 into kubernetes-sigs:main Dec 1, 2022

mimowo mentioned this pull request Dec 5, 2022

Add PodsReady condition if WaitForPodsReady is enabled #460

Merged

mimowo deleted the avoid-coscheduling-design branch March 18, 2023 18:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design for all-or-nothing semantics for workload resource assignment #433

Design for all-or-nothing semantics for workload resource assignment #433

mimowo commented Nov 23, 2022 •

edited

Loading

k8s-ci-robot commented Nov 23, 2022

ahg-g Nov 26, 2022

mimowo Nov 28, 2022 •

edited

Loading

mimowo Nov 28, 2022

mimowo Nov 28, 2022

ahg-g Nov 26, 2022

mimowo Nov 28, 2022

qelo Nov 28, 2022

mimowo Nov 29, 2022 •

edited

Loading

mimowo Nov 29, 2022 •

edited

Loading

mimowo Nov 29, 2022 •

edited

Loading

ahg-g Nov 29, 2022

mimowo Nov 29, 2022

ahg-g Nov 29, 2022 •

edited

Loading

kerthcet Nov 28, 2022

kerthcet Nov 28, 2022

mimowo Nov 28, 2022 •

edited

Loading

kerthcet Nov 28, 2022

mimowo Nov 30, 2022

kerthcet Nov 28, 2022

mimowo Nov 28, 2022 •

edited

Loading

kerthcet Nov 28, 2022

mimowo Nov 28, 2022

alculquicondor Dec 1, 2022

mimowo Dec 1, 2022

mimowo Dec 1, 2022

alculquicondor Dec 1, 2022

alculquicondor left a comment

k8s-ci-robot commented Dec 1, 2022

ahg-g commented Dec 1, 2022


		## Proposal

		We introduce a mechanism to avoid concurrent scheduling of Job pods. More

Design for all-or-nothing semantics for workload resource assignment #433

Design for all-or-nothing semantics for workload resource assignment #433

Conversation

mimowo commented Nov 23, 2022 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

k8s-ci-robot commented Nov 23, 2022

Choose a reason for hiding this comment

mimowo Nov 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Nov 29, 2022 • edited Loading

Choose a reason for hiding this comment

mimowo Nov 29, 2022 • edited Loading

Choose a reason for hiding this comment

mimowo Nov 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g Nov 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Nov 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Nov 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Dec 1, 2022

ahg-g commented Dec 1, 2022

mimowo commented Nov 23, 2022 •

edited

Loading

mimowo Nov 28, 2022 •

edited

Loading

mimowo Nov 29, 2022 •

edited

Loading

mimowo Nov 29, 2022 •

edited

Loading

mimowo Nov 29, 2022 •

edited

Loading

ahg-g Nov 29, 2022 •

edited

Loading

mimowo Nov 28, 2022 •

edited

Loading

mimowo Nov 28, 2022 •

edited

Loading