Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design for all-or-nothing semantics for workload resource assignment #433

Merged
merged 3 commits into from
Dec 1, 2022

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Nov 23, 2022

What type of PR is this?

/kind documentation
/kind feature

What this PR does / why we need it:

It proposes an opt-in mechanism to avoid concurrent pod scheduling for workloads. It is useful for running workloads which require all pods to be running at the same time.

Which issue(s) this PR fixes:

Part of: #349

Special notes for your reviewer:

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 23, 2022
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 23, 2022
keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved
keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved
keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved
keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved
keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved
keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved
keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved
keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved
keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved
keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved
@mimowo mimowo force-pushed the avoid-coscheduling-design branch 2 times, most recently from 3c08f58 to 2b91e1d Compare November 25, 2022 09:07
@mimowo mimowo changed the title Design for avoiding workload coscheduling - draft Design for avoiding concurrent pod scheduling for workloads Nov 25, 2022
@mimowo mimowo force-pushed the avoid-coscheduling-design branch 4 times, most recently from ad3eade to ad77660 Compare November 25, 2022 12:17
@mimowo mimowo marked this pull request as ready for review November 25, 2022 12:18
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 25, 2022
@mimowo mimowo force-pushed the avoid-coscheduling-design branch 3 times, most recently from 92b940c to f9abac0 Compare November 25, 2022 13:01
keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved
keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved
keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved
keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved
// If false, the workload starts as soon as it is admitted and it does not
// block admission of other workloads. The `null` value results in the same
// behavior as `false`.
waitForPodsScheduled *bool `json:"waitForPodsScheduled,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit odd. The semantics described here (specifically, "blocking other workloads and waiting until the workload pods are scheduled") are not a property that a user of the workload API cares about, the relevant workload property is "all-or-nothing". Moreover, a batch user shouldn't have knobs that explicitly control the queue scheduling behavior.

I do think that waitForPodsScheduled can be a property of the queue though and something that a batch admin may want to toggle.

So I see two options:

  1. Keep the knob in the Workload API but use different naming, something like AllOrNothing = Strict | BestEffort. With BestEffort being the current default behavior and Strict is the one where we wait for the pods to schedule. The Strict semantics may have a different implementation in the future that is based on a more optimistic approach (start all workloads just like we do now, but preempt the ones that get stuck and never fully schedule) or via an integration with the planned CA's provisioning API.

  2. Move the knob to the ClusterQueue API or even to ComponentConfig and keep the proposed naming and description.

Copy link
Contributor Author

@mimowo mimowo Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's review the options for the best place for the API and their pros&cons:

  • workload (current proposal)
  • local queue
  • cluster queue
  • component

Workload/Job level pros:

  • throughput - the same cluster queue may run workloads which don't need any guarantees and thus get admitted faster.
  • flexibility - we can easily implement queue-level (or component-level) by translating the queue-level property into workload property during workload creation. For example, add the workload annotation to every workload created for a given local queue.
  • conceptually, IMO, requiring all pods to run at the same time is a property of a workload itself.

Workload/Job level cons:

  • introducing a new annotation on a Job level will increase the complexity of the job reconciler which is a reference Job-based implementation of CRD <> Workload integration. However, translating the annotation into a field still seems to be a marginal increase of complexity.

Component level pros:

  • lower complexity of implementation comparing to workload level - we could just block admission until the workload is in the PodsScheduled state. We wouldn't need to check if there are any already admitted workloads - we know they all have already awaited to be in the PodsScheduled condition.

Component level cons:

  • throughput - would degrade substantially by sequencing of all workloads, even those which don't require the semantics

Queue level cons:

  • similar complexity of implementation comparing to workload level - in order to ensure exclusive start of pods we still need to both block admission of new workloads until in PodsScheduled and await for all already workloads to be in the PodsScheduled condition, as some workloads might be executed on the cluster bypassing the queues.
  • throughput - though it could be compensated by creating dedicated queues for workloads requiring all-or-nothing or not requiring it.

Still, I think the current naming - waitForPodsScheduled - is leaking implementation details and might be confusing at the workload level. I like the Idea of a declarative API at the workload levels. Here are the names that come to my mind:

  • AllOrNothing (Strict, BestEffort)
  • ResourceAllocation (AllOrNothing / Strict / All, BestEffort)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also worth considering:

  • PodScheduling (AllOrNothing, BestEffort)
  • PodResources (AllOrNothing, BestEffort)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a discussion with @alculquicondor and @ahg-g renamed the field to the following:
StartupRequirements: QuotaOnly, QuotaAndPodsReady

keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved
keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved
keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved
keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved

First, we introduce a new workload condition, called `PodsScheduled`, to indicate
that the first batch of workload's pods is scheduled, more precisely:
`job.status.ready + job.status.succeeded >= job.spec.parallelism`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain that job controllers are responsible for populating this condition and then discuss how we are doing it for v1.Job in a separate paragraph.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition is populated by the Kueue's Job reconciler. Rephrased and used the Job reconciler name vs job controller to avoid confusion.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to clearly separate Job & Workload interfaces in Kueue designs. The idea behind Workload was to generalize CRD <> Workload <> Kueue integration, with Job <> Workload <> Kueue being the reference implementation.

We shouldn't over-engineer it before we actually have another CRD integrations (YAGNI), but I suggest we separate:

  1. User stories and user guide assuming Job. It should refer to objects that users interact with, i.e. either ClusterQueue/LocalQueue or Job (no Workload, and maybe also not 'workload' to avoid ambiguity)

  2. Guide for CRD <> Workload integration, using Job reconciler as reference implementation. We should probably try to keep it lean. Whenever we add something here, like Job annotation, it will be an extra step for CRD <> Workload integration.

Copy link
Contributor Author

@mimowo mimowo Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the high-level perspective.

  • Re 1: I renamed the uses of workload to job to avoid the confusion.
  • Re 2: Indeed, it could be an argument against setting the annotation on Job. Not sure if it outweights the other pros. I have updated the review: Design for all-or-nothing semantics for workload resource assignment #433 (comment). Going to also put these into alternatives. Another possibility would be to keep the workload API field, but populate it based on the local queue configuration. WDYT? It would mean that the admin needs to set up a local queue with the configuration, so that the users sent their workloads there. On the other side, it means there is no need for the new Job annotation.

Copy link
Contributor Author

@mimowo mimowo Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the discussion into the notes section in the doc.
After the note I'm leaning to using LocalQueue API fields (for opting in and timeout declaration). What do you think @qelo @alculquicondor @ahg-g ?

Copy link
Contributor Author

@mimowo mimowo Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold
let's hold until we have more opinions on this, in particular: Jobs annotation vs. LocalQueue API field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a long term plan to use LocalQueue as a place for workload defaulting, I wouldn't add it now though, lets give the feature time to soak a bit, so perhaps starting with an annotation on the Job API is a good first step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of advantages which LocalQueue seems to have:

  • it does not leak implementation to JobReconciler, allowing to keep it lean as a reference implementation of the CRD <> Workload integration (Design for all-or-nothing semantics for workload resource assignment #433 (comment)).
  • it would also be a natural place for the timeout, currently the opt-in and related timeout are on two opposite ends of the concepts. Then, it allows you to have two LocalQueues, so that the timeouts could be different for large and small jobs.

In terms of implementation complexity, I think it is similar, but we have already a webhook for validating LocalQueue, and if we want to validate the job annotation we would need to introduce a new one.

I also think that in terms of the "first step" it also seems more natural to stay within Kueue concepts.

Copy link
Contributor

@ahg-g ahg-g Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of thoughts/history:

  1. The integration controllers should be lean, there is no disagreement, it is a principle we have been trying to maintain since day 1 :).
  2. Any workload level API that we expose can be populated by the integration controller, we will not intentionally block that.
  3. LocalQueue can be thought of as a place for defaulting workload parameters, this was one of the design goals of LocalQueue.

it does not leak implementation to JobReconciler, allowing to keep it lean as a reference implementation of the CRD <> Workload integration (#433 (comment)).

If we want to continue to have this as a workload level API (with an extension in the future to be defaulted via LocalQueue), then this is not leaking. As for complexity, as I mentioned on the other comment, we will have that no matter what option we choose because we need to translate pod ready status to a condition on the workload.

I also think that in terms of the "first step" it also seems more natural to stay within Kueue concepts.

My thinking is that we want to be very careful when adding new fields to ClusterQueue and LocalQueue since those are user-facing apis and we need to be very careful when adding new fields to them. While the Workload API is intended as a gluing layer that we may have a bit more wiggle room to experiment with, while at the same time adding an annotation to the job api is simple to deprecate.

keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved
keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved

## Proposal

We introduce a mechanism to avoid concurrent scheduling of Job pods. More
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to @ahg-g mentioned above, I think we're not aim to avoid concurrent scheduling, but to avoid resource deadlock through different admitted workloads. Or we do want to introduce such a strict serial scheduling mechanism?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still unclear about this goal of this KEP.🤔

Copy link
Contributor Author

@mimowo mimowo Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the goal is to ensure that a workload has all its resources allocated. Sequencing of admissions, for workloads which require the all-or-nothing semantics, is just a method that is currently available to Kueue.

keps/349-avoid-coscheduling/README.md Outdated Show resolved Hide resolved
have a guarantee that other workloads scheduled within this time interval would
not delay execution of my workload.

My use can be supported by adding `kueue.x-k8s.io/wait-for-pods-scheduled`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe replace this with your expected behaviors, for the approach is till unclear to readers and we'll explain this in details below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased the stories, hope it is clearer now.

As an administrator of a batch processing in my organization I want to be able
to configure kueue quotas for on-demand resources a little higher than what is
available in the quiet period. However, I don't want to risk two workloads
deadlock in case of issues with provisioning nodes to satisfy the configured
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, can't fully understand this, do you mean workloads with different priorities are scheduled the same time but are starving, then triggered the scaling up, but in theory, this can be avoided for higher workloads should run in priority?

Copy link
Contributor Author

@mimowo mimowo Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the same priorities , or unspecified priorities when the native k8s preemption does not help. Added a sentence to clarify

If the workload fails to schedule its pods it could block admission of other
workloads indefinitely.

The risk can be mitigated by using the Job's `ActiveDeadlineSeconds` timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really a serious problem, when other types of jobs do not support this feature. Even for native batch Job, pods can't run successfully for various of reasons besides resource problems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the sentiment, was just considering decoupling it from this proposal as it is already an issue to some extent. Still, moved it to Design Details as the need was expressed by reviewers.

@mimowo mimowo force-pushed the avoid-coscheduling-design branch 13 times, most recently from c809863 to ab3c41b Compare December 1, 2022 12:51
Comment on lines 198 to 211
// waitForPodsReady, when true, indicates that each admitted workload
// blocks admission of other workloads in the cluster, until it is in the
// `PodsReady` condition. If false, all workloads start as soon as they are
// admitted and do not block admission of other workloads. The PodsReady
// condition is only added if this setting is enabled. If unspecified,
// it defaults to false.
waitForPodsReady *bool `json:"waitForPodsReady,omitempty"`

// podsReadyTimeout describes the timeout for an admitted workload to reach
// the PodsReady condition since unsuspending of the corresponding Job.
// After exceeding this timeout the corresponding job gets suspended again
// and moved to the ClusterQueue's inadmissibleWorkloads list. The timeout is
// enforced only if waitForPodsReady=true. If unspecified, it defaults to 5min.
podsReadyTimeout *metav1.Time `json:"podsReadyTimeout,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waitForPodsReady:
  enabled: true
  timeout: 5m

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks cleaner, let me update it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I the proposed API I use enable for consistency with the analogous field in InternalCertManagement. Also, I suggest to use timeoutSeconds to express the timeout by seconds, similarly as Job's ActiveDeadlineSeconds is defined.

// job.status.StartTime, it can take an admitted workload to reach
// the PodsReady condition.
// After exceeding the timeout the corresponding job gets suspended again
// and moved to the ClusterQueue's inadmissibleWorkloads list. The timeout is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the inadmissibleWorkloads list is an implementation detail.

You can simply say that it's requeued.

Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

I'll leave the hold to @ahg-g

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ahg-g
Copy link
Contributor

ahg-g commented Dec 1, 2022

/hold cancel
/lgtm

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 1, 2022
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 1, 2022
@k8s-ci-robot k8s-ci-robot merged commit 2b48316 into kubernetes-sigs:main Dec 1, 2022
@mimowo mimowo deleted the avoid-coscheduling-design branch March 18, 2023 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants