A different API to express colocation and exclusiveness #75

ahg-g · 2023-04-26T05:50:32Z

The existing spec.replicatedJobs[*].exclusive API allows users to express that the jobs have 1:1 assignment to a domain. For example, one and only one job per rack.

This is currently expressed as:

exclusive
  topologyKey: rack

This API is simple compared to the complex pod affinity and anti-affinity rules that a user would have needed to add to achieve the same semantics (see #27).

However, one issue with this API is that it is limited to one use case (exclusive 1:1 placement); I would like to discuss making it a bit more generic to support the following use cases without losing simplicity:

Use Cases

UC1: Exclusive 1:1 job to domain assignment (what the current API offer)
UC2: Each job is colocated in one domain, but not necessarily exclusive to it (i.e., more than on job could land on the same domain)
UC3: Allow users to express either preference or required assignment?

What other use cases?

API options

UC3 can be supported by extending the existing API; however, the same can't be said with regards to UC2 since the type name "Exclusive" doesn't lend itself to that UC, even if we do #40

Option 1

type ReplicatedJob struct {
  ....
  ColocationPolicy *ColocationPolicy
}

type ColocationPolicy string

ColocationPolicyStrict ColocationPolicy = "Strict"
ColocationPolicyBestEffort ColocationPolicy = "BestEffort"

// Colocation describes how the pods of a job will be colocated.
type Colocation struct {
     Policy ColocationPolicy
     // Defines the topology on which the pods of a job replica will be colocated.
     TopologyKey string
     // Defines which other pods in which namespaces a job replica can not be colocated with.
     AntiAffinityPods *metav1.LabelSelector `json:"podSelector,omitempty"`
     AntiAffinityNamespaces *metav1.LabelSelector `json:"podSelector,omitempty"`
}

UC1

colocation:
  topologyKey: rack
  antiAffinityPods:
    matchExpressions:
    - key: job-name
      operator: Exists

UC2

colocation:
  topologyKey: rack

what other options?

The text was updated successfully, but these errors were encountered:

alculquicondor · 2023-04-26T12:30:49Z

The antiAffinity rule that you posted for UC1 is hiding the fact that you would insert the following (IIUC):

key: job-name
operator: NotIn
values:
- my-job

It might be better to keep the "exclusive" semantics:

topologyColocation:
  mode: None|SameDomain|ExclusiveDomain
  key: rack
  enforcement: Strict|BestEffort

ahg-g · 2023-04-26T14:53:56Z

The antiAffinity rule that you posted for UC1 is hiding the fact that you would insert the following (IIUC):

Correct, it is implicit.

It might be better to keep the "exclusive" semantics:

So my thinking is that the "exclusive" semantics are supported using the AntiAffinityPods and AntiAffinityNamespaces selectors since they offer greater flexibility to tune and define exclusiveness (e.g., exclusive against all pods or only other jobs etc.).

alculquicondor · 2023-04-26T15:39:08Z

The antiAffinity rule that you posted for UC1 is hiding the fact that you would insert the following (IIUC):

Correct, it is implicit.

This is my worry then. There is a hidden addition to the antiaffinity rules that is not obvious.

ahg-g · 2023-04-26T15:48:48Z

This is my worry then. There is a hidden addition to the antiaffinity rules that is not obvious.

We have to have that implicitly added in all cases though

alculquicondor · 2023-04-26T15:53:29Z

Yes, but IMO it's easier to explain if you say: "exclusive: true" of sorts, compared to mutating matchExpressions that the user provides.

ahg-g · 2023-04-26T16:04:29Z

Right, but I feel that we anyway need to provide a knob (a pod selector) to allow users to specify exclusive against what, and by definition it shouldn't be against itself. So having another parameter to say "exclusive" will be redundant in that sense, right?

alculquicondor · 2023-04-26T17:45:11Z

In that case, I don't think the name AntiAffinityPods is providing enough context. Could it be misunderstood as a means to implement 1:1 pod per node?

ahg-g · 2023-04-26T18:10:20Z

It is a selector on the pods, hence the name; any other suggestions?

alculquicondor · 2023-04-26T18:15:22Z

I would be tempted to call it jobAntiAffinity to detach it from the pods themselves. The fact that the pods get the pod anti-affinity rule is an implementation detail. But I could picture arguments against.
I'll stop here. Maybe someone else has a better idea :)

qelo · 2023-04-27T22:58:30Z

Aldo's proposal is quite appealing in its simplicity:

topologyColocation:
  mode: None|SameDomain|ExclusiveDomain
  key: rack
  enforcement: Strict|BestEffort

When there are needs for more advanced anti-affinity rules, maybe that can be left for classic anti-affinity expressed on the pod template ? (after we have MatchLabelKeys as pointed out in #27 )

Btw. consider that there can be a hierarchy of more and more collocated topologies, e.g. a rack within group of racks within a cluster. Do you want to support a usecase of "I want the most collocated I can get, but at most X" ? See max-distance in GCP compact placement APIs for inspiration: https://cloud.google.com/sdk/gcloud/reference/beta/compute/resource-policies/create/group-placement.

danielvegamyhre · 2023-05-05T21:20:59Z

@alculquicondor quick clarifying question about your suggestion:

topologyColocation:
  mode: None|SameDomain|ExclusiveDomain
  key: rack
  enforcement: Strict|BestEffort

Is the below interpretation correct or am I misunderstanding?
ExclusiveDomain = assign 1 job per topology domain (e.g. 1 job per nodepool)
SameDomain = assign all jobs in one topology domain (e.g. all jobs on one node pool)

ahg-g · 2023-05-05T21:56:05Z

Btw. consider that there can be a hierarchy of more and more collocated topologies, e.g. a rack within group of racks within a cluster. Do you want to support a usecase of "I want the most collocated I can get, but at most X" ? See max-distance in GCP compact placement APIs for inspiration: https://cloud.google.com/sdk/gcloud/reference/beta/compute/resource-policies/create/group-placement.

We don't have a canonical way of defining hierarchy using node labels, so I don't think we can reliably expose a max-distance API unless we predefine a hierarchal topology.

ahg-g · 2023-05-05T22:05:18Z

SameDomain = assign all jobs in one topology domain (e.g. all jobs on one node pool)

No, it means the Job itself is colocated in a single domain (node pool), but it can coexist with others.

If we are to go with an explicit enum, then it would probably be: Colocated|ColocatedAndExclusive

ahg-g · 2023-05-05T22:26:42Z

When there are needs for more advanced anti-affinity rules, maybe that can be left for classic anti-affinity expressed on the pod template ? (after we have MatchLabelKeys as pointed out in #27 )

I don't think it is doable in a reliable and non-surprising way because we need to have a selector term that prevents the job from excluding itself. The API in the main post make that selector term explicit, attached to the specified topology and the namespace where this applies.

ahg-g · 2023-05-07T20:05:02Z

Btw. consider that there can be a hierarchy of more and more collocated topologies, e.g. a rack within group of racks within a cluster. Do you want to support a usecase of "I want the most collocated I can get, but at most X" ? See max-distance in GCP compact placement APIs for inspiration: https://cloud.google.com/sdk/gcloud/reference/beta/compute/resource-policies/create/group-placement.

We don't have a canonical way of defining hierarchy using node labels, so I don't think we can reliably expose a max-distance API unless we predefine a hierarchal topology.

But, it can be done by allowing users to define a list of colocation rules, with the outermost topology being strict while the others being preferred:

colocation:
  - topologyKey: rack
     mode: BestEffort
  - topologyKey: zone
     mode: Strict

if we want to support more levels, then the API should allow setting pod-affinity weights (with higher weights for the inner most ones as you pointed out offline). At this point, the colocation API drifts closer to pod-affinity, so its value becomes less obvious compared to just using pod-affinity directly :)

alculquicondor · 2023-05-08T12:00:50Z

so its value becomes less obvious compared to just using pod-affinity directly :)

Yes, but at that point it's better to enhance the pod spec to support what we need. In the meantime, we jobset can support the minimal requirement: probably just Exclusive and Strict.

qelo · 2023-05-08T23:51:43Z

it can be done by allowing users to define a list of colocation rules with the outermost topology being strict while the others being preferred:

Exactly. Btw. since BestEffort/Strict is only the choice for the outermost (and inner are always BestEffort), mode could be outside the list:

colocation:
  topologyKey: [rack, zone]
  mode: Strict

UC1: Exclusive 1:1

I'm not sure how the list should support exclusive, though. I guess exclusive makes most sense when there is one.

Btw. why would jobs want exclusive actually? Because of some noisy neighbors? I suspect what we really want is an all-or-nothing mechanism to avoid deadlocks and exclusive is just a workaround for that? Like "I would be fine to share the rack with someone if we both fit, but I want to avoid deadlocks when two of us fit partially -- so I will keep the rack exclusive as a workaround".

At this point, the colocation API drifts closer to pod-affinity, so its value becomes less obvious compared to just using pod-affinity directly

I didn't follow this. The value would be much simpler API for the user. Implementation would set pod affinities and weights, but the user would only deal with 'colocation' API not with pod affinities, right?

That said, it also makes a lot of sense if you decide to focus on something simpler in the beginning.

vsoch · 2023-05-09T15:06:16Z

Btw. why would jobs want exclusive actually? Because of some noisy neighbors?

Does this mean one pod per machine? Or exclusive use of some machine? There are many uses for our workflows for which we would absolutely not want to share a machine as it would impact performance.

danielvegamyhre · 2024-01-04T00:49:47Z

This was addressed in #309 and #342 and included in release v0.3.0 🥳

danielvegamyhre · 2024-01-04T00:51:52Z

Actually, we didn't actually change the API (it's still an annotation), but just the implementation of it, to improve scheduling throughput for large scale training. Feel free to re-open this if you want to explore this further.

ahg-g · 2024-01-04T17:34:55Z

This issue is about having a proper API, not the optimizations.

k8s-triage-robot · 2024-05-16T18:25:18Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

danielvegamyhre · 2024-05-16T18:45:20Z

/remove-lifecycle stale

This was referenced May 1, 2023

Move exclusive API to annotation #97

Closed

Allow users to customize exclusiveness by selecting which pods the job should not co-exist with #40

Closed

danielvegamyhre closed this as completed Jan 4, 2024

ahg-g reopened this Jan 4, 2024

danielvegamyhre added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 16, 2024

danielvegamyhre mentioned this issue Feb 29, 2024

Release v0.5.0 requirements #438

Closed

10 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 16, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 16, 2024

ahg-g mentioned this issue Jun 12, 2024

Enhance scheduling capabilities for a group of pods kubernetes-sigs/lws#162

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A different API to express colocation and exclusiveness #75

A different API to express colocation and exclusiveness #75

ahg-g commented Apr 26, 2023

alculquicondor commented Apr 26, 2023 •

edited

Loading

ahg-g commented Apr 26, 2023

alculquicondor commented Apr 26, 2023

ahg-g commented Apr 26, 2023

alculquicondor commented Apr 26, 2023

ahg-g commented Apr 26, 2023

alculquicondor commented Apr 26, 2023

ahg-g commented Apr 26, 2023

alculquicondor commented Apr 26, 2023

qelo commented Apr 27, 2023

danielvegamyhre commented May 5, 2023

ahg-g commented May 5, 2023

ahg-g commented May 5, 2023 •

edited

Loading

ahg-g commented May 5, 2023

ahg-g commented May 7, 2023

alculquicondor commented May 8, 2023

qelo commented May 8, 2023

vsoch commented May 9, 2023

danielvegamyhre commented Jan 4, 2024

danielvegamyhre commented Jan 4, 2024

ahg-g commented Jan 4, 2024

k8s-triage-robot commented May 16, 2024

danielvegamyhre commented May 16, 2024

A different API to express colocation and exclusiveness #75

A different API to express colocation and exclusiveness #75

Comments

ahg-g commented Apr 26, 2023

Use Cases

API options

Option 1

alculquicondor commented Apr 26, 2023 • edited Loading

ahg-g commented Apr 26, 2023

alculquicondor commented Apr 26, 2023

ahg-g commented Apr 26, 2023

alculquicondor commented Apr 26, 2023

ahg-g commented Apr 26, 2023

alculquicondor commented Apr 26, 2023

ahg-g commented Apr 26, 2023

alculquicondor commented Apr 26, 2023

qelo commented Apr 27, 2023

danielvegamyhre commented May 5, 2023

ahg-g commented May 5, 2023

ahg-g commented May 5, 2023 • edited Loading

ahg-g commented May 5, 2023

ahg-g commented May 7, 2023

alculquicondor commented May 8, 2023

qelo commented May 8, 2023

vsoch commented May 9, 2023

danielvegamyhre commented Jan 4, 2024

danielvegamyhre commented Jan 4, 2024

ahg-g commented Jan 4, 2024

k8s-triage-robot commented May 16, 2024

danielvegamyhre commented May 16, 2024

alculquicondor commented Apr 26, 2023 •

edited

Loading

ahg-g commented May 5, 2023 •

edited

Loading