Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A different API to express colocation and exclusiveness #75

Open
Tracked by #350
ahg-g opened this issue Apr 26, 2023 · 23 comments
Open
Tracked by #350

A different API to express colocation and exclusiveness #75

ahg-g opened this issue Apr 26, 2023 · 23 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@ahg-g
Copy link
Contributor

ahg-g commented Apr 26, 2023

The existing spec.replicatedJobs[*].exclusive API allows users to express that the jobs have 1:1 assignment to a domain. For example, one and only one job per rack.

This is currently expressed as:

exclusive
  topologyKey: rack

This API is simple compared to the complex pod affinity and anti-affinity rules that a user would have needed to add to achieve the same semantics (see #27).

However, one issue with this API is that it is limited to one use case (exclusive 1:1 placement); I would like to discuss making it a bit more generic to support the following use cases without losing simplicity:

Use Cases

UC1: Exclusive 1:1 job to domain assignment (what the current API offer)
UC2: Each job is colocated in one domain, but not necessarily exclusive to it (i.e., more than on job could land on the same domain)
UC3: Allow users to express either preference or required assignment?

What other use cases?

API options

UC3 can be supported by extending the existing API; however, the same can't be said with regards to UC2 since the type name "Exclusive" doesn't lend itself to that UC, even if we do #40

Option 1

type ReplicatedJob struct {
  ....
  ColocationPolicy *ColocationPolicy
}

type ColocationPolicy string

ColocationPolicyStrict ColocationPolicy = "Strict"
ColocationPolicyBestEffort ColocationPolicy = "BestEffort"

// Colocation describes how the pods of a job will be colocated.
type Colocation struct {
     Policy ColocationPolicy
     // Defines the topology on which the pods of a job replica will be colocated.
     TopologyKey string
     // Defines which other pods in which namespaces a job replica can not be colocated with.
     AntiAffinityPods *metav1.LabelSelector `json:"podSelector,omitempty"`
     AntiAffinityNamespaces *metav1.LabelSelector `json:"podSelector,omitempty"`
}

UC1

colocation:
  topologyKey: rack
  antiAffinityPods:
    matchExpressions:
    - key: job-name
      operator: Exists

UC2

colocation:
  topologyKey: rack

what other options?

@alculquicondor
Copy link

alculquicondor commented Apr 26, 2023

The antiAffinity rule that you posted for UC1 is hiding the fact that you would insert the following (IIUC):

key: job-name
operator: NotIn
values:
- my-job

It might be better to keep the "exclusive" semantics:

topologyColocation:
  mode: None|SameDomain|ExclusiveDomain
  key: rack
  enforcement: Strict|BestEffort

@ahg-g
Copy link
Contributor Author

ahg-g commented Apr 26, 2023

The antiAffinity rule that you posted for UC1 is hiding the fact that you would insert the following (IIUC):

Correct, it is implicit.

It might be better to keep the "exclusive" semantics:

So my thinking is that the "exclusive" semantics are supported using the AntiAffinityPods and AntiAffinityNamespaces selectors since they offer greater flexibility to tune and define exclusiveness (e.g., exclusive against all pods or only other jobs etc.).

@alculquicondor
Copy link

The antiAffinity rule that you posted for UC1 is hiding the fact that you would insert the following (IIUC):

Correct, it is implicit.

This is my worry then. There is a hidden addition to the antiaffinity rules that is not obvious.

@ahg-g
Copy link
Contributor Author

ahg-g commented Apr 26, 2023

This is my worry then. There is a hidden addition to the antiaffinity rules that is not obvious.

We have to have that implicitly added in all cases though

@alculquicondor
Copy link

Yes, but IMO it's easier to explain if you say: "exclusive: true" of sorts, compared to mutating matchExpressions that the user provides.

@ahg-g
Copy link
Contributor Author

ahg-g commented Apr 26, 2023

Right, but I feel that we anyway need to provide a knob (a pod selector) to allow users to specify exclusive against what, and by definition it shouldn't be against itself. So having another parameter to say "exclusive" will be redundant in that sense, right?

@alculquicondor
Copy link

In that case, I don't think the name AntiAffinityPods is providing enough context. Could it be misunderstood as a means to implement 1:1 pod per node?

@ahg-g
Copy link
Contributor Author

ahg-g commented Apr 26, 2023

It is a selector on the pods, hence the name; any other suggestions?

@alculquicondor
Copy link

I would be tempted to call it jobAntiAffinity to detach it from the pods themselves. The fact that the pods get the pod anti-affinity rule is an implementation detail. But I could picture arguments against.
I'll stop here. Maybe someone else has a better idea :)

@qelo
Copy link

qelo commented Apr 27, 2023

Aldo's proposal is quite appealing in its simplicity:

topologyColocation:
  mode: None|SameDomain|ExclusiveDomain
  key: rack
  enforcement: Strict|BestEffort

When there are needs for more advanced anti-affinity rules, maybe that can be left for classic anti-affinity expressed on the pod template ? (after we have MatchLabelKeys as pointed out in #27 )

Btw. consider that there can be a hierarchy of more and more collocated topologies, e.g. a rack within group of racks within a cluster. Do you want to support a usecase of "I want the most collocated I can get, but at most X" ? See max-distance in GCP compact placement APIs for inspiration: https://cloud.google.com/sdk/gcloud/reference/beta/compute/resource-policies/create/group-placement.

@danielvegamyhre
Copy link
Contributor

@alculquicondor quick clarifying question about your suggestion:

topologyColocation:
  mode: None|SameDomain|ExclusiveDomain
  key: rack
  enforcement: Strict|BestEffort

Is the below interpretation correct or am I misunderstanding?
ExclusiveDomain = assign 1 job per topology domain (e.g. 1 job per nodepool)
SameDomain = assign all jobs in one topology domain (e.g. all jobs on one node pool)

@ahg-g
Copy link
Contributor Author

ahg-g commented May 5, 2023

Btw. consider that there can be a hierarchy of more and more collocated topologies, e.g. a rack within group of racks within a cluster. Do you want to support a usecase of "I want the most collocated I can get, but at most X" ? See max-distance in GCP compact placement APIs for inspiration: https://cloud.google.com/sdk/gcloud/reference/beta/compute/resource-policies/create/group-placement.

We don't have a canonical way of defining hierarchy using node labels, so I don't think we can reliably expose a max-distance API unless we predefine a hierarchal topology.

@ahg-g
Copy link
Contributor Author

ahg-g commented May 5, 2023

SameDomain = assign all jobs in one topology domain (e.g. all jobs on one node pool)

No, it means the Job itself is colocated in a single domain (node pool), but it can coexist with others.

If we are to go with an explicit enum, then it would probably be: Colocated|ColocatedAndExclusive

@ahg-g
Copy link
Contributor Author

ahg-g commented May 5, 2023

When there are needs for more advanced anti-affinity rules, maybe that can be left for classic anti-affinity expressed on the pod template ? (after we have MatchLabelKeys as pointed out in #27 )

I don't think it is doable in a reliable and non-surprising way because we need to have a selector term that prevents the job from excluding itself. The API in the main post make that selector term explicit, attached to the specified topology and the namespace where this applies.

@ahg-g
Copy link
Contributor Author

ahg-g commented May 7, 2023

Btw. consider that there can be a hierarchy of more and more collocated topologies, e.g. a rack within group of racks within a cluster. Do you want to support a usecase of "I want the most collocated I can get, but at most X" ? See max-distance in GCP compact placement APIs for inspiration: https://cloud.google.com/sdk/gcloud/reference/beta/compute/resource-policies/create/group-placement.

We don't have a canonical way of defining hierarchy using node labels, so I don't think we can reliably expose a max-distance API unless we predefine a hierarchal topology.

But, it can be done by allowing users to define a list of colocation rules, with the outermost topology being strict while the others being preferred:

colocation:
  - topologyKey: rack
     mode: BestEffort
  - topologyKey: zone
     mode: Strict

if we want to support more levels, then the API should allow setting pod-affinity weights (with higher weights for the inner most ones as you pointed out offline). At this point, the colocation API drifts closer to pod-affinity, so its value becomes less obvious compared to just using pod-affinity directly :)

@alculquicondor
Copy link

so its value becomes less obvious compared to just using pod-affinity directly :)

Yes, but at that point it's better to enhance the pod spec to support what we need. In the meantime, we jobset can support the minimal requirement: probably just Exclusive and Strict.

@qelo
Copy link

qelo commented May 8, 2023

it can be done by allowing users to define a list of colocation rules with the outermost topology being strict while the others being preferred:

Exactly. Btw. since BestEffort/Strict is only the choice for the outermost (and inner are always BestEffort), mode could be outside the list:

colocation:
  topologyKey: [rack, zone]
  mode: Strict

UC1: Exclusive 1:1

I'm not sure how the list should support exclusive, though. I guess exclusive makes most sense when there is one.

Btw. why would jobs want exclusive actually? Because of some noisy neighbors? I suspect what we really want is an all-or-nothing mechanism to avoid deadlocks and exclusive is just a workaround for that? Like "I would be fine to share the rack with someone if we both fit, but I want to avoid deadlocks when two of us fit partially -- so I will keep the rack exclusive as a workaround".

At this point, the colocation API drifts closer to pod-affinity, so its value becomes less obvious compared to just using pod-affinity directly

I didn't follow this. The value would be much simpler API for the user. Implementation would set pod affinities and weights, but the user would only deal with 'colocation' API not with pod affinities, right?

That said, it also makes a lot of sense if you decide to focus on something simpler in the beginning.

@vsoch
Copy link
Contributor

vsoch commented May 9, 2023

Btw. why would jobs want exclusive actually? Because of some noisy neighbors?

Does this mean one pod per machine? Or exclusive use of some machine? There are many uses for our workflows for which we would absolutely not want to share a machine as it would impact performance.

@danielvegamyhre
Copy link
Contributor

This was addressed in #309 and #342 and included in release v0.3.0 🥳

@danielvegamyhre
Copy link
Contributor

Actually, we didn't actually change the API (it's still an annotation), but just the implementation of it, to improve scheduling throughput for large scale training. Feel free to re-open this if you want to explore this further.

@ahg-g ahg-g reopened this Jan 4, 2024
@ahg-g
Copy link
Contributor Author

ahg-g commented Jan 4, 2024

This issue is about having a proper API, not the optimizations.

@danielvegamyhre danielvegamyhre added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 16, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 16, 2024
@danielvegamyhre
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

7 participants