More general scheduling constraints #367

Closed
thockin opened this Issue Jul 7, 2014 · 62 comments

Comments

Owner

thockin commented Jul 7, 2014

There have been a few folks who have asked about machine constraints for scheduling. Let's use this issue as a place to gather ideas and requirements.

@timothysc

verdverm commented Jul 7, 2014

I have noticed the FirstFit (default?) scheduler co-locates pods when there are open machines available. Each of these machines has a single cpu.

It would be nice to use information about available cpu and a pod's expected cpu requirements

sed 's/cpu/other_machine_stat/'

Contributor

monnand commented Jul 7, 2014

Currently, kubelet could get stats from cAdvisor which would be useful for scheduler. It could provide different percentiles of CPU and memory usage of a container (including the root container, i.e. the machine).

Owner

thockin commented Jul 7, 2014

That's just "scheduling", as opposed to machine constraints, though very
coarsely they feel similar :)

On Mon, Jul 7, 2014 at 1:32 PM, Tony Worm notifications@github.com wrote:

I have noticed the FirstFit (default?) scheduler co-locates pods when
there are open machines available. Each of these machines has a single cpu.

It would be nice to use information about available cpu and a pod's
expected cpu requirements

sed 's/cpu/other_machine_stat/'

Reply to this email directly or view it on GitHub
GoogleCloudPlatform#367 (comment)
.

Member

timothysc commented Jul 7, 2014

Labels "seems" like the ideal place to enable a rank & requirements to define constraints. However labels would need to be regularly published by minions.

e.g.
rank = memory
requirements = gpu & clusterXYZ

I have a couple of concerns here:

  1. This treads into the full scale scheduling world.
  2. Config syntax = ?, DSL? ...
Owner

thockin commented Jul 7, 2014

Let's worry about semantics before syntax. We have a similar issue open
for label selectors in general - we can discuss syntax there.

On Mon, Jul 7, 2014 at 1:42 PM, Timothy St. Clair notifications@github.com
wrote:

Labels "seems" like the ideal place to enable a rank & requirements to
define constraints. For example:

rank = memory
requirements = gpu & clusterXYZ

I have a couple of concerns here:

  1. This treads into the full scale scheduling world.
  2. Config syntax = ?, DSL? ...

Reply to this email directly or view it on GitHub
GoogleCloudPlatform#367 (comment)
.

Member

timothysc commented Jul 7, 2014

FWIW I often view constraints as a SQL query on a nvp store.

SELECT Resources
FROM Pool
WHERE Requirements
ORDER BY Rank
...

The hardest part are the 'fields' in an nvp store.

Owner

bgrant0607 commented Jul 7, 2014

Scheduling based on resources and constraints are 2 significantly different issues.

We have several issues open about resource (and QoS) awareness: #147 , #160 , #168 , #274 , #317.

Constraint syntax/semantics: We should start with the proposed label selector mechanism, #341 .

Member

timothysc commented Jul 9, 2014

I'm ok with doing the selection from a set of offers/resources from the scheduler.

Provided the offers have enough NVP information to enable discrimination.

Owner

thockin commented Jul 9, 2014

I don't know about NVP - where can I read more on it?
On Jul 8, 2014 7:45 PM, "Timothy St. Clair" notifications@github.com
wrote:

I'm ok with doing the selection from a set of offers/resources from the
scheduler.

Provided the offers have enough NVP information to enable discrimination.

Reply to this email directly or view it on GitHub
GoogleCloudPlatform#367 (comment)
.

Owner

bgrant0607 commented Jul 9, 2014

Searching for "NVP SQL" or "name value pair SQL" or "key value pair SQL" comes up with lots of hits. Common arguments against are performance and loss of control over DB schema. But I'm getting the feeling that we're barking up the wrong forest.

@timothysc What are you trying to do? Right now, k8s has essentially no intelligent scheduling. However, that's not a desirable end state. If what you want is a scheduler, we should figure out how to support scheduling plugins and/or layers on top of k8s.

Owner

thockin commented Jul 9, 2014

Name Value Pairs? Now I feel dumb :)

On Tue, Jul 8, 2014 at 7:51 PM, Tim Hockin thockin@google.com wrote:

I don't know about NVP - where can I read more on it?
On Jul 8, 2014 7:45 PM, "Timothy St. Clair" notifications@github.com
wrote:

I'm ok with doing the selection from a set of offers/resources from the
scheduler.

Provided the offers have enough NVP information to enable discrimination.

Reply to this email directly or view it on GitHub
GoogleCloudPlatform#367 (comment)
.

Owner

bgrant0607 commented Jul 9, 2014

Something somewhat different than label selectors is per-attribute limits for spreading. Aurora is one system that supports this model:
https://aurora.incubator.apache.org/documentation/latest/configuration-reference/#specifying-scheduling-constraints

This is more relevant to physical rather than virtual deployments. I'd consider it a distinct mechanism from constraints. @timothysc If you'd like this, we should file a separate issue. However, I'd prefer a a new failure tolerance scheduling policy object that specifies a label selector to identify the set of instances to be spread. We could debate about how to describe what kind and/or how much spreading to apply, but I'd initially just leave it entirely up to the infrastructure.

Member

timothysc commented Jul 9, 2014

I completely agree its more relevant to physical rather then virtual deployments.

I was somewhat testing the possibility of enabling the capabilities for more general purpose scheduling, on par with a mini-Condor approach but it's not a requirement.

Aurora or Marathon -esk capabilities will fill the gap.
https://github.com/mesosphere/marathon/wiki/Constraints

brendandburns added this to the 0.7 milestone Sep 24, 2014

@bgrant0607 bgrant0607 modified the milestone: v0.8, v0.7 Sep 26, 2014

davidopp was assigned by bgrant0607 Oct 4, 2014

Owner

bgrant0607 commented Oct 17, 2014

Note that in order to add constraints, we'd need a way to attach labels to minions/nodes.

Member

timothysc commented Oct 20, 2014

That is what I had alluded to earlier, but it received luke warm attention. In fact, I believe Wilkes had chimed in on a different thread regarding this topic.

Contributor

brendandburns commented Oct 20, 2014

I think we should have labels for worker nodes, but they need to be
dynamic, and that's tough without a re-scheduler.

For now, I think we should use resources on nodes, since they are already
there, and the are known to be static.

You can add resource requests to pods, to achieve appropriate scheduling.

Brendan
On Oct 20, 2014 8:26 AM, "Timothy St. Clair" notifications@github.com
wrote:

That is what I had alluded to earlier, but it received luke warm
attention. In fact, I believe Wilkes had chimed in on a different thread
regarding this topic.


Reply to this email directly or view it on GitHub
GoogleCloudPlatform#367 (comment)
.

Owner

erictune commented Oct 20, 2014

Replication controllers reschedule pods when the machines they are on are no longer available. Seems like replication controller could do the same if the machine becomes infeasible for scheduling. A fairly simple loop can recheck predicates as a background task in the scheduler, and move pods to terminated state if they no longer fit.

Questions:

  1. If a pod running pod is updated so that it's requirements no longer match the machine it is bound to, what happens?
    • pod moves to terminated state
    • refuse update
    • both, and let users be able to control the behavior (yay)
Owner

bgrant0607 commented Oct 20, 2014

The minion/node controller (#1366) should be responsible for killing pods with mismatched label selectors, and then, yes, replication controllers would recreate them.

Re. @erictune's question: Yes, we could support both, for instance, using a URL parameter to select the desired behavior.

Contributor

brendandburns commented Oct 20, 2014

Yeah, having the kubelet kill pods that don't match makes the most sense.

--brendan

On Mon, Oct 20, 2014 at 9:39 AM, bgrant0607 notifications@github.com
wrote:

The minion/node controller (#1366
GoogleCloudPlatform#1366) should
be responsible for killing pods with mismatched label selectors, and then,
yes, replication controllers would recreate them.

Re. @erictune https://github.com/erictune's question: Yes, we could
support both, for instance, using a URL parameter to select the desired
behavior.


Reply to this email directly or view it on GitHub
GoogleCloudPlatform#367 (comment)
.

bgrant0607 changed the title from Add scheduling constraints to Consider more general scheduling constraints Oct 27, 2014

bgrant0607 removed this from the v0.8 milestone Oct 27, 2014

davidopp was unassigned by bgrant0607 Oct 27, 2014

Contributor

raphael commented Nov 19, 2014

There is a very narrow use case of the generic scheduling constraint problem that would seem a lot easier to support and would provide a lot of value: make it possible to express that instances of a given pod or replica template may not live on the same minion. There is no need to identify minions to express this and it would alleviate a whole class of concerns related to building HA deployments. The concrete use case I'm looking at now is how to deploy a mongodb cluster where each node lives on a different minion.

Contributor

jbeda commented Nov 19, 2014

Hi @raphael!

This is actually trickier than it seems. The set of pods that you want to avoid co-scheduling may be different than the set of pods that are handled by a replica controller. We built quite a bit of flexibility here as we wanted you to, say, have 1 replica controller for a set of canaries and another replica controller for your production pods. You'd have all of these sit behind a load balancer. You'd also, probably want to enforce no co-location against the union of those pods.

I could see 2 ways of doing this:

  • Have a new scheduler 'object' that is a generic "scheduling constraint". You'd could then say "don't co-schedule any pods that match this label selector". Or perhaps: "Don't let more than 3 pods that match this label selector run across the set of nodes with the same value for the 'rack' label"
  • Have a "avoid co-schedule" labels on each pod. That way the pod would say "don't schedule me on any node that has pods that match this label selector".

My personal opinion is that the first is probably more true to the intent of the user but the second is easier to get going and manages as it hangs off of existing objects. I'm not sure how we do the "minimize the rack risk exposure" there though.

Contributor

brendandburns commented Nov 19, 2014

Yeah, I think we should probably do this. In general, spreading should
take care of this kind of thing, but explicit exclusive restricts probably
also make sense.

Brendan
On Nov 19, 2014 3:48 PM, "Raphaël Simon" notifications@github.com wrote:

There is a very narrow use case of the generic scheduling constraint
problem that would seem a lot easier to support and would provide a lot of
value: make it possible to express that instances of a given pod or replica
template may not live on the same minion. There is no need to identify
minions to express this and it would alleviate a whole class of concerns
related to building HA deployments. The concrete use case I'm looking at
now is how to deploy a mongodb cluster where each node lives on a different
minion.


Reply to this email directly or view it on GitHub
GoogleCloudPlatform#367 (comment)
.

Contributor

brendandburns commented Nov 19, 2014

I vote for the second approach, as it fits neatly with existing concepts.

Brendan
On Nov 19, 2014 4:02 PM, "Brendan Burns" bburns@google.com wrote:

Yeah, I think we should probably do this. In general, spreading should
take care of this kind of thing, but explicit exclusive restricts probably
also make sense.

Brendan
On Nov 19, 2014 3:48 PM, "Raphaël Simon" notifications@github.com wrote:

There is a very narrow use case of the generic scheduling constraint
problem that would seem a lot easier to support and would provide a lot of
value: make it possible to express that instances of a given pod or replica
template may not live on the same minion. There is no need to identify
minions to express this and it would alleviate a whole class of concerns
related to building HA deployments. The concrete use case I'm looking at
now is how to deploy a mongodb cluster where each node lives on a different
minion.


Reply to this email directly or view it on GitHub
GoogleCloudPlatform#367 (comment)
.

Contributor

raphael commented Nov 19, 2014

I actually wasn't thinking that you would express that no pod from the replica controller should be co-scheduled but rather that no pod from one given replica controller should be scheduled with pods from another. So it's like your 2nd example extended to replica controller "as units". I guess your example of "don't let more than 3 pods that match this label selector be co-scheduled" could be achieved by creating identical replicas containing 3 pods each. May not be as bad as it first sounds if creation of replicas is automated?

Owner

bgrant0607 commented Nov 19, 2014

Re. spreading, see #1965.

Owner

bgrant0607 commented Nov 19, 2014

@jbeda @brendandburns For the "avoid co-schedule" approach, a label selector would be needed to specify the labels that would match.

The other approach discussed (no more than 3 pods per node with same rack label value) was called "per-attribute limits" above. It is supported by Aurora, for example.

But for this case, I honestly don't understand why simple spreading of all pods matching a label selector (e.g., pods belonging to a service) wouldn't be good enough.

Owner

bgrant0607 commented Nov 19, 2014

And, yes, no scheduling constraints or other functionality should be tied to replication controllers.

Contributor

raphael commented Nov 20, 2014

@bgrant0607 simple spreading of all pods matching a label selector would certainly work for this. I did not realize that this was the current behavior, thanks for the pointer.

Contributor

brendandburns commented Nov 20, 2014

Its not the behavior by default. There is a spreading function implemented
that you can activate.

It does it by pod label collisions, not service selector collisions, so its
slightly less optimal.

We haven't done the work to combine priority functions, so you would have
to turn off the least loaded priority function. (Or do the work to combine
them ;)

Brendan
On Nov 19, 2014 5:00 PM, "Raphaël Simon" notifications@github.com wrote:

@bgrant0607 https://github.com/bgrant0607 simple spreading of all pods
matching a label selector would certainly work for this. I did not realize
that this was the current behavior, thanks for the pointer.


Reply to this email directly or view it on GitHub
GoogleCloudPlatform#367 (comment)
.

Contributor

raphael commented Nov 20, 2014

Oh I see NewSpreadingScheduler isn't called from anywhere, damn :) So there really isn't a way to do what I want today? (short of deploying my own version of Kubernetes - I was looking at using Google Container Engine)

Owner

bgrant0607 commented Nov 20, 2014

@mikedanese was working on implementing spreading for services, but was rethinking the approach.

@abhgupta is interested in making the scheduling policy configurable -- see #2313.

However, if you want to take a stab at either of these, feel free. Or, you could simply write a prioritization function that combined LeastRequestedPriority and CalculateSpreadPriority, which we could make the default pending the general solution to #2313.

Contributor

brendandburns commented Nov 20, 2014

Yeah, the trick is we don't have code to combine priority functions, and I
made the judgement call that least loaded was preferable to spreading.

For a cluster you deploy, you can modify the code as Brian suggests. Or we
could come up with some sort of normalizing + weighted averaging or
priority functions (or something else)

--brendan

On Wed, Nov 19, 2014 at 5:09 PM, bgrant0607 notifications@github.com
wrote:

Or, if you just want to use the spreading scheduler as is, modify the code
here:
https://github.com/GoogleCloudPlatform/kubernetes/blob/1c524607d898487e2ea8de46a5a79c0d520cd1da/plugin/pkg/scheduler/factory/factory.go#L69


Reply to this email directly or view it on GitHub
GoogleCloudPlatform#367 (comment)
.

Owner

bgrant0607 commented Nov 20, 2014

Let's move discussion of combining priority functions to #2313.

Contributor

abhgupta commented Nov 20, 2014

I had been distracted by other work over the last couple of days but have been able to make progress on refactoring the scheduler to make it configurable. I should be able to share my changes for early feedback tomorrow.

In addition, I am now working to allow multiple priority functions to be specified and the scores will be combined (feedback requested for simple normalizing) for each minion before presenting the prioritized list of minions to the selectHost function for picking one. I hope to have that one ready to share tomorrow as well.

Member

mikedanese commented Nov 20, 2014

But since configuring the scheduler requires a custom build of the scheduler, any improvements we make to the non-default scheduling algorithms and predicates are inaccessible. Furthermore none of the kubernetes daemons have a model of configuration beyond command line flags, so this is not an easy thing to tackle. Maybe the crew will address file driven configuration sometime soon.

Owner

bgrant0607 commented Nov 20, 2014

@mikedanese Currently that's the case, but, indeed, I would like it to be driven by configuration.

Configuration of k8s components is being discussed on #1627.

Contributor

abhgupta commented Nov 20, 2014

@mikedanese In the first pass, the only thing that will need to be modified/rebuilt is the scheduler executable. In the second pass (shortly thereafter), we can look at specifying the predicates/prioritizers and their simple/string arguments, if any, via configuration files or the command line option.

Contributor

abhgupta commented Nov 20, 2014

Didn't see the response from @bgrant0607 before posting my comment above. Would definitely want to piggyback on to a common configuration mechanism.

Contributor

mindscratch commented Dec 2, 2014

following along.

Member

pires commented Jan 30, 2015

+1 for @jbeda suggestion

Have a "avoid co-schedule" labels on each pod. That way the pod would say "don't schedule me on any node that has pods that match this label selector".

Also the SQL-alike syntax for defining this would be a awesome-to-have, but not a priority, I reckon.

Contributor

abhgupta commented Feb 23, 2015

@pires @jbeda @brendandburns @bgrant0607 Am a bit late to this discussion, but in case this issue is still relevant, here is my take. The "avoid co-schedule" functionality can be implemented in a more generic manner, while also being closer to what the user actually intended.

The user could either want to avoid co-scheduling on a per-minion basis, or for a rack/zone/etc. GoogleCloudPlatform#2906 implemented the priority functions for the scheduler that spreads pods across, for lack of a better word, "node groups" that are defined based on node labels. We should be able to specify the "avoid co-scheduling" requirements in the pod spec to target individual minions or racks/zones/etc (identified by node labels). The simple service spreading priority function as well as the service anti-affinity priority function could be modified to respect this attribute.

If a proper fit is not available, the pod would fail to be scheduled and this is an area where we don't currently have visibility (error being surfaced to the user properly). I would suggest going for these simple enhancements once we have the mechanism to bubble up scheduling constraint violations to the user/admin.

bgrant0607 changed the title from Consider more general scheduling constraints to More general scheduling constraints Feb 28, 2015

Owner

bgrant0607 commented May 14, 2015

See also anti-affinity discussion here: GoogleCloudPlatform#4301 (comment)

Member

timothysc commented May 14, 2015

I've been looking to eval: https://github.com/hashicorp/hcl and ttps://www.terraform.io/docs/configuration/syntax.html as an extensible config language for constraints, but it also has other applications: #7739

Owner

bgrant0607 commented May 14, 2015

Thanks for the pointer, but this is getting off-topic. You could use #1743 or file a new issue to discuss higher-level configuration mechanisms, such as parameterization and object cross references. You could also browse issues relating to the broader topic of configuration:
https://github.com/GoogleCloudPlatform/kubernetes/labels/area%2Fapp-config-deployment

Owner

erictune commented May 18, 2015

Docker Swarm supports:

  • Filters which limit the set of nodes that a container can be scheduled on.
  • Scheduling Strategies which affect how various eligible nodes are ranked.

Filters include:

  • Constraint, where container label has to match node label.
  • Affinity, where a container label needs to be near another container.
  • Port, which is largely not relevant in K8s due to IP per pod.
  • Dependency, which is when two containers need to be near the same volume. Pods captures this case for kubernetes.
  • Health: keeps containers off unhealthy nodes.

Scheduling Strategies include one which does spreading.

Member

timothysc commented May 18, 2015

/cc @willb @erikerlandson - re background.

Owner

bgrant0607 commented May 18, 2015

We have constraints via nodeSelector.

Affinity, Port, and Dependency are covered by pods.

Health should be automatic.

Strategies are currently baked into the scheduler, which is configurable, thus preserving what/how separation. We've discussed making it possible to disable default scheduling, which would enable users to run their own schedulers.

Owner

davidopp commented May 26, 2015

We have constraints via nodeSelector.

Swarm has regular and "soft" constraints. We don't have the latter yet, but I filed #8008 for us to implement it.

davidopp referenced this issue Jun 11, 2015

Merged

Upstream Kubernetes-Mesos framework #8882

4 of 4 tasks complete
Member

timothysc commented Jun 17, 2015

After giving this a fair amount of thought... for service orchestration you could likely boil down most options into two categories (affinity, and anti-affinity). We could get far more fancy (HTC/HPC config langs), but at some point I wonder "Everything Should Be Made as Simple as Possible..."

Here are my distilled thoughts:

function (label or annotation)
SPREADBY - typical use case - SLA uptime
GROUPBY - typical use case - SLA performance

operators:
&& ||

Constraint: SPREADBY (rack) && GROUPBY (cluster)

/cc @eparis @rrati @jayunit100

Owner

eparis commented Jun 17, 2015

I'd think a ! operator would be in order

Owner

davidopp commented Jun 17, 2015

There are "soft constraints" (scheduling goals) and "hard constraints." I think you're talking about the first?

What is the use case and semantics of the | operator? (For connecting SPREADBY and GROUPBY clauses, I assume)

Same question for !

Member

timothysc commented Jun 18, 2015

@davidopp

Constraint evals to simlpe true or false expressions, so it should really be &&, || this way you can connect them including simple predicate matching.

GROUPBY (cluster) || rack==funkytown

The more I think about it, the less I want to tread into config language space given the cattle idiom on services. Weather they are soft or hard could either be denoted kia keyword or some other semantics.

Owner

davidopp commented Jun 18, 2015

So IIUC the way the thing you're proposing would work would like this?

GROUPBY expr1 || expr2 => put a virtual label X on all the machines that match expr1 or expr2, and then try to co-locate all the pods of the service on machines with label X
GROUPBY expr1 && expr2 => put a virtual label X on all the machines that match expr1 and expr2, and then try to co-locate all the pods of the service on machines with label X
SPREADBY expr1 || expr2 => put a virtual label X on all the machines that match expr1 or expr2, and then try to spread all the pods of the service across machines with label X
SPREADBY expr1 && expr2 => put a virtual label X on all the machines that match expr1 and expr2, and then try to spread all the pods of the service across machines with label X

It would be good to flesh out some use cases...

Owner

bgrant0607 commented Jun 20, 2015

I agree that flavors of affinity and anti-affinity are the basic 2 features that would satisfy most use cases.

With respect to As Simple As Possible, specifying just whether to group or spread seams like the simplest possible API. That needs to be associated with some set of pods, via label selector (in which object TBD). Node groups to concentrate in or spread across could be configured in the scheduler in most cases.

jcderr commented Sep 17, 2015

+1

I deploy some fairly hefty celery tasks in our cluster, and definitively do not ever want more than one running on the same host at the same time. I'd rather some get left unscheduled and run a monitoring task to pick them up by scaling my cluster up.

Owner

dchen1107 commented Sep 17, 2015

Owner

dchen1107 commented Sep 17, 2015

Owner

davidopp commented Dec 7, 2015

This is part of #18261

Member

timothysc commented Dec 8, 2015

@davidopp I think it reasonable to close this issue, in lieu of the assorted proposal.

Owner

davidopp commented Dec 14, 2015

@timothysc Let's wait until we merge the proposal.

@vishh vishh pushed a commit to vishh/kubernetes that referenced this issue Apr 6, 2016

@vmarmol vmarmol Merge pull request #367 from kateknister/master
Caches container data for 5 seconds before updating it
0a9f963

@keontang keontang pushed a commit to keontang/kubernetes that referenced this issue May 14, 2016

@pendoragon @ddysher pendoragon + ddysher Merge pull request #367 from ddysher/baremetal-clean
Baremetal cleanup
928e3b2
Owner

bgrant0607 commented May 17, 2016

Affinity/anti-affinity proposals merged and implementations are underway.

bgrant0607 closed this May 17, 2016

@keontang keontang pushed a commit to keontang/kubernetes that referenced this issue Jul 1, 2016

@pendoragon @ddysher pendoragon + ddysher Merge pull request #367 from ddysher/baremetal-clean
Baremetal cleanup
f042e13

@harryge00 harryge00 pushed a commit to harryge00/kubernetes that referenced this issue Aug 11, 2016

@pendoragon pendoragon Merge pull request #367 from ddysher/baremetal-clean
Baremetal cleanup
b9fcdb2

@mqliang mqliang pushed a commit to mqliang/kubernetes that referenced this issue Dec 8, 2016

@pendoragon @liubog2008 pendoragon + liubog2008 Merge pull request #367 from ddysher/baremetal-clean
Baremetal cleanup
9224fd0

@metadave metadave pushed a commit to metadave/kubernetes that referenced this issue Feb 22, 2017

@prydonius @lachie83 prydonius + lachie83 WordPress updates (0.4.0) (#367)
* wordpress: add alpha and beta storageclass annotation support

* wordpress: use random password if not specified

* wordpress: update MariaDB dep

* wordpress: explicitly set imagePullPolicy

* wordpress: increase resource name truncation and trim '-' suffix

* wordpress: improve storageclass readme description

* wordpress: use built-in toYaml helper
bcc323d

@mqliang mqliang pushed a commit to mqliang/kubernetes that referenced this issue Mar 3, 2017

@pendoragon @keontang pendoragon + keontang Merge pull request #367 from ddysher/baremetal-clean
Baremetal cleanup
50b060b

@wenwenwenjun wenwenwenjun pushed a commit to wenwenwenjun/kubernetes that referenced this issue May 9, 2017

@chengyli chengyli Merge pull request #367 from liangjin/tess
Issue: 1302 Secure etcd storage
2608d36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment