New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More general scheduling constraints #367

Closed
thockin opened this Issue Jul 7, 2014 · 62 comments

Comments

Projects
None yet
@thockin
Member

thockin commented Jul 7, 2014

There have been a few folks who have asked about machine constraints for scheduling. Let's use this issue as a place to gather ideas and requirements.

@timothysc

@verdverm

This comment has been minimized.

verdverm commented Jul 7, 2014

I have noticed the FirstFit (default?) scheduler co-locates pods when there are open machines available. Each of these machines has a single cpu.

It would be nice to use information about available cpu and a pod's expected cpu requirements

sed 's/cpu/other_machine_stat/'

@monnand

This comment has been minimized.

Contributor

monnand commented Jul 7, 2014

Currently, kubelet could get stats from cAdvisor which would be useful for scheduler. It could provide different percentiles of CPU and memory usage of a container (including the root container, i.e. the machine).

@thockin

This comment has been minimized.

Member

thockin commented Jul 7, 2014

That's just "scheduling", as opposed to machine constraints, though very
coarsely they feel similar :)

On Mon, Jul 7, 2014 at 1:32 PM, Tony Worm notifications@github.com wrote:

I have noticed the FirstFit (default?) scheduler co-locates pods when
there are open machines available. Each of these machines has a single cpu.

It would be nice to use information about available cpu and a pod's
expected cpu requirements

sed 's/cpu/other_machine_stat/'

Reply to this email directly or view it on GitHub
#367 (comment)
.

@timothysc

This comment has been minimized.

Member

timothysc commented Jul 7, 2014

Labels "seems" like the ideal place to enable a rank & requirements to define constraints. However labels would need to be regularly published by minions.

e.g.
rank = memory
requirements = gpu & clusterXYZ

I have a couple of concerns here:

  1. This treads into the full scale scheduling world.
  2. Config syntax = ?, DSL? ...
@thockin

This comment has been minimized.

Member

thockin commented Jul 7, 2014

Let's worry about semantics before syntax. We have a similar issue open
for label selectors in general - we can discuss syntax there.

On Mon, Jul 7, 2014 at 1:42 PM, Timothy St. Clair notifications@github.com
wrote:

Labels "seems" like the ideal place to enable a rank & requirements to
define constraints. For example:

rank = memory
requirements = gpu & clusterXYZ

I have a couple of concerns here:

  1. This treads into the full scale scheduling world.
  2. Config syntax = ?, DSL? ...

Reply to this email directly or view it on GitHub
#367 (comment)
.

@timothysc

This comment has been minimized.

Member

timothysc commented Jul 7, 2014

FWIW I often view constraints as a SQL query on a nvp store.

SELECT Resources
FROM Pool
WHERE Requirements
ORDER BY Rank
...

The hardest part are the 'fields' in an nvp store.

@bgrant0607

This comment has been minimized.

Member

bgrant0607 commented Jul 7, 2014

Scheduling based on resources and constraints are 2 significantly different issues.

We have several issues open about resource (and QoS) awareness: #147 , #160 , #168 , #274 , #317.

Constraint syntax/semantics: We should start with the proposed label selector mechanism, #341 .

@timothysc

This comment has been minimized.

Member

timothysc commented Jul 9, 2014

I'm ok with doing the selection from a set of offers/resources from the scheduler.

Provided the offers have enough NVP information to enable discrimination.

@thockin

This comment has been minimized.

Member

thockin commented Jul 9, 2014

I don't know about NVP - where can I read more on it?
On Jul 8, 2014 7:45 PM, "Timothy St. Clair" notifications@github.com
wrote:

I'm ok with doing the selection from a set of offers/resources from the
scheduler.

Provided the offers have enough NVP information to enable discrimination.

Reply to this email directly or view it on GitHub
#367 (comment)
.

@bgrant0607

This comment has been minimized.

Member

bgrant0607 commented Jul 9, 2014

Searching for "NVP SQL" or "name value pair SQL" or "key value pair SQL" comes up with lots of hits. Common arguments against are performance and loss of control over DB schema. But I'm getting the feeling that we're barking up the wrong forest.

@timothysc What are you trying to do? Right now, k8s has essentially no intelligent scheduling. However, that's not a desirable end state. If what you want is a scheduler, we should figure out how to support scheduling plugins and/or layers on top of k8s.

@thockin

This comment has been minimized.

Member

thockin commented Jul 9, 2014

Name Value Pairs? Now I feel dumb :)

On Tue, Jul 8, 2014 at 7:51 PM, Tim Hockin thockin@google.com wrote:

I don't know about NVP - where can I read more on it?
On Jul 8, 2014 7:45 PM, "Timothy St. Clair" notifications@github.com
wrote:

I'm ok with doing the selection from a set of offers/resources from the
scheduler.

Provided the offers have enough NVP information to enable discrimination.

Reply to this email directly or view it on GitHub
#367 (comment)
.

@bgrant0607

This comment has been minimized.

Member

bgrant0607 commented Jul 9, 2014

Something somewhat different than label selectors is per-attribute limits for spreading. Aurora is one system that supports this model:
https://aurora.incubator.apache.org/documentation/latest/configuration-reference/#specifying-scheduling-constraints

This is more relevant to physical rather than virtual deployments. I'd consider it a distinct mechanism from constraints. @timothysc If you'd like this, we should file a separate issue. However, I'd prefer a a new failure tolerance scheduling policy object that specifies a label selector to identify the set of instances to be spread. We could debate about how to describe what kind and/or how much spreading to apply, but I'd initially just leave it entirely up to the infrastructure.

@timothysc

This comment has been minimized.

Member

timothysc commented Jul 9, 2014

I completely agree its more relevant to physical rather then virtual deployments.

I was somewhat testing the possibility of enabling the capabilities for more general purpose scheduling, on par with a mini-Condor approach but it's not a requirement.

Aurora or Marathon -esk capabilities will fill the gap.
https://github.com/mesosphere/marathon/wiki/Constraints

@bgrant0607

This comment has been minimized.

Member

bgrant0607 commented Oct 17, 2014

Note that in order to add constraints, we'd need a way to attach labels to minions/nodes.

@timothysc

This comment has been minimized.

Member

timothysc commented Oct 20, 2014

That is what I had alluded to earlier, but it received luke warm attention. In fact, I believe Wilkes had chimed in on a different thread regarding this topic.

@brendandburns

This comment has been minimized.

Contributor

brendandburns commented Oct 20, 2014

I think we should have labels for worker nodes, but they need to be
dynamic, and that's tough without a re-scheduler.

For now, I think we should use resources on nodes, since they are already
there, and the are known to be static.

You can add resource requests to pods, to achieve appropriate scheduling.

Brendan
On Oct 20, 2014 8:26 AM, "Timothy St. Clair" notifications@github.com
wrote:

That is what I had alluded to earlier, but it received luke warm
attention. In fact, I believe Wilkes had chimed in on a different thread
regarding this topic.


Reply to this email directly or view it on GitHub
#367 (comment)
.

@erictune

This comment has been minimized.

Member

erictune commented Oct 20, 2014

Replication controllers reschedule pods when the machines they are on are no longer available. Seems like replication controller could do the same if the machine becomes infeasible for scheduling. A fairly simple loop can recheck predicates as a background task in the scheduler, and move pods to terminated state if they no longer fit.

Questions:

  1. If a pod running pod is updated so that it's requirements no longer match the machine it is bound to, what happens?
    • pod moves to terminated state
    • refuse update
    • both, and let users be able to control the behavior (yay)
@bgrant0607

This comment has been minimized.

Member

bgrant0607 commented Oct 20, 2014

The minion/node controller (#1366) should be responsible for killing pods with mismatched label selectors, and then, yes, replication controllers would recreate them.

Re. @erictune's question: Yes, we could support both, for instance, using a URL parameter to select the desired behavior.

@brendandburns

This comment has been minimized.

Contributor

brendandburns commented Oct 20, 2014

Yeah, having the kubelet kill pods that don't match makes the most sense.

--brendan

On Mon, Oct 20, 2014 at 9:39 AM, bgrant0607 notifications@github.com
wrote:

The minion/node controller (#1366
#1366) should
be responsible for killing pods with mismatched label selectors, and then,
yes, replication controllers would recreate them.

Re. @erictune https://github.com/erictune's question: Yes, we could
support both, for instance, using a URL parameter to select the desired
behavior.


Reply to this email directly or view it on GitHub
#367 (comment)
.

@timothysc

This comment has been minimized.

Member

timothysc commented Jun 17, 2015

After giving this a fair amount of thought... for service orchestration you could likely boil down most options into two categories (affinity, and anti-affinity). We could get far more fancy (HTC/HPC config langs), but at some point I wonder "Everything Should Be Made as Simple as Possible..."

Here are my distilled thoughts:

function (label or annotation)
SPREADBY - typical use case - SLA uptime
GROUPBY - typical use case - SLA performance

operators:
&& ||

Constraint: SPREADBY (rack) && GROUPBY (cluster)

/cc @eparis @rrati @jayunit100

@eparis

This comment has been minimized.

Member

eparis commented Jun 17, 2015

I'd think a ! operator would be in order

@davidopp

This comment has been minimized.

Member

davidopp commented Jun 17, 2015

There are "soft constraints" (scheduling goals) and "hard constraints." I think you're talking about the first?

What is the use case and semantics of the | operator? (For connecting SPREADBY and GROUPBY clauses, I assume)

Same question for !

@timothysc

This comment has been minimized.

Member

timothysc commented Jun 18, 2015

@davidopp

Constraint evals to simlpe true or false expressions, so it should really be &&, || this way you can connect them including simple predicate matching.

GROUPBY (cluster) || rack==funkytown

The more I think about it, the less I want to tread into config language space given the cattle idiom on services. Weather they are soft or hard could either be denoted kia keyword or some other semantics.

@davidopp

This comment has been minimized.

Member

davidopp commented Jun 18, 2015

So IIUC the way the thing you're proposing would work would like this?

GROUPBY expr1 || expr2 => put a virtual label X on all the machines that match expr1 or expr2, and then try to co-locate all the pods of the service on machines with label X
GROUPBY expr1 && expr2 => put a virtual label X on all the machines that match expr1 and expr2, and then try to co-locate all the pods of the service on machines with label X
SPREADBY expr1 || expr2 => put a virtual label X on all the machines that match expr1 or expr2, and then try to spread all the pods of the service across machines with label X
SPREADBY expr1 && expr2 => put a virtual label X on all the machines that match expr1 and expr2, and then try to spread all the pods of the service across machines with label X

It would be good to flesh out some use cases...

@bgrant0607

This comment has been minimized.

Member

bgrant0607 commented Jun 20, 2015

I agree that flavors of affinity and anti-affinity are the basic 2 features that would satisfy most use cases.

With respect to As Simple As Possible, specifying just whether to group or spread seams like the simplest possible API. That needs to be associated with some set of pods, via label selector (in which object TBD). Node groups to concentrate in or spread across could be configured in the scheduler in most cases.

@jcderr

This comment has been minimized.

jcderr commented Sep 17, 2015

+1

I deploy some fairly hefty celery tasks in our cluster, and definitively do not ever want more than one running on the same host at the same time. I'd rather some get left unscheduled and run a monitoring task to pick them up by scaling my cluster up.

@dchen1107

This comment has been minimized.

Member

dchen1107 commented Sep 17, 2015

@dchen1107

This comment has been minimized.

Member

dchen1107 commented Sep 17, 2015

@davidopp

This comment has been minimized.

Member

davidopp commented Dec 7, 2015

This is part of #18261

@timothysc

This comment has been minimized.

Member

timothysc commented Dec 8, 2015

@davidopp I think it reasonable to close this issue, in lieu of the assorted proposal.

@davidopp

This comment has been minimized.

Member

davidopp commented Dec 14, 2015

@timothysc Let's wait until we merge the proposal.

vishh pushed a commit to vishh/kubernetes that referenced this issue Apr 6, 2016

Merge pull request kubernetes#367 from kateknister/master
Caches container data for 5 seconds before updating it

keontang pushed a commit to keontang/kubernetes that referenced this issue May 14, 2016

@bgrant0607

This comment has been minimized.

Member

bgrant0607 commented May 17, 2016

Affinity/anti-affinity proposals merged and implementations are underway.

@bgrant0607 bgrant0607 closed this May 17, 2016

keontang pushed a commit to keontang/kubernetes that referenced this issue Jul 1, 2016

harryge00 pushed a commit to harryge00/kubernetes that referenced this issue Aug 11, 2016

mqliang pushed a commit to mqliang/kubernetes that referenced this issue Dec 8, 2016

metadave pushed a commit to metadave/kubernetes that referenced this issue Feb 22, 2017

WordPress updates (0.4.0) (kubernetes#367)
* wordpress: add alpha and beta storageclass annotation support

* wordpress: use random password if not specified

* wordpress: update MariaDB dep

* wordpress: explicitly set imagePullPolicy

* wordpress: increase resource name truncation and trim '-' suffix

* wordpress: improve storageclass readme description

* wordpress: use built-in toYaml helper

mqliang pushed a commit to mqliang/kubernetes that referenced this issue Mar 3, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment