New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container and pod resource limits #168

Closed
bgrant0607 opened this Issue Jun 19, 2014 · 49 comments

Comments

Projects
None yet
@bgrant0607
Member

bgrant0607 commented Jun 19, 2014

Before we implement QoS tiers (#147), we need to support basic resource limits for containers and pods. All resource values should be integers.

For inspiration, see lmctfy:
https://github.com/google/lmctfy/blob/master/include/lmctfy.proto

Arguably we should start with pods first, to at least provide isolation between pods. However, that would require the ability to start Docker containers within cgroups. The support we need for individual containers already exists.

We should allow both minimum and maximum resource values to be provided, as lmctfy does. But let's not reuse lmctfy's limit and max_limit terminology. I like "requested" (amount scheduler will use for placement) and "limit" (hard limit beyond which the pod/container is throttled or killed).

Even without limit enforcement, the scheduler could use resource information for placement decisions.

@timothysc

This comment has been minimized.

Member

timothysc commented Jul 8, 2014

+1.

@bgrant0607

This comment has been minimized.

Member

bgrant0607 commented Jul 11, 2014

We have cpu and memory in the container manifest:
// Optional: Defaults to unlimited.
Memory int yaml:"memory,omitempty" json:"memory,omitempty"
// Optional: Defaults to unlimited.
CPU int yaml:"cpu,omitempty" json:"cpu,omitempty"

However, AFAICT, we don't do anything with them. Besides, I think we want something more similar to lmctfy's API (request, limit, qos for each resource).

Another consideration: We could make it fairly easy to add new resources. Kubelet needs to understand each individual resource's characteristics, for isolation, QoS, overcommitment, etc. OTOH, the scheduler could deal with resources entirely abstractly. It could get resources and their capacities from the machines. Similarly, we'd need to make it possible to request abstract resources in the container/pod manifest.

@thockin

This comment has been minimized.

Member

thockin commented Jul 11, 2014

What we described internally was that "common" resources like CPU, memory,
disk, etc were described as first-class things. Other resources are
handled essentially as opaque counters. E.g. a node says "I have 5
resources with ID 12345", a client says "I need 2 resources with ID 12345".
The scheduler maps them.

On Fri, Jul 11, 2014 at 2:04 PM, bgrant0607 notifications@github.com
wrote:

We have cpu and memory in the container manifest:
// Optional: Defaults to unlimited.
Memory int yaml:"memory,omitempty" json:"memory,omitempty"
// Optional: Defaults to unlimited.
CPU int yaml:"cpu,omitempty" json:"cpu,omitempty"

However, AFAICT, we don't do anything with them. Besides, I think we want
something more similar to lmctfy's API (request, limit, qos for each
resource).

Another consideration: We could make it fairly easy to add new resources.
Kubelet needs to understand each individual resource's characteristics, for
isolation, QoS, overcommitment, etc. OTOH, the scheduler could deal with
resources entirely abstractly. It could get resources and their capacities
from the machines. Similarly, we'd need to make it possible to request
abstract resources in the container/pod manifest.

Reply to this email directly or view it on GitHub
#168 (comment)
.

@erictune

This comment has been minimized.

Member

erictune commented Jul 15, 2014

Consider that the resource types and units used for pod/container requests could also be used for describing how to subdivide cluster resources (see #442 ). For example, if team A is limited to using 10GB RAM at the cluster level, then team A can run 10 pods x 1GB RAM; or 2 pods x 5GB per pod; or some combination, etc.

@adam-mesos

This comment has been minimized.

adam-mesos commented Jul 30, 2014

+1 to all of this. Mesos has a very similar model, with the scheduler/allocator able to work with any custom resource, but the slave/containerizer needs to know enough details to map it to an isolator. This would also be the appropriate separation for requested resource vs. resource limits.

@bgrant0607 bgrant0607 added this to the v1.0 milestone Aug 27, 2014

@brendandburns brendandburns modified the milestones: 0.7, v1.0 Sep 24, 2014

@bgrant0607 bgrant0607 modified the milestones: v0.8, v0.7 Sep 26, 2014

@bgrant0607 bgrant0607 added the area/api label Oct 2, 2014

@bgrant0607

This comment has been minimized.

Member

bgrant0607 commented Oct 2, 2014

/cc @johnwilkes @davidopp @rjnagal @smarterclayton @brendandburns @thockin

The resource model doc has been created. We should align our API with it. v1beta3 leaves resource requests unchanged, though the ResourceList type was added in order to represent node capacity. We could either add the new fields in a backwards-compatible way, or replace the existing Container Memory and CPU fields in v1beta3 -- if we prefer to do the latter, we should add this issue to #1519.

I propose that we add an optional ResourceSpec struct containing optional Request and Limit ResourceList fields to both PodSpec and Container.

@smarterclayton smarterclayton referenced this issue Oct 2, 2014

Closed

Implement v1beta3 api #1519

16 of 20 tasks complete
@bgrant0607

This comment has been minimized.

Member

bgrant0607 commented Oct 2, 2014

Clarification: The separation of desired-state fields into a ResourceSpec struct was deliberate, conforming to the careful separation of desired and current state in v1beta3. Usage-related fields would go into a ResourceStatus struct, as would effective settings, such as soft or hard container limits. @johnwilkes agreed this made sense. At some point, we should clarify this in resources.md.

@thockin

This comment has been minimized.

Member

thockin commented Oct 2, 2014

I don't think we want pod-level resources yet, or if we do then we accept
EITHER pod resources OR container resources, but never both on a single
pod. Not yet.

On Thu, Oct 2, 2014 at 2:56 PM, bgrant0607 notifications@github.com wrote:

/cc @johnwilkes https://github.com/johnwilkes @davidopp
https://github.com/davidopp @rjnagal https://github.com/rjnagal
@smarterclayton https://github.com/smarterclayton @brendandburns
https://github.com/brendandburns @thockin https://github.com/thockin

The resource model doc
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/resources.md
has been created. We should align our API with it. v1beta3
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/api/v1beta3/types.go
leaves resource requests unchanged, though the ResourceList type was
added in order to represent node capacity. We could either add the new
fields in a backwards-compatible way, or replace the existing Container
Memory and CPU fields in v1beta3 -- if we prefer to do the latter, we
should add this issue to #1519
#1519.

I propose that we add an optional ResourceSpec struct containing optional
Request and Limit ResourceList fields to both PodSpec and Container.

Reply to this email directly or view it on GitHub
#168 (comment)
.

@bgrant0607

This comment has been minimized.

Member

bgrant0607 commented Oct 2, 2014

Fair enough. We can't support pod limits until libcontainer and Docker do, so I'd be fine with omitting that for now.

@smarterclayton

This comment has been minimized.

Contributor

smarterclayton commented Jul 26, 2015

If we use limit to mean hard cap, we can still vertical autosize / default when a user specifies request. As Derek notes, we want to offer predictable behavior when pods run on multiple heterogenous workloads for end users and admins (which shares can't guarantee), which negatively impacts utilization in some cases but offers a predictable experience for admins.

Do we have the tools to distinguish qos properly when an autosizer is present? Should the autosizer know enough about qos to reflect the appropriate use?

@davidopp

This comment has been minimized.

Member

davidopp commented Jul 26, 2015

@erictune My understanding is that request set (say to value "R") and limit unset would be interpreted the way Borg handles memory limit = R with allow_overlimit_mem, and CPU limit = R.

@timothysc

This comment has been minimized.

Member

timothysc commented Jul 28, 2015

a) Use cpu limit to mean hard cap. Most pods/containers would likely specify only request.

agreed, that only hard limits as an option.

b) Perform no hard capping by default. Gradually improve enforcement and isolation over time. We'd do this even in the case of (a).

disagree, not hard capping will affect SLO.

I do believe what's missing in this conversation is the notion of history in proper sizing. Right now there is no job-history service to yield best-guess on actual usage.

@Iwan-Zotow

This comment has been minimized.

Iwan-Zotow commented Jul 30, 2015

Gentlemen,

I'm running Monte Carlo on google cloud cluster now, and to emulate batch scheduling I had to set CPU limit to some strange values.

jlowdermilk pushed a commit to jlowdermilk/kubernetes that referenced this issue Oct 27, 2015

Merge pull request kubernetes#168 from eparis/more-info
Display github e2e queue in submit queue web page
@vishh

This comment has been minimized.

Member

vishh commented Dec 28, 2015

FYI: Docker now supports updates to cgroups #15078

vishh pushed a commit to vishh/kubernetes that referenced this issue Apr 6, 2016

Merge pull request kubernetes#168 from rjnagal/master
Factor out data comparator for storage tests.
@andyxning

This comment has been minimized.

Member

andyxning commented Aug 21, 2016

Allowing memory limit over-commiting may cause unpredicatable process killing by triggering kernel OOM killers.

I have run a program which allocate 50GB memory in a pod whose memory limit is 118GB on a node with 64GB. when the program is running for several seconds, it is oom killed and i can get the oom killer log in /var/log/syslog.

@montanaflynn

This comment has been minimized.

montanaflynn commented Sep 11, 2016

Hard limits for CPU are very important for our video transcoding pods we run on Google Container Engine. We need to have nodes with lots of cores for speed but also don't want a single pod greedily using up all the cores. It would be ideal to set their limit at 3/4 of the total nodes CPU.

We can currently do this for scheduling with requests so we don't put two transcoders on a single node but the lack of hard limits mean that when the pod is running it uses all the cores even with limits set. This had led us to having two clusters, one especially for transcoding large media and the other for small media and the rest of our services.

@thockin

This comment has been minimized.

Member

thockin commented Sep 11, 2016

I thought we used shares for "request" and quota for "limit" thereby
providing true hard limits. Did I mis-comprehend?

On Sat, Sep 10, 2016 at 6:05 PM, Montana Flynn notifications@github.com
wrote:

Hard limits for CPU are very important for our video transcoding pods we
run on Google Container Engine. We need to have nodes with lots of cores
for speed but also don't want a single pod greedily using up all the cores.
It would be ideal to set their limit at 3/4 of the total nodes CPU.

We can currently do this for scheduling with requests so we don't put two
transcoders on a single node but the lack of hard limits mean that when the
pod is running it uses all the cores even with limits set. This had led
us to having two clusters, one especially for transcoding large media and
the other for small media and the rest of our services.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#168 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVDzTg30uN-jKpxxsJMU9x61_z6KVks5qo1PegaJpZM4CFzkL
.

@smarterclayton

This comment has been minimized.

Contributor

smarterclayton commented Sep 12, 2016

@montanaflynn

This comment has been minimized.

montanaflynn commented Sep 12, 2016

It seems hard limits came with v1.2 based on the changelog. I remember when I first started with kubernetes there was a warning saying that CPU limits were not enforced. Maybe it was that my host OS that didn't support it. Looking at the compute resources documentation it looks like kubernetes does support hard limits by default now.

CPU hardcapping will be enabled by default for containers with CPU limit set, if supported by the kernel. You should either adjust your CPU limit, or set CPU request only, if you want to avoid hardcapping. If the kernel does not support CPU Quota, NodeStatus will contain a warning indicating that CPU Limits cannot be enforced.

@thockin

This comment has been minimized.

Member

thockin commented Sep 12, 2016

Note that CPU hardlimits can be surprising. All it guarantees is that you
can use X core-seconds per wall-second. Consider a 16 core machine, and a
pod that has an 8 core limit. If your app is multi-threaded or
multi-process, and the number of executable threads/processes is larger
than 8, you could use up all 8 cores of your limit in less than 1 wall
second. If you used all 16 cores for 0.5 seconds, you would leave your pod
ineligible to run for 0.5 seconds (that's a long time!), giving you
terrible tail latency.

Now, in reality the time slice is smaller, but it is still in the tens or
hundreds of milliseconds. If you're not careful, you really could find
yourself with unexpected latency blips of 50 or 100 milliseconds or more.

On Sun, Sep 11, 2016 at 9:54 PM, Montana Flynn notifications@github.com
wrote:

It seems hard limits came with v1.2 based on the changelog
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG.md/#action-required-11.
I remember when I first started with kubernetes there was a warning saying
that CPU limits were not enforced. Maybe it was that my host OS that didn't
support it. Looking at the [compute resources documentation] it looks like
kubernetes does support hard limits by default now.

CPU hardcapping will be enabled by default for containers with CPU limit
set, if supported by the kernel. You should either adjust your CPU limit,
or set CPU request only, if you want to avoid hardcapping. If the kernel
does not support CPU Quota, NodeStatus will contain a warning indicating
that CPU Limits cannot be enforced.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#168 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVCtWesObV0ElJqI9oeeeu1XUwL8Fks5qpNsUgaJpZM4CFzkL
.

@timothysc

This comment has been minimized.

Member

timothysc commented Sep 12, 2016

If it's a hard constraint to not tolerate the blips, then you're likely looking for cpu affinity or cpu-sets. xref: #10570

@montanaflynn

This comment has been minimized.

montanaflynn commented Sep 12, 2016

I'm on google cloud's container engine and found the warning I referenced above is still shown while running master and nodes kubernetes version 1.3.5:

The warning is displayed with kubectl describe nodes isWARNING: CPU hardcapping unsupported.

$ kubectl describe nodes
Name:           gke-cluster-1-default-pool-777adf16-an5j
Labels:         beta.kubernetes.io/arch=amd64
            beta.kubernetes.io/instance-type=n1-highcpu-8
            beta.kubernetes.io/os=linux
            cloud.google.com/gke-nodepool=default-pool
            failure-domain.beta.kubernetes.io/region=us-central1
            failure-domain.beta.kubernetes.io/zone=us-central1-b
            kubernetes.io/hostname=gke-cluster-1-default-pool-777adf16-an5j
Taints:         <none>
CreationTimestamp:  Wed, 12 Sep 2016 08:14:45 -0700
Phase:
Conditions:
  Type          Status  LastHeartbeatTime           LastTransitionTime          Reason              Message
  ----          ------  -----------------           ------------------          ------              -------
  NetworkUnavailable    False   Mon, 01 Jan 0001 00:00:00 +0000     Wed, 07 Sep 2016 18:15:58 -0700     RouteCreated            RouteController created a route
  OutOfDisk         False   Mon, 12 Sep 2016 14:15:13 -0700     Wed, 07 Sep 2016 18:14:45 -0700     KubeletHasSufficientDisk    kubelet has sufficient disk space available
  MemoryPressure    False   Mon, 12 Sep 2016 14:15:13 -0700     Wed, 07 Sep 2016 18:14:45 -0700     KubeletHasSufficientMemory  kubelet has sufficient memory available
  Ready         True    Mon, 12 Sep 2016 14:15:13 -0700     Wed, 07 Sep 2016 18:15:21 -0700     KubeletReady            kubelet is posting ready status. WARNING: CPU hardcapping unsupported
@vishh

This comment has been minimized.

Member

vishh commented Sep 12, 2016

@montanaflynn On Google Container Engine can you switch to GCI as the image type. You can upgrade your node-pool to GCI by setting --image-type=gci or pass that flag while creating a new cluster.
GCI is the new version of the existing debian 7 based base image on GKE. CPU limits are supported there.

@montanaflynn

This comment has been minimized.

montanaflynn commented Sep 12, 2016

@vishh where / how could I set --image-type=gci for an existing cluster?

@vishh

This comment has been minimized.

Member

vishh commented Sep 12, 2016

@montanaflynn Assuming you have only the default node-pool, run gcloud container clusters upgrade <your_cluster_name> --image-type gci --node-pool default-pool.
This change is disruptive since it restarts existing nodes.
Another option is to create a new node pool for your cluster that is using GCI and then slowly turn down the default node pool. - gcloud container node-pools create --cluster <your_cluster_name> --image-type gci

@montanaflynn

This comment has been minimized.

montanaflynn commented Sep 12, 2016

Thanks! Will container engine be using that image by default in the future?

@vishh

This comment has been minimized.

Member

vishh commented Sep 12, 2016

Yes. That might happen as early as v1.4 on GKE.

@timothysc

This comment has been minimized.

Member

timothysc commented Dec 7, 2016

I think we should move to close this issue, the root topic has been addressed but there are multiple side-threads that are on this issue where I believe they would be better served on other issues.

@vishh thoughts?

@derekwaynecarr

This comment has been minimized.

Member

derekwaynecarr commented Dec 7, 2016

@timothysc timothysc closed this Dec 8, 2016

metadave pushed a commit to metadave/kubernetes that referenced this issue Feb 22, 2017

iaguis pushed a commit to kinvolk/kubernetes that referenced this issue Feb 6, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment