New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vertical pod auto-sizer #10782

Closed
erictune opened this Issue Jul 6, 2015 · 34 comments

Comments

Projects
None yet
@erictune
Copy link
Member

erictune commented Jul 6, 2015

We should create a vertical auto-sizer. A vertical auto-sizer sets the compute resource limits and request for pods which do not have them set, and periodically adjust them based on demand signals. It does not directly deal with replication controllers, services, or nodes.

Related issues:

@davidopp

This comment has been minimized.

Copy link
Member

davidopp commented Jul 7, 2015

@jszczepkowski Do you have any issues open already on this?

@jszczepkowski

This comment has been minimized.

Copy link
Contributor

jszczepkowski commented Jul 7, 2015

I don't have any issue for it yet.

@piosz piosz referenced this issue Jul 20, 2015

Closed

Implement Resource Consumer #11570

17 of 19 tasks complete

@bgrant0607 bgrant0607 removed this from the v1.0-post milestone Jul 24, 2015

@jszczepkowski

This comment has been minimized.

Copy link
Contributor

jszczepkowski commented Aug 3, 2015

We are planning to divide vertical pod autoscaling into three implementation steps. We are going to deliver the first of them, Setting Initial Resources, for version 1.1.

Setting Initial Resources

Setting initial resources will be implemented as an admission plugin. It will try to estimate and set values for request memory/CPU for containers within each pod if they were not given by user. (The plugin will not set limit to avoid OOM killing).

We will additionally annotate containers metrics with image name. Usage for given image will be aggregated (it is not yet decided how and by whom) and initial resource plugin will set request base on the aggregation.

Reactive vertical autoscaling by deployment update

We will add a new object: vertical pod autoscaler, which will work on a deployment level. User, when specifying a pod template in a deployment, will have an option to set enable_vertical_autoscaler flag. If the auto flag is given, vertical pod autoscaler will monitor resource usage of pod’s containers and change they resource requirements by updating the pod’s template in the deployment object. So, the deployment will act as the actuator of the autoscaler. Note that user can both specify requirements for pod and turn on auto flag for it. In such case, the requirements given by the user will be treated only as initial, and may be overwritten by the autoscaler.

Reactive vertical autoscaling by in-place update

We have an initial idea of a more complicated autoscaler, which will not be bounded to the deployment object, but will work on a pod level, and will actuate resource requirements by in-place update of the pod. Such autoscaler, before the update, will first need to consult the scheduler, if the new resources for the pod will fit, and in-place update is feasible. The answer given by scheduler will not be 100% reliable: it still may happen that the pod after the in-place update will be killed by kubelet due to lack of resources.

@jszczepkowski

This comment has been minimized.

Copy link
Contributor

jszczepkowski commented Aug 3, 2015

@jszczepkowski jszczepkowski changed the title Vertical auto-sizer Vertical pod auto-sizer Aug 3, 2015

@smarterclayton

This comment has been minimized.

Copy link
Contributor

smarterclayton commented Aug 4, 2015

@smarterclayton

This comment has been minimized.

Copy link
Contributor

smarterclayton commented Aug 4, 2015

Metrics aggregation needs some top level issue to track - I'm not aware of one but we'd like to see it be usable from several angles - UI and tracking other container metrics from related systems (load balancers)

@AnanyaKumar

This comment has been minimized.

Copy link
Contributor

AnanyaKumar commented Aug 12, 2015

\CC me

@erictune

This comment has been minimized.

Copy link
Member Author

erictune commented Aug 14, 2015

It is important that there be feedback when the predictions are wrong. In particular, I think it is important that a Pod which is over its request (due to an incorrect initial prediction) is much more likely to be killed than some other pod which is under its request. That way, a malfunction of the "Setting Initial Limits" system appears to affect specific pods, and not random pods, making it very difficult to diagnose.

One way to do that is to make the kill probability in a system OOM situation proportional to the amount over request. @AnanyaKumar @vishh does the current implementation have that property?

@dchen1107 pointed out that it is bad to have a system OOM in the first place. So, two things we might do are:

  • have the "Setting Initial Limits" system set a Limit which is, say, 2x the initial request. That way, assuming there are several similar size pods on a machine, a single misbehaved pod will exceed its pod limit before causing a system OOM. This is not foolproof but I think it is a really good heuristic to start with. @jszczepkowski thoughts on this specific suggestion? If we do this in the initial version, we can always back it out later, but adding it later is harder since it breaks people's assumptions.
  • @dchen1107 and others have talked about a system to dynamically set the memory limits on pods which are over their request to such a value that a single pod spiking is unlikely to cause a system oom. This requires a control loop, but we have experience that suggests user-space control loops can work well. The drawback of this approach is that it requires support for updating limits, which Docker doesn't support yet. So, I don't think we can do this for v1.1.

TL;DR: can we please set limit to 2x predicted request for v1.1.

@vishh

This comment has been minimized.

Copy link
Member

vishh commented Aug 14, 2015

@erictune: Yes. The kernel will prefer to kill containers that exceeds their request in case of a OOM scenario.

+1 for starting with a conservative estimate.

@piosz

This comment has been minimized.

Copy link
Member

piosz commented Aug 17, 2015

@erictune I see your point of view and I agree that setting Limits will solve your case. On the other hands I can imagine some other situations when it causes problems rather than solves them. Especially when the estimation is wrong and the user would observe unexpected kill of his container. So that we need to have high confidence while setting Limits and we can't guarantee it from the beginning.

I think everyone agrees that setting Request should improve the overall experience which may not be true for setting Limits. Long term we definitely want to set both, but I would set only Request in the first version (which may be different then v1.1), gather some feedback from users and then eventually add setting Limits once we will have algorithm tuned.

@vishh How about having two containers that exceed their request: which one will be killed? The one that exceeds request 'more' or random one?

@vishh

This comment has been minimized.

Copy link
Member

vishh commented Aug 17, 2015

As per the current kubelet qos policy, all processes exceeding their request will be equally likely to be killed by the OOM killer.

@piosz

This comment has been minimized.

Copy link
Member

piosz commented Aug 17, 2015

By 'more' you mean relative or absolute value?

@vishh

This comment has been minimized.

Copy link
Member

vishh commented Aug 17, 2015

@piosz: I updated my original comment. Does it make sense now?

@fgrzadkowski

This comment has been minimized.

Copy link
Member

fgrzadkowski commented Nov 17, 2016

  • We don't need in-place update for MVP of vertical pod autoscaler. We can just be more conservative and recreate pods via deployments
  • Infrastore would be useful, but for MVP we can just aggregate this data in VPA controller if we don't have infrastore before that time or we can read this information from a monitoring pipeline
@fgrzadkowski

This comment has been minimized.

Copy link
Member

fgrzadkowski commented Nov 17, 2016

@derekwaynecarr

This comment has been minimized.

Copy link
Member

derekwaynecarr commented Nov 18, 2016

If we pursue a vertical autosizer that requires kicking a deployment, how hard is it to take that requirement back? For example, I would think many of our users would prefer a solution that did not require a re-deploy and instead could re-size existing pods.

@erictune

This comment has been minimized.

Copy link
Member Author

erictune commented Nov 18, 2016

How would a vertical autosizer that did restarts work with a Deployment, exactly? Can resizing happen concurrently with new image rollout? If the user wants to roll back the image change, and there was an intervening resource change, what happens? Can I end up with four replicaSets (cross product of two image version and old/new resource advice)? Are these competing for revisionHistoryLimit?

@smarterclayton

This comment has been minimized.

Copy link
Contributor

smarterclayton commented Nov 18, 2016

An autosizer blowing out my deployment revision budget is not ok :)

On Nov 18, 2016, at 11:08 AM, Derek Carr notifications@github.com wrote:

If we pursue a vertical autosizer that requires kicking a deployment, how
hard is it to take that requirement back? For example, I would think many
of our users would prefer a solution that did not require a re-deploy and
instead could re-size existing pods.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#10782 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pxXhEpFLDNpB3fvifTuHPuCw9PX9ks5q_c1fgaJpZM4FTFDA
.

@derekwaynecarr

This comment has been minimized.

Copy link
Member

derekwaynecarr commented Nov 18, 2016

That said, I think most java apps would need to restart to take advantage,
but ideally that restart would not result in a reschedule.

On Fri, Nov 18, 2016 at 3:21 PM, Clayton Coleman notifications@github.com
wrote:

An autosizer blowing out my deployment revision budget is not ok :)

On Nov 18, 2016, at 11:08 AM, Derek Carr notifications@github.com wrote:

If we pursue a vertical autosizer that requires kicking a deployment, how
hard is it to take that requirement back? For example, I would think many
of our users would prefer a solution that did not require a re-deploy and
instead could re-size existing pods.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://github.com/kubernetes/kubernetes/issues/
10782#issuecomment-261569621>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_
pxXhEpFLDNpB3fvifTuHPuCw9PX9ks5q_c1fgaJpZM4FTFDA>
.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#10782 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF8dbOXZpetfpluUuQrbAv2-4i2vJzErks5q_gjkgaJpZM4FTFDA
.

@davidopp

This comment has been minimized.

Copy link
Member

davidopp commented Nov 19, 2016

@derekwaynecarr I don't think anyone is proposing "a vertical autosizer that requires kicking a deployment" in the long run -- instead I think @fgrzadkowski is saying that the first version would work that way because it's simpler and isn't blocked on in-place resource update in kubelet. Our plan for the next step of PodDisruptionBudget is to allow it to specify a disruption rate, not just a max-number-simultaneously-down. So you could imagine attaching a max disruption rate PDB to your Deployment, that the vertical autoscaler would respect (i.e. it would not exceed the specified rate when it does resource updates that require killing the container).

I think @erictune is asking a good question. I was surprised that @fgrzadkowski said vertical autoscaling would create a new Deployment. IIRC in Borg we use an API that is distinct from collection update (i.e. does not create a new future collection), to handle vertical autoscaling, so that it doesn't interfere with any user-initiated update that might be ongoing at the same time.

@fgrzadkowski

This comment has been minimized.

Copy link
Member

fgrzadkowski commented Nov 21, 2016

@davidopp I didn't suggest creating new deployment. I only suggested to change requirements via existing deployment.

@erictune I think those are great questions! I don't have concrete answers - it should be covered in a proposal/design doc. However I recall a conversation with @bgrant0607 some time ago, that Deployment could potentially have multiple rollouts in-flight on different vertices. With regard to limiting how quickly we would roll it out I agree with @davidopp that it should be solved by PDB.

@derekwaynecarr I imagine that initially in the validation we would always expect that the target object would be Deployment. Later we can relax this requirement and accept RelicaSets or Pods. Even in the final solution user should be aware that some reschedules/restarts may happen. Maybe we just document that "for now" we will always recreate pod. In the future we will change it. Maybe supporting in-place upgrade should be a prerequisite for becoming a GA feature?

@bgrant0607

This comment has been minimized.

Copy link
Member

bgrant0607 commented Nov 22, 2016

Some quick comments:

  • We should think about how this would interact with LimitRange, esp. if we add a label selector to it (#17097).
  • Updates to some applications, such as Java ones, would also require configuration changes, such as heap size and thread-pool size. We should think about how we'll handle that.
  • In-place rolling update is #9043. Multiple independent updates would require deep changes to Deployment and to other controllers that we want to support this.
  • In-place pod update is #5774.
  • Another example imperative operation: relabeling #36897
  • We need to think about whether we'd want to be able to rollback resource changes. I assume not, but this would require a new mechanism (probably something like #17333), and we need to think about how to explain it to users, since at least one person found it confusing that we don't rollback replicas changes (#25236).
@DonghuiZhuo

This comment has been minimized.

Copy link

DonghuiZhuo commented Mar 16, 2017

@jszczepkowski You mentioned that "the admission plugin will try to estimate and set values for request memory/CPU for containers within each pod if they were not given by user."

Can you please elaborate how the admission plug does the estimation? Does it estimate based on historical data of similar jobs, or based on some profiling result, or something else? thanks

@jszczepkowski

This comment has been minimized.

Copy link
Contributor

jszczepkowski commented Mar 17, 2017

You mentioned that "the admission plugin will try to estimate and set values for request memory/CPU for containers within each pod if they were not given by user."

Can you please elaborate how the admission plug does the estimation? Does it estimate based on historical data of similar jobs, or based on some profiling result, or something else? thanks

@bitbyteshort
You are referring to old, obsolete design. For the current design proposal please see kubernetes/community#338.

@DonghuiZhuo

This comment has been minimized.

Copy link

DonghuiZhuo commented Mar 20, 2017

@jszczepkowski thanks for pointing me to the latest proposal.

@mhausenblas

This comment has been minimized.

Copy link

mhausenblas commented Aug 5, 2017

FYI: I've put together a blog post aiming at raising awareness and introducing our demonstrators resorcerer.

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jan 2, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@warmchang

This comment has been minimized.

Copy link
Contributor

warmchang commented Jan 15, 2018

Can we try this VPA feture in K8S release 1.8?

@DirectXMan12

This comment has been minimized.

Copy link
Contributor

DirectXMan12 commented Jan 16, 2018

not really. None of the work has really landed yet (except for @mhausenblas's PoC called resourcerer, but that's not quite the same as the final design).

@bgrant0607

This comment has been minimized.

Copy link
Member

bgrant0607 commented Jan 22, 2018

/remove-lifecycle stale
/lifecycle frozen

@mwielgus

This comment has been minimized.

Copy link
Contributor

mwielgus commented Oct 22, 2018

VPA is in alpha in https://github.com/kubernetes/autoscaler

@mwielgus mwielgus closed this Oct 22, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment