WIP: cluster autoscaler integration with machine API #2653

ingvagabund · 2018-09-09T16:38:16Z

TBD

k8s-ci-robot · 2018-09-09T16:38:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: jbeda

If they are not already assigned, you can assign the PR to them by writing /assign @jbeda in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

keps/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

MaciekPytel · 2018-09-10T11:49:27Z

keps/sig-autoscaling/0000-kep-template.md

+### Non-Goals
+
+* cluster autoscaler is able to autoprovision new node groups through machine API
+* cluster autoscaler is able to estimate node resource requirements through machine API


Estimating node resources is part of CA implementation, not a user visible feature. The user facing result of this point is scale-to/from-0 won't be supported.

Scaling the MachienSet/MachienDeployment to/from 0 could be a really interesting feature. AFAIK currently autoscaler relies on Node-objects to estimate the node-capacity based resources. We are introducing the concept of MachineClass in cluster-api, where the resource-capacity based details[eg capacity, allocatable] could be stored in separate CRD called MachineClass and corresponding MachineDeployment/MachineSet does not necessarily need to exist. Essentially MachineSet/Deployment would only have reference to the Machineclass.

This can potentially give us the freedom of both auto-provision of nodegroups and scale to/from 0 issues.

First-cut PR: Add apis for machine-class kubernetes-sigs/cluster-api#488

I think having MachineClass should be enough to support scale-to/from-0 easily, provided you can construct a Node object easily from it. It doesn't have to be perfect, but it needs to provide information used by scheduling predicates, such as labels, taints, local volumes, etc.

Estimating node resources is part of CA implementation, not a user visible feature. The user facing result of this point is scale-to/from-0 won't be supported.

The estimation itself is, way of providing the information to generate the template is not. I wanted to point out the machine API will provide the information, the CA will use node template populated with the information to estimate requirements.

I agree this is information no user should be aware of. Though, one can not get cloud provider specific information unless the one knows how to interpret the provider config. The same holds for the machine classes. Unless we generalize concept of machine types, security groups, ... and make reasonable mapping from the generalized concepts into cloud specific resources.

MaciekPytel · 2018-09-10T11:49:31Z

keps/sig-autoscaling/0000-kep-template.md

+Both `Create` and `Delete` operations return an error `ErrAlreadyExist` and the autoprovisioning is off.
+
+The autoscaler allows to either enumerate all node groups (via `--nodes` option)
+or automatically discover all node groups on-the-fly (via `--node-group-auto-discovery`).


--node-group-auto-discovery is provider specific, all related logic is implemented by cloudprovider. It's optional for cloudprovider to implement it (obviously the flag doesn't work for providers who don't implement it).

MaciekPytel · 2018-09-10T11:51:28Z

keps/sig-autoscaling/0000-kep-template.md

+
+Given the scaling up/down operation corresponds to increasing/decreasing the number
+of nodes in a node group, it's sufficient to "just" change the number of replicas
+of corresponding machine set.


That is only true for scale-up. Scale-down requires the ability to delete a specific node, not just reduce number of replicas.

Yes, enhancements for machine-set are being discussed at cluster-api community, which allows machine-set to delete specific machine while scaling-down.

MachineSet: support for defining specific machines when scaling down kubernetes-sigs/cluster-api#75

MaciekPytel · 2018-09-10T11:54:23Z

keps/sig-autoscaling/0000-kep-template.md

+Other option is to pull all the information from the `spec.providerConfig`.
+Which means the autoscaler needs to import actuator specific code, which forces
+to choose a single implementation of actuator for each cloud provider.
+That can be to restrictive.


Dependency management in autoscaler is already very hard. I don't really like the idea of adding a whole bunch of implementations of cluster API.

Agreed. I would also expect that machine-api should have necessary generic details ready at MachineClass or MachineSet/Deployment, which can be readily consumed by autoscaler without knowing much about the actuator-specific apis.

MaciekPytel · 2018-09-10T11:57:25Z

keps/sig-autoscaling/0000-kep-template.md

+the provider config). This may lead to data duplication as one needs to provide
+a machine type in the config (e.g. to provision an instance) and cpu, gpu and memory
+requirements of the machine outside the config (e.g. to construct node object
+resource capacity).


Cluster Autoscaler needs to create a whole Node object, not just resource capacity. In particular it needs to know both capacity and allocatable and a full list of taints and labels the node will have.

MaciekPytel · 2018-09-10T12:00:05Z

keps/sig-autoscaling/0000-kep-template.md

+system requirements, one can use generic kubelet's configuration discovery to get
+the kubelet's reserved and system requirements from the first machine in a given
+node group (given there is at least one node in the group).
+The same holds for node taints.


Cluster Autoscaler is based on assumption that every node in node group is exactly identical (for example multi-zonal ASGs are not supported). If at least a single node exists CA won't try to parse machine object, it will just make an in-memory copy of existing node and use that to represent new node.

MaciekPytel · 2018-09-10T12:02:02Z

Glad to see this moving forward! I've left some initial comments. I'm happy to discuss on slack or have a quick meeting if you have any questions (or just discuss via comments here).

hardikdr · 2018-09-10T13:51:36Z

Great to see that work has been initiated to look into the integration of autoscaler with machine-api.

@MaciekPytel At cluster-api, we are trying to define the APIs for machine which could be best aligned with current cluster-autoscaler and its future plans. It would be great to have a quick call sometime, to define the requirements from both components and have smooth integration.

aleksandra-malinowska · 2018-09-10T14:03:55Z

keps/sig-autoscaling/0000-kep-template.md

+by provisioning of underlying instances).
+With some mama's marmalade, one can represent an autoscaling group with a machine set.
+Though, the current implementation of the machineset controller needs to be improved
+to support autoscaling use cases. E.g. define delete strategy to pick appropriate machine


It's crucial to Cluster Autoscaler logic that it is able to name a specific machine to be deleted in case of scale-down. Delete strategy on the side of machine set alone won't work for this use-case.

I think we should have a write-up on why this is the case, since it doesn't seem to be immediately obvious.

More specifically, it seems at first glance that the machineset controller would want similar criteria for scale-down to the cluster autoscaler; a scale-down that doesn't take into account disruption budget, node load, node readiness, particular pod types, etc doesn't seem like it would be particularly useful.

The difference that jumps to mind initially is the scheduling simulation -- is the issue that the optimal node to delete in terms of load+disruption-budget+pod-types+etc might have pods that don't fit elsewhere?

I'm not sure if machine set should have all this logic. If the user resizes it some node must be removed, even if doing so violates PDB, kills a pod that won't be re-created, etc. There may be no right decision and any heuristic would be arbitrary.

Given the ambiguity I think it's more correct to put the responsibility to choose node on whoever makes the decision to remove it (CA, user, other controller). If no explicit decision is made machine set should use some very simple and predictable strategy (remove newest node, or oldest node, or something like that). Ideally the API should allow to reduce target size of machine deployment and remove specific machine. Some new subresource maybe?

BTW it feels like stateful set would have the same problem. Without detailed knowledge about workload it's not possible to make an informed decision on which replica to remove, so a user really should have an option to choose. It feels like Cluster API should do the same thing stateful set does (or the other way round, if stateful set haven't solved it yet).

Also that's just how CA works. It picks a node to delete, not a node group to resize. Unless we guarantee the internal logic is 100% the same between CA and machine controller they will conflict.

Given the ambiguity I think it's more correct to put the responsibility to choose node on whoever makes the decision to remove it (CA, user, other controller)

That may be an acceptable answer, but I think we should explicitly say that in the proposal, and enumerate why a bit. I've seen a lot of "it has to work this way because the cluster autoscaler currently works this way" written down, and I think we need to write down more justification beyond that.

It feels like Cluster API should do the same thing stateful set does (or the other way round, if stateful set haven't solved it yet).

Yeah, I wonder if the workloads folks have solved that yet @kubernetes/sig-apps-misc

Also that's just how CA works

If we're designing for future plans, maybe we should re-evaluate that. Maybe the CA should be closer to the way the HPA works? (i.e. "have we re-evaluated our choices in light of the cluster API being a thing")

Maybe the CA should be closer to the way the HPA works?

Various groups of VMs (ASGs on AWS, MIGs on GCP etc.) already work similarly to HPA (i.e., the number of replicas depends on a metric's value). So at least in those environment, users can already have it without running a separate component. The problem is that autoscaling Kubernetes nodes in this way doesn't work well in most cases. In fact, it causes so much trouble we've had to put explicit warnings in UI against enabling MIG autoscaler for VMs running GKE nodes.

It all boils down to "what does it mean that we need a bigger/smaller cluster". If a metric we use as input is above target, but there are no pending pods, adding a new machine will do nothing except unnecessarily increase the cost of running the cluster, because nothing will be scheduled there (although it may indeed bring some metrics down, e.g. average utilization.) This is even worse for scale-down - not all nodes run the same workloads, and removing nodes without checking the pods running there can be moved elsewhere may result in some of those pods no longer running (possibly taking down entire applications.)

It can be argued we just haven't figured out the right metrics for it yet, but even if we could use something like "number of pending pods" as input to scale-up, I've yet to see anything that would work for scale-down. And if we find one, perhaps it can be used with already existing VM autoscalers?

MaciekPytel · 2018-09-10T14:16:17Z

@hardikdr Sure, happy to have a quick meeting. I'm available this week, than I'm going for 2 week vacation. Let's sync up on slack if you want to set something up.

hardikdr · 2018-09-10T14:13:58Z

keps/sig-autoscaling/0000-kep-template.md

+### Non-Goals
+
+* cluster autoscaler is able to autoprovision new node groups through machine API
+* cluster autoscaler is able to estimate node resource requirements through machine API


Scaling the MachienSet/MachienDeployment to/from 0 could be a really interesting feature. AFAIK currently autoscaler relies on Node-objects to estimate the node-capacity based resources. We are introducing the concept of MachineClass in cluster-api, where the resource-capacity based details[eg capacity, allocatable] could be stored in separate CRD called MachineClass and corresponding MachineDeployment/MachineSet does not necessarily need to exist. Essentially MachineSet/Deployment would only have reference to the Machineclass.

This can potentially give us the freedom of both auto-provision of nodegroups and scale to/from 0 issues.

First-cut PR: Add apis for machine-class kubernetes-sigs/cluster-api#488

hardikdr · 2018-09-10T14:17:13Z

keps/sig-autoscaling/0000-kep-template.md

+
+* cluster autoscaler is able to autoprovision new node groups through machine API
+* cluster autoscaler is able to estimate node resource requirements through machine API
+* cluster autoscaler is able to use at least one pricing model on top of the machine API


If I understand correctly, pricing-model at the moment in autoscaler is config-rule based, if this information again could be embedded into the MachineClasses, we can achieve price-based autoscaling as well.

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/pricing.md

/cc @venezia

roberthbailey · 2018-09-11T18:07:10Z

@MaciekPytel @hardikdr - I've scheduled a meeting for tomorrow at 9:30am pacific time. You should have gotten an invite. Feel free to add other folks as well. And if anyone else wants an invite they can let me know.

enxebre · 2018-09-12T10:37:29Z

@roberthbailey Can I get an invite for this? thanks!
@vikaschoudhary16 @ingvagabund and @froware might be interested as well

roberthbailey · 2018-09-12T20:11:55Z

@enxebre - I'm sorry; I didn't see your message until after the meeting ended. :/ The upside is that we ran out of time anyway and need to schedule a follow up. I'll try to give more advance notice for that one. Also, the autoscaler folks invited us to attend their SIG meeting (on Mondays) to discuss further as well.

enxebre · 2018-09-13T08:13:04Z

np @roberthbailey thanks for the update!

alvaroaleman · 2018-10-07T17:26:36Z

keps/sig-autoscaling/0000-kep-template.md

+
+Cluster autoscaling builds on top of concept of node groups which effectively
+translates to cloud provider specific autoscaling groups, resp. autoscaling sets.
+In case of the machine API, the node group translates into machine set (living


Shouldn't this be MachineDeployment? Otherwise MachineDeployments can not be used in conjunction with CA.

justaugustus · 2018-10-13T03:12:53Z

/kind kep

krmayankk · 2018-10-15T07:13:54Z

keps/sig-autoscaling/0000-kep-template.md

+Each of the currently supported cloud providers has its own concept of scaling groups:
+- AWS has autoscaling groups
+- Azure has autoscaling sets
+- GCE/GKE have autoscaling groups


cluster api currently doesnt support GKE as i understand it . the integration of autoscaler with cluster api should be optional or opt in so that autoscaler continues to be usable in gke

My understanding is that the implementation in Cluster Autoscaler would be done by adding a new 'Cluster API cloudprovider'. Existing cloudprovider integrations won't be impacted.

justaugustus · 2018-11-20T04:35:57Z

REMINDER: KEPs are moving to k/enhancements on November 30. Please attempt to merge this KEP before then to signal consensus.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.

justaugustus · 2018-12-01T08:03:48Z

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

k8s-ci-robot · 2018-12-01T08:03:49Z

@justaugustus: Closed this PR.

In response to this:

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot requested review from calebamiles and jbeda September 9, 2018 16:38

WIP: cluster autoscaler integration with machine API

46bd3ac

ingvagabund force-pushed the integrate-clusterapi-into-cluster-autoscaler branch from 411bc1e to 46bd3ac Compare September 9, 2018 16:48

ingvagabund mentioned this pull request Sep 9, 2018

Cluster Autoscaler / Cluster API Integration kubernetes/enhancements#609

Closed

MaciekPytel reviewed Sep 10, 2018

View reviewed changes

hardikdr mentioned this pull request Sep 10, 2018

Add apis for machine-class kubernetes-sigs/cluster-api#488

Merged

aleksandra-malinowska reviewed Sep 10, 2018

View reviewed changes

hardikdr reviewed Sep 10, 2018

View reviewed changes

Extend the node templating options

c469ee1

k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Sep 14, 2018

alvaroaleman reviewed Oct 7, 2018

View reviewed changes

alvaroaleman mentioned this pull request Oct 11, 2018

REQUEST: New membership for @alvaroaleman kubernetes/org#158

Closed

6 tasks

k8s-ci-robot added the kind/kep label Oct 13, 2018

krmayankk reviewed Oct 15, 2018

View reviewed changes

alvaroaleman mentioned this pull request Nov 1, 2018

Extend MachineStatus to add ProviderID kubernetes-sigs/cluster-api#565

Merged

k8s-ci-robot requested a review from venezia November 4, 2018 11:44

k8s-ci-robot closed this Dec 1, 2018

WIP: cluster autoscaler integration with machine API #2653

WIP: cluster autoscaler integration with machine API #2653

Conversation

ingvagabund commented Sep 9, 2018

k8s-ci-robot commented Sep 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaciekPytel commented Sep 10, 2018

hardikdr commented Sep 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaciekPytel commented Sep 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roberthbailey commented Sep 11, 2018

enxebre commented Sep 12, 2018 • edited

roberthbailey commented Sep 12, 2018

enxebre commented Sep 13, 2018

Choose a reason for hiding this comment

justaugustus commented Oct 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justaugustus commented Nov 20, 2018

justaugustus commented Dec 1, 2018

k8s-ci-robot commented Dec 1, 2018

enxebre commented Sep 12, 2018 •

edited