proposal: add auto-scaling for MachineSets #83

mhrivnak · 2019-11-25T22:04:01Z

In many clusters, it is desirable for the size of a MachineSet to always equal the number of matching BareMetalHosts. In such a scenario, the cluster owner wants all of their hardware to be provisioned and turned into Nodes, and they want to remove excess Machines in case they remove hosts from their cluster.

Rather than make some external process manage the size of MachineSets as BareMetalHosts come and go, we could create a small controller that (optionally) automatically ensures a MachineSet has a size equal to the number of matching BareMetalHosts.

The controller would be an additional Controller in this project. It would watch MachineSets as its primary resource, and if they have a particular annotation, ensure that their size equals the number of matching BareMetalHosts. It would watch BareMetalHosts as a secondary resource.

Thoughts?

The text was updated successfully, but these errors were encountered:

dhellmann · 2019-11-25T22:31:31Z

This makes sense to me. I'm not sure if there are realistic use cases for having inventory in a cluster that isn't being consumed by the cluster. I can't really think of good reasons for doing that off the top of my head.

andybraren · 2019-11-25T23:09:34Z

I can imagine this enabling some great UX improvements. 👍

IIRC @dhellmann you once suggested the possibility of using a few Available/Ready (non-Provisioned) BMHs to create a brand new cluster using the first cluster as a sort of... bootstrap cluster? That might be easier than going through the usual install process and setting up a bootstrap node, and could be relatively common in (non-Edge) multi-cluster environments where nodes are roughly collocated. Maybe. 🤷‍♂️

This proposal doesn’t really preclude that flow I suppose. Some BMHs might just have to be deprovisioned before turning into a new cluster, which I’d expect to be a valid path regardless.

andybraren · 2019-11-25T23:22:52Z

If it ends up being the case that this autoscaling behavior is desired more often than not, would it make sense for it to be on by default and the annotation would turn it off instead?

dhellmann · 2019-11-25T23:42:15Z

I can imagine this enabling some great UX improvements. 👍

IIRC @dhellmann you once suggested the possibility of using a few Available/Ready (non-Provisioned) BMHs to create a brand new cluster using the first cluster as a sort of... bootstrap cluster? That might be easier than going through the usual install process and setting up a bootstrap node, and could be relatively common in (non-Edge) multi-cluster environments where nodes are roughly collocated. Maybe. 🤷‍♂️

The OpenShift installer doesn't really support that today, but it could be made to work. And the v1alpha2 work being done in metal3 already supports this flow for standard kubernetes clusters using a newer machine API.

This proposal doesn’t really preclude that flow I suppose. Some BMHs might just have to be deprovisioned before turning into a new cluster, which I’d expect to be a valid path regardless.

Yeah, I think this proposal is asking us to go all-in on the idea that there is no unused inventory in a cluster.

zaneb · 2019-12-04T21:22:28Z

In such a scenario, the cluster owner wants all of their hardware to be provisioned and turned into Nodes,

I'm not completely convinced by this - in the OpenStack world operators generally complain about the fact that all of the hardware is always provisioned and in use. There's a real cost (in terms of electrical power consumption) to running servers that are not needed. Currently the cluster-autoscaler does not integrate with the cluster-api, but when it does it seems to me that that's what you would want managing the MachineSet size.

One Baremetal-specific scenario that does not account for is that in the simple case where you have only one cluster, it would be advantageous to be able to keep all of the Hosts provisioned and only toggle the power as you bring them in and out of the cluster. My first impression though is that this would need to be handled at a level below the Machine API.

I could buy that in a hyperconverged storage scenario you might want to keep all of the available Hosts in the cluster all of the time. I wonder if that could be better handled by rook (or whatever hyperconverged storage operator) tweaking the cluster-autoscaler parameters appropriately though, rather than writing a competing autoscaler.

and they want to remove excess Machines in case they remove hosts from their cluster.

This is more understandable, although if there are insufficient Hosts available I don't think anything bad happens; you just get some Machines hanging around that can never turn into Nodes. I don't know whether or not the cluster-autoscaler will handle this case for you (i.e. notice that nothing bad is happening with the current number of Nodes, yet the MachineSet size is larger, therefore contract the MachineSet to match).

mhrivnak · 2019-12-04T22:23:19Z

Powering down hardware when not needed is a different story than deprovisioning hardware when not needed. Provisioning is expensive and time-consuming. If we apply a cluster-autoscaler to a bare metal cluster, once the autoscaler decided it needs more capacity, it could easily be 30+ minutes (worse in many cases) before new capacity was done provisioning and became available. Perhaps that's a constraint someone would be willing to live with, but we haven't received that request yet AFAIK. It seems like scale-by-provisioning with that level of latency would be a better fit for workloads that are time-of-day specific; if you can anticipate when demand will increase, you can proactively begin re-provisioning dark hardware. (like the thermostat in my house that turns on the heat ~30 minutes before I wake up)

If we really wanted to pursue load-based cluster autoscaling with bare metal, I think we would be much better served looking at being able to suspend or hibernate systems rather than deprovision them.

In the mean time, we do have a multi-cluster use case where inventory is managed at a level above clusters. We're either going to build logic into that thing to scale MachineSets up and down as it adds and removes inventory in a specific cluster, or put that logic into the provider running on the cluster. I think doing it in the provider makes more sense and would enable more re-use. Since it's optional and opt-in (you have to annotate a MachineSet to get the behavior), there's no harm for someone who wants to scale their MachineSets another way.

zaneb · 2019-12-05T20:34:33Z

It feels like we might be missing a concept like a BareMetalHostSet - where each Host in the set would be provisioned with the configuration defined in the MachineSet, but not powered on until it is associated with a Machine.
In a standalone cluster, you'd typically use something like what is proposed here, to make sure that all matching Hosts are always in the HostSet; in more specialised deployments or a multi-cluster environment you'd have a scheduler + reservation system that would assign Hosts to HostSets according to actual + projected demand (just need somebody to come up with an AI angle here ;).

I think we should try to avoid needing a baremetal-specific cluster-autoscaler cloud provider to implement these kinds of use cases.

dhellmann · 2019-12-05T20:52:48Z

Aren't at least some of the settings for the host time-sensitive? I'm thinking about the certs used for the host to identify itself and register with the cluster. Those have a limited lifetime, right? If we pre-provision a host, then power it off, when it boots again we might have to do more than power it on to make it usable.

zaneb · 2019-12-06T01:36:54Z

Good question. If there is stuff that is specific to a particular Machine passed in the userdata then probably the best we can hope for is to be able to rebuild the host in Ironic to update the config-drive, but I assume that still involves rebooting into ironic-python-agent and back again, so it'd be roughly as slow as provisioning (IIUC it's mainly having to test that amount of RAM on startup that makes things so slow?).

mhrivnak · 2019-12-09T21:52:45Z

I can see something like that being valuable in some cases, bet we're getting into use cases that go well beyond the scope of this request. If we're interested in pursuing the ability to pre-provision hosts, adjust cluster size based on load, or power down hosts for energy conservation, let's make issues for those use cases and discuss them there.

Many users will just want to provision whatever hardware they tell the cluster about, and that's the use case I'm trying to address. Rather than make it a two-step process of 1) add or remove a BareMetalHost and 2) increment or decrement the corresponding MachineSet (an inherently imperative operation BTW), we can reduce that to one step by letting the user declare with an annotation that they want their MachineSet size to match what's available.

Are there objections to that? It's opt-in, requiring an annotation to be placed on the MachineSet, so default behavior is unchanged. The code is going to be written; if not here, then other tools that want to add and remove BareMetalHosts will need to implement it. For example, multi-cluster tooling that's coming together will need this behavior. I'd rather do it here so we can provide a consistent behavior and let one implementation be re-used. I'm also happy to implement it as long as nobody objects.

zaneb · 2019-12-11T04:14:03Z

It seems like we're not near to figuring out the shape of the solution for those more complex use cases, so I agree we shouldn't block this.

In many clusters, it is desirable for the size of a MachineSet to always equal the number of matching BareMetalHosts. In such a scenario, the cluster owner wants all of their hardware to be provisioned and turned into Nodes, and they want to remove excess Machines in case they remove hosts from their cluster. This change adds a controller that scales MachineSets to the number of matching BareMetalHosts. The behavior is opt-in, requiring an annotation on the MachineSet. fixes #188

stbenjam · 2020-01-29T14:11:35Z

/kind feature

In many clusters, it is desirable for the size of a MachineSet to always equal the number of matching BareMetalHosts. In such a scenario, the cluster owner wants all of their hardware to be provisioned and turned into Nodes, and they want to remove excess Machines in case they remove hosts from their cluster. This change adds a controller that scales MachineSets to the number of matching BareMetalHosts. The behavior is opt-in, requiring an annotation on the MachineSet. fixes #188

stbenjam · 2020-04-01T13:52:27Z

/kind feature

metal3-io-bot · 2020-06-30T14:18:21Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

dhellmann · 2020-06-30T14:30:53Z

/remove-lifecycle stale

metal3-io-bot · 2020-09-28T15:08:23Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

metal3-io-bot · 2020-10-28T15:11:25Z

Stale issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle stale.

/close

metal3-io-bot · 2020-10-28T15:11:28Z

@metal3-io-bot: Closing this issue.

In response to this:

Stale issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle stale.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mhrivnak self-assigned this Nov 27, 2019

dhellmann transferred this issue from metal3-io/cluster-api-provider-baremetal Mar 23, 2020

metal3-io-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 1, 2020

stbenjam added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 1, 2020

metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 30, 2020

metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 30, 2020

metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 28, 2020

metal3-io-bot closed this as completed Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: add auto-scaling for MachineSets #83

proposal: add auto-scaling for MachineSets #83

mhrivnak commented Nov 25, 2019

dhellmann commented Nov 25, 2019

andybraren commented Nov 25, 2019

andybraren commented Nov 25, 2019

dhellmann commented Nov 25, 2019

zaneb commented Dec 4, 2019

mhrivnak commented Dec 4, 2019

zaneb commented Dec 5, 2019

dhellmann commented Dec 5, 2019

zaneb commented Dec 6, 2019

mhrivnak commented Dec 9, 2019

zaneb commented Dec 11, 2019

stbenjam commented Jan 29, 2020

stbenjam commented Apr 1, 2020

metal3-io-bot commented Jun 30, 2020

dhellmann commented Jun 30, 2020

metal3-io-bot commented Sep 28, 2020

metal3-io-bot commented Oct 28, 2020

metal3-io-bot commented Oct 28, 2020

proposal: add auto-scaling for MachineSets #83

proposal: add auto-scaling for MachineSets #83

Comments

mhrivnak commented Nov 25, 2019

dhellmann commented Nov 25, 2019

andybraren commented Nov 25, 2019

andybraren commented Nov 25, 2019

dhellmann commented Nov 25, 2019

zaneb commented Dec 4, 2019

mhrivnak commented Dec 4, 2019

zaneb commented Dec 5, 2019

dhellmann commented Dec 5, 2019

zaneb commented Dec 6, 2019

mhrivnak commented Dec 9, 2019

zaneb commented Dec 11, 2019

stbenjam commented Jan 29, 2020

stbenjam commented Apr 1, 2020

metal3-io-bot commented Jun 30, 2020

dhellmann commented Jun 30, 2020

metal3-io-bot commented Sep 28, 2020

metal3-io-bot commented Oct 28, 2020

metal3-io-bot commented Oct 28, 2020