Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: add auto-scaling for MachineSets #83

Closed
mhrivnak opened this issue Nov 25, 2019 · 18 comments
Closed

proposal: add auto-scaling for MachineSets #83

mhrivnak opened this issue Nov 25, 2019 · 18 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@mhrivnak
Copy link
Member

In many clusters, it is desirable for the size of a MachineSet to always equal the number of matching BareMetalHosts. In such a scenario, the cluster owner wants all of their hardware to be provisioned and turned into Nodes, and they want to remove excess Machines in case they remove hosts from their cluster.

Rather than make some external process manage the size of MachineSets as BareMetalHosts come and go, we could create a small controller that (optionally) automatically ensures a MachineSet has a size equal to the number of matching BareMetalHosts.

The controller would be an additional Controller in this project. It would watch MachineSets as its primary resource, and if they have a particular annotation, ensure that their size equals the number of matching BareMetalHosts. It would watch BareMetalHosts as a secondary resource.

Thoughts?

@dhellmann
Copy link
Member

This makes sense to me. I'm not sure if there are realistic use cases for having inventory in a cluster that isn't being consumed by the cluster. I can't really think of good reasons for doing that off the top of my head.

@andybraren
Copy link

I can imagine this enabling some great UX improvements. 👍

IIRC @dhellmann you once suggested the possibility of using a few Available/Ready (non-Provisioned) BMHs to create a brand new cluster using the first cluster as a sort of... bootstrap cluster? That might be easier than going through the usual install process and setting up a bootstrap node, and could be relatively common in (non-Edge) multi-cluster environments where nodes are roughly collocated. Maybe. 🤷‍♂️

This proposal doesn’t really preclude that flow I suppose. Some BMHs might just have to be deprovisioned before turning into a new cluster, which I’d expect to be a valid path regardless.

@andybraren
Copy link

If it ends up being the case that this autoscaling behavior is desired more often than not, would it make sense for it to be on by default and the annotation would turn it off instead?

@dhellmann
Copy link
Member

I can imagine this enabling some great UX improvements. 👍

IIRC @dhellmann you once suggested the possibility of using a few Available/Ready (non-Provisioned) BMHs to create a brand new cluster using the first cluster as a sort of... bootstrap cluster? That might be easier than going through the usual install process and setting up a bootstrap node, and could be relatively common in (non-Edge) multi-cluster environments where nodes are roughly collocated. Maybe. 🤷‍♂️

The OpenShift installer doesn't really support that today, but it could be made to work. And the v1alpha2 work being done in metal3 already supports this flow for standard kubernetes clusters using a newer machine API.

This proposal doesn’t really preclude that flow I suppose. Some BMHs might just have to be deprovisioned before turning into a new cluster, which I’d expect to be a valid path regardless.

Yeah, I think this proposal is asking us to go all-in on the idea that there is no unused inventory in a cluster.

@mhrivnak mhrivnak self-assigned this Nov 27, 2019
@zaneb
Copy link
Member

zaneb commented Dec 4, 2019

In such a scenario, the cluster owner wants all of their hardware to be provisioned and turned into Nodes,

I'm not completely convinced by this - in the OpenStack world operators generally complain about the fact that all of the hardware is always provisioned and in use. There's a real cost (in terms of electrical power consumption) to running servers that are not needed. Currently the cluster-autoscaler does not integrate with the cluster-api, but when it does it seems to me that that's what you would want managing the MachineSet size.

One Baremetal-specific scenario that does not account for is that in the simple case where you have only one cluster, it would be advantageous to be able to keep all of the Hosts provisioned and only toggle the power as you bring them in and out of the cluster. My first impression though is that this would need to be handled at a level below the Machine API.

I could buy that in a hyperconverged storage scenario you might want to keep all of the available Hosts in the cluster all of the time. I wonder if that could be better handled by rook (or whatever hyperconverged storage operator) tweaking the cluster-autoscaler parameters appropriately though, rather than writing a competing autoscaler.

and they want to remove excess Machines in case they remove hosts from their cluster.

This is more understandable, although if there are insufficient Hosts available I don't think anything bad happens; you just get some Machines hanging around that can never turn into Nodes. I don't know whether or not the cluster-autoscaler will handle this case for you (i.e. notice that nothing bad is happening with the current number of Nodes, yet the MachineSet size is larger, therefore contract the MachineSet to match).

@mhrivnak
Copy link
Member Author

mhrivnak commented Dec 4, 2019

Powering down hardware when not needed is a different story than deprovisioning hardware when not needed. Provisioning is expensive and time-consuming. If we apply a cluster-autoscaler to a bare metal cluster, once the autoscaler decided it needs more capacity, it could easily be 30+ minutes (worse in many cases) before new capacity was done provisioning and became available. Perhaps that's a constraint someone would be willing to live with, but we haven't received that request yet AFAIK. It seems like scale-by-provisioning with that level of latency would be a better fit for workloads that are time-of-day specific; if you can anticipate when demand will increase, you can proactively begin re-provisioning dark hardware. (like the thermostat in my house that turns on the heat ~30 minutes before I wake up)

If we really wanted to pursue load-based cluster autoscaling with bare metal, I think we would be much better served looking at being able to suspend or hibernate systems rather than deprovision them.

In the mean time, we do have a multi-cluster use case where inventory is managed at a level above clusters. We're either going to build logic into that thing to scale MachineSets up and down as it adds and removes inventory in a specific cluster, or put that logic into the provider running on the cluster. I think doing it in the provider makes more sense and would enable more re-use. Since it's optional and opt-in (you have to annotate a MachineSet to get the behavior), there's no harm for someone who wants to scale their MachineSets another way.

@zaneb
Copy link
Member

zaneb commented Dec 5, 2019

It feels like we might be missing a concept like a BareMetalHostSet - where each Host in the set would be provisioned with the configuration defined in the MachineSet, but not powered on until it is associated with a Machine.
In a standalone cluster, you'd typically use something like what is proposed here, to make sure that all matching Hosts are always in the HostSet; in more specialised deployments or a multi-cluster environment you'd have a scheduler + reservation system that would assign Hosts to HostSets according to actual + projected demand (just need somebody to come up with an AI angle here ;).

I think we should try to avoid needing a baremetal-specific cluster-autoscaler cloud provider to implement these kinds of use cases.

@dhellmann
Copy link
Member

Aren't at least some of the settings for the host time-sensitive? I'm thinking about the certs used for the host to identify itself and register with the cluster. Those have a limited lifetime, right? If we pre-provision a host, then power it off, when it boots again we might have to do more than power it on to make it usable.

@zaneb
Copy link
Member

zaneb commented Dec 6, 2019

Good question. If there is stuff that is specific to a particular Machine passed in the userdata then probably the best we can hope for is to be able to rebuild the host in Ironic to update the config-drive, but I assume that still involves rebooting into ironic-python-agent and back again, so it'd be roughly as slow as provisioning (IIUC it's mainly having to test that amount of RAM on startup that makes things so slow?).

@mhrivnak
Copy link
Member Author

mhrivnak commented Dec 9, 2019

I can see something like that being valuable in some cases, bet we're getting into use cases that go well beyond the scope of this request. If we're interested in pursuing the ability to pre-provision hosts, adjust cluster size based on load, or power down hosts for energy conservation, let's make issues for those use cases and discuss them there.

Many users will just want to provision whatever hardware they tell the cluster about, and that's the use case I'm trying to address. Rather than make it a two-step process of 1) add or remove a BareMetalHost and 2) increment or decrement the corresponding MachineSet (an inherently imperative operation BTW), we can reduce that to one step by letting the user declare with an annotation that they want their MachineSet size to match what's available.

Are there objections to that? It's opt-in, requiring an annotation to be placed on the MachineSet, so default behavior is unchanged. The code is going to be written; if not here, then other tools that want to add and remove BareMetalHosts will need to implement it. For example, multi-cluster tooling that's coming together will need this behavior. I'd rather do it here so we can provide a consistent behavior and let one implementation be re-used. I'm also happy to implement it as long as nobody objects.

@zaneb
Copy link
Member

zaneb commented Dec 11, 2019

It seems like we're not near to figuring out the shape of the solution for those more complex use cases, so I agree we shouldn't block this.

mhrivnak referenced this issue in mhrivnak/cluster-api-provider-baremetal Dec 17, 2019
In many clusters, it is desirable for the size of a MachineSet to always equal
the number of matching BareMetalHosts. In such a scenario, the cluster owner
wants all of their hardware to be provisioned and turned into Nodes, and they
want to remove excess Machines in case they remove hosts from their cluster.
This change adds a controller that scales MachineSets to the number of matching
BareMetalHosts. The behavior is opt-in, requiring an annotation on the
MachineSet.

fixes #188
mhrivnak referenced this issue in mhrivnak/cluster-api-provider-baremetal Jan 13, 2020
In many clusters, it is desirable for the size of a MachineSet to always equal
the number of matching BareMetalHosts. In such a scenario, the cluster owner
wants all of their hardware to be provisioned and turned into Nodes, and they
want to remove excess Machines in case they remove hosts from their cluster.
This change adds a controller that scales MachineSets to the number of matching
BareMetalHosts. The behavior is opt-in, requiring an annotation on the
MachineSet.

fixes #188
mhrivnak referenced this issue in mhrivnak/cluster-api-provider-baremetal Jan 14, 2020
In many clusters, it is desirable for the size of a MachineSet to always equal
the number of matching BareMetalHosts. In such a scenario, the cluster owner
wants all of their hardware to be provisioned and turned into Nodes, and they
want to remove excess Machines in case they remove hosts from their cluster.
This change adds a controller that scales MachineSets to the number of matching
BareMetalHosts. The behavior is opt-in, requiring an annotation on the
MachineSet.

fixes #188
mhrivnak referenced this issue in mhrivnak/cluster-api-provider-baremetal Jan 16, 2020
In many clusters, it is desirable for the size of a MachineSet to always equal
the number of matching BareMetalHosts. In such a scenario, the cluster owner
wants all of their hardware to be provisioned and turned into Nodes, and they
want to remove excess Machines in case they remove hosts from their cluster.
This change adds a controller that scales MachineSets to the number of matching
BareMetalHosts. The behavior is opt-in, requiring an annotation on the
MachineSet.

fixes #188
@stbenjam
Copy link
Member

/kind feature

mhrivnak referenced this issue in openshift/cluster-api-provider-baremetal Feb 19, 2020
In many clusters, it is desirable for the size of a MachineSet to always equal
the number of matching BareMetalHosts. In such a scenario, the cluster owner
wants all of their hardware to be provisioned and turned into Nodes, and they
want to remove excess Machines in case they remove hosts from their cluster.
This change adds a controller that scales MachineSets to the number of matching
BareMetalHosts. The behavior is opt-in, requiring an annotation on the
MachineSet.

fixes #188
@dhellmann dhellmann transferred this issue from metal3-io/cluster-api-provider-baremetal Mar 23, 2020
@stbenjam
Copy link
Member

stbenjam commented Apr 1, 2020

/kind feature

@metal3-io-bot metal3-io-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 1, 2020
@stbenjam stbenjam added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 1, 2020
@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 30, 2020
@dhellmann
Copy link
Member

/remove-lifecycle stale

@metal3-io-bot metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 30, 2020
@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 28, 2020
@metal3-io-bot
Copy link
Contributor

Stale issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle stale.

/close

@metal3-io-bot
Copy link
Contributor

@metal3-io-bot: Closing this issue.

In response to this:

Stale issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle stale.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
Status: Done / Closed
Development

No branches or pull requests

6 participants