-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: add auto-scaling for MachineSets #83
Comments
This makes sense to me. I'm not sure if there are realistic use cases for having inventory in a cluster that isn't being consumed by the cluster. I can't really think of good reasons for doing that off the top of my head. |
I can imagine this enabling some great UX improvements. 👍 IIRC @dhellmann you once suggested the possibility of using a few Available/Ready (non-Provisioned) BMHs to create a brand new cluster using the first cluster as a sort of... bootstrap cluster? That might be easier than going through the usual install process and setting up a bootstrap node, and could be relatively common in (non-Edge) multi-cluster environments where nodes are roughly collocated. Maybe. 🤷♂️ This proposal doesn’t really preclude that flow I suppose. Some BMHs might just have to be deprovisioned before turning into a new cluster, which I’d expect to be a valid path regardless. |
If it ends up being the case that this autoscaling behavior is desired more often than not, would it make sense for it to be on by default and the annotation would turn it off instead? |
The OpenShift installer doesn't really support that today, but it could be made to work. And the v1alpha2 work being done in metal3 already supports this flow for standard kubernetes clusters using a newer machine API.
Yeah, I think this proposal is asking us to go all-in on the idea that there is no unused inventory in a cluster. |
I'm not completely convinced by this - in the OpenStack world operators generally complain about the fact that all of the hardware is always provisioned and in use. There's a real cost (in terms of electrical power consumption) to running servers that are not needed. Currently the cluster-autoscaler does not integrate with the cluster-api, but when it does it seems to me that that's what you would want managing the MachineSet size. One Baremetal-specific scenario that does not account for is that in the simple case where you have only one cluster, it would be advantageous to be able to keep all of the Hosts provisioned and only toggle the power as you bring them in and out of the cluster. My first impression though is that this would need to be handled at a level below the Machine API. I could buy that in a hyperconverged storage scenario you might want to keep all of the available Hosts in the cluster all of the time. I wonder if that could be better handled by rook (or whatever hyperconverged storage operator) tweaking the cluster-autoscaler parameters appropriately though, rather than writing a competing autoscaler.
This is more understandable, although if there are insufficient Hosts available I don't think anything bad happens; you just get some Machines hanging around that can never turn into Nodes. I don't know whether or not the cluster-autoscaler will handle this case for you (i.e. notice that nothing bad is happening with the current number of Nodes, yet the MachineSet size is larger, therefore contract the MachineSet to match). |
Powering down hardware when not needed is a different story than deprovisioning hardware when not needed. Provisioning is expensive and time-consuming. If we apply a cluster-autoscaler to a bare metal cluster, once the autoscaler decided it needs more capacity, it could easily be 30+ minutes (worse in many cases) before new capacity was done provisioning and became available. Perhaps that's a constraint someone would be willing to live with, but we haven't received that request yet AFAIK. It seems like scale-by-provisioning with that level of latency would be a better fit for workloads that are time-of-day specific; if you can anticipate when demand will increase, you can proactively begin re-provisioning dark hardware. (like the thermostat in my house that turns on the heat ~30 minutes before I wake up) If we really wanted to pursue load-based cluster autoscaling with bare metal, I think we would be much better served looking at being able to suspend or hibernate systems rather than deprovision them. In the mean time, we do have a multi-cluster use case where inventory is managed at a level above clusters. We're either going to build logic into that thing to scale MachineSets up and down as it adds and removes inventory in a specific cluster, or put that logic into the provider running on the cluster. I think doing it in the provider makes more sense and would enable more re-use. Since it's optional and opt-in (you have to annotate a MachineSet to get the behavior), there's no harm for someone who wants to scale their MachineSets another way. |
It feels like we might be missing a concept like a BareMetalHostSet - where each Host in the set would be provisioned with the configuration defined in the MachineSet, but not powered on until it is associated with a Machine. I think we should try to avoid needing a baremetal-specific cluster-autoscaler cloud provider to implement these kinds of use cases. |
Aren't at least some of the settings for the host time-sensitive? I'm thinking about the certs used for the host to identify itself and register with the cluster. Those have a limited lifetime, right? If we pre-provision a host, then power it off, when it boots again we might have to do more than power it on to make it usable. |
Good question. If there is stuff that is specific to a particular Machine passed in the userdata then probably the best we can hope for is to be able to rebuild the host in Ironic to update the config-drive, but I assume that still involves rebooting into ironic-python-agent and back again, so it'd be roughly as slow as provisioning (IIUC it's mainly having to test that amount of RAM on startup that makes things so slow?). |
I can see something like that being valuable in some cases, bet we're getting into use cases that go well beyond the scope of this request. If we're interested in pursuing the ability to pre-provision hosts, adjust cluster size based on load, or power down hosts for energy conservation, let's make issues for those use cases and discuss them there. Many users will just want to provision whatever hardware they tell the cluster about, and that's the use case I'm trying to address. Rather than make it a two-step process of 1) add or remove a BareMetalHost and 2) increment or decrement the corresponding MachineSet (an inherently imperative operation BTW), we can reduce that to one step by letting the user declare with an annotation that they want their MachineSet size to match what's available. Are there objections to that? It's opt-in, requiring an annotation to be placed on the MachineSet, so default behavior is unchanged. The code is going to be written; if not here, then other tools that want to add and remove BareMetalHosts will need to implement it. For example, multi-cluster tooling that's coming together will need this behavior. I'd rather do it here so we can provide a consistent behavior and let one implementation be re-used. I'm also happy to implement it as long as nobody objects. |
It seems like we're not near to figuring out the shape of the solution for those more complex use cases, so I agree we shouldn't block this. |
In many clusters, it is desirable for the size of a MachineSet to always equal the number of matching BareMetalHosts. In such a scenario, the cluster owner wants all of their hardware to be provisioned and turned into Nodes, and they want to remove excess Machines in case they remove hosts from their cluster. This change adds a controller that scales MachineSets to the number of matching BareMetalHosts. The behavior is opt-in, requiring an annotation on the MachineSet. fixes #188
In many clusters, it is desirable for the size of a MachineSet to always equal the number of matching BareMetalHosts. In such a scenario, the cluster owner wants all of their hardware to be provisioned and turned into Nodes, and they want to remove excess Machines in case they remove hosts from their cluster. This change adds a controller that scales MachineSets to the number of matching BareMetalHosts. The behavior is opt-in, requiring an annotation on the MachineSet. fixes #188
In many clusters, it is desirable for the size of a MachineSet to always equal the number of matching BareMetalHosts. In such a scenario, the cluster owner wants all of their hardware to be provisioned and turned into Nodes, and they want to remove excess Machines in case they remove hosts from their cluster. This change adds a controller that scales MachineSets to the number of matching BareMetalHosts. The behavior is opt-in, requiring an annotation on the MachineSet. fixes #188
In many clusters, it is desirable for the size of a MachineSet to always equal the number of matching BareMetalHosts. In such a scenario, the cluster owner wants all of their hardware to be provisioned and turned into Nodes, and they want to remove excess Machines in case they remove hosts from their cluster. This change adds a controller that scales MachineSets to the number of matching BareMetalHosts. The behavior is opt-in, requiring an annotation on the MachineSet. fixes #188
/kind feature |
In many clusters, it is desirable for the size of a MachineSet to always equal the number of matching BareMetalHosts. In such a scenario, the cluster owner wants all of their hardware to be provisioned and turned into Nodes, and they want to remove excess Machines in case they remove hosts from their cluster. This change adds a controller that scales MachineSets to the number of matching BareMetalHosts. The behavior is opt-in, requiring an annotation on the MachineSet. fixes #188
/kind feature |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
Stale issues close after 30d of inactivity. Reopen the issue with /close |
@metal3-io-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In many clusters, it is desirable for the size of a MachineSet to always equal the number of matching BareMetalHosts. In such a scenario, the cluster owner wants all of their hardware to be provisioned and turned into Nodes, and they want to remove excess Machines in case they remove hosts from their cluster.
Rather than make some external process manage the size of MachineSets as BareMetalHosts come and go, we could create a small controller that (optionally) automatically ensures a MachineSet has a size equal to the number of matching BareMetalHosts.
The controller would be an additional
Controller
in this project. It would watch MachineSets as its primary resource, and if they have a particular annotation, ensure that their size equals the number of matching BareMetalHosts. It would watch BareMetalHosts as a secondary resource.Thoughts?
The text was updated successfully, but these errors were encountered: