New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to disable NodeStartupTimeout on MachineHealthCheck #4468
Comments
|
The current minimum we require is 30s, although we could allow folks to set it to 0s instead? /area api |
|
/area health |
|
I was having a play before with this and couldn't get it to work with 0, but just had another play and got it working. Though because a I'm happy with either, though, I am wondering if the |
|
Although I guess there is precedent. E.g. in go http client if you don't set a timeout, it's 0 and 0 means indefinite, which is - I guess - what we want. But I think it's still important that we shouldn't implicitly set an indefinite timeout and this should be chosen intentionally. It could be a bit strange for users to set the timeout to the Go default value of duration and then get a different result then when it's unset. So I guess I'm pro-"let's use -1". |
|
Most of the Kubernetes APIs I've seen have 0 to default to unlimited / disabled behaviors. I wouldn't want to necessarily introduce yet another value between nil or 0 |
Do they also have different behaviour regarding explicitly set to zero value vs unset? EDIT: Ah sorry missed that we have a duration pointer... |
No, right now if it's |
|
using 0s sounds reasonable to me.
@JoelSpeed what's the relation you seeing between having a custom health check and having an infinite NodeStartupTimeout? |
|
@enxebre in particular, the scenario we are looking at is spot instances. For security reasons we use MHC as a proxy to get deletion and draining. So we want an MHC that will react to our spot termination condition, but not react to startup timeout. It's acceptable for the system as a whole to react to these system initiated events, but users may not want the startup timeout to be in place in this scenario. By allowing it to be disabled, we can allow for condition based health checks that can be applied to all platforms without fear of the node startup timeout being an unfavorable value for some of the platforms (eg AWS vs Baremetal) |
|
@michaelgugino just raised a question on the PR which warrants further discussion. (#4471 (comment)) Should we consider instead of making the option default to 10m, default to off with the reasoning that a user should have to opt in to the node startup timeout behaviour. I can see merit in this approach but am interested to know what others think. (CC @n1r1 @beekhof) |
Let's please consider this in another issue. This is a breaking behavioral change, and we might not be able to make it for v0.4.0 at this point. |
+1 for default to off. It makes sense. |
|
/milestone v0.4 |
User Story
As a user I would like to be able to specify a healthcheck that does not have a node startup timeout for so that I can handle complex healthcheck logic separately.
As a cluster fleet administrator, I would like to create MachineHealthChecks that are suitable for all clusters, regardless of platform. Since I cannot know what a reasonable NodeStartupTimeout is, I would like to disable this feature and allow my cluster admins to opt in by creating their own MHCs where appropriate.
Detailed Description
If a MachineHealthCheck can not cater for your health check needs, you may want to handle this health check somewhere else. For security reasons, you don't want to give this alternate health checker permission to delete Machine resources, but can give it permission to add a condition to a Node (eg. by using the kubelet credentials for the node it is operating on).
In this scenario, it is not suitable for a NodeStartupTimeout to apply. The MachineHealthCheck should react only to the signal from the custom health check component itself. Currently there is no way to disable the NodeStartupTimeout.
I think there are valid use cases for where you may want to just use the condition checks of MachineHealthCheck without the NodeStartupTimeout.
Anything else you would like to add:
I think based on the existing field, a sensible approach to setting a disabled value is to have an explicit option of
-1to disable the feature. We could use zero, however this is the zero value of the go type and so would be indistinguishable from an unset value, and hence would be defaulted./kind feature
The text was updated successfully, but these errors were encountered: