Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to disable NodeStartupTimeout on MachineHealthCheck #4468

Closed
JoelSpeed opened this issue Apr 12, 2021 · 13 comments · Fixed by #4471
Closed

Add option to disable NodeStartupTimeout on MachineHealthCheck #4468

JoelSpeed opened this issue Apr 12, 2021 · 13 comments · Fixed by #4471
Labels
area/api Issues or PRs related to the APIs kind/feature Categorizes issue or PR as related to a new feature.
Milestone

Comments

@JoelSpeed
Copy link
Contributor

User Story

As a user I would like to be able to specify a healthcheck that does not have a node startup timeout for so that I can handle complex healthcheck logic separately.

As a cluster fleet administrator, I would like to create MachineHealthChecks that are suitable for all clusters, regardless of platform. Since I cannot know what a reasonable NodeStartupTimeout is, I would like to disable this feature and allow my cluster admins to opt in by creating their own MHCs where appropriate.

Detailed Description

If a MachineHealthCheck can not cater for your health check needs, you may want to handle this health check somewhere else. For security reasons, you don't want to give this alternate health checker permission to delete Machine resources, but can give it permission to add a condition to a Node (eg. by using the kubelet credentials for the node it is operating on).

In this scenario, it is not suitable for a NodeStartupTimeout to apply. The MachineHealthCheck should react only to the signal from the custom health check component itself. Currently there is no way to disable the NodeStartupTimeout.

I think there are valid use cases for where you may want to just use the condition checks of MachineHealthCheck without the NodeStartupTimeout.

Anything else you would like to add:

I think based on the existing field, a sensible approach to setting a disabled value is to have an explicit option of -1 to disable the feature. We could use zero, however this is the zero value of the go type and so would be indistinguishable from an unset value, and hence would be defaulted.

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 12, 2021
@vincepri
Copy link
Member

NodeStartupTimeout seems to be a metav1.Duration, we could distinguish between nil and 0s, although that makes defaulting a bit complicated. Looking at the defaulting code we have today, it seems that if the value is nil we default it to 10 minutes.

The current minimum we require is 30s, although we could allow folks to set it to 0s instead?

/area api

@k8s-ci-robot k8s-ci-robot added the area/api Issues or PRs related to the APIs label Apr 12, 2021
@vincepri
Copy link
Member

/area health

@JoelSpeed
Copy link
Contributor Author

I was having a play before with this and couldn't get it to work with 0, but just had another play and got it working. Though because a metav1.Duration is a string, you have to put it in quotes, so it looks like nodeStartupTimeout: "0".

I'm happy with either, though, I am wondering if the -1 value is a more obvious disable string, wondering if 0 could be interpreted in a different way?

@sbueringer
Copy link
Member

sbueringer commented Apr 12, 2021

Semantically "0" could be a bit problematic. I think it could imply that startup timeout is actually "0"

Although I guess there is precedent. E.g. in go http client if you don't set a timeout, it's 0 and 0 means indefinite, which is - I guess - what we want.

But I think it's still important that we shouldn't implicitly set an indefinite timeout and this should be chosen intentionally. It could be a bit strange for users to set the timeout to the Go default value of duration and then get a different result then when it's unset.

So I guess I'm pro-"let's use -1".

@vincepri
Copy link
Member

Most of the Kubernetes APIs I've seen have 0 to default to unlimited / disabled behaviors. I wouldn't want to necessarily introduce yet another value between nil or 0

@sbueringer
Copy link
Member

sbueringer commented Apr 12, 2021

Most of the Kubernetes APIs I've seen have 0 to default to unlimited / disabled behaviors. I wouldn't want to necessarily introduce yet another value between nil or 0

Do they also have different behaviour regarding explicitly set to zero value vs unset?

EDIT: Ah sorry missed that we have a duration pointer...

@vincepri
Copy link
Member

Do they also have different behaviour regarding explicitly set to zero value vs unset?

No, right now if it's nil, we default it to 10m and the minimum is 30s. We can allow folks to set it to 0s and disable it that way.

@enxebre
Copy link
Member

enxebre commented Apr 13, 2021

using 0s sounds reasonable to me.

In this scenario, it is not suitable for a NodeStartupTimeout to apply. The MachineHealthCheck should react only to the signal from the custom health check component itself. Currently there is no way to disable the NodeStartupTimeout.

@JoelSpeed what's the relation you seeing between having a custom health check and having an infinite NodeStartupTimeout?
A NodeStartupTimeout would be reasonable regardless of what/how deletion was triggered right?

@JoelSpeed
Copy link
Contributor Author

@enxebre in particular, the scenario we are looking at is spot instances. For security reasons we use MHC as a proxy to get deletion and draining. So we want an MHC that will react to our spot termination condition, but not react to startup timeout. It's acceptable for the system as a whole to react to these system initiated events, but users may not want the startup timeout to be in place in this scenario.

By allowing it to be disabled, we can allow for condition based health checks that can be applied to all platforms without fear of the node startup timeout being an unfavorable value for some of the platforms (eg AWS vs Baremetal)

@JoelSpeed
Copy link
Contributor Author

@michaelgugino just raised a question on the PR which warrants further discussion. (#4471 (comment))

Should we consider instead of making the option default to 10m, default to off with the reasoning that a user should have to opt in to the node startup timeout behaviour. I can see merit in this approach but am interested to know what others think. (CC @n1r1 @beekhof)

@vincepri
Copy link
Member

Should we consider instead of making the option default to 10m, default to off with the reasoning that a user should have to opt in to the node startup timeout behaviour. I can see merit in this approach but am interested to know what others think. (CC @n1r1 @beekhof)

Let's please consider this in another issue. This is a breaking behavioral change, and we might not be able to make it for v0.4.0 at this point.

@n1r1
Copy link
Contributor

n1r1 commented Apr 13, 2021

Should we consider instead of making the option default to 10m, default to off with the reasoning that a user should have to opt in to the node startup timeout behaviour. I can see merit in this approach but am interested to know what others think.

+1 for default to off. It makes sense.

@vincepri
Copy link
Member

/milestone v0.4

@k8s-ci-robot k8s-ci-robot added this to the v0.4 milestone Apr 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Issues or PRs related to the APIs kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants