-
Notifications
You must be signed in to change notification settings - Fork 231
Add maxAge field to MachineHealthChecks #843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
elmiko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks mostly good to me, one minor nit
install/0000_30_machine-api-operator_07_machinehealthcheck.crd.yaml
Outdated
Show resolved
Hide resolved
| // +optional | ||
| // +kubebuilder:validation:Pattern="^([0-9]+(\\.[0-9]+)?(ns|us|µs|ms|s|m|h))+$" | ||
| // +kubebuilder:validation:Type:=string | ||
| MaxAge *metav1.Duration `json:"maxAge"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a pointer? Could we just use zero as the disabled state? Could use omitEmpty so it doesn't show up when it is set to zero
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added omitEmpty
pkg/controller/machinehealthcheck/machinehealthcheck_controller_test.go
Outdated
Show resolved
Hide resolved
pkg/controller/machinehealthcheck/machinehealthcheck_controller_test.go
Outdated
Show resolved
Hide resolved
michaelgugino
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should compare instance provisioning time rather than machine creation timestamp as the two may be significantly different, up to 2 hours today. In the future, I would like the 2 hour window for CSR approval to be based on instance creation time, this will allow the usecase of creating a bunch of machines that spin until price/capacity is available for spot instances.
Also, the field name maxAge is not specific. We should use something like maxInstanceAge
|
How will we prevent all machines in a scalable resource which might have very similar creation timestamp from being deleted at a time? So we don't drain multiple times relocating workload inefficiently? |
| if t.MHC.Spec.MaxAge != nil { | ||
| machineCreationTime := t.Machine.CreationTimestamp.Time | ||
| if machineCreationTime.Add(t.MHC.Spec.MaxAge.Duration).Before(now) { | ||
| klog.V(3).Infof("%s: unhealthy: machine reached maximal age %v", t.string(), t.MHC.Spec.MaxAge.Duration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beyond just logging a message does the MHC generate k8s events or have prometheus metrics?
We have some similar-ish node killer automation running in our OCP v3.11 environment and we found it very useful to create k8s events and have prometheus metrics for monitoring/tracking old nodes that are being deleted in our production environment.
Could this be successfully mitigated by having the user also create a MachineDisruptionBudget? In our real word OCP v3.11 cluster we delete one "old node" every two hours. I suppose even with an MDB the MHC could delete a lot more than one machine every two hours. |
|
@alexander-demichev: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
|
@alexander-demichev: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Add maxAge field to MachineHealthChecks. If exists more than maximal age allowed it is marked for remediation.