Skip to content

Conversation

@alexander-demicev
Copy link
Contributor

Add maxAge field to MachineHealthChecks. If exists more than maximal age allowed it is marked for remediation.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from alexander-demichev after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks mostly good to me, one minor nit

// +optional
// +kubebuilder:validation:Pattern="^([0-9]+(\\.[0-9]+)?(ns|us|µs|ms|s|m|h))+$"
// +kubebuilder:validation:Type:=string
MaxAge *metav1.Duration `json:"maxAge"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a pointer? Could we just use zero as the disabled state? Could use omitEmpty so it doesn't show up when it is set to zero

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added omitEmpty

Copy link
Contributor

@michaelgugino michaelgugino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should compare instance provisioning time rather than machine creation timestamp as the two may be significantly different, up to 2 hours today. In the future, I would like the 2 hour window for CSR approval to be based on instance creation time, this will allow the usecase of creating a bunch of machines that spin until price/capacity is available for spot instances.

Also, the field name maxAge is not specific. We should use something like maxInstanceAge

@enxebre
Copy link
Member

enxebre commented Apr 8, 2021

How will we prevent all machines in a scalable resource which might have very similar creation timestamp from being deleted at a time? So we don't drain multiple times relocating workload inefficiently?

if t.MHC.Spec.MaxAge != nil {
machineCreationTime := t.Machine.CreationTimestamp.Time
if machineCreationTime.Add(t.MHC.Spec.MaxAge.Duration).Before(now) {
klog.V(3).Infof("%s: unhealthy: machine reached maximal age %v", t.string(), t.MHC.Spec.MaxAge.Duration)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beyond just logging a message does the MHC generate k8s events or have prometheus metrics?

We have some similar-ish node killer automation running in our OCP v3.11 environment and we found it very useful to create k8s events and have prometheus metrics for monitoring/tracking old nodes that are being deleted in our production environment.

@seanmalloy
Copy link

How will we prevent all machines in a scalable resource which might have very similar creation timestamp from being deleted at a time? So we don't drain multiple times relocating workload inefficiently?

Could this be successfully mitigated by having the user also create a MachineDisruptionBudget?

In our real word OCP v3.11 cluster we delete one "old node" every two hours. I suppose even with an MDB the MHC could delete a lot more than one machine every two hours.

@openshift-ci-robot
Copy link
Contributor

@alexander-demichev: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 11, 2021
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 10, 2021
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 9, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 17, 2021

@alexander-demichev: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-libvirt 574836f link /test e2e-libvirt
ci/prow/e2e-gcp 574836f link /test e2e-gcp
ci/prow/e2e-azure 574836f link /test e2e-azure
ci/prow/e2e-vsphere-upgrade 574836f link /test e2e-vsphere-upgrade
ci/prow/e2e-gcp-operator 574836f link /test e2e-gcp-operator
ci/prow/e2e-vsphere 574836f link /test e2e-vsphere
ci/prow/e2e-metal-ipi 574836f link /test e2e-metal-ipi
ci/prow/e2e-azure-operator 574836f link /test e2e-azure-operator
ci/prow/e2e-aws 574836f link /test e2e-aws
ci/prow/e2e-metal-ipi-upgrade 574836f link /test e2e-metal-ipi-upgrade
ci/prow/e2e-metal-ipi-virtualmedia 574836f link /test e2e-metal-ipi-virtualmedia
ci/prow/e2e-aws-operator 574836f link /test e2e-aws-operator
ci/prow/e2e-aws-upgrade 574836f link /test e2e-aws-upgrade
ci/prow/e2e-metal-ipi-ovn-dualstack 574836f link /test e2e-metal-ipi-ovn-dualstack
ci/prow/e2e-vsphere-serial 574836f link /test e2e-vsphere-serial

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants