Add maxAge field to MachineHealthChecks #843

alexander-demicev · 2021-04-08T13:09:54Z

Add maxAge field to MachineHealthChecks. If exists more than maximal age allowed it is marked for remediation.

openshift-ci-robot · 2021-04-08T13:10:09Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from alexander-demichev after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

elmiko

looks mostly good to me, one minor nit

install/0000_30_machine-api-operator_07_machinehealthcheck.crd.yaml

JoelSpeed · 2021-04-08T13:16:57Z

pkg/apis/machine/v1beta1/machinehealthcheck_types.go

+	// +optional
+	// +kubebuilder:validation:Pattern="^([0-9]+(\\.[0-9]+)?(ns|us|µs|ms|s|m|h))+$"
+	// +kubebuilder:validation:Type:=string
+	MaxAge *metav1.Duration `json:"maxAge"`


Does this need to be a pointer? Could we just use zero as the disabled state? Could use omitEmpty so it doesn't show up when it is set to zero

I added omitEmpty

pkg/controller/machinehealthcheck/machinehealthcheck_controller_test.go

pkg/apis/machine/v1beta1/machinehealthcheck_types.go

michaelgugino

We should compare instance provisioning time rather than machine creation timestamp as the two may be significantly different, up to 2 hours today. In the future, I would like the 2 hour window for CSR approval to be based on instance creation time, this will allow the usecase of creating a bunch of machines that spin until price/capacity is available for spot instances.

Also, the field name maxAge is not specific. We should use something like maxInstanceAge

enxebre · 2021-04-08T14:18:57Z

How will we prevent all machines in a scalable resource which might have very similar creation timestamp from being deleted at a time? So we don't drain multiple times relocating workload inefficiently?

seanmalloy · 2021-04-09T03:22:41Z

pkg/controller/machinehealthcheck/machinehealthcheck_controller.go

+	if t.MHC.Spec.MaxAge != nil {
+		machineCreationTime := t.Machine.CreationTimestamp.Time
+		if machineCreationTime.Add(t.MHC.Spec.MaxAge.Duration).Before(now) {
+			klog.V(3).Infof("%s: unhealthy: machine reached maximal age %v", t.string(), t.MHC.Spec.MaxAge.Duration)


Beyond just logging a message does the MHC generate k8s events or have prometheus metrics?

We have some similar-ish node killer automation running in our OCP v3.11 environment and we found it very useful to create k8s events and have prometheus metrics for monitoring/tracking old nodes that are being deleted in our production environment.

seanmalloy · 2021-04-09T03:42:38Z

How will we prevent all machines in a scalable resource which might have very similar creation timestamp from being deleted at a time? So we don't drain multiple times relocating workload inefficiently?

Could this be successfully mitigated by having the user also create a MachineDisruptionBudget?

In our real word OCP v3.11 cluster we delete one "old node" every two hours. I suppose even with an MDB the MHC could delete a lot more than one machine every two hours.

openshift-ci-robot · 2021-04-11T09:14:00Z

@alexander-demichev: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2021-07-10T14:14:45Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-08-09T17:58:58Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-ci · 2021-08-17T15:45:03Z

@alexander-demichev: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-libvirt	`574836f`	link	`/test e2e-libvirt`
ci/prow/e2e-gcp	`574836f`	link	`/test e2e-gcp`
ci/prow/e2e-azure	`574836f`	link	`/test e2e-azure`
ci/prow/e2e-vsphere-upgrade	`574836f`	link	`/test e2e-vsphere-upgrade`
ci/prow/e2e-gcp-operator	`574836f`	link	`/test e2e-gcp-operator`
ci/prow/e2e-vsphere	`574836f`	link	`/test e2e-vsphere`
ci/prow/e2e-metal-ipi	`574836f`	link	`/test e2e-metal-ipi`
ci/prow/e2e-azure-operator	`574836f`	link	`/test e2e-azure-operator`
ci/prow/e2e-aws	`574836f`	link	`/test e2e-aws`
ci/prow/e2e-metal-ipi-upgrade	`574836f`	link	`/test e2e-metal-ipi-upgrade`
ci/prow/e2e-metal-ipi-virtualmedia	`574836f`	link	`/test e2e-metal-ipi-virtualmedia`
ci/prow/e2e-aws-operator	`574836f`	link	`/test e2e-aws-operator`
ci/prow/e2e-aws-upgrade	`574836f`	link	`/test e2e-aws-upgrade`
ci/prow/e2e-metal-ipi-ovn-dualstack	`574836f`	link	`/test e2e-metal-ipi-ovn-dualstack`
ci/prow/e2e-vsphere-serial	`574836f`	link	`/test e2e-vsphere-serial`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot requested review from beekhof and cynepco3hahue April 8, 2021 13:10

elmiko reviewed Apr 8, 2021

View reviewed changes

install/0000_30_machine-api-operator_07_machinehealthcheck.crd.yaml Outdated Show resolved Hide resolved

JoelSpeed reviewed Apr 8, 2021

View reviewed changes

michaelgugino suggested changes Apr 8, 2021

View reviewed changes

alexander-demicev force-pushed the mhc branch from 0537515 to 6207eda Compare April 8, 2021 13:46

alexander-demicev added 2 commits April 8, 2021 15:53

Add maxAge field to MHC object

13abaf5

Remidiate machine after max age passed

574836f

alexander-demicev force-pushed the mhc branch from 6207eda to 574836f Compare April 8, 2021 13:53

seanmalloy reviewed Apr 9, 2021

View reviewed changes

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 11, 2021

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 10, 2021

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 9, 2021

alexander-demicev closed this Aug 20, 2021

Add maxAge field to MachineHealthChecks #843

Add maxAge field to MachineHealthChecks #843

Uh oh!

Conversation

alexander-demicev commented Apr 8, 2021

Uh oh!

openshift-ci-robot commented Apr 8, 2021

Uh oh!

elmiko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JoelSpeed Apr 8, 2021

Choose a reason for hiding this comment

Uh oh!

alexander-demicev Apr 8, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michaelgugino left a comment

Choose a reason for hiding this comment

Uh oh!

enxebre commented Apr 8, 2021

Uh oh!

seanmalloy Apr 9, 2021

Choose a reason for hiding this comment

Uh oh!

seanmalloy commented Apr 9, 2021

Uh oh!

openshift-ci-robot commented Apr 11, 2021

Uh oh!

openshift-bot commented Jul 10, 2021

Uh oh!

openshift-bot commented Aug 9, 2021

Uh oh!

openshift-ci bot commented Aug 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants