WIP [MGMT-2101] Added remediation history to MachineHealthChecks #760

slintes · 2020-11-24T17:50:59Z

In order to better understand remediations, we want to track a limited amount of remediation in the MHC status, with the target machine / node, reason, and the timestamps when the unhealthy machine / node was detected, when remediation started, when the node is fenced (=deleted), and when remediation is done (node is ready again).

TODO: add tests

initial feedback welcome :)

Example:

apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  annotations:
    machine.openshift.io/remediation-strategy: external-baremetal
  creationTimestamp: "2020-11-24T16:31:22Z"
spec:
  maxUnhealthy: 100%
  nodeStartupTimeout: 60m
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-machine-role: worker
  unhealthyConditions:
  - status: "False"
    timeout: 20s
    type: Ready
  - status: Unknown
    timeout: 20s
    type: Ready
status:
  conditions:
  - lastTransitionTime: "2020-11-24T17:17:32Z"
    status: "True"
    type: RemediationAllowed
  currentHealthy: 2
  expectedMachines: 2
  remediationHistory:
  - conditionStatus: Unknown
    conditionType: Ready
    detected: "2020-11-24T17:33:43Z"
    fenced: "2020-11-24T17:34:35Z"
    finished: "2020-11-24T17:35:09Z"
    remediationType: external
    started: "2020-11-24T17:34:04Z"
    targetKind: Node
    targetName: worker-1
  remediationsAllowed: 2

openshift-ci-robot · 2020-11-24T17:51:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign danil-grigorev after the PR has been reviewed.
You can assign the PR to them by writing /assign @danil-grigorev in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Marc Sluiter <msluiter@redhat.com>

beekhof · 2020-11-24T23:34:37Z

targetKind: Node

Maybe Master vs. Worker would be a more useful destinction?

beekhof · 2020-11-24T23:35:14Z

also, how many entries do we keep?

slintes · 2020-11-25T08:13:14Z

targetKind: Node

Maybe Master vs. Worker would be a more useful destinction?

The idea is to track whether a unhealthy machine (e.g. [0]) or a node with unhealthy conditions ([1]) triggered remediation.
If it's a master or worker should be visible by the name.

also, how many entries do we keep?

atm hardcoded 5 ([2] and [3])

[0] https://github.com/openshift/machine-api-operator/pull/760/files#diff-8b3f455c5e13c63eb4eb480e223a91c2051decc3269398fd2f53196196a6033fR623
[1] https://github.com/openshift/machine-api-operator/pull/760/files#diff-8b3f455c5e13c63eb4eb480e223a91c2051decc3269398fd2f53196196a6033fR666
[2] https://github.com/openshift/machine-api-operator/pull/760/files#diff-3c70c3e9ec89f59adc85598b5ea78b19a08270743277d631bb0dcfad4d622e49R12
[3] https://github.com/openshift/machine-api-operator/pull/760/files#diff-3c70c3e9ec89f59adc85598b5ea78b19a08270743277d631bb0dcfad4d622e49R74

slintes · 2020-11-25T08:21:28Z

/retest

openshift-merge-robot · 2020-11-25T10:27:34Z

@slintes: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-libvirt	`883fb39`	link	`/test e2e-libvirt`
ci/prow/e2e-aws-workers-rhel7	`883fb39`	link	`/test e2e-aws-workers-rhel7`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

elmiko · 2020-12-01T16:59:14Z

i'm just starting to take a look at this, and as i was reading the description and comments i was wondering if our recent efforts to add more metrics on the mhc would help answer some of the underlying questions.

for example, we recently added a metric for successful remediations (see #754), this metric will update as machines are remediated. currently the metric only contains labels for the name and namespace of the mhc (not the nodes remediated). i don't know that it would solve the issues this pr is addressing, but it is another piece of data.

slintes · 2020-12-01T21:00:53Z

Hi Michael, thanks for pointing out the new metrics.
The context for our work is that we'd like to show the remediation history in the UI. It will we be hard for them to parse metrics (or events) and calculate the actual state of recent and ongoing remediations from it. That's why we want to add them to the MHC status.

elmiko · 2020-12-01T21:09:56Z

The context for our work is that we'd like to show the remediation history in the UI. It will we be hard for them to parse metrics (or events) and calculate the actual state of recent and ongoing remediations from it. That's why we want to add them to the MHC status.

makes sense to me

openshift-bot · 2021-03-02T01:36:30Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

slintes · 2021-03-04T10:20:15Z

/remove-lifecycle stale
/lifecycle frozen

openshift-ci · 2021-08-17T15:37:25Z

@slintes: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-11-04T09:38:46Z

@slintes: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws	`883fb39`	link	`/test e2e-aws`
ci/prow/e2e-aws-operator	`883fb39`	link	`/test e2e-aws-operator`
ci/prow/e2e-aws-upgrade	`883fb39`	link	`/test e2e-aws-upgrade`
ci/prow/e2e-metal-ipi	`883fb39`	link	`/test e2e-metal-ipi`
ci/prow/e2e-metal-ipi-ovn-dualstack	`883fb39`	link	`/test e2e-metal-ipi-ovn-dualstack`
ci/prow/e2e-metal-ipi-upgrade	`883fb39`	link	`/test e2e-metal-ipi-upgrade`
ci/prow/e2e-metal-ipi-virtualmedia	`883fb39`	link	`/test e2e-metal-ipi-virtualmedia`
ci/prow/e2e-vsphere-serial	`883fb39`	link	`/test e2e-vsphere-serial`
ci/prow/verify-crds-sync	`883fb39`	link	true	`/test verify-crds-sync`
ci/prow/e2e-aws-ovn	`883fb39`	link	true	`/test e2e-aws-ovn`
ci/prow/e2e-vsphere-ovn-serial	`883fb39`	link	true	`/test e2e-vsphere-ovn-serial`
ci/prow/e2e-aws-ovn-upgrade	`883fb39`	link	true	`/test e2e-aws-ovn-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

slintes · 2022-11-04T10:21:00Z

/close

Our team is more focused on NHC than MHC these days. I will create a new PR once we decide to revisit this topic.

openshift-ci · 2022-11-04T10:21:10Z

@slintes: Closed this PR.

In response to this:

/close

Our team is more focused on NHC than MHC these days. I will create a new PR once we decide to revisit this topic.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

elmiko · 2022-11-04T19:48:31Z

thanks @slintes

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 24, 2020

openshift-ci-robot requested review from beekhof and paulfantom November 24, 2020 17:51

slintes changed the title ~~[MGMT-2101] Added remediation history to MachineHealthChecks~~ WIP [MGMT-2101] Added remediation history to MachineHealthChecks Nov 24, 2020

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 24, 2020

Added remediation history to MachineHealthChecks

069dec0

Signed-off-by: Marc Sluiter <msluiter@redhat.com>

slintes force-pushed the remediation-history branch from 40e7252 to 069dec0 Compare November 24, 2020 18:00

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 24, 2020

Workaround for failing unit test

883fb39

Signed-off-by: Marc Sluiter <msluiter@redhat.com>

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 2, 2021

openshift-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 4, 2021

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 17, 2021

openshift-ci bot closed this Nov 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP [MGMT-2101] Added remediation history to MachineHealthChecks #760

WIP [MGMT-2101] Added remediation history to MachineHealthChecks #760

slintes commented Nov 24, 2020 •

edited

openshift-ci-robot commented Nov 24, 2020

beekhof commented Nov 24, 2020

beekhof commented Nov 24, 2020

slintes commented Nov 25, 2020

slintes commented Nov 25, 2020

openshift-merge-robot commented Nov 25, 2020 •

edited

elmiko commented Dec 1, 2020

slintes commented Dec 1, 2020

elmiko commented Dec 1, 2020

openshift-bot commented Mar 2, 2021

slintes commented Mar 4, 2021

openshift-ci bot commented Aug 17, 2021

openshift-ci bot commented Nov 4, 2022

slintes commented Nov 4, 2022

openshift-ci bot commented Nov 4, 2022

elmiko commented Nov 4, 2022

WIP [MGMT-2101] Added remediation history to MachineHealthChecks #760

WIP [MGMT-2101] Added remediation history to MachineHealthChecks #760

Conversation

slintes commented Nov 24, 2020 • edited

openshift-ci-robot commented Nov 24, 2020

beekhof commented Nov 24, 2020

beekhof commented Nov 24, 2020

slintes commented Nov 25, 2020

slintes commented Nov 25, 2020

openshift-merge-robot commented Nov 25, 2020 • edited

elmiko commented Dec 1, 2020

slintes commented Dec 1, 2020

elmiko commented Dec 1, 2020

openshift-bot commented Mar 2, 2021

slintes commented Mar 4, 2021

openshift-ci bot commented Aug 17, 2021

openshift-ci bot commented Nov 4, 2022

slintes commented Nov 4, 2022

openshift-ci bot commented Nov 4, 2022

elmiko commented Nov 4, 2022

slintes commented Nov 24, 2020 •

edited

openshift-merge-robot commented Nov 25, 2020 •

edited