Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP [MGMT-2101] Added remediation history to MachineHealthChecks #760

Closed
wants to merge 2 commits into from

Conversation

slintes
Copy link
Member

@slintes slintes commented Nov 24, 2020

In order to better understand remediations, we want to track a limited amount of remediation in the MHC status, with the target machine / node, reason, and the timestamps when the unhealthy machine / node was detected, when remediation started, when the node is fenced (=deleted), and when remediation is done (node is ready again).

TODO: add tests

initial feedback welcome :)

Example:

apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  annotations:
    machine.openshift.io/remediation-strategy: external-baremetal
  creationTimestamp: "2020-11-24T16:31:22Z"
spec:
  maxUnhealthy: 100%
  nodeStartupTimeout: 60m
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-machine-role: worker
  unhealthyConditions:
  - status: "False"
    timeout: 20s
    type: Ready
  - status: Unknown
    timeout: 20s
    type: Ready
status:
  conditions:
  - lastTransitionTime: "2020-11-24T17:17:32Z"
    status: "True"
    type: RemediationAllowed
  currentHealthy: 2
  expectedMachines: 2
  remediationHistory:
  - conditionStatus: Unknown
    conditionType: Ready
    detected: "2020-11-24T17:33:43Z"
    fenced: "2020-11-24T17:34:35Z"
    finished: "2020-11-24T17:35:09Z"
    remediationType: external
    started: "2020-11-24T17:34:04Z"
    targetKind: Node
    targetName: worker-1
  remediationsAllowed: 2

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 24, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign danil-grigorev after the PR has been reviewed.
You can assign the PR to them by writing /assign @danil-grigorev in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@slintes slintes changed the title [MGMT-2101] Added remediation history to MachineHealthChecks WIP [MGMT-2101] Added remediation history to MachineHealthChecks Nov 24, 2020
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 24, 2020
Signed-off-by: Marc Sluiter <msluiter@redhat.com>
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 24, 2020
Signed-off-by: Marc Sluiter <msluiter@redhat.com>
@beekhof
Copy link
Contributor

beekhof commented Nov 24, 2020

targetKind: Node

Maybe Master vs. Worker would be a more useful destinction?

@beekhof
Copy link
Contributor

beekhof commented Nov 24, 2020

also, how many entries do we keep?

@slintes
Copy link
Member Author

slintes commented Nov 25, 2020

@slintes
Copy link
Member Author

slintes commented Nov 25, 2020

/retest

@openshift-merge-robot
Copy link
Contributor

openshift-merge-robot commented Nov 25, 2020

@slintes: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-libvirt 883fb39 link /test e2e-libvirt
ci/prow/e2e-aws-workers-rhel7 883fb39 link /test e2e-aws-workers-rhel7

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@elmiko
Copy link
Contributor

elmiko commented Dec 1, 2020

i'm just starting to take a look at this, and as i was reading the description and comments i was wondering if our recent efforts to add more metrics on the mhc would help answer some of the underlying questions.

for example, we recently added a metric for successful remediations (see #754), this metric will update as machines are remediated. currently the metric only contains labels for the name and namespace of the mhc (not the nodes remediated). i don't know that it would solve the issues this pr is addressing, but it is another piece of data.

@slintes
Copy link
Member Author

slintes commented Dec 1, 2020

Hi Michael, thanks for pointing out the new metrics.
The context for our work is that we'd like to show the remediation history in the UI. It will we be hard for them to parse metrics (or events) and calculate the actual state of recent and ongoing remediations from it. That's why we want to add them to the MHC status.

@elmiko
Copy link
Contributor

elmiko commented Dec 1, 2020

The context for our work is that we'd like to show the remediation history in the UI. It will we be hard for them to parse metrics (or events) and calculate the actual state of recent and ongoing remediations from it. That's why we want to add them to the MHC status.

makes sense to me

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 2, 2021
@slintes
Copy link
Member Author

slintes commented Mar 4, 2021

/remove-lifecycle stale
/lifecycle frozen

@openshift-ci-robot openshift-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 4, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 17, 2021

@slintes: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 17, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 4, 2022

@slintes: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws 883fb39 link /test e2e-aws
ci/prow/e2e-aws-operator 883fb39 link /test e2e-aws-operator
ci/prow/e2e-aws-upgrade 883fb39 link /test e2e-aws-upgrade
ci/prow/e2e-metal-ipi 883fb39 link /test e2e-metal-ipi
ci/prow/e2e-metal-ipi-ovn-dualstack 883fb39 link /test e2e-metal-ipi-ovn-dualstack
ci/prow/e2e-metal-ipi-upgrade 883fb39 link /test e2e-metal-ipi-upgrade
ci/prow/e2e-metal-ipi-virtualmedia 883fb39 link /test e2e-metal-ipi-virtualmedia
ci/prow/e2e-vsphere-serial 883fb39 link /test e2e-vsphere-serial
ci/prow/verify-crds-sync 883fb39 link true /test verify-crds-sync
ci/prow/e2e-aws-ovn 883fb39 link true /test e2e-aws-ovn
ci/prow/e2e-vsphere-ovn-serial 883fb39 link true /test e2e-vsphere-ovn-serial
ci/prow/e2e-aws-ovn-upgrade 883fb39 link true /test e2e-aws-ovn-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@slintes
Copy link
Member Author

slintes commented Nov 4, 2022

/close

Our team is more focused on NHC than MHC these days. I will create a new PR once we decide to revisit this topic.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 4, 2022

@slintes: Closed this PR.

In response to this:

/close

Our team is more focused on NHC than MHC these days. I will create a new PR once we decide to revisit this topic.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot closed this Nov 4, 2022
@elmiko
Copy link
Contributor

elmiko commented Nov 4, 2022

thanks @slintes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants