-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP [MGMT-2101] Added remediation history to MachineHealthChecks #760
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Marc Sluiter <msluiter@redhat.com>
40e7252
to
069dec0
Compare
Signed-off-by: Marc Sluiter <msluiter@redhat.com>
Maybe Master vs. Worker would be a more useful destinction? |
also, how many entries do we keep? |
The idea is to track whether a unhealthy machine (e.g. [0]) or a node with unhealthy conditions ([1]) triggered remediation.
atm hardcoded 5 ([2] and [3]) [0] https://github.com/openshift/machine-api-operator/pull/760/files#diff-8b3f455c5e13c63eb4eb480e223a91c2051decc3269398fd2f53196196a6033fR623 |
/retest |
@slintes: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
i'm just starting to take a look at this, and as i was reading the description and comments i was wondering if our recent efforts to add more metrics on the mhc would help answer some of the underlying questions. for example, we recently added a metric for successful remediations (see #754), this metric will update as machines are remediated. currently the metric only contains labels for the name and namespace of the mhc (not the nodes remediated). i don't know that it would solve the issues this pr is addressing, but it is another piece of data. |
Hi Michael, thanks for pointing out the new metrics. |
makes sense to me |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
@slintes: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@slintes: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/close Our team is more focused on NHC than MHC these days. I will create a new PR once we decide to revisit this topic. |
@slintes: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
thanks @slintes |
In order to better understand remediations, we want to track a limited amount of remediation in the MHC status, with the target machine / node, reason, and the timestamps when the unhealthy machine / node was detected, when remediation started, when the node is fenced (=deleted), and when remediation is done (node is ready again).
TODO: add tests
initial feedback welcome :)
Example: