New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1892338: metrics: Rework template_router_reload_failure metric #209
Conversation
pkg/router/template/router.go: Replace the current template_router_reload_fails counter metric with a new gauge metric, titled template_router_reload_failure, that tracks the result of the most recent HAProxy reload. If a reload fails, the template_router_reload_failure metric will be set to 1 until a successful reload sets the metric to 0. Previously, the template_router_reload_fails counter metric would monotonically increase when an HAProxy reload failed. This counter metric is difficult to alert on since an increasing counter essentially gives no indication that the reload failures have been resolved. This new boolean based metric is trivial to properly alert on since the metric will hold "1" if and only if the router is _not_ successfully reloading.
@sgreene570: This pull request references Bugzilla bug 1892338, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
openshift/router#209 reworks the HAProxy reload fails metric so that the HAProxyReloadFail alert can be improved. The new template_router_reload_failure metric in router openshift#209 that replaces the template_router_reload_fails metric is a simple boolean gauge metric, which allows the HAProxyReloadFail alert to fire for the duration of the HAProxy reload outage. Previously, the HAProxyReloadFail alert would fire for ~5 minutes regardless of whether or not reloads were still continuing to fail on the router.
openshift/router#209 reworks the HAProxy reload fails metric so that the HAProxyReloadFail alert can be improved. The new template_router_reload_failure metric in router openshift#209 that replaces the template_router_reload_fails metric is a simple boolean gauge metric, which allows the HAProxyReloadFail alert to fire for the duration of the HAProxy reload outage. Previously, the HAProxyReloadFail alert would fire for ~5 minutes regardless of whether or not reloads were still continuing to fail on the router. Also drops the HAProxyReloadFail alert to warning severity.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Miciah, sgreene570, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
14 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
10 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
Failing with |
/retest |
1 similar comment
/retest |
CI is back! |
@sgreene570: Some pull requests linked via external trackers have merged: The following pull requests linked via external trackers have not merged: These pull request must merge or be unlinked from the Bugzilla bug in order for it to move to the next state. Bugzilla bug 1892338 has not been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.6 |
@sgreene570: new pull request created: #215 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.5 |
@sgreene570: new pull request created: #216 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
openshift/router#209 reworks the HAProxy reload fails metric so that the HAProxyReloadFail alert can be improved. The new template_router_reload_failure metric in router openshift#209 that replaces the template_router_reload_fails metric is a simple boolean gauge metric, which allows the HAProxyReloadFail alert to fire for the duration of the HAProxy reload outage. Previously, the HAProxyReloadFail alert would fire for ~5 minutes regardless of whether or not reloads were still continuing to fail on the router. Also drops the HAProxyReloadFail alert to warning severity.
openshift/router#209 reworks the HAProxy reload fails metric so that the HAProxyReloadFail alert can be improved. The new template_router_reload_failure metric in router openshift#209 that replaces the template_router_reload_fails metric is a simple boolean gauge metric, which allows the HAProxyReloadFail alert to fire for the duration of the HAProxy reload outage. Previously, the HAProxyReloadFail alert would fire for ~5 minutes regardless of whether or not reloads were still continuing to fail on the router. Also drops the HAProxyReloadFail alert to warning severity.
openshift/router#209 reworks the HAProxy reload fails metric so that the HAProxyReloadFail alert can be improved. The new template_router_reload_failure metric in router openshift#209 that replaces the template_router_reload_fails metric is a simple boolean gauge metric, which allows the HAProxyReloadFail alert to fire for the duration of the HAProxy reload outage. Previously, the HAProxyReloadFail alert would fire for ~5 minutes regardless of whether or not reloads were still continuing to fail on the router. Also drops the HAProxyReloadFail alert to warning severity.
pkg/router/template/router.go: Replace the current template_router_reload_fails counter metric with a new gauge metric, titled template_router_reload_failure, that tracks the result of the most recent HAProxy reload. If a reload fails, the template_router_reload_failure metric will be set to 1 until a successful reload sets the metric to 0.
Previously, the template_router_reload_fails counter metric would monotonically increase when an HAProxy reload failed. This counter metric is difficult to alert on since an increasing counter essentially gives no indication that the reload failures have been resolved.
This new boolean based metric is trivial to properly alert on since the metric will hold "1" if and only if the router is not successfully reloading.