New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1985073: use 1m resolution for control plane cpu alerts #1201
Conversation
/bugzilla refresh |
@tkashem: This pull request references Bugzilla bug 1985073, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@tkashem: An error was encountered querying GitHub for users with public email (kewang@redhat.com) for bug 1985073 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details. Full error message.
non-200 OK status code: 403 Forbidden body: "{\n \"documentation_url\": \"https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#abuse-rate-limits\",\n \"message\": \"You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.\"\n}\n"
Please contact an administrator to resolve this issue, then request a bug refresh with In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@@ -42,7 +42,7 @@ spec: | |||
kube-apiservers are also under-provisioned. | |||
To fix this, increase the CPU and memory on your control plane nodes. | |||
expr: | | |||
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90 AND on (instance) label_replace( kube_node_role{role="master"}, "instance", "$1", "node", "(.+)" ) | |||
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100) > 90 AND on (instance) label_replace( kube_node_role{role="master"}, "instance", "$1", "node", "(.+)" ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're refactoring this, I think it's easier to read if we rephrase from 100 - (avg idle) > 90
to avg idle < 10
.
Also, this still doesn't account for holes in the node_cpu_seconds_total
metric. My understanding of the rate
call is that if the covered minute has any node_cpu_seconds_total
data, but node_cpu_seconds_total
is ticking up at 20% for 10s, while node_cpu_seconds_total
is missing for the other 50s, it will look like 0.2 * 0.1 = 0.02 = 2% idle, so the expr
would match despite 20% being > 10%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will look like 0.2 * 0.1 = 0.02 = 2% idle
rate
does extrapolation based on the slope of the first and the last sample under the window.
rate
also avoids extrapolating too far, extrapolation extends to half the sample interval when the first or the last sample is too far away from the window
so prometheus does not handle the missing data/gap as above. since rate
already extrapolates for us i don't see it's necessary for the alert to take into account any gap in its calculation.
(slack thread where we discussed it - https://coreos.slack.com/archives/C01CQA76KMX/p1626752724386400)
/retest |
2 similar comments
/retest |
/retest |
/assign @mfojtik |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mfojtik, tkashem The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest |
2 similar comments
/retest |
/retest |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest-required Please review the full test history for this PR and help us cut down flakes. |
@tkashem: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
6 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
@tkashem: An error was encountered searching for external tracker bugs for bug 1985073 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details. Full error message.
could not unmarshal response body: invalid character '<' looking for beginning of value
Please contact an administrator to resolve this issue, then request a bug refresh with In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
No description provided.