Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1985073: use 1m resolution for control plane cpu alerts #1201

Merged
merged 1 commit into from Aug 9, 2021

Conversation

tkashem
Copy link
Contributor

@tkashem tkashem commented Aug 4, 2021

No description provided.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 4, 2021
@openshift-ci openshift-ci bot requested review from mfojtik and soltysh August 4, 2021 16:08
@tkashem
Copy link
Contributor Author

tkashem commented Aug 4, 2021

image
image

5m, 2m vs 1m - looks like we should go with 1m resolution.

@tkashem tkashem changed the title [WIP] use 1m resolution for control plane cpu alerts Bug 1985073: use 1m resolution for control plane cpu alerts Aug 4, 2021
@tkashem
Copy link
Contributor Author

tkashem commented Aug 4, 2021

/bugzilla refresh

@openshift-ci openshift-ci bot added the bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. label Aug 4, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 4, 2021

@tkashem: This pull request references Bugzilla bug 1985073, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.9.0) matches configured target release for branch (4.9.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 4, 2021
@openshift-ci openshift-ci bot requested a review from wangke19 August 4, 2021 16:34
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 4, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 4, 2021

@tkashem: An error was encountered querying GitHub for users with public email (kewang@redhat.com) for bug 1985073 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details.

Full error message. non-200 OK status code: 403 Forbidden body: "{\n \"documentation_url\": \"https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#abuse-rate-limits\",\n \"message\": \"You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.\"\n}\n"

Please contact an administrator to resolve this issue, then request a bug refresh with /bugzilla refresh.

In response to this:

Bug 1985073: use 1m resolution for control plane cpu alerts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -42,7 +42,7 @@ spec:
kube-apiservers are also under-provisioned.
To fix this, increase the CPU and memory on your control plane nodes.
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90 AND on (instance) label_replace( kube_node_role{role="master"}, "instance", "$1", "node", "(.+)" )
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100) > 90 AND on (instance) label_replace( kube_node_role{role="master"}, "instance", "$1", "node", "(.+)" )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're refactoring this, I think it's easier to read if we rephrase from 100 - (avg idle) > 90 to avg idle < 10.

Also, this still doesn't account for holes in the node_cpu_seconds_total metric. My understanding of the rate call is that if the covered minute has any node_cpu_seconds_total data, but node_cpu_seconds_total is ticking up at 20% for 10s, while node_cpu_seconds_total is missing for the other 50s, it will look like 0.2 * 0.1 = 0.02 = 2% idle, so the expr would match despite 20% being > 10%

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will look like 0.2 * 0.1 = 0.02 = 2% idle

rate does extrapolation based on the slope of the first and the last sample under the window.
rate also avoids extrapolating too far, extrapolation extends to half the sample interval when the first or the last sample is too far away from the window
so prometheus does not handle the missing data/gap as above. since rate already extrapolates for us i don't see it's necessary for the alert to take into account any gap in its calculation.

(slack thread where we discussed it - https://coreos.slack.com/archives/C01CQA76KMX/p1626752724386400)

@tkashem
Copy link
Contributor Author

tkashem commented Aug 5, 2021

/retest

2 similar comments
@tkashem
Copy link
Contributor Author

tkashem commented Aug 6, 2021

/retest

@tkashem
Copy link
Contributor Author

tkashem commented Aug 9, 2021

/retest

@tkashem
Copy link
Contributor Author

tkashem commented Aug 9, 2021

/assign @mfojtik

@mfojtik
Copy link
Member

mfojtik commented Aug 9, 2021

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 9, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 9, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mfojtik, tkashem

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 9, 2021
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@tkashem
Copy link
Contributor Author

tkashem commented Aug 9, 2021

/retest

2 similar comments
@tkashem
Copy link
Contributor Author

tkashem commented Aug 9, 2021

/retest

@tkashem
Copy link
Contributor Author

tkashem commented Aug 9, 2021

/retest

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 9, 2021

@tkashem: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-gcp-operator-single-node d39789f link /test e2e-gcp-operator-single-node

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

6 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci openshift-ci bot merged commit a107994 into openshift:master Aug 9, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 9, 2021

@tkashem: An error was encountered searching for external tracker bugs for bug 1985073 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details.

Full error message. could not unmarshal response body: invalid character '<' looking for beginning of value

Please contact an administrator to resolve this issue, then request a bug refresh with /bugzilla refresh.

In response to this:

Bug 1985073: use 1m resolution for control plane cpu alerts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tkashem tkashem deleted the bz-1985073 branch August 23, 2021 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants