[OCPCLOUD-923] add alerts for memory and cpu core limits #213

elmiko · 2021-07-07T20:41:39Z

This change adds the alerts for when the cluster autoscaler is unable to
scale out due to reaching cpu or memory limits. It also updates the
alert documents.

ref: https://issues.redhat.com/browse/OCPCLOUD-923

elmiko · 2021-07-07T20:42:25Z

this will need to wait until we do the 1.22 rebase of the cluster autoscaler to pick up the new metrics sources.
/hold

elmiko · 2021-07-07T20:42:46Z

/cc @openshift/sre-alert-sme

docs/user/alerts.md

dofinn · 2021-07-08T20:28:16Z

awesome job on this @elmiko +1

This change adds the alerts for when the cluster autoscaler is unable to scale out due to reaching cpu or memory limits. It also updates the alert documents. ref: https://issues.redhat.com/browse/OCPCLOUD-923

elmiko · 2021-07-09T16:55:35Z

updated to drop severity from warning to info

RiRa12621 · 2021-07-09T17:32:08Z

Lgtm from our end

michaelgugino

/lgtm

Info is definitely the right severity level for this one.

JoelSpeed

This seems good to me, just curious about the default values mentioned, PTAL

JoelSpeed · 2021-07-12T12:43:37Z

docs/user/alerts.md

+The number of total cores in the cluster has exceeded the maximum number set on the
+cluster autoscaler. This is calculated by summing the cpu capacity for all nodes
+in the cluster and comparing that number against the maximum cores value set for the
+cluster autoscaler (default 320000 cores).


Is this default correct? Seems absurdly high? Do we have a source for that value? If so, is it worth adding a link inline?

this is just the default if you set nothing on the cluster autoscaler, see this faq entry, also we do allow not setting those limits in a ClusterAutoscaler

Could you add a description how you can get the current maxmum settings? like "You can run this to get what's the maximum cores setup in this cluster ...".

@wanghaoran1988 Does this need to happen before we merge or could that follow in a later release? I'm happy with this personally and would prefer to merge it today so its in before feature freeze, but Mike is out on PTO today so wouldn't have time to update this before FF

I'm ok with a follow release.

JoelSpeed · 2021-07-12T12:47:30Z

docs/user/alerts.md

+The number of total bytes of RAM in the cluster has exceeded the maximum number set on
+the cluster autoscaler. This is calculated by summing the memory capacity for all nodes
+in the cluster and comparing that number against the maximum memory bytes value set
+for the cluster autoscaler (default 6400000 gigabytes).


Same as above, where did this figure come from?

same as above

elmiko · 2021-07-22T19:25:34Z

reminder: update openshift/enhancements#538 when this merges

wanghaoran1988 · 2021-07-23T08:46:26Z

/cc

JoelSpeed · 2021-07-24T07:09:46Z

/approve
/retest-required

openshift-ci · 2021-07-24T07:09:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JoelSpeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

elmiko · 2021-07-24T13:38:36Z

this PR technically needs the metric to be present, we will pick it up during the 1.22 rebase. i am removing the hold as we don't believe this will break anything by merging early.
/hold cancel

elmiko · 2021-07-24T18:58:37Z

/test e2e-aws-operator

openshift-bot · 2021-07-25T05:28:22Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-07-25T05:40:21Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-07-25T07:52:43Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-07-25T08:04:22Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-07-25T09:40:23Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-07-25T09:52:22Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-07-25T11:16:22Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-07-25T11:28:24Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-07-25T13:28:22Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-07-25T13:40:22Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-07-25T15:40:22Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

elmiko · 2021-07-25T16:40:52Z

seeing this error several times in the logs

 framework.go:128] error querying api for ValidatingWebhookConfiguration: no matches for kind "ValidatingWebhookConfiguration" in version "admissionregistration.k8s.io/v1beta1", retrying...

doesn't seem related to this PR, but makes me wonder if that is one of the types that got upgraded to v1 version?

elmiko · 2021-07-25T16:43:08Z

seems like yes, https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-22

JoelSpeed · 2021-07-25T16:47:49Z

The webhook test is a known perma failure at the moment, we have a PR to our tests updating the webhook versions but other tests are flaking causing it not have merged yet

Having reviewed the failures, I don't think any of them are related to this PR, this should be safe to merge

/override ci/prow/e2e-aws-operator

openshift-ci · 2021-07-25T16:48:08Z

@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/e2e-aws-operator

In response to this:

The webhook test is a known perma failure at the moment, we have a PR to our tests updating the webhook versions but other tests are flaking causing it not have merged yet

Having reviewed the failures, I don't think any of them are related to this PR, this should be safe to merge

/override ci/prow/e2e-aws-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

elmiko · 2021-07-25T16:49:51Z

thanks Joel, i was just looking at that webhook version failure lol

openshift-bot · 2021-07-25T17:52:22Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

JoelSpeed · 2021-07-25T19:22:25Z

/override ci/prow/e2e-aws-operator

Re overide based on the above comment

openshift-ci · 2021-07-25T19:22:43Z

@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/e2e-aws-operator

In response to this:

/override ci/prow/e2e-aws-operator

Re overide based on the above comment

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot requested review from JoelSpeed and michaelgugino July 7, 2021 20:41

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 7, 2021

RiRa12621 reviewed Jul 8, 2021

View reviewed changes

docs/user/alerts.md Show resolved Hide resolved

add alerts for memory and cpu core limits

897b464

This change adds the alerts for when the cluster autoscaler is unable to scale out due to reaching cpu or memory limits. It also updates the alert documents. ref: https://issues.redhat.com/browse/OCPCLOUD-923

elmiko force-pushed the add-ca-alerts branch from 4c155fa to 897b464 Compare July 9, 2021 16:54

michaelgugino approved these changes Jul 9, 2021

View reviewed changes

openshift-ci bot assigned michaelgugino Jul 9, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 9, 2021

JoelSpeed reviewed Jul 12, 2021

View reviewed changes

openshift-ci bot requested a review from wanghaoran1988 July 23, 2021 08:46

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 24, 2021

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 24, 2021

openshift-merge-robot merged commit e5c1bd1 into openshift:master Jul 25, 2021

elmiko deleted the add-ca-alerts branch July 26, 2021 13:19

[OCPCLOUD-923] add alerts for memory and cpu core limits #213

[OCPCLOUD-923] add alerts for memory and cpu core limits #213

Conversation

elmiko commented Jul 7, 2021

elmiko commented Jul 7, 2021

elmiko commented Jul 7, 2021

dofinn commented Jul 8, 2021

elmiko commented Jul 9, 2021

RiRa12621 commented Jul 9, 2021

michaelgugino left a comment

Choose a reason for hiding this comment

JoelSpeed left a comment

Choose a reason for hiding this comment

JoelSpeed Jul 12, 2021

Choose a reason for hiding this comment

elmiko Jul 12, 2021 • edited

Choose a reason for hiding this comment

wanghaoran1988 Jul 23, 2021 • edited

Choose a reason for hiding this comment

JoelSpeed Jul 23, 2021

Choose a reason for hiding this comment

wanghaoran1988 Jul 23, 2021

Choose a reason for hiding this comment

JoelSpeed Jul 12, 2021

Choose a reason for hiding this comment

elmiko Jul 12, 2021

Choose a reason for hiding this comment

elmiko commented Jul 22, 2021

wanghaoran1988 commented Jul 23, 2021

JoelSpeed commented Jul 24, 2021

openshift-ci bot commented Jul 24, 2021

elmiko commented Jul 24, 2021

elmiko commented Jul 24, 2021

openshift-bot commented Jul 25, 2021

openshift-bot commented Jul 25, 2021

openshift-bot commented Jul 25, 2021

openshift-bot commented Jul 25, 2021

openshift-bot commented Jul 25, 2021

openshift-bot commented Jul 25, 2021

openshift-bot commented Jul 25, 2021

openshift-bot commented Jul 25, 2021

openshift-bot commented Jul 25, 2021

openshift-bot commented Jul 25, 2021

openshift-bot commented Jul 25, 2021

elmiko commented Jul 25, 2021 • edited

elmiko commented Jul 25, 2021

JoelSpeed commented Jul 25, 2021

openshift-ci bot commented Jul 25, 2021

elmiko commented Jul 25, 2021

openshift-bot commented Jul 25, 2021

JoelSpeed commented Jul 25, 2021

openshift-ci bot commented Jul 25, 2021

elmiko Jul 12, 2021 •

edited

wanghaoran1988 Jul 23, 2021 •

edited

elmiko commented Jul 25, 2021 •

edited