Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OCPCLOUD-923] add alerts for memory and cpu core limits #213

Merged
merged 1 commit into from Jul 25, 2021

Conversation

elmiko
Copy link
Contributor

@elmiko elmiko commented Jul 7, 2021

This change adds the alerts for when the cluster autoscaler is unable to
scale out due to reaching cpu or memory limits. It also updates the
alert documents.

ref: https://issues.redhat.com/browse/OCPCLOUD-923

@elmiko
Copy link
Contributor Author

elmiko commented Jul 7, 2021

this will need to wait until we do the 1.22 rebase of the cluster autoscaler to pick up the new metrics sources.
/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 7, 2021
@elmiko
Copy link
Contributor Author

elmiko commented Jul 7, 2021

/cc @openshift/sre-alert-sme

@dofinn
Copy link

dofinn commented Jul 8, 2021

awesome job on this @elmiko +1

This change adds the alerts for when the cluster autoscaler is unable to
scale out due to reaching cpu or memory limits. It also updates the
alert documents.

ref: https://issues.redhat.com/browse/OCPCLOUD-923
@elmiko
Copy link
Contributor Author

elmiko commented Jul 9, 2021

updated to drop severity from warning to info

@RiRa12621
Copy link

Lgtm from our end

Copy link
Contributor

@michaelgugino michaelgugino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Info is definitely the right severity level for this one.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 9, 2021
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good to me, just curious about the default values mentioned, PTAL

The number of total cores in the cluster has exceeded the maximum number set on the
cluster autoscaler. This is calculated by summing the cpu capacity for all nodes
in the cluster and comparing that number against the maximum cores value set for the
cluster autoscaler (default 320000 cores).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this default correct? Seems absurdly high? Do we have a source for that value? If so, is it worth adding a link inline?

Copy link
Contributor Author

@elmiko elmiko Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just the default if you set nothing on the cluster autoscaler, see this faq entry, also we do allow not setting those limits in a ClusterAutoscaler

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a description how you can get the current maxmum settings? like "You can run this to get what's the maximum cores setup in this cluster ...".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wanghaoran1988 Does this need to happen before we merge or could that follow in a later release? I'm happy with this personally and would prefer to merge it today so its in before feature freeze, but Mike is out on PTO today so wouldn't have time to update this before FF

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with a follow release.

The number of total bytes of RAM in the cluster has exceeded the maximum number set on
the cluster autoscaler. This is calculated by summing the memory capacity for all nodes
in the cluster and comparing that number against the maximum memory bytes value set
for the cluster autoscaler (default 6400000 gigabytes).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, where did this figure come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@elmiko
Copy link
Contributor Author

elmiko commented Jul 22, 2021

reminder: update openshift/enhancements#538 when this merges

@wanghaoran1988
Copy link
Member

/cc

@JoelSpeed
Copy link
Contributor

/approve
/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 24, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 24, 2021
@elmiko
Copy link
Contributor Author

elmiko commented Jul 24, 2021

this PR technically needs the metric to be present, we will pick it up during the 1.22 rebase. i am removing the hold as we don't believe this will break anything by merging early.
/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 24, 2021
@elmiko
Copy link
Contributor Author

elmiko commented Jul 24, 2021

/test e2e-aws-operator

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

5 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

4 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@elmiko
Copy link
Contributor Author

elmiko commented Jul 25, 2021

seeing this error several times in the logs

 framework.go:128] error querying api for ValidatingWebhookConfiguration: no matches for kind "ValidatingWebhookConfiguration" in version "admissionregistration.k8s.io/v1beta1", retrying...

doesn't seem related to this PR, but makes me wonder if that is one of the types that got upgraded to v1 version?

@elmiko
Copy link
Contributor Author

elmiko commented Jul 25, 2021

@JoelSpeed
Copy link
Contributor

The webhook test is a known perma failure at the moment, we have a PR to our tests updating the webhook versions but other tests are flaking causing it not have merged yet

Having reviewed the failures, I don't think any of them are related to this PR, this should be safe to merge

/override ci/prow/e2e-aws-operator

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 25, 2021

@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/e2e-aws-operator

In response to this:

The webhook test is a known perma failure at the moment, we have a PR to our tests updating the webhook versions but other tests are flaking causing it not have merged yet

Having reviewed the failures, I don't think any of them are related to this PR, this should be safe to merge

/override ci/prow/e2e-aws-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@elmiko
Copy link
Contributor Author

elmiko commented Jul 25, 2021

thanks Joel, i was just looking at that webhook version failure lol

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@JoelSpeed
Copy link
Contributor

/override ci/prow/e2e-aws-operator

Re overide based on the above comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 25, 2021

@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/e2e-aws-operator

In response to this:

/override ci/prow/e2e-aws-operator

Re overide based on the above comment

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit e5c1bd1 into openshift:master Jul 25, 2021
@elmiko elmiko deleted the add-ca-alerts branch July 26, 2021 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants