New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OCPCLOUD-923] add alerts for memory and cpu core limits #213
Conversation
this will need to wait until we do the 1.22 rebase of the cluster autoscaler to pick up the new metrics sources. |
/cc @openshift/sre-alert-sme |
awesome job on this @elmiko +1 |
This change adds the alerts for when the cluster autoscaler is unable to scale out due to reaching cpu or memory limits. It also updates the alert documents. ref: https://issues.redhat.com/browse/OCPCLOUD-923
updated to drop severity from warning to info |
Lgtm from our end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Info is definitely the right severity level for this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems good to me, just curious about the default values mentioned, PTAL
The number of total cores in the cluster has exceeded the maximum number set on the | ||
cluster autoscaler. This is calculated by summing the cpu capacity for all nodes | ||
in the cluster and comparing that number against the maximum cores value set for the | ||
cluster autoscaler (default 320000 cores). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this default correct? Seems absurdly high? Do we have a source for that value? If so, is it worth adding a link inline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just the default if you set nothing on the cluster autoscaler, see this faq entry, also we do allow not setting those limits in a ClusterAutoscaler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a description how you can get the current maxmum settings? like "You can run this to get what's the maximum cores setup in this cluster ...".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wanghaoran1988 Does this need to happen before we merge or could that follow in a later release? I'm happy with this personally and would prefer to merge it today so its in before feature freeze, but Mike is out on PTO today so wouldn't have time to update this before FF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with a follow release.
The number of total bytes of RAM in the cluster has exceeded the maximum number set on | ||
the cluster autoscaler. This is calculated by summing the memory capacity for all nodes | ||
in the cluster and comparing that number against the maximum memory bytes value set | ||
for the cluster autoscaler (default 6400000 gigabytes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, where did this figure come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
reminder: update openshift/enhancements#538 when this merges |
/cc |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JoelSpeed The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
this PR technically needs the metric to be present, we will pick it up during the 1.22 rebase. i am removing the hold as we don't believe this will break anything by merging early. |
/test e2e-aws-operator |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
5 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
4 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
seeing this error several times in the logs
doesn't seem related to this PR, but makes me wonder if that is one of the types that got upgraded to |
The webhook test is a known perma failure at the moment, we have a PR to our tests updating the webhook versions but other tests are flaking causing it not have merged yet Having reviewed the failures, I don't think any of them are related to this PR, this should be safe to merge /override ci/prow/e2e-aws-operator |
@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/e2e-aws-operator In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
thanks Joel, i was just looking at that webhook version failure lol |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/override ci/prow/e2e-aws-operator Re overide based on the above comment |
@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/e2e-aws-operator In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This change adds the alerts for when the cluster autoscaler is unable to
scale out due to reaching cpu or memory limits. It also updates the
alert documents.
ref: https://issues.redhat.com/browse/OCPCLOUD-923