New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OCPCLOUD-923] add alerts for memory and cpu core limits #213
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -59,3 +59,55 @@ should investigate the logs associated with your cloud provider controllers and | |
the Machine API resources to discover the root cause. For more information on | ||
why nodes, or machines, might not become ready please see the | ||
[Machine API FAQ](https://github.com/openshift/machine-api-operator/blob/master/FAQ.md). | ||
|
||
## ClusterAutoscalerUnableToScaleCPULimitReached | ||
The number of total cores in the cluster has exceeded the maximum number set on the | ||
cluster autoscaler. This is calculated by summing the cpu capacity for all nodes | ||
in the cluster and comparing that number against the maximum cores value set for the | ||
cluster autoscaler (default 320000 cores). | ||
|
||
### Query | ||
``` | ||
# for: 15m | ||
cluster_autoscaler_cluster_cpu_current_cores >= cluster_autoscaler_cpu_limits_cores{direction="maximum"} | ||
``` | ||
|
||
### Possible Causes | ||
* Too many nodes have been created in the cluster. | ||
* Nodes of larger than expected size have joined the cluster. | ||
* Maximum CPU limit on the ClusterAutoscaler is set too low. | ||
|
||
### Resolution | ||
This alert is indicating that the cluster autoscaler is unable to continue scaling out. Depending | ||
on your needs and resources this alert may indicate action is required. If you require more | ||
resources in your cluster, a simple solution is to increase the maximum core count in your | ||
ClusterAutoscaler. If you do not need more resources in your cluster, this condition is | ||
non-harmful to the cluster and the autoscaler will continue to function as normal, with the | ||
elmiko marked this conversation as resolved.
Show resolved
Hide resolved
|
||
exception of creating new nodes. The cluster autoscaler will resume its scale out functionality | ||
once the number of cores in the cluster is fewer than the maximum. | ||
|
||
## ClusterAutoscalerUnableToScaleMemoryLimitReached | ||
The number of total bytes of RAM in the cluster has exceeded the maximum number set on | ||
the cluster autoscaler. This is calculated by summing the memory capacity for all nodes | ||
in the cluster and comparing that number against the maximum memory bytes value set | ||
for the cluster autoscaler (default 6400000 gigabytes). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same as above, where did this figure come from? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same as above |
||
|
||
### Query | ||
``` | ||
# for: 15m | ||
cluster_autoscaler_cluster_memory_current_bytes >= cluster_autoscaler_memory_limits_bytes{direction="maximum"} | ||
``` | ||
|
||
### Possible Causes | ||
* Too many nodes have been created in the cluster. | ||
* Nodes of larger than expected size have joined the cluster. | ||
* Maximum memory limit on the ClusterAutoscaler is set too low. | ||
|
||
### Resolution | ||
This alert is indicating that the cluster autoscaler is unable to continue scaling out. Depending | ||
on your needs and resources this alert may indicate action is required. If you require more | ||
resources in your cluster, a simple solution is to increase the maximum memory bytes in your | ||
ClusterAutoscaler. If you do not need more resources in your cluster, this condition is | ||
non-harmful to the cluster and the autoscaler will continue to function as normal, with the | ||
exception of creating new nodes. The cluster autoscaler will resume its scale out functionality | ||
once the amount of bytes of RAM in the cluster is fewer than the maximum. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this default correct? Seems absurdly high? Do we have a source for that value? If so, is it worth adding a link inline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just the default if you set nothing on the cluster autoscaler, see this faq entry, also we do allow not setting those limits in a ClusterAutoscaler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a description how you can get the current maxmum settings? like "You can run this to get what's the maximum cores setup in this cluster ...".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wanghaoran1988 Does this need to happen before we merge or could that follow in a later release? I'm happy with this personally and would prefer to merge it today so its in before feature freeze, but Mike is out on PTO today so wouldn't have time to update this before FF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with a follow release.