Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OCPCLOUD-923] add alerts for memory and cpu core limits #213

Merged
merged 1 commit into from Jul 25, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
52 changes: 52 additions & 0 deletions docs/user/alerts.md
Expand Up @@ -59,3 +59,55 @@ should investigate the logs associated with your cloud provider controllers and
the Machine API resources to discover the root cause. For more information on
why nodes, or machines, might not become ready please see the
[Machine API FAQ](https://github.com/openshift/machine-api-operator/blob/master/FAQ.md).

## ClusterAutoscalerUnableToScaleCPULimitReached
The number of total cores in the cluster has exceeded the maximum number set on the
cluster autoscaler. This is calculated by summing the cpu capacity for all nodes
in the cluster and comparing that number against the maximum cores value set for the
cluster autoscaler (default 320000 cores).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this default correct? Seems absurdly high? Do we have a source for that value? If so, is it worth adding a link inline?

Copy link
Contributor Author

@elmiko elmiko Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just the default if you set nothing on the cluster autoscaler, see this faq entry, also we do allow not setting those limits in a ClusterAutoscaler

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a description how you can get the current maxmum settings? like "You can run this to get what's the maximum cores setup in this cluster ...".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wanghaoran1988 Does this need to happen before we merge or could that follow in a later release? I'm happy with this personally and would prefer to merge it today so its in before feature freeze, but Mike is out on PTO today so wouldn't have time to update this before FF

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with a follow release.


### Query
```
# for: 15m
cluster_autoscaler_cluster_cpu_current_cores >= cluster_autoscaler_cpu_limits_cores{direction="maximum"}
```

### Possible Causes
* Too many nodes have been created in the cluster.
* Nodes of larger than expected size have joined the cluster.
* Maximum CPU limit on the ClusterAutoscaler is set too low.

### Resolution
This alert is indicating that the cluster autoscaler is unable to continue scaling out. Depending
on your needs and resources this alert may indicate action is required. If you require more
resources in your cluster, a simple solution is to increase the maximum core count in your
ClusterAutoscaler. If you do not need more resources in your cluster, this condition is
non-harmful to the cluster and the autoscaler will continue to function as normal, with the
elmiko marked this conversation as resolved.
Show resolved Hide resolved
exception of creating new nodes. The cluster autoscaler will resume its scale out functionality
once the number of cores in the cluster is fewer than the maximum.

## ClusterAutoscalerUnableToScaleMemoryLimitReached
The number of total bytes of RAM in the cluster has exceeded the maximum number set on
the cluster autoscaler. This is calculated by summing the memory capacity for all nodes
in the cluster and comparing that number against the maximum memory bytes value set
for the cluster autoscaler (default 6400000 gigabytes).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, where did this figure come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above


### Query
```
# for: 15m
cluster_autoscaler_cluster_memory_current_bytes >= cluster_autoscaler_memory_limits_bytes{direction="maximum"}
```

### Possible Causes
* Too many nodes have been created in the cluster.
* Nodes of larger than expected size have joined the cluster.
* Maximum memory limit on the ClusterAutoscaler is set too low.

### Resolution
This alert is indicating that the cluster autoscaler is unable to continue scaling out. Depending
on your needs and resources this alert may indicate action is required. If you require more
resources in your cluster, a simple solution is to increase the maximum memory bytes in your
ClusterAutoscaler. If you do not need more resources in your cluster, this condition is
non-harmful to the cluster and the autoscaler will continue to function as normal, with the
exception of creating new nodes. The cluster autoscaler will resume its scale out functionality
once the amount of bytes of RAM in the cluster is fewer than the maximum.
23 changes: 23 additions & 0 deletions pkg/controller/clusterautoscaler/monitoring.go
Expand Up @@ -191,6 +191,29 @@ func (r *Reconciler) AutoscalerPrometheusRule(ca *autoscalingv1.ClusterAutoscale
"message": "Cluster Autoscaler is reporting that the cluster is not ready for scaling",
},
},
{
Alert: "ClusterAutoscalerUnableToScaleCPULimitReached",
Expr: intstr.FromString("cluster_autoscaler_cluster_cpu_current_cores >= cluster_autoscaler_cpu_limits_cores{direction=\"maximum\"}"),

For: "15m",
Labels: map[string]string{
"severity": "info",
},
Annotations: map[string]string{
"message": "Cluster Autoscaler has reached its CPU core limit and is unable to scale out",
},
},
{
Alert: "ClusterAutoscalerUnableToScaleMemoryLimitReached",
Expr: intstr.FromString("cluster_autoscaler_cluster_memory_current_bytes >= cluster_autoscaler_memory_limits_bytes{direction=\"maximum\"}"),
For: "15m",
Labels: map[string]string{
"severity": "info",
},
Annotations: map[string]string{
"message": "Cluster Autoscaler has reached its Memory bytes limit and is unable to scale out",
},
},
},
},
},
Expand Down