Skip to content

Commit

Permalink
add alerts for memory and cpu core limits
Browse files Browse the repository at this point in the history
This change adds the alerts for when the cluster autoscaler is unable to
scale out due to reaching cpu or memory limits. It also updates the
alert documents.

ref: https://issues.redhat.com/browse/OCPCLOUD-923
  • Loading branch information
elmiko committed Jul 9, 2021
1 parent 70beee3 commit 897b464
Show file tree
Hide file tree
Showing 2 changed files with 75 additions and 0 deletions.
52 changes: 52 additions & 0 deletions docs/user/alerts.md
Expand Up @@ -59,3 +59,55 @@ should investigate the logs associated with your cloud provider controllers and
the Machine API resources to discover the root cause. For more information on
why nodes, or machines, might not become ready please see the
[Machine API FAQ](https://github.com/openshift/machine-api-operator/blob/master/FAQ.md).

## ClusterAutoscalerUnableToScaleCPULimitReached
The number of total cores in the cluster has exceeded the maximum number set on the
cluster autoscaler. This is calculated by summing the cpu capacity for all nodes
in the cluster and comparing that number against the maximum cores value set for the
cluster autoscaler (default 320000 cores).

### Query
```
# for: 15m
cluster_autoscaler_cluster_cpu_current_cores >= cluster_autoscaler_cpu_limits_cores{direction="maximum"}
```

### Possible Causes
* Too many nodes have been created in the cluster.
* Nodes of larger than expected size have joined the cluster.
* Maximum CPU limit on the ClusterAutoscaler is set too low.

### Resolution
This alert is indicating that the cluster autoscaler is unable to continue scaling out. Depending
on your needs and resources this alert may indicate action is required. If you require more
resources in your cluster, a simple solution is to increase the maximum core count in your
ClusterAutoscaler. If you do not need more resources in your cluster, this condition is
non-harmful to the cluster and the autoscaler will continue to function as normal, with the
exception of creating new nodes. The cluster autoscaler will resume its scale out functionality
once the number of cores in the cluster is fewer than the maximum.

## ClusterAutoscalerUnableToScaleMemoryLimitReached
The number of total bytes of RAM in the cluster has exceeded the maximum number set on
the cluster autoscaler. This is calculated by summing the memory capacity for all nodes
in the cluster and comparing that number against the maximum memory bytes value set
for the cluster autoscaler (default 6400000 gigabytes).

### Query
```
# for: 15m
cluster_autoscaler_cluster_memory_current_bytes >= cluster_autoscaler_memory_limits_bytes{direction="maximum"}
```

### Possible Causes
* Too many nodes have been created in the cluster.
* Nodes of larger than expected size have joined the cluster.
* Maximum memory limit on the ClusterAutoscaler is set too low.

### Resolution
This alert is indicating that the cluster autoscaler is unable to continue scaling out. Depending
on your needs and resources this alert may indicate action is required. If you require more
resources in your cluster, a simple solution is to increase the maximum memory bytes in your
ClusterAutoscaler. If you do not need more resources in your cluster, this condition is
non-harmful to the cluster and the autoscaler will continue to function as normal, with the
exception of creating new nodes. The cluster autoscaler will resume its scale out functionality
once the amount of bytes of RAM in the cluster is fewer than the maximum.
23 changes: 23 additions & 0 deletions pkg/controller/clusterautoscaler/monitoring.go
Expand Up @@ -191,6 +191,29 @@ func (r *Reconciler) AutoscalerPrometheusRule(ca *autoscalingv1.ClusterAutoscale
"message": "Cluster Autoscaler is reporting that the cluster is not ready for scaling",
},
},
{
Alert: "ClusterAutoscalerUnableToScaleCPULimitReached",
Expr: intstr.FromString("cluster_autoscaler_cluster_cpu_current_cores >= cluster_autoscaler_cpu_limits_cores{direction=\"maximum\"}"),

For: "15m",
Labels: map[string]string{
"severity": "info",
},
Annotations: map[string]string{
"message": "Cluster Autoscaler has reached its CPU core limit and is unable to scale out",
},
},
{
Alert: "ClusterAutoscalerUnableToScaleMemoryLimitReached",
Expr: intstr.FromString("cluster_autoscaler_cluster_memory_current_bytes >= cluster_autoscaler_memory_limits_bytes{direction=\"maximum\"}"),
For: "15m",
Labels: map[string]string{
"severity": "info",
},
Annotations: map[string]string{
"message": "Cluster Autoscaler has reached its Memory bytes limit and is unable to scale out",
},
},
},
},
},
Expand Down

0 comments on commit 897b464

Please sign in to comment.