add alerts for memory and cpu core limits

This change adds the alerts for when the cluster autoscaler is unable to scale out due to reaching cpu or memory limits. It also updates the alert documents. ref: https://issues.redhat.com/browse/OCPCLOUD-923
openshift · Jul 9, 2021 · 897b464 · 897b464
1 parent 70beee3
commit 897b464
Show file tree

Hide file tree

Showing 2 changed files with 75 additions and 0 deletions.
diff --git a/docs/user/alerts.md b/docs/user/alerts.md
@@ -59,3 +59,55 @@ should investigate the logs associated with your cloud provider controllers and
 the Machine API resources to discover the root cause. For more information on
 why nodes, or machines, might not become ready please see the
 [Machine API FAQ](https://github.com/openshift/machine-api-operator/blob/master/FAQ.md).
+
+## ClusterAutoscalerUnableToScaleCPULimitReached
+The number of total cores in the cluster has exceeded the maximum number set on the
+cluster autoscaler. This is calculated by summing the cpu capacity for all nodes
+in the cluster and comparing that number against the maximum cores value set for the
+cluster autoscaler (default 320000 cores).
+
+### Query
+```
+# for: 15m
+cluster_autoscaler_cluster_cpu_current_cores >= cluster_autoscaler_cpu_limits_cores{direction="maximum"}
+```
+
+### Possible Causes
+* Too many nodes have been created in the cluster.
+* Nodes of larger than expected size have joined the cluster.
+* Maximum CPU limit on the ClusterAutoscaler is set too low.
+
+### Resolution
+This alert is indicating that the cluster autoscaler is unable to continue scaling out. Depending
+on your needs and resources this alert may indicate action is required. If you require more
+resources in your cluster, a simple solution is to increase the maximum core count in your
+ClusterAutoscaler. If you do not need more resources in your cluster, this condition is
+non-harmful to the cluster and the autoscaler will continue to function as normal, with the
+exception of creating new nodes. The cluster autoscaler will resume its scale out functionality
+once the number of cores in the cluster is fewer than the maximum.
+
+## ClusterAutoscalerUnableToScaleMemoryLimitReached
+The number of total bytes of RAM in the cluster has exceeded the maximum number set on
+the cluster autoscaler. This is calculated by summing the memory capacity for all nodes
+in the cluster and comparing that number against the maximum memory bytes value set
+for the cluster autoscaler (default 6400000 gigabytes).
+
+### Query
+```
+# for: 15m
+cluster_autoscaler_cluster_memory_current_bytes >= cluster_autoscaler_memory_limits_bytes{direction="maximum"}
+```
+
+### Possible Causes
+* Too many nodes have been created in the cluster.
+* Nodes of larger than expected size have joined the cluster.
+* Maximum memory limit on the ClusterAutoscaler is set too low.
+
+### Resolution
+This alert is indicating that the cluster autoscaler is unable to continue scaling out. Depending
+on your needs and resources this alert may indicate action is required. If you require more
+resources in your cluster, a simple solution is to increase the maximum memory bytes in your
+ClusterAutoscaler. If you do not need more resources in your cluster, this condition is
+non-harmful to the cluster and the autoscaler will continue to function as normal, with the
+exception of creating new nodes. The cluster autoscaler will resume its scale out functionality
+once the amount of bytes of RAM in the cluster is fewer than the maximum.
diff --git a/pkg/controller/clusterautoscaler/monitoring.go b/pkg/controller/clusterautoscaler/monitoring.go
@@ -191,6 +191,29 @@ func (r *Reconciler) AutoscalerPrometheusRule(ca *autoscalingv1.ClusterAutoscale
 								"message": "Cluster Autoscaler is reporting that the cluster is not ready for scaling",
 							},
 						},
+						{
+							Alert: "ClusterAutoscalerUnableToScaleCPULimitReached",
+							Expr:  intstr.FromString("cluster_autoscaler_cluster_cpu_current_cores >= cluster_autoscaler_cpu_limits_cores{direction=\"maximum\"}"),
+
+							For: "15m",
+							Labels: map[string]string{
+								"severity": "info",
+							},
+							Annotations: map[string]string{
+								"message": "Cluster Autoscaler has reached its CPU core limit and is unable to scale out",
+							},
+						},
+						{
+							Alert: "ClusterAutoscalerUnableToScaleMemoryLimitReached",
+							Expr:  intstr.FromString("cluster_autoscaler_cluster_memory_current_bytes >= cluster_autoscaler_memory_limits_bytes{direction=\"maximum\"}"),
+							For:   "15m",
+							Labels: map[string]string{
+								"severity": "info",
+							},
+							Annotations: map[string]string{
+								"message": "Cluster Autoscaler has reached its Memory bytes limit and is unable to scale out",
+							},
+						},
 					},
 				},
 			},