grafana · pracucci · Oct 15, 2022 · Oct 14, 2022 · Oct 14, 2022 · Oct 14, 2022
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -12,6 +12,7 @@
 
 ### Mixin
 
+* [CHANGE] Alerts: Change `MimirSchedulerQueriesStuck` `for` time to 7 minutes to account for the time it takes for HPA to scale up. #3223
 * [ENHANCEMENT] Alerts: Add MimirRingMembersMismatch firing when a component does not have the expected number of running jobs. #2404
 * [ENHANCEMENT] Dashboards: Add optional row about the Distributor's metric forwarding feature to the `Mimir / Writes` dashboard. #3182
 * [BUGFIX] Dashboards: Fix legend showing `persistentvolumeclaim` when using `deployment_type=baremetal` for `Disk space utilization` panels. #3173
@@ -27,6 +28,7 @@
 ### Documentation
 
 * [ENHANCEMENT] Improve `MimirQuerierAutoscalerNotActive` runbook. #3186
+* [ENHANCEMENT] Improve `MimirSchedulerQueriesStuck` runbook to reflect debug steps with querier auto-scaling enabled. #3223
 
 ### Tools
 

@@ -690,6 +690,7 @@ How to **investigate**:
   - On multi-tenant Mimir clusters with **query-sharding enabled** and **only a single tenant** being affected:
     - Verify if the particular queries are hitting edge cases, where query-sharding is not benefical, by getting traces from the `Mimir / Slow Queries` dashboard and then look where time is spent. If time is spent in the query-frontend running PromQL engine, then it means query-sharding is not beneficial for this tenant. Consider disabling query-sharding or reduce the shard count using the `query_sharding_total_shards` override.
     - Otherwise and only if the queries by the tenant are within reason representing normal usage, consider scaling of queriers and potentially store-gateways.
+  - On a Mimir cluster with **querier auto-scaling enabled** after checking the health of the existing querier replicas, check to see if the auto-scaler has added additional querier replicas or if the maximum number of querier replicas has been reached and is not sufficient and should be increased.
 
 ### MimirMemcachedRequestErrors
 

@@ -78,7 +78,7 @@ groups:
         There are {{ $value }} queued up queries in {{ $labels.cluster }}/{{ $labels.namespace }} {{ $labels.job }}.
     expr: |
       sum by (cluster, namespace, job) (min_over_time(cortex_query_scheduler_queue_length[1m])) > 0
-    for: 5m
+    for: 7m
     labels:
       severity: critical
   - alert: MimirMemcachedRequestErrors

@@ -144,7 +144,7 @@
           expr: |||
             sum by (%s, job) (min_over_time(cortex_query_scheduler_queue_length[1m])) > 0
           ||| % $._config.alert_aggregation_labels,
-          'for': '5m',  // We don't want to block for longer.
+          'for': '7m',  // We don't want to block for longer.
           labels: {
             severity: 'critical',
           },