grafana · pracucci · Jul 2, 2021 · Jul 2, 2021 · Jul 2, 2021
@@ -479,7 +479,7 @@
           },
           annotations: {
             message: |||
-              High QPS for ingesters, add more ingesters.
+              Ingesters in {{ $labels.namespace }} ingest too many samples per second.
             |||,
           },
         },
@@ -498,7 +498,7 @@
           },
           annotations: {
             message: |||
-              Too much memory being used by {{ $labels.namespace }}/{{ $labels.pod }} - add more ingesters.
+              Ingester {{ $labels.namespace }}/{{ $labels.pod }} is using too much memory.
             |||,
           },
         },
@@ -517,7 +517,7 @@
           },
           annotations: {
             message: |||
-              Too much memory being used by {{ $labels.namespace }}/{{ $labels.pod }} - add more ingesters.
+              Ingester {{ $labels.namespace }}/{{ $labels.pod }} is using too much memory.
             |||,
           },
         },

@@ -451,7 +451,25 @@ How to **fix**:
 
 ### CortexAllocatingTooMuchMemory
 
-_TODO: this playbook has not been written yet._
+This alert fires when an ingester memory utilization is getting closer to the limit.
+
+How it **works**:
+- Cortex ingesters are a stateful service
+- Having 2+ ingesters `OOMKilled` may cause a cluster outage
+- Ingester memory baseline usage is primarily influenced by memory allocated by the process (mostly go heap) and mmap-ed files (used by TSDB)
+- Ingester memory short spikes are primarily influenced by queries and TSDB head compaction into new blocks (occurring every 2h)
+- A pod gets `OOMKilled` once its working set memory reaches the configured limit, so it's important to prevent ingesters memory utilization (working set memory) from getting close to the limit (we need to keep at least 30% room for spikes due to queries)
+
+How to **fix**:
+- Check if the issue occurs only for few ingesters. If so:
+  - Restart affected ingesters 1 by 1 (proceed with the next one once the previous pod has restarted and it's Ready)
+    ```
+    kubectl -n <namespace> delete pod ingester-XXX
+    ```
+  - Restarting an ingester typically reduces the memory allocated by mmap-ed files. After the restart, ingester may allocate this memory again over time, but it may give more time while working on a longer term solution
+- Check the `Cortex / Writes Resources` dashboard to see if the number of series per ingester is above the target (1.5M). If so:
+  - Scale up ingesters
+  - Memory is expected to be reclaimed at the next TSDB head compaction (occurring every 2h)
 
 ### CortexGossipMembersMismatch