Skip to content
This repository was archived by the owner on Apr 28, 2025. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions cortex-mixin/alerts/alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -479,7 +479,7 @@
},
annotations: {
message: |||
High QPS for ingesters, add more ingesters.
Ingesters in {{ $labels.namespace }} ingest too many samples per second.
|||,
},
},
Expand All @@ -498,7 +498,7 @@
},
annotations: {
message: |||
Too much memory being used by {{ $labels.namespace }}/{{ $labels.pod }} - add more ingesters.
Ingester {{ $labels.namespace }}/{{ $labels.pod }} is using too much memory.
|||,
},
},
Expand All @@ -517,7 +517,7 @@
},
annotations: {
message: |||
Too much memory being used by {{ $labels.namespace }}/{{ $labels.pod }} - add more ingesters.
Ingester {{ $labels.namespace }}/{{ $labels.pod }} is using too much memory.
|||,
},
},
Expand Down
20 changes: 19 additions & 1 deletion cortex-mixin/docs/playbooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -451,7 +451,25 @@ How to **fix**:

### CortexAllocatingTooMuchMemory

_TODO: this playbook has not been written yet._
This alert fires when an ingester memory utilization is getting closer to the limit.

How it **works**:
- Cortex ingesters are a stateful service
- Having 2+ ingesters `OOMKilled` may cause a cluster outage
- Ingester memory baseline usage is primarily influenced by memory allocated by the process (mostly go heap) and mmap-ed files (used by TSDB)
- Ingester memory short spikes are primarily influenced by queries and TSDB head compaction into new blocks (occurring every 2h)
- A pod gets `OOMKilled` once its working set memory reaches the configured limit, so it's important to prevent ingesters memory utilization (working set memory) from getting close to the limit (we need to keep at least 30% room for spikes due to queries)

How to **fix**:
- Check if the issue occurs only for few ingesters. If so:
- Restart affected ingesters 1 by 1 (proceed with the next one once the previous pod has restarted and it's Ready)
```
kubectl -n <namespace> delete pod ingester-XXX
```
- Restarting an ingester typically reduces the memory allocated by mmap-ed files. After the restart, ingester may allocate this memory again over time, but it may give more time while working on a longer term solution
- Check the `Cortex / Writes Resources` dashboard to see if the number of series per ingester is above the target (1.5M). If so:
- Scale up ingesters
- Memory is expected to be reclaimed at the next TSDB head compaction (occurring every 2h)

### CortexGossipMembersMismatch

Expand Down