Skip to content
This repository was archived by the owner on Apr 28, 2025. It is now read-only.

Conversation

@pracucci
Copy link
Collaborator

@pracucci pracucci commented Jul 2, 2021

What this PR does:
Added playbook for CortexAllocatingTooMuchMemory. I've also changed a bit the CortexAllocatingTooMuchMemory and CortexProvisioningTooManyWrites messages.

Which issue(s) this PR fixes:
N/A

Checklist

  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci requested a review from a team as a code owner July 2, 2021 10:20
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice improvements.

annotations: {
message: |||
High QPS for ingesters, add more ingesters.
Ingesters in {{ $labels.namespace }} have an high samples/sec rate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative: Ingesters in {{ $labels.namespace }} ingest too many samples per second.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better!

- Cortex ingesters are a stateful service
- Having 2+ ingesters `OOMKilled` may cause a cluster outage
- Ingester memory baseline usage is primarily influenced by memory allocated by the process (mostly go heap) and mmap-ed files (used by TSDB)
- Ingester memory short spikes are primarily influenced by queries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also when cutting new blocks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right!

- Having 2+ ingesters `OOMKilled` may cause a cluster outage
- Ingester memory baseline usage is primarily influenced by memory allocated by the process (mostly go heap) and mmap-ed files (used by TSDB)
- Ingester memory short spikes are primarily influenced by queries
- A pod gets `OOMKilled` once it's working set memory reaches the configured limit, so it's important to prevent ingesters memory utilization (working set memory) from getting close to the limit (we need to keep at least 30% room for spikes due to queries)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"it's working set" -> "its working set"

```
kubectl -n <namespace> delete pod ingester-XXX
```
- Restarting an ingester typically reduces the memory allocated by mmap-ed files. Such memory could be reallocated again, but may let you gain more time while working on a longer term solution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Restarting an ingester typically reduces the memory allocated by mmap-ed files. Such memory could be reallocated again, but may let you gain more time while working on a longer term solution
- Restarting an ingester typically reduces the memory allocated by mmap-ed files. After the restart, ingester may allocate this memory again over time, but it may give more time while working on a longer term solution

Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci
Copy link
Collaborator Author

pracucci commented Jul 2, 2021

Thanks @pstibrany for your valuable feedback! Applied all changes.

@pracucci pracucci merged commit 3528572 into main Jul 2, 2021
@pracucci pracucci deleted the playbook-for-CortexAllocatingTooMuchMemory branch July 2, 2021 11:35
simonswine pushed a commit to grafana/mimir that referenced this pull request Oct 18, 2021
…or-CortexAllocatingTooMuchMemory

Added playbook for CortexAllocatingTooMuchMemory
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants