Skip to content
This repository was archived by the owner on Apr 28, 2025. It is now read-only.

Conversation

@pracucci
Copy link
Collaborator

@pracucci pracucci commented Jul 5, 2021

What this PR does:
In this PR I've tried to improve the CortexRequestLatency playbook, both updating it (eg. query-scheduler, store-gateway, ...) and expanding more about the investigation procedure.

Which issue(s) this PR fixes:
N/A

Checklist

  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci requested a review from a team as a code owner July 5, 2021 08:58
Copy link
Contributor

@gouthamve gouthamve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some suggestions. Overall quite good!


#### Read Latency
Query performance is an known problem. When you get this alert, you need to work out if: (a) this is a operation issue / configuration (b) this is because of algorithms and inherently limited (c) this is a bug
The alert message includes both the Cortex service and route experiencing the high latency. Establish if the alert is about the read or write path based on that.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to specify the paths that are read and those that are write?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea 👍

- **`distributor`**
- Typically, distributor p99 latency is in the range 50-100ms. If the distributor latency is higher than this, you may need to scale up the distributors.
- **`ingester`**
- Typically, ingester p99 latency is in the range 5-50ms. If the ingester latency is higher than this, you should investigate the root cause before scaling up ingesters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add the scaling dashboard to check if the ingesters are running more than 2Mil series per pod?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good idea. However, since it's already mentioned in other playbook (that should fire if series / ingester > 1.6M) then I've mentioned that alert here too.

- High CPU utilization in ingesters
- Scale up ingesters
- Low cache hit ratio in the store-gateways
- If memcached eviction rate is high, then you should scale up memcached replicas. Check the recommendations by `Cortex / Scaling` dashboard and make reasonable adjustments as necessary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest the memcached dashboard here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, done

Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci
Copy link
Collaborator Author

pracucci commented Jul 5, 2021

Thanks a lot @gouthamve for your thoughtful review!

@pracucci
Copy link
Collaborator Author

pracucci commented Jul 5, 2021

I'm going to merge it but if you have any further comment I will promptly address it 🙏

@pracucci pracucci merged commit 27078c6 into main Jul 5, 2021
@pracucci pracucci deleted the improve-request-failure-and-latency-playbooks branch July 5, 2021 12:02
simonswine pushed a commit to grafana/mimir that referenced this pull request Oct 18, 2021
…quest-failure-and-latency-playbooks

Improved CortexRequestLatency playbook
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants