-
Notifications
You must be signed in to change notification settings - Fork 55
Improved CortexRequestLatency playbook #352
Conversation
Signed-off-by: Marco Pracucci <marco@pracucci.com>
gouthamve
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with some suggestions. Overall quite good!
cortex-mixin/docs/playbooks.md
Outdated
|
|
||
| #### Read Latency | ||
| Query performance is an known problem. When you get this alert, you need to work out if: (a) this is a operation issue / configuration (b) this is because of algorithms and inherently limited (c) this is a bug | ||
| The alert message includes both the Cortex service and route experiencing the high latency. Establish if the alert is about the read or write path based on that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to specify the paths that are read and those that are write?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea 👍
| - **`distributor`** | ||
| - Typically, distributor p99 latency is in the range 50-100ms. If the distributor latency is higher than this, you may need to scale up the distributors. | ||
| - **`ingester`** | ||
| - Typically, ingester p99 latency is in the range 5-50ms. If the ingester latency is higher than this, you should investigate the root cause before scaling up ingesters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add the scaling dashboard to check if the ingesters are running more than 2Mil series per pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good idea. However, since it's already mentioned in other playbook (that should fire if series / ingester > 1.6M) then I've mentioned that alert here too.
| - High CPU utilization in ingesters | ||
| - Scale up ingesters | ||
| - Low cache hit ratio in the store-gateways | ||
| - If memcached eviction rate is high, then you should scale up memcached replicas. Check the recommendations by `Cortex / Scaling` dashboard and make reasonable adjustments as necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest the memcached dashboard here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, done
Signed-off-by: Marco Pracucci <marco@pracucci.com>
|
Thanks a lot @gouthamve for your thoughtful review! |
|
I'm going to merge it but if you have any further comment I will promptly address it 🙏 |
…quest-failure-and-latency-playbooks Improved CortexRequestLatency playbook
What this PR does:
In this PR I've tried to improve the
CortexRequestLatencyplaybook, both updating it (eg. query-scheduler, store-gateway, ...) and expanding more about the investigation procedure.Which issue(s) this PR fixes:
N/A
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]