Skip to content

Commit

Permalink
Merge pull request grafana/cortex-jsonnet#338 from grafana/playbook-f…
Browse files Browse the repository at this point in the history
…or-request-errors

Add playbook for CortexRequestErrors and config option to exclude specific routes
  • Loading branch information
pracucci authored Jun 23, 2021
2 parents 1be26db + a3e9b28 commit 2af795f
Show file tree
Hide file tree
Showing 3 changed files with 31 additions and 10 deletions.
21 changes: 16 additions & 5 deletions jsonnet/mimir-mixin/alerts/alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,14 @@
// Note if alert_aggregation_labels is "job", this will repeat the label. But
// prometheus seems to tolerate that.
expr: |||
100 * sum by (%s, job, route) (rate(cortex_request_duration_seconds_count{status_code=~"5..",route!~"ready"}[1m]))
100 * sum by (%(group_by)s, job, route) (rate(cortex_request_duration_seconds_count{status_code=~"5..",route!~"%(excluded_routes)s"}[1m]))
/
sum by (%s, job, route) (rate(cortex_request_duration_seconds_count{route!~"ready"}[1m]))
sum by (%(group_by)s, job, route) (rate(cortex_request_duration_seconds_count{route!~"%(excluded_routes)s"}[1m]))
> 1
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
||| % {
group_by: $._config.alert_aggregation_labels,
excluded_routes: std.join('|', ['ready'] + $._config.alert_excluded_routes),
},
'for': '15m',
labels: {
severity: 'critical',
Expand All @@ -39,10 +42,18 @@
{
alert: 'CortexRequestLatency',
expr: |||
%(group_prefix_jobs)s_route:cortex_request_duration_seconds:99quantile{route!~"metrics|/frontend.Frontend/Process|ready|/schedulerpb.SchedulerForFrontend/FrontendLoop|/schedulerpb.SchedulerForQuerier/QuerierLoop"}
%(group_prefix_jobs)s_route:cortex_request_duration_seconds:99quantile{route!~"%(excluded_routes)s"}
>
%(cortex_p99_latency_threshold_seconds)s
||| % $._config,
||| % $._config {
excluded_routes: std.join('|', [
'metrics',
'/frontend.Frontend/Process',
'ready',
'/schedulerpb.SchedulerForFrontend/FrontendLoop',
'/schedulerpb.SchedulerForQuerier/QuerierLoop',
] + $._config.alert_excluded_routes),
},
'for': '15m',
labels: {
severity: 'warning',
Expand Down
3 changes: 3 additions & 0 deletions jsonnet/mimir-mixin/config.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -64,5 +64,8 @@
writes: true,
reads: true,
},

// The routes to exclude from alerts.
alert_excluded_routes: [],
},
}
17 changes: 12 additions & 5 deletions jsonnet/mimir-mixin/docs/playbooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,18 @@ Right now most of the execution time will be spent in PromQL's innerEval. NB tha

### CortexRequestErrors

_TODO: this playbook has not been written yet._
This alert fires when the rate of 5xx errors of a specific route is > 1% for some time.

This alert typically acts as a last resort to detect issues / outages. SLO alerts are expected to trigger earlier: if an **SLO alert** has triggered as well for the same read/write path, then you can ignore this alert and focus on the SLO one.

How to **investigate**:
- Check for which route the alert fired
- Write path: open the `Cortex / Writes` dashboard
- Read path: open the `Cortex / Reads` dashboard
- Looking at the dashboard you should see in which Cortex service the error originates
- The panels in the dashboard are vertically sorted by the network path (eg. on the write path: cortex-gw -> distributor -> ingester)
- If the failing service is going OOM (`OOMKilled`): scale up or increase the memory
- If the failing service is crashing / panicking: look for the stack trace in the logs and investigate from there

### CortexTransferFailed
This alert goes off when an ingester fails to find another node to transfer its data to when it was shutting down. If there is both a pod stuck terminating and one stuck joining, look at the kubernetes events. This may be due to scheduling problems caused by some combination of anti affinity rules/resource utilization. Adding a new node can help in these circumstances. You can see recent events associated with a resource via kubectl describe, ex: `kubectl -n <namespace> describe pod <pod>`
Expand Down Expand Up @@ -355,10 +366,6 @@ WAL corruptions are only detected at startups, so at this point the WAL/Checkpoi
2. Equal or more than the quorum number but less than replication factor: There is a good chance that there is no data loss if it was replicated to desired number of ingesters. But it's good to check once for data loss.
3. Equal or more than the replication factor: Then there is definitely some data loss.

### CortexRequestErrors

_TODO: this playbook has not been written yet._

### CortexTableSyncFailure

_This alert applies to Cortex chunks storage only._
Expand Down

0 comments on commit 2af795f

Please sign in to comment.