Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mixin: adapt alerts/playbooks to consider ruler query path components #1949

Merged
merged 5 commits into from
Jun 8, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
* [CHANGE] Split `mimir_queries` rules group into `mimir_queries` and `mimir_ingester_queries` to keep number of rules per group within the default per-tenant limit. #1885
* [CHANGE] Dashboards: Expose full image tag in "Mimir / Rollout progress" dashboard's "Pod per version panel." #1932
* [CHANGE] Dashboards: Disabled gateway panels by default, because most users don't have a gateway exposing the metrics expected by Mimir dashboards. You can re-enable it setting `gateway_enabled: true` in the mixin config and recompiling the mixin running `make build-mixin`. #1954
* [CHANGE] Alerts: adapt `MimirFrontendQueriesStuck` and `MimirSchedulerQueriesStuck` to consider ruler query path components. #1949
* [ENHANCEMENT] Dashboards: Add config option `datasource_regex` to customise the regular expression used to select valid datasources for Mimir dashboards. #1802
* [ENHANCEMENT] Dashboards: Added "Mimir / Remote ruler reads" and "Mimir / Remote ruler reads resources" dashboards. #1911 #1937
* [ENHANCEMENT] Dashboards: Make networking panels work for pods created by the mimir-distributed helm chart. #1927
Expand Down
1 change: 1 addition & 0 deletions docs/sources/operators-guide/mimir-runbooks/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,7 @@ There is a category of errors that is more important: errors due to failure to r
How to **fix** it:

- Investigate the ruler logs to find out the reason why ruler cannot evaluate queries. Note that ruler logs rule evaluation errors even for "user errors", but those are not causing the alert to fire. Focus on problems with ingesters or store-gateways.
- In case remote operational mode is enabled the problem could be at any of the ruler query path components (ruler-query-frontend, ruler-query-scheduler and ruler-querier). Check the `Mimir / Remote ruler reads` and `Mimir / Remote ruler reads resources` dashboards to find out in which Mimir service the error is being originated.

### MimirRulerMissedEvaluations

Expand Down
8 changes: 4 additions & 4 deletions operations/mimir-mixin-compiled/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -66,18 +66,18 @@ groups:
- alert: MimirFrontendQueriesStuck
annotations:
message: |
There are {{ $value }} queued up queries in {{ $labels.cluster }}/{{ $labels.namespace }} query-frontend.
There are {{ $value }} queued up queries in {{ $labels.cluster }}/{{ $labels.namespace }} {{ $labels.job }}.
expr: |
sum by (cluster, namespace) (cortex_query_frontend_queue_length) > 1
sum by (cluster, namespace, job) (cortex_query_frontend_queue_length) > 1
for: 5m
labels:
severity: critical
- alert: MimirSchedulerQueriesStuck
annotations:
message: |
There are {{ $value }} queued up queries in {{ $labels.cluster }}/{{ $labels.namespace }} query-scheduler.
There are {{ $value }} queued up queries in {{ $labels.cluster }}/{{ $labels.namespace }} {{ $labels.job }}.
expr: |
sum by (cluster, namespace) (cortex_query_scheduler_queue_length) > 1
sum by (cluster, namespace, job) (cortex_query_scheduler_queue_length) > 1
for: 5m
labels:
severity: critical
Expand Down
8 changes: 4 additions & 4 deletions operations/mimir-mixin/alerts/alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -127,30 +127,30 @@
{
alert: $.alertName('FrontendQueriesStuck'),
expr: |||
sum by (%s) (cortex_query_frontend_queue_length) > 1
sum by (%s, job) (cortex_query_frontend_queue_length) > 1
||| % $._config.alert_aggregation_labels,
'for': '5m', // We don't want to block for longer.
labels: {
severity: 'critical',
},
annotations: {
message: |||
There are {{ $value }} queued up queries in %(alert_aggregation_variables)s query-frontend.
There are {{ $value }} queued up queries in %(alert_aggregation_variables)s {{ $labels.job }}.
||| % $._config,
},
},
{
alert: $.alertName('SchedulerQueriesStuck'),
expr: |||
sum by (%s) (cortex_query_scheduler_queue_length) > 1
sum by (%s, job) (cortex_query_scheduler_queue_length) > 1
||| % $._config.alert_aggregation_labels,
'for': '5m', // We don't want to block for longer.
labels: {
severity: 'critical',
},
annotations: {
message: |||
There are {{ $value }} queued up queries in %(alert_aggregation_variables)s query-scheduler.
There are {{ $value }} queued up queries in %(alert_aggregation_variables)s {{ $labels.job }}.
||| % $._config,
},
},
Expand Down