Skip to content

Commit

Permalink
Split cortex_api recording rule group into three groups.
Browse files Browse the repository at this point in the history
This is a workaround for large clusters where this group can become slow to evaluate.
  • Loading branch information
stevesg committed Oct 4, 2021
1 parent 1d5e6a4 commit f86df91
Show file tree
Hide file tree
Showing 2 changed files with 78 additions and 3 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
* [CHANGE] Decreased `-server.grpc-max-concurrent-streams` from 100k to 10k. #369
* [CHANGE] Decreased blocks storage ingesters graceful termination period from 80m to 20m. #369
* [CHANGE] Changed default `job_names` for query-frontend, query-scheduler and querier to match custom deployments too. #376
* [CHANGE] Split `cortex_api` recording rule group into three groups. This is a workaround for large clusters where this group can become slow to evaluate. #
* [ENHANCEMENT] Add overrides config to compactor. This allows setting retention configs per user. #386
* [ENHANCEMENT] cortex-mixin: Make `cluster_namespace_deployment:kube_pod_container_resource_requests_{cpu_cores,memory_bytes}:sum` backwards compatible with `kube-state-metrics` v2.0.0. #317
* [ENHANCEMENT] Cortex-mixin: Include `cortex-gw-internal` naming variation in default `gateway` job names. #328
Expand Down
80 changes: 77 additions & 3 deletions cortex-mixin/recording_rules.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,18 @@ local utils = import 'mixin-utils/utils.libsonnet';
prometheusRules+:: {
groups+: [
{
name: 'cortex_api',
name: 'cortex_api_1',
rules:
utils.histogramRules('cortex_request_duration_seconds', ['cluster', 'job'])
},
{
name: 'cortex_api_2',
rules:
utils.histogramRules('cortex_request_duration_seconds', ['cluster', 'job', 'route'])
},
{
name: 'cortex_api_3',
rules:
utils.histogramRules('cortex_request_duration_seconds', ['cluster', 'job']) +
utils.histogramRules('cortex_request_duration_seconds', ['cluster', 'job', 'route']) +
utils.histogramRules('cortex_request_duration_seconds', ['cluster', 'namespace', 'job', 'route']),
},
{
Expand Down Expand Up @@ -366,6 +374,72 @@ local utils = import 'mixin-utils/utils.libsonnet';
},
],
},
{
name: 'cortex_alertmanager_rules',
rules: [
// Aggregations of per-user Alertmanager metrics used in dashboards.
{
record: 'cluster_job_%s:cortex_alertmanager_alerts:sum' % $._config.per_instance_label,
expr: |||
sum by (cluster, job, %s) (cortex_alertmanager_alerts)
||| % $._config.per_instance_label,
},
{
record: 'cluster_job_%s:cortex_alertmanager_silences:sum' % $._config.per_instance_label,
expr: |||
sum by (cluster, job, %s) (cortex_alertmanager_silences)
||| % $._config.per_instance_label,
},
{
record: 'cluster_job:cortex_alertmanager_alerts_received_total:rate5m',
expr: |||
sum by (cluster, job) (rate(cortex_alertmanager_alerts_received_total[5m]))
|||,
},
{
record: 'cluster_job:cortex_alertmanager_alerts_invalid_total:rate5m',
expr: |||
sum by (cluster, job) (rate(cortex_alertmanager_alerts_invalid_total[5m]))
|||,
},
{
record: 'cluster_job_integration:cortex_alertmanager_notifications_total:rate5m',
expr: |||
sum by (cluster, job, integration) (rate(cortex_alertmanager_notifications_total[5m]))
|||,
},
{
record: 'cluster_job_integration:cortex_alertmanager_notifications_failed_total:rate5m',
expr: |||
sum by (cluster, job, integration) (rate(cortex_alertmanager_notifications_failed_total[5m]))
|||,
},
{
record: 'cluster_job:cortex_alertmanager_state_replication_total:rate5m',
expr: |||
sum by (cluster, job) (rate(cortex_alertmanager_state_replication_total[5m]))
|||,
},
{
record: 'cluster_job:cortex_alertmanager_state_replication_failed_total:rate5m',
expr: |||
sum by (cluster, job) (rate(cortex_alertmanager_state_replication_failed_total[5m]))
|||,
},
{
record: 'cluster_job:cortex_alertmanager_partial_state_merges_total:rate5m',
expr: |||
sum by (cluster, job) (rate(cortex_alertmanager_partial_state_merges_total[5m]))
|||,
},
{
record: 'cluster_job:cortex_alertmanager_partial_state_merges_failed_total:rate5m',
expr: |||
sum by (cluster, job) (rate(cortex_alertmanager_partial_state_merges_failed_total[5m]))
|||,
},
],
},
],
},
}

0 comments on commit f86df91

Please sign in to comment.