Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce cardinality of metrics exposed by Mimir #1750

Closed
pracucci opened this issue Apr 22, 2022 · 9 comments
Closed

Reduce cardinality of metrics exposed by Mimir #1750

pracucci opened this issue Apr 22, 2022 · 9 comments
Labels
enhancement New feature or request

Comments

@pracucci
Copy link
Collaborator

@colega did a great work to analyze Mimir metrics cardinality and which metrics are effectively used by any of our dashboards, alerts or manual queries issued at Grafana Labs. I took a quick look at the report of both used and unused metrics and I think we have room to reduce the cardinality of metrics exposed by Mimir.

Mimir metrics are typically low cardinality, unless you run very large Mimir clusters and/or with a very large number of tenants. Below I'm sharing some ideas on how we could reduce them.

cortex_ring_tokens_owned

  • Exposed by the ring client
  • For each ring member, it exposes how many tokens (ring virtual nodes) that member has registered in the ring (eg. 512)

Proposal: remove the metric (I think it's useless).

cortex_ring_member_ownership_percent

  • Exposed by the ring client
  • For each ring member, it exposes the % of tokens owned in the ring

Proposal: the same information is available in the ring UI page, I think we can just remove the metric

cortex_alertmanager_notification_requests_total and cortex_alertmanager_notification_requests_failed_total

  • Exposed by the Alertmanager
  • For each user and receiver integration (even if it's not used!) it exposes the number of notification requests sent / failed

Proposal: do not expose it for unused receivers. We could propose a change to not initialize them in Prometheus Alertmanager or do a trick to not expose these metrics if their value is 0 from Mimir (we remap them)

The same could be done for all other Alertmanager counters having both user and integration label, like cortex_alertmanager_notifications_total.

cortex_distributor_ingester_queries_total and cortex_distributor_ingester_query_failures_total

  • Exposed by querier and ruler
  • Exposes the number of queries sent to ingesters (and failed ones)

Proposal: remove the metric (the same can be inferred by the generic grpc requests by route metric).

cortex_alertmanager_alerts_insert_limited_total

This and other Alertmanager counters with only the user label are exposed for all tenants regardless any value has ever been tracked (so even if the counter value is 0).
If it was a normal CounterVec they wouldn't, but in this case they are because they've been remapped by Mimir.

Proposal: improve remapping logic for counters, to optionally allow to not expose a counter if value is 0.

@pracucci pracucci added the enhancement New feature or request label Apr 22, 2022
@pstibrany
Copy link
Member

Good suggestions, I agree with all of them.

@09jvilla
Copy link
Contributor

Neat - this is great @pracucci .

If I read your proposal correctly, you're not suggesting we replace any of these metrics with aggregated versions. Seems like you don't want to get rid of the user label entirely, or the receiver label entirely, but rather you want to drop series with certain characteristics (e.g., value always 0 or receiver is unused).

Is that correct?

I'm just trying to get a feel for cases where aggregations are or aren't helpful for people.

@johannaratliff
Copy link
Contributor

johannaratliff commented Apr 22, 2022

the same information is available in the ring UI page

When referring to dropping cortex_ring_member_ownership_percent - is that value on the UI page not derived from this metric? Do we recalculate it for the page? Unsure here.

@pracucci
Copy link
Collaborator Author

If I read your proposal correctly, you're not suggesting we replace any of these metrics with aggregated versions. Seems like you don't want to get rid of the user label entirely, or the receiver label entirely, but rather you want to drop series with certain characteristics (e.g., value always 0 or receiver is unused).

Correct. Reducing cardinality directly in Mimir (wherever possible) will benefit any Mimir user, not just who will use Grafana Labs' aggregations.

When referring to dropping cortex_ring_member_ownership_percent - is that value on the UI page not derived from this metric? Do we recalculate it for the page? Unsure here.

We have a function computing the %. The result is both used to populate the UI page and expose the metric. If we drop the metric, we'll just keep showing it in the UI.

@pracucci
Copy link
Collaborator Author

pracucci commented May 2, 2022

The suggestions in the issue's description has been all applied, so I'm going to close this issue. However, we may do other improvements over the time, as we find them.

@pracucci pracucci closed this as completed May 2, 2022
@pracucci
Copy link
Collaborator Author

pracucci commented May 6, 2022

We just deployed these changes to a single tenant cluster (about 70M active series) and we're seeing about -10% metrics reduction:

  • before these PRs: 502k metrics series
  • after these PRs: 443k metrics series (-11%)

Screenshot 2022-05-06 at 09 53 21

@09jvilla
Copy link
Contributor

@pracucci -- just to be sure I'm following - this is 'series count', not 'metrics count' correct?

@pracucci
Copy link
Collaborator Author

@pracucci -- just to be sure I'm following - this is 'series count', not 'metrics count' correct?

Correct!

@pracucci
Copy link
Collaborator Author

pracucci commented May 23, 2022

The reduction on alertmanager was even more impressive (-68%), for the case there are many tenants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants