Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify query for calculating replica count for ingesters in runbook #410

Conversation

dimitarvdimitrov
Copy link

@dimitarvdimitrov dimitarvdimitrov commented Oct 18, 2021

What this PR does: Modify query for calculating replica count for ingesters in runbook

Why: We're ingesting the same metrics of our GEM cluster twice - under two different tenants. This ends up throwing off the query in the runbook.

Which issue(s) this PR fixes: n/a but relates to https://github.com/grafana/deployment_tools/pull/17784

Checklist

  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@dimitarvdimitrov dimitarvdimitrov requested a review from a team as a code owner October 18, 2021 14:52
@@ -565,7 +565,7 @@ How to **fix**:
- Scale up ingesters
- To compute the desired number of ingesters to satisfy the average samples rate you can run the following query, replacing `<namespace>` with the namespace to analyse and `<target>` with the target number of samples/sec per ingester (check out the alert threshold to see the current target):
```
sum(rate(cortex_ingester_ingested_samples_total{namespace="<namespace>"}[$__rate_interval])) / (<target> * 0.9)
sum by (__tenant_id__) (rate(cortex_ingester_ingested_samples_total{namespace="<namespace>"}[$__rate_interval])) / (<target> * 0.9)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One can make the point that this is unnecessary because having two GEM clusters in the same namespace is unlikely to occur in real life. We encountered this when we duplicated the metrics of our cluster under a different tenant so we have more load. And if we start adding by (__tenant_id__) to all cortex-related queries, that will be too many queries.

I'm inclined to keep the change here because there is no better place for them and it might lead to overprovisioning the ingesters (x2).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the __tenant__id label? It's something coming from Cortex, so I'm wondering if it's just something deployment specific. If so, I'm not sure it should be part of the OSS playbook. Or at least you could add another query below providing this alternative query and explaining when it should be used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah, fair point - this may not be the right place for this change. The GEM playbook imports this playbook, so while following the import path I forgot that this change isn't necessarily something relevant here.

GEM cortex adds this label with the tenant ID that wrote it in the first place.

In any case - I think the use-case of having two clusters running in the same namespace is a very niche one, so this change is better off as an NB on the internal playbook.

I will close this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pracucci Do you happen to know how the number in the docs was devised (80e3)? I can see our dev ingesters performing way under their k8s limits (and requests for that matter) while processing more series per second than that threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants