Modify query for calculating replica count for ingesters in runbook #410

dimitarvdimitrov · 2021-10-18T14:52:25Z

What this PR does: Modify query for calculating replica count for ingesters in runbook

Why: We're ingesting the same metrics of our GEM cluster twice - under two different tenants. This ends up throwing off the query in the runbook.

Which issue(s) this PR fixes: n/a but relates to https://github.com/grafana/deployment_tools/pull/17784

Checklist

CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

dimitarvdimitrov · 2021-10-18T15:10:29Z

cortex-mixin/docs/playbooks.md

@@ -565,7 +565,7 @@ How to **fix**:
 - Scale up ingesters
  - To compute the desired number of ingesters to satisfy the average samples rate you can run the following query, replacing `<namespace>` with the namespace to analyse and `<target>` with the target number of samples/sec per ingester (check out the alert threshold to see the current target):
    ```
-    sum(rate(cortex_ingester_ingested_samples_total{namespace="<namespace>"}[$__rate_interval])) / (<target> * 0.9)
+    sum by (__tenant_id__) (rate(cortex_ingester_ingested_samples_total{namespace="<namespace>"}[$__rate_interval])) / (<target> * 0.9)


One can make the point that this is unnecessary because having two GEM clusters in the same namespace is unlikely to occur in real life. We encountered this when we duplicated the metrics of our cluster under a different tenant so we have more load. And if we start adding by (__tenant_id__) to all cortex-related queries, that will be too many queries.

I'm inclined to keep the change here because there is no better place for them and it might lead to overprovisioning the ingesters (x2).

What is the __tenant__id label? It's something coming from Cortex, so I'm wondering if it's just something deployment specific. If so, I'm not sure it should be part of the OSS playbook. Or at least you could add another query below providing this alternative query and explaining when it should be used.

Aah, fair point - this may not be the right place for this change. The GEM playbook imports this playbook, so while following the import path I forgot that this change isn't necessarily something relevant here.

GEM cortex adds this label with the tenant ID that wrote it in the first place.

In any case - I think the use-case of having two clusters running in the same namespace is a very niche one, so this change is better off as an NB on the internal playbook.

I will close this.

@pracucci Do you happen to know how the number in the docs was devised (80e3)? I can see our dev ingesters performing way under their k8s limits (and requests for that matter) while processing more series per second than that threshold.

Modify query for calculating replica count for ingesters in runbook

3c040b8

dimitarvdimitrov requested a review from a team as a code owner October 18, 2021 14:52

dimitarvdimitrov commented Oct 18, 2021

View reviewed changes

dimitarvdimitrov closed this Oct 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify query for calculating replica count for ingesters in runbook #410

Modify query for calculating replica count for ingesters in runbook #410

dimitarvdimitrov commented Oct 18, 2021 •

edited

dimitarvdimitrov Oct 18, 2021

pracucci Oct 18, 2021

dimitarvdimitrov Oct 18, 2021

dimitarvdimitrov Oct 19, 2021

Modify query for calculating replica count for ingesters in runbook #410

Modify query for calculating replica count for ingesters in runbook #410

Conversation

dimitarvdimitrov commented Oct 18, 2021 • edited

dimitarvdimitrov Oct 18, 2021

Choose a reason for hiding this comment

pracucci Oct 18, 2021

Choose a reason for hiding this comment

dimitarvdimitrov Oct 18, 2021

Choose a reason for hiding this comment

dimitarvdimitrov Oct 19, 2021

Choose a reason for hiding this comment

dimitarvdimitrov commented Oct 18, 2021 •

edited