Distributor autoscaling #3378

jhesketh · 2022-11-03T13:13:26Z

What this PR does

Add support for autoscaling distributors. Lay some groundwork to help making other components autoscale.

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

jhesketh · 2022-11-03T13:15:13Z

This is still WIP, but looking for advise on how to implement more flexible alerts for more autoscaled components.

Specifically in operations/mimir-mixin/alerts/autoscaling.libsonnet I want to loop over $._config.autoscaling but jsonnet is expecting an array and $._config.autoscaling is apparently an object. Not sure why or what I could do?

pracucci · 2022-11-03T13:19:08Z

This is still WIP, but looking for advise on how to implement more flexible alerts for more autoscaled components.

Specifically in operations/mimir-mixin/alerts/autoscaling.libsonnet I want to loop over $._config.autoscaling but jsonnet is expecting an array and $._config.autoscaling is apparently an object. Not sure why or what I could do?

Not a direct answer, but I think alerts shouldn't be based on the configured $._config.autoscaling.*enabled if possible. The reason is that if you deploy multiple Mimir clusters, alerts are global but config is per cluster (at least that's how it works at Grafana Labs) so if you autoscaling enabled only in some clusters then the alert will not work as expected for all clusters.

pracucci · 2022-11-03T13:21:00Z

operations/mimir-mixin/alerts/autoscaling.libsonnet

+      false
+    )
+  ),
+  groups+: if !anyEnabled($._config.autoscaling) then [] else [


I would keep it simple and always configure the alert, regardless whether autoscaling is enabled or not. If autoscaling is not enabled, then the metric kube_horizontalpodautoscaler_status_condition won't exist and alerting rule will not evaluate. That's an approach we take for other alerts too, which are based on features you may or may not have enabled.

pracucci · 2022-11-03T13:23:07Z

Specifically in operations/mimir-mixin/alerts/autoscaling.libsonnet I want to loop over $._config.autoscaling but jsonnet is expecting an array and $._config.autoscaling is apparently an object.

To answer this specific question, you can use one of the functions from the stdlib to get an array from an object, for example std.objectFields(). See here the ones that better fit to your use case: https://jsonnet.org/ref/stdlib.html

pracucci · 2022-11-03T13:26:29Z

operations/mimir-mixin/config.libsonnet

-    },
+    autoscaling: [
+      {
+        name: 'querier',


Two questions:

Any specific reason why we're moving away from the previous config structure?

Given you shouldn't configured it twice for the same component, have you considered having a config structure like this?

autoscaling: { querier: { enabled: true, hpa_name: 'keda-hpa-querier', } }

Moving away so that I can loop over the components.

Not sure what you mean by configuring it twice? But yes, your suggested structure would work well. It's really just moving the name field to a key, but it's much neater so will try to implement that.

jhesketh · 2022-11-03T13:28:13Z

Not a direct answer, but I think alerts shouldn't be based on the configured $._config.autoscaling.*enabled if possible. The reason is that if you deploy multiple Mimir clusters, alerts are global but config is per cluster (at least that's how it works at Grafana Labs) so if you autoscaling enabled only in some clusters then the alert will not work as expected for all clusters.

Okay. I was following the existing convention and expanding it for more components. In this case enabled means that the alert is enabled, not necessarily that autoscaling itself is actually configured.

However from your next comment..

I would keep it simple and always configure the alert, regardless whether autoscaling is enabled or not. If autoscaling is not enabled, then the metric kube_horizontalpodautoscaler_status_condition won't exist and alerting rule will not evaluate. That's an approach we take for other alerts too, which are based on features you may or may not have enabled.

In that case I should drop the .enabled field. I still need $._config.autoscaling though to describe the other components.

I'll give this a go tomorrow with std.objectFields(). I did play around with some of the stdlib stuff, but thought there might be a better way to do it.

jhesketh · 2022-11-04T06:59:20Z

FYI, the problem I was experiencing was that I missed that operations/mimir-mixin/mixin-compiled.libsonnet was overwriting the autoscaling value and thus modifying the object(s). Now that I see that path, I should be able to resolve this.

This is WIP. Because the distributor scales off multiple metrics this needs some adjusting.

Signed-off-by: Marco Pracucci <marco@pracucci.com>

pracucci

Thanks for working on this! We pushed a couple of commits while doing pair code review, and left few comments. Apart from the minor comments, everything else LGTM 👏

operations/mimir-mixin/config.libsonnet

operations/mimir/autoscaling.libsonnet

docs/sources/operators-guide/deploy-grafana-mimir/jsonnet/configure-autoscaling.md

Co-authored-by: Marco Pracucci <marco@pracucci.com>

pracucci

LGTM!

osg-grafana · 2022-11-04T11:19:28Z

CHANGELOG.md

@@ -28,12 +28,15 @@

 * [CHANGE] Alerts: Change `MimirSchedulerQueriesStuck` `for` time to 7 minutes to account for the time it takes for HPA to scale up. #3223
 * [CHANGE] Dashboards: Removed the `Querier > Stages` panel from the `Mimir / Queries` dashboard. #3311
+* [CHANGE] Configuration: The format of the `autoscaling` section of the configuration has changed to support more components. #3378
+  * Instead of specific config variables for each component, they are listed in a dictionary. For example, `autoscaling.querier_enabled` becomes `autoscaling.querier.enabled`.


Suggested change

* Instead of specific config variables for each component, they are listed in a dictionary. For example, `autoscaling.querier_enabled` becomes `autoscaling.querier.enabled`.

* Instead of specific configuration variables for each component, they are listed in a dictionary. For example, `autoscaling.querier_enabled` becomes `autoscaling.querier.enabled`.

osg-grafana · 2022-11-04T11:19:50Z

CHANGELOG.md

@@ -58,6 +61,7 @@
    * Renaming the alertmanager's bucket name configuration from provider-specific to the new `alertmanager_storage_bucket_name` key.
 * [ENHANCEMENT] Added `$._config.usageStatsConfig` to track the installation mode via the anonymous usage statistics. #3294
 * [ENHANCEMENT] The query-tee node port (`$._config.query_tee_node_port`) is now optional. #3272
+* [ENHANCEMENT] Add support for autoscaling distributors. #3378


Suggested change

* [ENHANCEMENT] Add support for autoscaling distributors. #3378

* [ENHANCEMENT] Added support for autoscaling distributors. #3378

osg-grafana · 2022-11-04T11:20:05Z

CHANGELOG.md

@@ -58,6 +61,7 @@
    * Renaming the alertmanager's bucket name configuration from provider-specific to the new `alertmanager_storage_bucket_name` key.
 * [ENHANCEMENT] Added `$._config.usageStatsConfig` to track the installation mode via the anonymous usage statistics. #3294
 * [ENHANCEMENT] The query-tee node port (`$._config.query_tee_node_port`) is now optional. #3272


Suggested change

* [ENHANCEMENT] The query-tee node port (`$._config.query_tee_node_port`) is now optional. #3272

* [ENHANCEMENT] Made the query-tee node port (`$._config.query_tee_node_port`) optional. #3272

Out-of-scope, so feel free to throw out feedback

osg-grafana · 2022-11-04T11:21:22Z

docs/sources/operators-guide/deploy-grafana-mimir/jsonnet/configure-autoscaling.md

@@ -45,7 +47,7 @@ However, if KEDA is not running successfully, there are consequences for Mimir a
 - `keda-operator` is down (not critical): changes to `ScaledObject` CRD will not be reflected to the HPA until the operator will get back online. HPA functionality is not affected.
 - `keda-operator-metrics-apiserver` is down (critical): HPA is not able to fetch updated metrics and it will stop scaling the deployment until metrics will be back. The deployment (e.g. queriers) will keep working but, in case of any surge of traffic, HPA will not be able to detect it (because of a lack of metrics) and so will not scale up.

-The [alert `MimirQuerierAutoscalerNotActive`]({{< relref "../../monitor-grafana-mimir/_index.md" >}}) fires if HPA is unable to scale the deployment for any reason (e.g. unable to scrape metrics from KEDA metrics API server).
+The [alert `MimirAutoscalerNotActive`]({{< relref "../../monitor-grafana-mimir/_index.md" >}}) fires if HPA is unable to scale the deployment for any reason (e.g. unable to scrape metrics from KEDA metrics API server).


Suggested change

The [alert `MimirAutoscalerNotActive`]({{< relref "../../monitor-grafana-mimir/_index.md" >}}) fires if HPA is unable to scale the deployment for any reason (e.g. unable to scrape metrics from KEDA metrics API server).

The [alert `MimirAutoscalerNotActive`]({{< relref "../../monitor-grafana-mimir/_index.md" >}}) fires if HPA is unable to scale the deployment for any reason, for example it is unable to scrape metrics from the KEDA metrics API server.

* Operations: Add support for autoscaling distributors * Update documentation regarding autoscaling * Update MimirAutoscalerNotActive alert to support more components * Fix compiling MimirAutoscalerNotActive * Add dashboard row for distributor autoscaling metrics This is WIP. Because the distributor scales off multiple metrics this needs some adjusting. * Split the distributor autoscaling panels into two: CPU and memory Signed-off-by: Marco Pracucci <marco@pracucci.com> * Simplied the autoscaling alert Signed-off-by: Marco Pracucci <marco@pracucci.com> * Apply suggestions from code review Co-authored-by: Marco Pracucci <marco@pracucci.com> * Update runbook documentation for MimirAutoscalerNotActive * Update CHANGELOG * Add distributor autoscaling to jsonnet tests Signed-off-by: Marco Pracucci <marco@pracucci.com> Co-authored-by: Marco Pracucci <marco@pracucci.com>

mattmendick · 2023-01-27T22:48:36Z

Hey @jhesketh - autoscaling distributors - that's great! I think I see that it's in a few environments, but not production quite yet (via this). If that's not true, LMK what the reality is, and if it is true, can you let me know what your thoughts are on the rollout to more environments? Thanks!

jhesketh · 2023-01-30T06:05:16Z

Hey @jhesketh - autoscaling distributors - that's great! I think I see that it's in a few environments, but not production quite yet (via this). If that's not true, LMK what the reality is, and if it is true, can you let me know what your thoughts are on the rollout to more environments? Thanks!

Hey Matt. That is correct. We have rolled it out in dev clusters, but not prod. (More detail on slack).

mattmendick · 2023-01-30T16:50:33Z

Ah, my apologies for the private Grafana Labs repo link!

jhesketh added 3 commits November 4, 2022 00:10

Operations: Add support for autoscaling distributors

c09b3ed

Update documentation regarding autoscaling

f5a7f62

Update MimirAutoscalerNotActive alert to support more components

047e26c

jhesketh mentioned this pull request Nov 3, 2022

Jsonnet: add auto scaling support for read component in the read-write deployment mode #3365

Closed

pracucci reviewed Nov 3, 2022

View reviewed changes

Fix compiling MimirAutoscalerNotActive

2598a60

jhesketh and others added 3 commits November 4, 2022 18:01

Add dashboard row for distributor autoscaling metrics

cc16dec

This is WIP. Because the distributor scales off multiple metrics this needs some adjusting.

Split the distributor autoscaling panels into two: CPU and memory

b94da20

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Simplied the autoscaling alert

5e5b91f

Signed-off-by: Marco Pracucci <marco@pracucci.com>

pracucci approved these changes Nov 4, 2022

View reviewed changes

jhesketh and others added 4 commits November 4, 2022 21:14

Apply suggestions from code review

09df735

Co-authored-by: Marco Pracucci <marco@pracucci.com>

Update runbook documentation for MimirAutoscalerNotActive

097ee05

Update CHANGELOG

196f974

Add distributor autoscaling to jsonnet tests

90074c6

jhesketh marked this pull request as ready for review November 4, 2022 10:44

jhesketh requested review from osg-grafana and a team as code owners November 4, 2022 10:44

pracucci approved these changes Nov 4, 2022

View reviewed changes

pracucci enabled auto-merge (squash) November 4, 2022 10:51

osg-grafana added the type/docs Improvements or additions to documentation label Nov 4, 2022

osg-grafana reviewed Nov 4, 2022

View reviewed changes

pracucci merged commit 9090af5 into grafana:main Nov 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributor autoscaling #3378

Distributor autoscaling #3378

jhesketh commented Nov 3, 2022 •

edited

jhesketh commented Nov 3, 2022

pracucci commented Nov 3, 2022 •

edited

pracucci Nov 3, 2022

pracucci commented Nov 3, 2022

pracucci Nov 3, 2022

jhesketh Nov 3, 2022

jhesketh commented Nov 3, 2022

jhesketh commented Nov 4, 2022

pracucci left a comment

pracucci left a comment

osg-grafana Nov 4, 2022

osg-grafana Nov 4, 2022

osg-grafana Nov 4, 2022

osg-grafana Nov 4, 2022

osg-grafana Nov 4, 2022 •

edited

mattmendick commented Jan 27, 2023

jhesketh commented Jan 30, 2023

mattmendick commented Jan 30, 2023

	* Instead of specific config variables for each component, they are listed in a dictionary. For example, `autoscaling.querier_enabled` becomes `autoscaling.querier.enabled`.
	* Instead of specific configuration variables for each component, they are listed in a dictionary. For example, `autoscaling.querier_enabled` becomes `autoscaling.querier.enabled`.

	* [ENHANCEMENT] Add support for autoscaling distributors. #3378
	* [ENHANCEMENT] Added support for autoscaling distributors. #3378

	* [ENHANCEMENT] The query-tee node port (`$._config.query_tee_node_port`) is now optional. #3272
	* [ENHANCEMENT] Made the query-tee node port (`$._config.query_tee_node_port`) optional. #3272

	The [alert `MimirAutoscalerNotActive`]({{< relref "../../monitor-grafana-mimir/_index.md" >}}) fires if HPA is unable to scale the deployment for any reason (e.g. unable to scrape metrics from KEDA metrics API server).
	The [alert `MimirAutoscalerNotActive`]({{< relref "../../monitor-grafana-mimir/_index.md" >}}) fires if HPA is unable to scale the deployment for any reason, for example it is unable to scrape metrics from the KEDA metrics API server.

Distributor autoscaling #3378

Distributor autoscaling #3378

Conversation

jhesketh commented Nov 3, 2022 • edited

What this PR does

Checklist

jhesketh commented Nov 3, 2022

pracucci commented Nov 3, 2022 • edited

pracucci Nov 3, 2022

Choose a reason for hiding this comment

pracucci commented Nov 3, 2022

pracucci Nov 3, 2022

Choose a reason for hiding this comment

jhesketh Nov 3, 2022

Choose a reason for hiding this comment

jhesketh commented Nov 3, 2022

jhesketh commented Nov 4, 2022

pracucci left a comment

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

osg-grafana Nov 4, 2022

Choose a reason for hiding this comment

osg-grafana Nov 4, 2022

Choose a reason for hiding this comment

osg-grafana Nov 4, 2022

Choose a reason for hiding this comment

osg-grafana Nov 4, 2022

Choose a reason for hiding this comment

osg-grafana Nov 4, 2022 • edited

Choose a reason for hiding this comment

mattmendick commented Jan 27, 2023

jhesketh commented Jan 30, 2023

mattmendick commented Jan 30, 2023

jhesketh commented Nov 3, 2022 •

edited

pracucci commented Nov 3, 2022 •

edited

osg-grafana Nov 4, 2022 •

edited