New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributor autoscaling #3378
Distributor autoscaling #3378
Conversation
This is still WIP, but looking for advise on how to implement more flexible alerts for more autoscaled components. Specifically in |
Not a direct answer, but I think alerts shouldn't be based on the configured |
false | ||
) | ||
), | ||
groups+: if !anyEnabled($._config.autoscaling) then [] else [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep it simple and always configure the alert, regardless whether autoscaling is enabled or not. If autoscaling is not enabled, then the metric kube_horizontalpodautoscaler_status_condition
won't exist and alerting rule will not evaluate. That's an approach we take for other alerts too, which are based on features you may or may not have enabled.
To answer this specific question, you can use one of the functions from the stdlib to get an array from an object, for example |
}, | ||
autoscaling: [ | ||
{ | ||
name: 'querier', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions:
- Any specific reason why we're moving away from the previous config structure?
- Given you shouldn't configured it twice for the same component, have you considered having a config structure like this?
autoscaling: {
querier: {
enabled: true,
hpa_name: 'keda-hpa-querier',
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Moving away so that I can loop over the components.
- Not sure what you mean by configuring it twice? But yes, your suggested structure would work well. It's really just moving the name field to a key, but it's much neater so will try to implement that.
Okay. I was following the existing convention and expanding it for more components. In this case However from your next comment..
In that case I should drop the I'll give this a go tomorrow with |
FYI, the problem I was experiencing was that I missed that |
This is WIP. Because the distributor scales off multiple metrics this needs some adjusting.
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! We pushed a couple of commits while doing pair code review, and left few comments. Apart from the minor comments, everything else LGTM 👏
docs/sources/operators-guide/deploy-grafana-mimir/jsonnet/configure-autoscaling.md
Show resolved
Hide resolved
docs/sources/operators-guide/deploy-grafana-mimir/jsonnet/configure-autoscaling.md
Outdated
Show resolved
Hide resolved
Co-authored-by: Marco Pracucci <marco@pracucci.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -28,12 +28,15 @@ | |||
|
|||
* [CHANGE] Alerts: Change `MimirSchedulerQueriesStuck` `for` time to 7 minutes to account for the time it takes for HPA to scale up. #3223 | |||
* [CHANGE] Dashboards: Removed the `Querier > Stages` panel from the `Mimir / Queries` dashboard. #3311 | |||
* [CHANGE] Configuration: The format of the `autoscaling` section of the configuration has changed to support more components. #3378 | |||
* Instead of specific config variables for each component, they are listed in a dictionary. For example, `autoscaling.querier_enabled` becomes `autoscaling.querier.enabled`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Instead of specific config variables for each component, they are listed in a dictionary. For example, `autoscaling.querier_enabled` becomes `autoscaling.querier.enabled`. | |
* Instead of specific configuration variables for each component, they are listed in a dictionary. For example, `autoscaling.querier_enabled` becomes `autoscaling.querier.enabled`. |
@@ -58,6 +61,7 @@ | |||
* Renaming the alertmanager's bucket name configuration from provider-specific to the new `alertmanager_storage_bucket_name` key. | |||
* [ENHANCEMENT] Added `$._config.usageStatsConfig` to track the installation mode via the anonymous usage statistics. #3294 | |||
* [ENHANCEMENT] The query-tee node port (`$._config.query_tee_node_port`) is now optional. #3272 | |||
* [ENHANCEMENT] Add support for autoscaling distributors. #3378 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* [ENHANCEMENT] Add support for autoscaling distributors. #3378 | |
* [ENHANCEMENT] Added support for autoscaling distributors. #3378 |
@@ -58,6 +61,7 @@ | |||
* Renaming the alertmanager's bucket name configuration from provider-specific to the new `alertmanager_storage_bucket_name` key. | |||
* [ENHANCEMENT] Added `$._config.usageStatsConfig` to track the installation mode via the anonymous usage statistics. #3294 | |||
* [ENHANCEMENT] The query-tee node port (`$._config.query_tee_node_port`) is now optional. #3272 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* [ENHANCEMENT] The query-tee node port (`$._config.query_tee_node_port`) is now optional. #3272 | |
* [ENHANCEMENT] Made the query-tee node port (`$._config.query_tee_node_port`) optional. #3272 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out-of-scope, so feel free to throw out feedback
@@ -45,7 +47,7 @@ However, if KEDA is not running successfully, there are consequences for Mimir a | |||
- `keda-operator` is down (not critical): changes to `ScaledObject` CRD will not be reflected to the HPA until the operator will get back online. HPA functionality is not affected. | |||
- `keda-operator-metrics-apiserver` is down (critical): HPA is not able to fetch updated metrics and it will stop scaling the deployment until metrics will be back. The deployment (e.g. queriers) will keep working but, in case of any surge of traffic, HPA will not be able to detect it (because of a lack of metrics) and so will not scale up. | |||
|
|||
The [alert `MimirQuerierAutoscalerNotActive`]({{< relref "../../monitor-grafana-mimir/_index.md" >}}) fires if HPA is unable to scale the deployment for any reason (e.g. unable to scrape metrics from KEDA metrics API server). | |||
The [alert `MimirAutoscalerNotActive`]({{< relref "../../monitor-grafana-mimir/_index.md" >}}) fires if HPA is unable to scale the deployment for any reason (e.g. unable to scrape metrics from KEDA metrics API server). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The [alert `MimirAutoscalerNotActive`]({{< relref "../../monitor-grafana-mimir/_index.md" >}}) fires if HPA is unable to scale the deployment for any reason (e.g. unable to scrape metrics from KEDA metrics API server). | |
The [alert `MimirAutoscalerNotActive`]({{< relref "../../monitor-grafana-mimir/_index.md" >}}) fires if HPA is unable to scale the deployment for any reason, for example it is unable to scrape metrics from the KEDA metrics API server. |
* Operations: Add support for autoscaling distributors * Update documentation regarding autoscaling * Update MimirAutoscalerNotActive alert to support more components * Fix compiling MimirAutoscalerNotActive * Add dashboard row for distributor autoscaling metrics This is WIP. Because the distributor scales off multiple metrics this needs some adjusting. * Split the distributor autoscaling panels into two: CPU and memory Signed-off-by: Marco Pracucci <marco@pracucci.com> * Simplied the autoscaling alert Signed-off-by: Marco Pracucci <marco@pracucci.com> * Apply suggestions from code review Co-authored-by: Marco Pracucci <marco@pracucci.com> * Update runbook documentation for MimirAutoscalerNotActive * Update CHANGELOG * Add distributor autoscaling to jsonnet tests Signed-off-by: Marco Pracucci <marco@pracucci.com> Co-authored-by: Marco Pracucci <marco@pracucci.com>
Hey Matt. That is correct. We have rolled it out in dev clusters, but not prod. (More detail on slack). |
Ah, my apologies for the private Grafana Labs repo link! |
What this PR does
Add support for autoscaling distributors. Lay some groundwork to help making other components autoscale.
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]