How can I autoscale Prometheus shards using HPA? #4946

SHEELE41 · 2022-08-01T14:32:35Z

What did you do?
I already read #3130, #2590 and saw that Prometheus shards can be autoscaled via the HPA.

If you need a solution quickly, you can already use additional relabeling rules on your ServiceMonitor via the hashmod action, and create multiple ServiceMonitors per "shard". Your use case makes a lot of sense, I'd like to think it through a little bit further, and arrive at a solution, that would allow us to eventually autoscale sharding based on the metric ingestion (I'm thinking a general purpose way, where a Prometheus object would become a shard and maybe a ShardedPrometheus object that orchestrates these, and can be autoscaled via the HPA). What I'm saying is, maybe the sharding decision should be configured in the Prometheus object ultimately instead of the ServiceMonitor (where it's already possible albeit a little manual today).

I want to increase only 'shards' value, not 'replicas' value, when overall cpu usage of Prometheus shards(pods) is increasing.
(Because I want each Prometheus object to scrape metric from targets mutual exclusively.)

kind: Prometheus
...
spec:
  # replicas is fixed
  replicas: 1
  # horizontal scaling by increasing shards
  shards: 3, 4, 5...

So, I used kubernetes-sigs/metrics-server for cpu usage based scaling, and wrote a HPA manifest file with 'Prometheus' CRD as the target.

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: prometheus-autoscaler
  namespace: monitoring
spec:
  scaleTargetRef:
    apiVersion: monitoring.coreos.com/v1
    kind: Prometheus
    name: prometheus
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 50

However, this HPA could not get cpu usage of target, so the target CPU usage remained unknown.

$ kubectl get hpa prometheus-autoscaler
... TARGETS ...
... <unknown>/50% ...

I realized it stands to reason because 'Promethues' CRD not implements labelselector /scale subresource.

$ kubectl describe hpa prometheus-autoscaler
Events:
  Type     Reason                        Age                     From                       Message
  ----     ------                        ----                    ----                       -------
  Warning  FailedComputeMetricsReplicas  5m2s (x12 over 7m52s)   horizontal-pod-autoscaler  selector is required
  Warning  SelectorRequired              2m43s (x21 over 7m52s)  horizontal-pod-autoscaler  selector is required

In Prometheus-operator, StatefulSet is created per shard and I checked that each id of the shard is pre-written in each StatefulSet manifest by running kubectl edit.

(In my case... shards=3 & replicas=1)

NAME                                          READY   STATUS    RESTARTS   AGE
pod/prometheus-prometheus-0                   2/2     Running   0          3h28m
pod/prometheus-prometheus-shard-1-0           2/2     Running   0          3h28m
pod/prometheus-prometheus-shard-2-0           2/2     Running   0          3h28m

NAME                                             READY   AGE
statefulset.apps/prometheus-prometheus           1/1     3d3h
statefulset.apps/prometheus-prometheus-shard-1   1/1     6h12m
statefulset.apps/prometheus-prometheus-shard-2   1/1     6h12m

NAME                                          VERSION   REPLICAS   AGE
prometheus.monitoring.coreos.com/prometheus   v2.33.5   1          3d3h

So even if I write the HPA manifest file to target each StatefulSet, it won't work as I expected.

In conclusion,

How can I get entire cpu usage of Prometheus shards by kubernetes-sigs/metrics-server?
How can I autoscale Prometheus shards by increasing/decreasing only 'shards' value without touching 'replicas' value using HPA? (like increasing the number of StatefulSet)

FYI, I'm using Prometheus to record external server's metrics, not my kubernetes cluster's metrics.

Environment

Prometheus Operator version:

v0.58.0
Kubernetes version information:

v1.24

The text was updated successfully, but these errors were encountered:

simonpasquier · 2022-08-03T09:28:19Z

Thanks for the detailed report!

How can I get entire cpu usage of Prometheus shards by kubernetes-sigs/metrics-server?

Hmm I would need to look more into the details of how HPA works.

How can I autoscale Prometheus shards by increasing/decreasing only 'shards' value without touching 'replicas' value using HPA? (like increasing the number of StatefulSet)

This is a very good question. Even if #4735 implements the scale subresouce, it's probably not doing what we want since as you noted, it's going to add more replicas instead of more shards.

To be honest, we need to review your use case in more depth. And I think it will become even more pressing with the agent CRD...

simonpasquier · 2022-08-03T12:46:44Z

We've been discussing it offline with @slashpai and came to the conclusion that #4735 had to be reverted for now (see #4952).

The plan is rather to re-implement the scale subresource with the following changes:

scale the number of shards rather than the number of replicas.
implement a status.selector field that can be used by HPA.

@SHEELE41 does it make sense from your point of view?

SHEELE41 · 2022-08-03T18:09:08Z

Thanks you for answering @simonpasquier.

scale the number of shards rather than the number of replicas.

implement a status.selector field that can be used by HPA.

That's exactly what I wanted!

My opinion
I checked #4735 and #4952.

As you said at #4952, I also thought status.selector subresource should be added to Prometheus CRD and enabled via +kubebuilder:subresource:scale:selectorpath=.status.selector.

This may solve the problem that HPA can't get metrics but, as you know, this will be not enough yet to make HPA scale the number of shards rather than the number of replicas.

However, I don't think even scaling the number of replicas will work properly because Prometheus CR's spec.replicas and status.replicas have to be different. (status.replicas = spec.replicas * spec.shards)

$ kubectl get --raw /apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheuses/prometheus/scale
{"kind":"Scale","apiVersion":"autoscaling/v1",    ...    "spec":{"replicas":1},"status":{"replicas":3}}

HPA will try to make spec.replicas and spec.replicas identical, which will mess up the resources.

Additionally, I think spec.shards scaling feature and spec.replicas scaling feature should be separated clearly.

If not, user should create HPA per StatefulSet for scaling the number of replicas, but this HPA will scale only pods in target StatefulSet. (when only spec.shards scaling feature is implemented as default)

My use case

There are many Pushgateway(scrape target) pods that are being autoscaled via HPA on my k8s cluster.
There are several Prometheus pods too, but aren't being autoscaled.
I'm using remote TSDB so I can use Prometheus as stateless.
The Prometheus pods are generated by 'shards' field of Prometheus CR so that each Prometheus instance scrapes metrics from the Pushgateway pods without intersections with other Prometheus instances.
In this situation, I want to autoscale Prometheus pods via the HPA.

simonpasquier · 2022-08-04T06:55:56Z

Additionally, I think spec.shards scaling feature and spec.replicas scaling feature should be separated clearly.

My idea is that scaling should only work for shards. There's little incentive to scale the number of replicas IMHO: increasing this number is never going to spread the load.

I think it could work if we add a status.shards field to the Prometheus status?

// +kubebuilder:subresource:scale:specpath=.spec.shards,statuspath=.status.shards,selectorpath=.status.selector

SHEELE41 · 2022-08-04T09:14:39Z

My idea is that scaling should only work for shards. There's little incentive to scale the number of replicas IMHO: increasing this number is never going to spread the load.

Oh, I see.

Then I think we can deal with this problem in the way you suggested.

By the way, have you any plan of implementation that makes Prometheus pods(not CR) have the same labels specified in Prometheus CR?

Case 1. Prometheus pods inherit labels from metadata.labels of Prometheus CR.
Case 2. Prometheus pods inherit labels from spec.template.metadata.labels of Prometheus CR. (as if Prometheus CR behaves like a deployment.apps)

I think both cases need to write some code in makeStatefulSetSpec func for PodTemplateSpec.

In general, user will be able to scale by just adding app.kubernetes.io/name: prometheus to spec.selector of Prometheus CR. (or app.kubernetes.io/instance: $NAME_OF_PROMETHEUS_CR)

[/pkg/prometheus/statefulset.go:683]

        // In cases where an existing selector label is modified, or a new one is added, new sts cannot match existing pods.
	// We should try to avoid removing such immutable fields whenever possible since doing
	// so forces us to enter the 'recreate cycle' and can potentially lead to downtime.
	// The requirement to make a change here should be carefully evaluated.
	podSelectorLabels := map[string]string{
		"app.kubernetes.io/name":       "prometheus",
		"app.kubernetes.io/managed-by": "prometheus-operator",
		"app.kubernetes.io/instance":   p.Name,
		"prometheus":                   p.Name,
		shardLabelName:                 fmt.Sprintf("%d", shard),
		prometheusNameLabelName:        p.Name,
	}

However, most users will not know what value Prometheus CR's spec.selector should be unless they run kubectl describe pod.

Therefore, I wonder if you have a plan to provide this feature.

I really appreciate your help. :)

simonpasquier · 2022-08-04T13:02:35Z

However, most users will not know what value Prometheus CR's spec.selector should be unless they run kubectl describe pod.

Therefore, I wonder if you have a plan to provide this feature.

We don't want to expose a spec.selector field that users would have to specify. Instead we need a status.selector which would be provided by the operator and would select all pods associated to the given Prometheus resource (as described here).

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: example
spec:
  replicas: 1
  shards: 2
status:
  selector: |-
    app.kubernetes.io/name in(prometheus), operator.prometheus.io/name in(example)
  replicas: 1
  shards: 2

What do you think?

SHEELE41 · 2022-08-04T15:13:51Z

Oh, okay!
That sounds good to me!

Is there something I can help you with? :)

slashpai · 2022-09-04T16:51:43Z

I have been working on this, will try to submit patch soon as possible :)

SHEELE41 · 2022-09-06T03:15:27Z

Well noted :)
Thanks for your help, @slashpai

diranged · 2022-09-08T22:34:37Z

Glad you all are looking into this - just adding in our chime of support. We were just looking into using Keda (https://keda.sh/docs/2.8/concepts/scaling-deployments/#scaling-of-custom-resources) to scale the PrometheusOperator resource shards as well...

tlorreyte · 2023-02-20T07:51:03Z

Any news on that?

slashpai · 2023-02-20T10:22:31Z

@tlorreyte Unfortunately I didn't get much time to look further into this. Would you want to contribute for this change?

Migueljfs · 2023-03-28T15:15:57Z

@slashpai did you manage to work on this further?

I don't mind helping, is there a branch you are working on?

slashpai · 2023-04-10T04:59:15Z

@Migueljfs Please feel free to work on the issue. I didn't get enough time to work on this.

rafilkmp3 · 2023-05-15T21:32:44Z

It would be amazing to be able to make shard running on us-east-1a scrap targets only in this own AZ; this would reduce a lot of the data transfers between AZs

ArthurSens · 2023-05-15T21:39:52Z

It would be amazing to be able to make shard running on us-east-1a scrap targets only in this own AZ; this would reduce a lot of the data transfers between AZs

Indeed :)

One of the ideas we have in #5495, would be awesome to start discussions at some point!

SHEELE41 added the kind/support label Aug 1, 2022

simonpasquier added kind/feature and removed kind/support labels Aug 3, 2022

simonpasquier mentioned this issue Aug 3, 2022

Revert "pkg/apis: Add scale subresource for Prometheus" #4952

Merged

simonpasquier added the help wanted label Sep 2, 2022

nicolastakashi mentioned this issue Apr 13, 2023

Graceful scale down of shards #4967

Open

simonpasquier mentioned this issue Sep 21, 2023

Automated sharding #5931

Closed

simonpasquier added the area/sharding label Sep 29, 2023

ArthurSens mentioned this issue Oct 6, 2023

Add scale subresource to Prometheus/PrometheusAgent #5962

Merged

5 tasks

ArthurSens closed this as completed in #5962 Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I autoscale Prometheus shards using HPA? #4946

How can I autoscale Prometheus shards using HPA? #4946

SHEELE41 commented Aug 1, 2022 •

edited

simonpasquier commented Aug 3, 2022

simonpasquier commented Aug 3, 2022 •

edited

SHEELE41 commented Aug 3, 2022

simonpasquier commented Aug 4, 2022

SHEELE41 commented Aug 4, 2022 •

edited

simonpasquier commented Aug 4, 2022

SHEELE41 commented Aug 4, 2022 •

edited

slashpai commented Sep 4, 2022

SHEELE41 commented Sep 6, 2022

diranged commented Sep 8, 2022

tlorreyte commented Feb 20, 2023

slashpai commented Feb 20, 2023

Migueljfs commented Mar 28, 2023

slashpai commented Apr 10, 2023

rafilkmp3 commented May 15, 2023

ArthurSens commented May 15, 2023

How can I autoscale Prometheus shards using HPA? #4946

How can I autoscale Prometheus shards using HPA? #4946

Comments

SHEELE41 commented Aug 1, 2022 • edited

simonpasquier commented Aug 3, 2022

simonpasquier commented Aug 3, 2022 • edited

SHEELE41 commented Aug 3, 2022

simonpasquier commented Aug 4, 2022

SHEELE41 commented Aug 4, 2022 • edited

simonpasquier commented Aug 4, 2022

SHEELE41 commented Aug 4, 2022 • edited

slashpai commented Sep 4, 2022

SHEELE41 commented Sep 6, 2022

diranged commented Sep 8, 2022

tlorreyte commented Feb 20, 2023

slashpai commented Feb 20, 2023

Migueljfs commented Mar 28, 2023

slashpai commented Apr 10, 2023

rafilkmp3 commented May 15, 2023

ArthurSens commented May 15, 2023

SHEELE41 commented Aug 1, 2022 •

edited

simonpasquier commented Aug 3, 2022 •

edited

SHEELE41 commented Aug 4, 2022 •

edited

SHEELE41 commented Aug 4, 2022 •

edited