Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I autoscale Prometheus shards using HPA? #4946

Closed
SHEELE41 opened this issue Aug 1, 2022 · 16 comments · Fixed by #5962
Closed

How can I autoscale Prometheus shards using HPA? #4946

SHEELE41 opened this issue Aug 1, 2022 · 16 comments · Fixed by #5962

Comments

@SHEELE41
Copy link

SHEELE41 commented Aug 1, 2022

What did you do?
I already read #3130, #2590 and saw that Prometheus shards can be autoscaled via the HPA.

If you need a solution quickly, you can already use additional relabeling rules on your ServiceMonitor via the hashmod action, and create multiple ServiceMonitors per "shard". Your use case makes a lot of sense, I'd like to think it through a little bit further, and arrive at a solution, that would allow us to eventually autoscale sharding based on the metric ingestion (I'm thinking a general purpose way, where a Prometheus object would become a shard and maybe a ShardedPrometheus object that orchestrates these, and can be autoscaled via the HPA). What I'm saying is, maybe the sharding decision should be configured in the Prometheus object ultimately instead of the ServiceMonitor (where it's already possible albeit a little manual today).

I want to increase only 'shards' value, not 'replicas' value, when overall cpu usage of Prometheus shards(pods) is increasing.
(Because I want each Prometheus object to scrape metric from targets mutual exclusively.)

kind: Prometheus
...
spec:
  # replicas is fixed
  replicas: 1
  # horizontal scaling by increasing shards
  shards: 3, 4, 5...

So, I used kubernetes-sigs/metrics-server for cpu usage based scaling, and wrote a HPA manifest file with 'Prometheus' CRD as the target.

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: prometheus-autoscaler
  namespace: monitoring
spec:
  scaleTargetRef:
    apiVersion: monitoring.coreos.com/v1
    kind: Prometheus
    name: prometheus
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 50

However, this HPA could not get cpu usage of target, so the target CPU usage remained unknown.

$ kubectl get hpa prometheus-autoscaler
... TARGETS ...
... <unknown>/50% ...

I realized it stands to reason because 'Promethues' CRD not implements labelselector /scale subresource.

$ kubectl describe hpa prometheus-autoscaler
Events:
  Type     Reason                        Age                     From                       Message
  ----     ------                        ----                    ----                       -------
  Warning  FailedComputeMetricsReplicas  5m2s (x12 over 7m52s)   horizontal-pod-autoscaler  selector is required
  Warning  SelectorRequired              2m43s (x21 over 7m52s)  horizontal-pod-autoscaler  selector is required

In Prometheus-operator, StatefulSet is created per shard and I checked that each id of the shard is pre-written in each StatefulSet manifest by running kubectl edit.

(In my case... shards=3 & replicas=1)

NAME                                          READY   STATUS    RESTARTS   AGE
pod/prometheus-prometheus-0                   2/2     Running   0          3h28m
pod/prometheus-prometheus-shard-1-0           2/2     Running   0          3h28m
pod/prometheus-prometheus-shard-2-0           2/2     Running   0          3h28m

NAME                                             READY   AGE
statefulset.apps/prometheus-prometheus           1/1     3d3h
statefulset.apps/prometheus-prometheus-shard-1   1/1     6h12m
statefulset.apps/prometheus-prometheus-shard-2   1/1     6h12m

NAME                                          VERSION   REPLICAS   AGE
prometheus.monitoring.coreos.com/prometheus   v2.33.5   1          3d3h

So even if I write the HPA manifest file to target each StatefulSet, it won't work as I expected.

In conclusion,

  1. How can I get entire cpu usage of Prometheus shards by kubernetes-sigs/metrics-server?
  2. How can I autoscale Prometheus shards by increasing/decreasing only 'shards' value without touching 'replicas' value using HPA? (like increasing the number of StatefulSet)

FYI, I'm using Prometheus to record external server's metrics, not my kubernetes cluster's metrics.

Environment

  • Prometheus Operator version:

    v0.58.0

  • Kubernetes version information:

    v1.24

@simonpasquier
Copy link
Contributor

Thanks for the detailed report!

How can I get entire cpu usage of Prometheus shards by kubernetes-sigs/metrics-server?

Hmm I would need to look more into the details of how HPA works.

How can I autoscale Prometheus shards by increasing/decreasing only 'shards' value without touching 'replicas' value using HPA? (like increasing the number of StatefulSet)

This is a very good question. Even if #4735 implements the scale subresouce, it's probably not doing what we want since as you noted, it's going to add more replicas instead of more shards.

To be honest, we need to review your use case in more depth. And I think it will become even more pressing with the agent CRD...

@simonpasquier
Copy link
Contributor

simonpasquier commented Aug 3, 2022

We've been discussing it offline with @slashpai and came to the conclusion that #4735 had to be reverted for now (see #4952).

The plan is rather to re-implement the scale subresource with the following changes:

  • scale the number of shards rather than the number of replicas.
  • implement a status.selector field that can be used by HPA.

@SHEELE41 does it make sense from your point of view?

@SHEELE41
Copy link
Author

SHEELE41 commented Aug 3, 2022

Thanks you for answering @simonpasquier.

  • scale the number of shards rather than the number of replicas.
  • implement a status.selector field that can be used by HPA.

That's exactly what I wanted!

My opinion
I checked #4735 and #4952.

As you said at #4952, I also thought status.selector subresource should be added to Prometheus CRD and enabled via +kubebuilder:subresource:scale:selectorpath=.status.selector.

This may solve the problem that HPA can't get metrics but, as you know, this will be not enough yet to make HPA scale the number of shards rather than the number of replicas.

However, I don't think even scaling the number of replicas will work properly because Prometheus CR's spec.replicas and status.replicas have to be different. (status.replicas = spec.replicas * spec.shards)

$ kubectl get --raw /apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheuses/prometheus/scale
{"kind":"Scale","apiVersion":"autoscaling/v1",    ...    "spec":{"replicas":1},"status":{"replicas":3}}

HPA will try to make spec.replicas and spec.replicas identical, which will mess up the resources.

Additionally, I think spec.shards scaling feature and spec.replicas scaling feature should be separated clearly.

If not, user should create HPA per StatefulSet for scaling the number of replicas, but this HPA will scale only pods in target StatefulSet. (when only spec.shards scaling feature is implemented as default)

My use case

  • There are many Pushgateway(scrape target) pods that are being autoscaled via HPA on my k8s cluster.
  • There are several Prometheus pods too, but aren't being autoscaled.
  • I'm using remote TSDB so I can use Prometheus as stateless.
  • The Prometheus pods are generated by 'shards' field of Prometheus CR so that each Prometheus instance scrapes metrics from the Pushgateway pods without intersections with other Prometheus instances.
  • In this situation, I want to autoscale Prometheus pods via the HPA.

스크린샷 2022-08-01 오전 10 47 10

@simonpasquier
Copy link
Contributor

Additionally, I think spec.shards scaling feature and spec.replicas scaling feature should be separated clearly.

My idea is that scaling should only work for shards. There's little incentive to scale the number of replicas IMHO: increasing this number is never going to spread the load.

I think it could work if we add a status.shards field to the Prometheus status?

// +kubebuilder:subresource:scale:specpath=.spec.shards,statuspath=.status.shards,selectorpath=.status.selector

@SHEELE41
Copy link
Author

SHEELE41 commented Aug 4, 2022

My idea is that scaling should only work for shards. There's little incentive to scale the number of replicas IMHO: increasing this number is never going to spread the load.

Oh, I see.

Then I think we can deal with this problem in the way you suggested.

By the way, have you any plan of implementation that makes Prometheus pods(not CR) have the same labels specified in Prometheus CR?

  • Case 1. Prometheus pods inherit labels from metadata.labels of Prometheus CR.
  • Case 2. Prometheus pods inherit labels from spec.template.metadata.labels of Prometheus CR. (as if Prometheus CR behaves like a deployment.apps)

I think both cases need to write some code in makeStatefulSetSpec func for PodTemplateSpec.

In general, user will be able to scale by just adding app.kubernetes.io/name: prometheus to spec.selector of Prometheus CR. (or app.kubernetes.io/instance: $NAME_OF_PROMETHEUS_CR)

[/pkg/prometheus/statefulset.go:683]

        // In cases where an existing selector label is modified, or a new one is added, new sts cannot match existing pods.
	// We should try to avoid removing such immutable fields whenever possible since doing
	// so forces us to enter the 'recreate cycle' and can potentially lead to downtime.
	// The requirement to make a change here should be carefully evaluated.
	podSelectorLabels := map[string]string{
		"app.kubernetes.io/name":       "prometheus",
		"app.kubernetes.io/managed-by": "prometheus-operator",
		"app.kubernetes.io/instance":   p.Name,
		"prometheus":                   p.Name,
		shardLabelName:                 fmt.Sprintf("%d", shard),
		prometheusNameLabelName:        p.Name,
	}

However, most users will not know what value Prometheus CR's spec.selector should be unless they run kubectl describe pod.

Therefore, I wonder if you have a plan to provide this feature.

I really appreciate your help. :)

@simonpasquier
Copy link
Contributor

However, most users will not know what value Prometheus CR's spec.selector should be unless they run kubectl describe pod.

Therefore, I wonder if you have a plan to provide this feature.

We don't want to expose a spec.selector field that users would have to specify. Instead we need a status.selector which would be provided by the operator and would select all pods associated to the given Prometheus resource (as described here).

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: example
spec:
  replicas: 1
  shards: 2
status:
  selector: |-
    app.kubernetes.io/name in(prometheus), operator.prometheus.io/name in(example)
  replicas: 1
  shards: 2

What do you think?

@SHEELE41
Copy link
Author

SHEELE41 commented Aug 4, 2022

Oh, okay!
That sounds good to me!

Is there something I can help you with? :)

@slashpai
Copy link
Contributor

slashpai commented Sep 4, 2022

I have been working on this, will try to submit patch soon as possible :)

@SHEELE41
Copy link
Author

SHEELE41 commented Sep 6, 2022

Well noted :)
Thanks for your help, @slashpai

@diranged
Copy link

diranged commented Sep 8, 2022

Glad you all are looking into this - just adding in our chime of support. We were just looking into using Keda (https://keda.sh/docs/2.8/concepts/scaling-deployments/#scaling-of-custom-resources) to scale the PrometheusOperator resource shards as well...

@tlorreyte
Copy link
Contributor

Any news on that?

@slashpai
Copy link
Contributor

@tlorreyte Unfortunately I didn't get much time to look further into this. Would you want to contribute for this change?

@Migueljfs
Copy link

@slashpai did you manage to work on this further?

I don't mind helping, is there a branch you are working on?

@slashpai
Copy link
Contributor

@Migueljfs Please feel free to work on the issue. I didn't get enough time to work on this.

@rafilkmp3
Copy link

It would be amazing to be able to make shard running on us-east-1a scrap targets only in this own AZ; this would reduce a lot of the data transfers between AZs

@ArthurSens
Copy link
Member

It would be amazing to be able to make shard running on us-east-1a scrap targets only in this own AZ; this would reduce a lot of the data transfers between AZs

Indeed :)

One of the ideas we have in #5495, would be awesome to start discussions at some point!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants