High CPU usage (32 vCPUs) - looks due to targets discovery in K8s #8014

lorenzo-biava · 2020-10-06T12:47:08Z

What did you do?

We're monitoring a Kubernetes cluster consisting of about 400 nodes and 4500 pods, leveraging a single Prometheus instance with 32 vCPUs (almost fully utilized, while memory hovers between 40-50Gi). The setup leverages Prometheus Operator and most of the targets come from Service Monitors definitions (shouldn't be too relevant for the issue though).
There are about 130 target pools, with a few of those that result each in a few hundreds pods to scrape (a handful can have a couple thousands pods).
Judging by the CPU profiling graph, it looks like most of the CPU is used to update those target pools.
pprof.prometheus.samples.cpu.005.pb.gz

EDIT: We're experiencing the same in another cluster with much less total Pods (~1500) but way higher target pools (~450).

What did you expect to see?

Not exactly sure what overall CPU usage to expect for such load, but definitely not >60% of 32vCPUs for targets discovery alone.

In case this usage is expected (and provided it is indeed coming from targets discovery), I would expect to be able to set a custom interval for targets update to tune such behavior or some other ways to reduce CPU footprint.

What did you see instead? Under which circumstances?

32 vCPUs (almost fully utilized), >60% of which seems to be related to targets discovery.

I see about 80 of such pools taking more than 5 seconds to get synched (varying between 4 and 8 seconds).
If my understanding is correct, the sync is executed every 5 seconds (

prometheus/scrape/manager.go

Line 158 in bd53b5f

ticker := time.NewTicker(5 * time.Second)

).

Environment

System information:

Linux 4.15.0-1093-azure x86_64

Prometheus version:

prometheus, version 2.20.1 (branch: HEAD, revision: 983ebb4)
build user: root@7cbd4d1c15e0
build date: 20200805-17:26:58
go version: go1.14.6

Prometheus configuration file:

global:
  scrape_interval: 30s
  scrape_timeout: 10s
  evaluation_interval: 30s
  external_labels:
    cluster: prod1c
    prometheus: monitoring/prometheus-operator-prometheus
    prometheus_replica: prometheus-prometheus-operator-prometheus-0
alerting:
  alert_relabel_configs:
  - separator: ;
    regex: prometheus_replica
    replacement: $1
    action: labeldrop
  alertmanagers:
  - kubernetes_sd_configs:
    - role: endpoints
      namespaces:
        names:
        - monitoring
    scheme: http
    path_prefix: /
    timeout: 10s
    api_version: v1
    relabel_configs:
    - source_labels: [__meta_kubernetes_service_name]
      separator: ;
      regex: prometheus-operator-alertmanager
      replacement: $1
      action: keep
    - source_labels: [__meta_kubernetes_endpoint_port_name]
      separator: ;
      regex: web
      replacement: $1
      action: keep
rule_files:
- /etc/prometheus/rules/prometheus-prometheus-operator-prometheus-rulefiles-0/*.yaml
- /etc/prometheus/rules/prometheus-prometheus-operator-prometheus-rulefiles-1/*.yaml
scrape_configs:
- job_name: asraas-prod/ambassador-asraas-prod/0
  honor_timestamps: true
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - asraas-prod
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_label_service]
    separator: ;
    regex: ambassador-admin
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_port_name]
    separator: ;
    regex: ambassador-admin
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Node;(.*)
    target_label: node
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Pod;(.*)
    target_label: pod
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: service
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_name]
    separator: ;
    regex: (.*)
    target_label: pod
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: job
    replacement: ${1}
    action: replace
  - separator: ;
    regex: (.*)
    target_label: endpoint
    replacement: ambassador-admin
    action: replace
- job_name: asraas-prod/bofa-eng-usa-400-krypton/0
  honor_timestamps: true
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - asraas-prod
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_label_app]
    separator: ;
    regex: krypton
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_service_label_release]
    separator: ;
    regex: bofa-eng-usa-400
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_port_name]
    separator: ;
    regex: kr-svc-http
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Node;(.*)
    target_label: node
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Pod;(.*)
    target_label: pod
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: service
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_name]
    separator: ;
    regex: (.*)
    target_label: pod
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: job
    replacement: ${1}
    action: replace
  - separator: ;
    regex: (.*)
    target_label: endpoint
    replacement: kr-svc-http
    action: replace
- job_name: asraas-prod/bofa-eng-usa-400-krypton/1
  honor_timestamps: true
  scrape_interval: 10s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - asraas-prod
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_label_app]
    separator: ;
    regex: krypton
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_service_label_release]
    separator: ;
    regex: bofa-eng-usa-400
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_port_name]
    separator: ;
    regex: kr-fluentd-metrics-port
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Node;(.*)
    target_label: node
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Pod;(.*)
    target_label: pod
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: service
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_name]
    separator: ;
    regex: (.*)
    target_label: pod
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: job
    replacement: ${1}
    action: replace
  - separator: ;
    regex: (.*)
    target_label: endpoint
    replacement: kr-fluentd-metrics-port
    action: replace
[...]

PS: I can provide the full configuration if that's helpful, though it's quite longer

The text was updated successfully, but these errors were encountered:

lorenzo-biava · 2020-10-08T07:41:57Z

Removing a few jobs that weren't scraping any targets (~20) did actually reduce the CPU load significantly (~3 vCPU)

roidelapluie · 2020-10-08T07:55:45Z

Can we have the startup logs (if possible loglevel debug)?

lorenzo-biava · 2020-10-08T08:18:21Z

Here's the log from the past few days, including startup (I'm seeing a lot of K8s API errors, which I believe are related to transient instability of that managed cluster).
prod1c-prom-logs.zip

This is a fresh start with debug level.
prod1c-prom-logs-debug.zip

roidelapluie · 2020-10-08T08:33:41Z

You are opening 28 different kubernetes connections. You could try to reuse the exact same service discoveries config and apply relabel_configs per job to select which targets to monitor.

cat prod1c-prom-logs-debug.txt |grep subs |grep kuber -c
28

roidelapluie · 2020-10-08T08:37:37Z

Thanks for the logs anyway, I will try to have a look at the code as well.

lorenzo-biava · 2020-10-08T09:28:13Z

@roidelapluie thanks for the suggestion; can you point me to an example of such config? I then need to check if that's possible via PromOp ServiceMonitors or something similar...
I know the overall config could be way more efficient, though we're offering the Prometheus environment to multiple teams, each of which should be able to deploy the configuration needed.

brian-brazil · 2020-10-08T10:16:26Z

Basically if the bit under kubernetes_sd_configs: is identical then we'll only instantiate one k8s discovery.

lorenzo-biava · 2020-10-08T10:49:21Z

So basically I need to get rid of the namespace in the kubernetes_sd_configs and add a relabel_configs to filter the namespace.

I believe that's not (currently) possible with ServiceMonitors, not even with a more general matchExpressions (which is limited to the Service's label:
https://github.com/prometheus-operator/prometheus-operator/blob/v0.42.1/pkg/prometheus/promcfg.go#L911).

I'll ask around and possibly reach out to PromOp to see if there are ways to do that. However having the namespace filter in the kubernetes_sd_configs seemed more logical, and I was under the impression to also be more efficient (not trying to filter among all the clusters' pods).
I still hope you can find some other optimizations in the discovery code or Prometheus-wide settings to leverage 😃

brian-brazil · 2020-10-08T10:56:09Z

What's more efficient would depend on how many namespaces there are, and what proportion of them that this Prometheus is interested in. I'd expect specifying a namespace to always be more efficient though, so if this is the issue this smells like a performance issue in the k8s client code somewhere as the namespace filtering isn't happening properly on the server side.

djsly · 2020-10-08T13:31:08Z

@brian-brazil we have

❯ k get ns --no-headers | wc -l
     112

and

❯ k get servicemonitor -A --no-headers  | awk '{ print $1}' | sort | uniq -c | sort -r
 118 asraas-stage
  87 asraas-qa
  54 gatekeeper-qa
  29 gatekeeper-dev
  24 mix-dev
  23 monitoring
  23 mix-stage
  23 mix-qa
  18 nluaas-stage
  16 gatekeeper2-dev
  16 asraas-dev
  15 nluaas-qa
   9 gatekeeper-stage
   8 fabric-stage
   8 fabric-qa
   8 fabric-dev
   8 fabric-ctp-dev
   7 fabric-ctp-qa
   6 ttsaas-dev
   6 nluaas-ctp-dev
   6 dlgaas-dev
   5 nluaas-dev
   4 ttsaas-stage
   4 svc-ps-dev
   4 nluaas-ctp-qa
   4 global-auth-stage
   4 global-auth-qa
   4 global-auth-dev
   4 global-auth-ctp-qa
   4 global-auth-ctp-dev
   4 dlgaas-qa
   3 ttsaas-qa
   3 mixidp-auth-dev
   3 media-manager-dev
   3 dlgaas-stage
   2 mixidp-auth-stage
   2 mixidp-auth-qa
   1 xarch-dev
   1 ingress-controllers
   1 cert-manager

djsly · 2020-10-24T03:37:44Z

Any insights here would be helpful. we are getting to 500 nodes with > 2000 pods and right now we would need to get > 32 CPU for prometheus.

chancez · 2020-11-07T21:15:37Z

What's more efficient would depend on how many namespaces there are, and what proportion of them that this Prometheus is interested in. I'd expect specifying a namespace to always be more efficient though, so if this is the issue this smells like a performance issue in the k8s client code somewhere as the namespace filtering isn't happening properly on the server side.

There is no namespace level filtering on the server-side that I'm aware of. There's two ways you can implement it. 1 'watch all namespaces' watcher, with client side filtering, or open 1 watch per-namespace (with http2 that can still be 1 connection if you re-use the client I believe).

lorenzo-biava · 2020-11-12T08:49:59Z

So we managed to merge together a lot of ServiceMonitors. Even though they were all scraping the same namespace, we got an impressive reduction in CPU usage (less than half; see below).

I think this indicates the number of connections to Kubernetes (which should be one per namespace in this scenario, as suggested previously) might not be the primary contributor, while the sheer number of jobs that use K8s Service Discovery is.

Just to reiterate on this: is my assumption the service discovery for each target pool runs every 5 seconds correct (see here)?

brian-brazil · 2020-11-12T09:14:38Z

Just to reiterate on this: is my assumption the service discovery for each target pool runs every 5 seconds correct (see here)?

No, that's a throttle so we don't process updates from an SD more than every 5s. Processing updates is considered to be cheap, but not that cheap.

w4rgrum · 2020-11-24T09:20:08Z

We are currently facing the same issue on our Prometheis instances: we have ~30 kube sd jobs (pod type) that are not constrained by namespaces and there are a lot of pods running on the platform (~5K pods) and the CPU usage for those instances is abnormally high.
Having a look at the discovery page of the ui we can see that most of the jobs are keeping less than 100 targets out of ~28K each.
In order to mitigate this we added selectors to some of the kube sd jobs, which helped reduce the CPU usage by half. However this is not perfect since it seems when a selector is matching nothing you see 0/28K in the discovery which is weird (I would have expected something like 0/0).
Our impressions on this is that the less targets discovered you have as a result of the kube calls the less CPU is consumed in the end, meaning the consumption might not be linked to kube calls but to what is done afterwards (relabeling and such?).

NB: we also noticed that the rate on prometheus_target_sync_length_seconds_sum dropped after adding the selectors.
sum(rate(prometheus_target_sync_length_seconds_sum{<filters>}[2m]))

This is a sum but if you get the details for all the scrape jobs, the jobs with the new selector drop to ~0 and the other jobs see their rate dropping.
rate(prometheus_target_sync_length_seconds_sum{<filters>}[2m])

brian-brazil · 2020-11-24T10:33:40Z

Can you get a CPU profile?

w4rgrum · 2020-11-24T12:36:17Z

CPU profile with the selectors
profile_with_selectors.zip

CPU profile without selectors
profile.zip

brian-brazil · 2020-11-24T12:58:31Z

Looks like it's that part of the code alright. It's not something we've ever really optimised, as it's not meant to be run particularly often. k8s sending 28k targets for each of 30 scrape configs every 5 seconds wasn't exactly in mind. There's probably some low hanging fruit here.

vitkovskii · 2020-12-02T14:11:05Z

@brian-brazil Hey there! What if we have a label filter in the k8s discovery plugin as a fast win? In our case discovery produces ~120 labels per target, but we use only 5 of them. Yes, there is a relabel config, but the main performance issue is to alloc memory for the number of labels (30000 targets * 120 labels=~3 million labels), sort them, hash for deduplication, and then send to relabeling. This long and hard work happens each scrape pool sync. What do you think?

brian-brazil · 2020-12-02T14:23:32Z

That sounds like slowly reinventing relabelling, so would ultimately end up with the same performance costs. We have to allocate the labels one way or the other.

I'd suggest looking at ways to make what we have more efficient, rather than immediately jumping to adding yet more configuration for users. There's likely quite a few low hanging fruit performance wise, as the relevant code paths have never really been optimised.

d-ulyanov · 2020-12-02T15:54:02Z

The option from @vitkovskii sounds really reasonable. Unfortunately, we just wasting CPU and memory for useless work here and I have no idea what to do next because the number of targets in our K8S growing really rapidly. It seems that Prometheus becoming unscalable here. Let's find an engineering solution, colleagues.

roidelapluie · 2020-12-02T15:56:21Z

cc @simonpasquier @brancz

vitkovskii · 2020-12-02T16:05:50Z

Actually, this hard work is doing per service monitor. So if you have 30 of them, you have x30 useless work. Let's count: 120 labels is about 6Kb, 30000 targets, and 30 service monitors. It's about 5.5Gb of RAM only to create raw label sets from discovery.

roidelapluie · 2020-12-02T16:09:30Z

Do you have concrete exemples about how we could avoid that? Is that e.g. underlying labels you could delete from your pods?

brian-brazil · 2020-12-02T16:21:29Z

Let's count: 120 labels is about 6Kb, 30000 targets, and 30 service monitors. It's about 5.5Gb of RAM only to create raw label sets from discovery.

That'd only be if none of them were dropped, which is unlikely to be the case. We should only be keeping dropped targets once, not 30x - if not that sounds like a low hanging fruit that could be tackled.

lorenzo-biava · 2020-12-02T16:27:28Z

Actually, this hard work is doing per service monitor. So if you have 30 of them, you have x30 useless work.

It's definitely amplified by the number of Service Monitors. We replaced tens of SM with a single one (for a particular application), while the overall pods/labels stayed the same (or even slightly increased) and the CPU usage got to less than 50% it was before (see #8014 (comment))

brian-brazil · 2020-12-02T16:40:15Z

and the CPU usage got to less than 50% it was before

I'm confused here. #8014 (comment) is talking about RAM, and now you're talking about CPU. Which is the problem?

lorenzo-biava · 2020-12-02T16:49:47Z

For us it was definitely CPU (as per the title of the issue). Not sure about RAM, haven't seen any correlation with a change in memory usage yet.
PS: All the provided profiles are for CPU. Let me know if you also need a memory one.

brian-brazil · 2020-12-02T16:58:23Z

Let's just stick with CPU then. If it is also memory that's a separate thing which we can look at optimising.

My first though on how to handle this generally so it benefits all SDs/users would be to see if we can avoid re-process a target that hasn't changed when there's refresh from SD, thus avoiding a large chunk of the relabelling processing.

brian-brazil · 2020-12-02T17:25:53Z

Yeah, looking at the code at https://github.com/prometheus/prometheus/blob/master/scrape/scrape.go#L417-L431 if we made droppedTargets a map we could check if an identical target was dropped last time and fastpath that, bypassing all the relabelling.

roidelapluie · 2020-12-09T15:37:05Z

@brian-brazil Is it a strict rule that we should expose all original labels(to UI/API) in a sorted way? Or can we break the sequence? My experiments show that the most expensive operation is sorting labels. But for relabeling it isn't required labels to be sorted. So we can sort much fewer labels only after relabeling. The drawback is that the original label set will be exposed unsorted.

We could sort them in the UI, if that is really making a big difference. But I think that relabeling might somwhow have dependencies on order, alonside other sanity checks we do (like checking for duplicate labels)

brian-brazil · 2020-12-09T16:20:57Z

Is it a strict rule that we should expose all original labels(to UI/API) in a sorted way?

Yes, that's part of our documented API that cannot be broken - though the sorting doesn't matter, it is a JSON map which has no ordering.

My experiments show that the most expensive operation is sorting labels.

For the data structure to work it needs to be sorted, and that includes for relabelling so it can find the labels. Are we sorting more than we need to? We should only need to do it once per SD target.

vitkovskii · 2020-12-10T13:02:03Z

The first one is here: https://github.com/prometheus/prometheus/blob/master/scrape/target.go#L368 actually, it's not necessary for relabeling. The second time after relabeling: https://github.com/prometheus/prometheus/blob/master/scrape/target.go#L425. The second one is OK because next we should calculate hash() for deduplication of targets.

vitkovskii · 2020-12-10T13:04:20Z

and that includes for relabelling so it can find the labels

Relabeling scans all labels and it's not required them to be sorted.

vitkovskii · 2020-12-10T13:08:40Z

Why relabeling take a list of labels, not a map? For every target on every config iteration, it allocates a new list. This is the second bottleneck after sorting.

brian-brazil · 2020-12-10T14:17:38Z

The map is the old data structure, Prometheus 2.x introduced the list for performance reasons. The map should be considered legacy, and remaining uses removed where practical.

Why not do the sort up in SD? We only need it once, not for every scrape config and it'd further prune where the old data structure is used.

Relabeling scans all labels and it's not required them to be sorted.

It's using a library which expects them to be sorted, so that invariant should be maintained. Relabelling does require them to be sorted due to this.

shaikatz · 2021-01-24T19:29:45Z

@brian-brazil there are any plans to improve that area? it's still a pain for anybody who uses many service monitors in his cluster, the savings potential here is great.

I'll donate my CPU profiling here, 172 service monitors, 5 cores currently in use, we can see that at least half of them being wasted for the same targetsFromGroup function:

brian-brazil · 2021-01-24T19:55:56Z

I think that's the first thing we should probably look at improving.

brancz · 2021-02-15T12:46:24Z

@shaikatz do you mind uploading the whole profile either as a file to github or via https://share.polarsignals.com/ ? It would be great if we could explore the profile.

I also opened prometheus-operator/prometheus-operator#3840 as I think that is another angle from which this can be optimized on the prometheus-operator side.

shaikatz · 2021-02-15T14:18:48Z

@brancz
pprof.prometheus.samples.cpu.001.pb.gz

brancz · 2021-02-15T14:51:05Z

Could you also share an allocs profile?

From that profile it does look to me like a lot of memory-trash is produced by discovery which causes large sweep and GC CPU usage. It does all seem to add up to prometheus-operator/prometheus-operator#3840.

shaikatz · 2021-02-15T14:57:21Z

Is that the one you need? profile001.svg.zip
If not, can you provide the exact pprof command to get your required profile?

brancz · 2021-02-17T18:07:33Z

Yeah it is, thank you! Yeah that seems to also point in the same direction.

lorenzo-biava · 2021-09-29T12:02:37Z

@brancz / @brian-brazil / @roidelapluie is there any update on this issue? any options we can explore to patch this behavior?
The prom-operator's PR seems to be stuck since March...

wulianhuo · 2022-03-02T08:03:09Z

Same problem here.
The easy solution for now is to merge servicemonitor.
Try Using only one servicemonitor for all the services to be monitored, and the prometheus reload op will be more efficicent.

d-ulyanov · 2022-03-02T09:07:10Z

As MR was not accepted we've decided to implement our own separate discovery service (we're calling it "target balancer") and deliver targets in files (fileSD) to Proms. Profits: 1) reduce kubeAPI load 2) significantly reduced CPU usage on Prom instances

m-yosefpor · 2022-09-08T18:08:05Z

It seems we are also hitting this issue, but the weird thing is we are hitting the issue with only 1 of our prometheus servers!! (we have configured 2 replicas in prometheus operator, no sharding).

However the cpu usage is mostly in scrape.run for us rather than scrape.reload, so not sure if it's the same problem.

prometheus-0: profile.pb.gz

prometheus-1: profile(1).pb.gz

You can see the difference in pprof between two instances. Also the difference in CPU usage of these instances:

More info:

$ oc get servicemonitor,podmonitor -A | wc -l
461
$ oc get po -A | wc -l
3646

d-ulyanov · 2022-09-08T18:16:36Z

@m-yosefpor we finally moved out all discovery logic outside of prometheus to separate daemon and switched to simple file/http discovery, it also allowed us to implement custom sharding logic.
Maybe it makes sense to open source this tool.

m-yosefpor · 2022-09-08T18:17:54Z

@m-yosefpor we finally moved out all discovery logic outside of prometheus to separate daemon and switched to simple file/http discovery, it also allowed us to implement custom sharding logic. Maybe it makes sense to open source this tool.

We would appreciate if such tools gets open-sourced.

iamyeka · 2023-02-17T11:18:35Z

Need a great solution, currently we set scrape.discovery-reload-interval to a larger duration as the temporary solution.

iamyeka · 2023-02-19T09:14:17Z

It's using a library which expects them to be sorted, so that invariant should be maintained. Relabelling does require them to be sorted due to this.

I can't figure out why this could happen. What library is being used?

bboreham · 2023-03-09T12:25:54Z

The problems described at the top may be improved by #12048 and #12084.

beorn7 · 2024-05-21T11:17:58Z

Hello from the bug scrub.

We assume this problem has been improved by #12048 and #12084 indeed. If you still see excessive CPU usage that we might be able to fix, please follow up here (or file a new issue).

brian-brazil mentioned this issue Jan 21, 2021

High continuous CPU usage if many targets cannot be scraped #8392

Closed

brancz mentioned this issue Feb 15, 2021

Allow using all list/watch prometheus-operator/prometheus-operator#3840

Open

trutty mentioned this issue May 12, 2021

Thanos Sidecar heartbeat failed due to "context deadline exceeded" thanos-io/thanos#4121

Closed

lorenzo-biava mentioned this issue Sep 29, 2021

HPA scheduling implementation doesn't scale well with custom/external metrics - all checks are sequential by one goroutine kubernetes/kubernetes#96242

Closed

ahurtaud mentioned this issue Apr 26, 2022

Add hidden flag to configure discovery loop interval #10634

Merged

Nexucis mentioned this issue May 3, 2022

optimize the target list to be scrapped #10662

Closed

beorn7 closed this as completed May 21, 2024

High CPU usage (32 vCPUs) - looks due to targets discovery in K8s #8014

High CPU usage (32 vCPUs) - looks due to targets discovery in K8s #8014

Comments

lorenzo-biava commented Oct 6, 2020 • edited

lorenzo-biava commented Oct 8, 2020

roidelapluie commented Oct 8, 2020 • edited

lorenzo-biava commented Oct 8, 2020

roidelapluie commented Oct 8, 2020 • edited

roidelapluie commented Oct 8, 2020

lorenzo-biava commented Oct 8, 2020

brian-brazil commented Oct 8, 2020

lorenzo-biava commented Oct 8, 2020

brian-brazil commented Oct 8, 2020

djsly commented Oct 8, 2020

djsly commented Oct 24, 2020

chancez commented Nov 7, 2020

lorenzo-biava commented Nov 12, 2020 • edited

brian-brazil commented Nov 12, 2020

w4rgrum commented Nov 24, 2020 • edited

brian-brazil commented Nov 24, 2020

w4rgrum commented Nov 24, 2020

brian-brazil commented Nov 24, 2020

vitkovskii commented Dec 2, 2020 • edited

brian-brazil commented Dec 2, 2020

d-ulyanov commented Dec 2, 2020

roidelapluie commented Dec 2, 2020

vitkovskii commented Dec 2, 2020

roidelapluie commented Dec 2, 2020

brian-brazil commented Dec 2, 2020

lorenzo-biava commented Dec 2, 2020

brian-brazil commented Dec 2, 2020

lorenzo-biava commented Dec 2, 2020

brian-brazil commented Dec 2, 2020

brian-brazil commented Dec 2, 2020

roidelapluie commented Dec 9, 2020

brian-brazil commented Dec 9, 2020

vitkovskii commented Dec 10, 2020 • edited

vitkovskii commented Dec 10, 2020 • edited

vitkovskii commented Dec 10, 2020

brian-brazil commented Dec 10, 2020

shaikatz commented Jan 24, 2021 • edited

brian-brazil commented Jan 24, 2021

brancz commented Feb 15, 2021

shaikatz commented Feb 15, 2021

brancz commented Feb 15, 2021

shaikatz commented Feb 15, 2021 • edited

brancz commented Feb 17, 2021

lorenzo-biava commented Sep 29, 2021

wulianhuo commented Mar 2, 2022

d-ulyanov commented Mar 2, 2022

m-yosefpor commented Sep 8, 2022 • edited

d-ulyanov commented Sep 8, 2022

m-yosefpor commented Sep 8, 2022

iamyeka commented Feb 17, 2023 • edited

iamyeka commented Feb 19, 2023 • edited

bboreham commented Mar 9, 2023

beorn7 commented May 21, 2024

lorenzo-biava commented Oct 6, 2020 •

edited

roidelapluie commented Oct 8, 2020 •

edited

roidelapluie commented Oct 8, 2020 •

edited

lorenzo-biava commented Nov 12, 2020 •

edited

w4rgrum commented Nov 24, 2020 •

edited

vitkovskii commented Dec 2, 2020 •

edited

vitkovskii commented Dec 10, 2020 •

edited

vitkovskii commented Dec 10, 2020 •

edited

shaikatz commented Jan 24, 2021 •

edited

shaikatz commented Feb 15, 2021 •

edited

m-yosefpor commented Sep 8, 2022 •

edited

iamyeka commented Feb 17, 2023 •

edited

iamyeka commented Feb 19, 2023 •

edited