Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Relatively) high CPU usage for the cadvisor container #2523

Open
immortaly007 opened this issue Apr 29, 2020 · 38 comments
Open

(Relatively) high CPU usage for the cadvisor container #2523

immortaly007 opened this issue Apr 29, 2020 · 38 comments

Comments

@immortaly007
Copy link

immortaly007 commented Apr 29, 2020

I noticed that the CPU usage of cadvisor is the highest of all containers I'm running, which I feel is unexpected. it's still not crazy (an average of 7%), but I feel it should be less.

I'm using the following query to calculate CPU usage (in grafana):

sum(rate(container_cpu_usage_seconds_total{image!=""}[1m])) by (id,name)

Some info about my instance:

  • Prometheus is querying the metrics every 15 seconds. I was querying every 5 seconds before, which lead to 15% CPU usage.
  • Cadvisor runs using the arguments: -docker_only --disable_metrics disk,tcp,udp
  • There are a total of 29 running docker containers
  • The host is using a dual core Intel(R) Celeron(R) CPU G1610 @ 2.60GHz, and has 8GB of RAM (which is approximately half filled by the other containers).
  • Running Debian 10

Is this the expected CPU usage? Is there something I can improve? Or am I monitoring CPU usage wrong?

Edit: I managed to bring it down to ~3% by setting the --housekeeping_interval to 10s. But it would be nice if it was even lower.

@dashpole
Copy link
Collaborator

you should add a number of things to the disable_metrics list. See the help text for the flag: https://github.com/google/cadvisor/blob/master/cmd/cadvisor.go#L137

@kassyss
Copy link

kassyss commented Apr 30, 2020

I am also experiencing this issue.
On a small server (with CPU: Quad Core Intel Xeon E3-1265L V2) even i have disabled a large number of metrics, cAdvisor is using an average of 8% cpu. Compared to the other containers, it is at least 8 times higher than the most consuming container.

That's annoying to have the monitoring tool to eat far more resources than the monitored containers.

Best regards

@gowrisankar22
Copy link

@dashpole I have a docker-compose set up for caadvisor. where to add --housekeeping_interval to 10s ?

@dashpole
Copy link
Collaborator

dashpole commented May 4, 2020

@gowrisankar22 to the container args. I'm not sure where those are specified...

@kassyss
Copy link

kassyss commented May 4, 2020

@gowrisankar22 This issue is about high cpu usage, not about docker-compose content.
anyway

#  cadvisor:
#    image: gcr.io/google-containers/cadvisor:v0.36.0
#    container_name: cadvisor
#    command:
#      - '--housekeeping_interval=55s'
#      - '--docker_only'

@immortaly007
Copy link
Author

immortaly007 commented May 6, 2020

I've disabled all statistics listed, specifically:
disk,diskIO,network,tcp,udp,percpu,sched,process

Note that the statistics cpu_topology, hugetlb and referenced_memory were mentioned, but i couldn't actually disable them: it gave me an error about an invalid argument.

Now the CPU usage is down to ~1.5%. This is still the highest average CPU usage of all 31 containers running on my machine (followed by promotheus itself at ~1.3%).

But I wouldn't say the issue is quite resolved: I expect cadvisor to be able to report statistics with intervals of ~15-30 seconds, without taking more than 1% of the CPU load/time, and I didn't expect to have to disable all statistics I can to get near that number.

@dashpole
Copy link
Collaborator

dashpole commented May 6, 2020

1% of how many cores? Also, the query interval isn't what really matters; it is the housekeeping interval. cAdvisor collects metrics in the background, and serves them from its cache.

@immortaly007
Copy link
Author

immortaly007 commented May 7, 2020

The machine has two cores, so I believe it would be ~1.5% of a single core. It is calculated as:
sum(rate(container_cpu_usage_seconds_total{image!=""}[1m])) by (id,name)

But note that this usage is when disabling all metrics I could disable, and setting the housekeeping interval to 15s.

With default settings, cadvisor would take up ~15% CPU usage, which in my opinion is too much for the default settings.

@tehniemer
Copy link

I'm having this same problem and followed the suggestions in this issue and it corrected it for a while, but the high CPU use came back after restarting the container.
Image 1597870017

This is my docker-compose.

  cadvisor:
    image: google/cadvisor:latest
    container_name: cadvisor
    restart: always
    command: 
      - '--docker=tcp://socket-proxy:2375'
      - '--housekeeping_interval=15s'
      - '--docker_only=true'
      - '--disable_metrics=disk,network,tcp,udp,percpu,sched,process'
    networks:
      - socket_proxy
      - database
    depends_on:
      - socket-proxy
      - prometheus
    security_opt:
      - no-new-privileges:true
    ports:
      - '$CADVISOR_PORT:8080'
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /dev/disk/:/dev/disk:ro

@idontusenumbers
Copy link

It seems concerning that a tool that is a fraction of the functionality of a task manager use ~15% CPU tracking the bare essentials of CPU, Memory, and Network. I had to turn down the reporting frequency to something I think is pretty unreasonable (30s) to get it to a comfortable level.

The reduction in cadvisor's CPU usage has coincided with a reduction of TeamCity server CPU usage (also running as a container in docker), which I believe also tracks CPU usage of the 'host' machine. Maybe the CPU tracking of both is playing off each other in a bad way?

@iwankgb
Copy link
Collaborator

iwankgb commented Jan 5, 2021

@idontusenumbers, have you tried to enable profiling and see what makes cAdvisor to use that much resources?

@idontusenumbers
Copy link

@idontusenumbers, have you tried to enable profiling and see what makes cAdvisor to use that much resources?

I have not. I have no experience doing go dev or native profiling; just JS, Java, and PHP. I'd be happy to help if the issue can't be reproduced, but I'm going to need some guidance.

@iwankgb
Copy link
Collaborator

iwankgb commented Jan 10, 2021

@idontusenumbers you need to add argument -profiling to your cadvisor invocation and navigate to IP:PORT/debug/pprof/profile?seconds=300. It will generate profile for 300 seconds of cAdvisor execution. After 300 seconds a file will be sent to you - download it and then navigate to the directory where the file is saved and execute: go tool pprof profile (profile is default name for the downloaded file). It will open profiling console. Type web and hit enter. After some time a browser should be opened on your system and you should execution profile for your instance.

@paaacman
Copy link

Thanks @iwankgb, I made some tests with differents settings, with 60s profiles.
I compare mainly "Total samples" time, and I monitor CPU usage with htop and Prometheus rate(container_cpu_user_seconds_total{name="cadvisor"}[1m]) * 100.
For every test, I change only 1 parameter. Almost all tests are made with -housekeeping_interval=1s.

My observations, so far, are :

  • '-docker_only=true' is the most efficient change. The "Total samples" time go from 8.89s to 2.60s when I add this configuration.
  • '--disable_metrics=disk,network,tcp,udp,percpu,sched,process' does not help, and can be worst.
  • Modifying Prometheus scrape_interval can help : from 5s to 45s it make a big difference (from 2.46s to 1.53s) but if it's more than 30s, Prometheus and Grafana will only show dots in graphs (with rate(container_cpu_user_seconds_total{name="cadvisor"}[1m]) * 100). I keep scrape_interval: 15s and don't understand exactly what scrape_interval does.
  • Finally, housekeeping_interval from 1s (Total samples=2.0s) to 5s (Total samples=0.7s) and 10s (0.6s) make a big difference too.

My final configuration :

# docker-compose.yml
...
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    volumes:
      - /:/rootfs:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
      - /sys/fs/cgroup:/cgroup:ro
      - /dev/disk/:/dev/disk:ro
    command:
      - '-housekeeping_interval=10s'
      - '-docker_only=true'
    restart: unless-stopped
    devices:
      - /dev/kmsg:/dev/kmsg
    security_opt:
      - no-new-privileges:true
    expose:
      - 8080
    networks:
      - monitor-net
    labels:
      org.label-schema.group: "monitoring"
...
# prometheus.yml
...
scrape_configs:
  - job_name: 'cadvisor'
    scrape_interval: 15s
    static_configs:
      - targets: ['cadvisor:8080']
...

More details about this configuration here https://github.com/paaacman/dockprom.

@setop
Copy link

setop commented Feb 24, 2021

...
My observations, so far, are :
...
More details about this configuration here https://github.com/paaacman/dockprom.

Thanks you very much for these advises. When applied, I drop form 8% CPU usage by cadvisor to 0.6% ; much more sustainable.

opsxcq added a commit to strm-ansible-roles/ansible-role-linux-server that referenced this issue Mar 13, 2021
@Herz3h
Copy link

Herz3h commented Apr 2, 2021

Is it -housekeeping_interval=1m or --housekeeping_interval=1m @paaacman (1 or 2 dash)

I don't see any difference when using housekeeping_interval and docker_only. Still high cpu usage 100% (1 core fully used) with around 100 dockers to monitor (4core cpu 3.5Ghz), using 0.38.8 version :/

Edit: Seems that all parameter must have only one dash. Also -allow_dynamic_housekeeping is set to true by default, I think it should be set to false for -housekeeping_interval to be taken into account

@paaacman
Copy link

paaacman commented Apr 2, 2021

As you concluded, it's with 1 dash.
You have a lot of dockers, I have only 10-15, maybe it's a lot more consuming ?
I don't know if it's fetching states "per container" or "per host".

@JMLX42
Copy link

JMLX42 commented Apr 18, 2021

Here is the configuration I'm using:

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.38.8
    command:
      - '-allow_dynamic_housekeeping=false'
      - '-housekeeping_interval=10s'
      - '-docker_only=true'
    restart: unless-stopped
    devices:
      - /dev/kmsg:/dev/kmsg
    security_opt:
      - no-new-privileges:true
    volumes:
      - /:/rootfs:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
      - /sys/fs/cgroup:/cgroup:ro
      - /dev/disk/:/dev/disk:ro
    networks:
      - prometheus
    logging:
      driver: "fluentd"
      options:
        fluentd-address: "localhost:24224"
        fluentd-async: "true"

And cadvisor still consumes the exact same amount of CPU (~14% for 18 containers).

Update: I was looking at the wrong dashboard. cadvisor does indeed consume a lot less CPU: 1 to 7% instead of a continuous ~14%.

@zsimic
Copy link

zsimic commented Apr 21, 2021

I'm observing the same thing: no matter what I do, cadvisor uses ~10-15% CPU, all the time. Confirmed with htop.
This is on a raspberry-pi (if it helps).

Tried all suggestions here, including double -- and single - for options... the doc mentions 2 dashes: https://github.com/google/cadvisor/blob/master/docs/runtime_options.md but this thread seems to say it should be one...

The reporting seems accurate -> cadvisor does seem to be the most CPU-using container out of the bunch I have (grafana, prometheus, pihole, synthing, ...), its self-reported CPU usage seems accurate (and so seem all the others).

I would love to find a way to make it stop using so much CPU for no reason...
Prometheus is configured to gather data from cadvisor once per minute (and nothing else is using cadvisor... so it should be idle like 99.9% of the time)

Attached a screenshot of what it looks like on my system.
The top green line is cadvisor (the 2nd green line is unifi-controller, not showing up on the legend there... it's the only one not showing, viewport not big enough to show all 9 entries at once...)

cpu-usage

@zsimic
Copy link

zsimic commented Apr 21, 2021

Actually correction: turns out, everything I tried was a no-op...
Running cadvisor via docker-compose, and apparently one must down a container before it will pick up any changes.... (I was doing restart...)

Anyhow, after actually applying the configs, CPU usage goes drastically down with this:

    command:
      - '-docker_only=true'
      - '-housekeeping_interval=30s'
      - '-disable_metrics=disk'

With that ^^, CPU usage goes to ~1.5%

The docs on this page: https://github.com/google/cadvisor/blob/master/docs/runtime_options.md are actually completely out of date, not only do they state double dash --, but they also have a completely wrong list of valid options for -disable_metrics (list is out of date).

Fun bonus: if one states an invalid disable_metrics value, one gets a huuuge dump of all of the --help message, multiple times, and one has to hunt for the actual "meaningful" error message like:

... prev full --help message (because docker-compose is trying to restart the thing...) ...
...
cadvisor    | invalid value "disk,process" for flag -disable_metrics: unsupported metric "process"
cadvisor    | Usage of /usr/bin/cadvisor:                                                                                                                                                               cadvisor    |   -allow_dynamic_housekeeping                                                                                                                                                             cadvisor    |           Whether to allow ...
... 8< ...
... tons of non-helpful output

@paaacman
Copy link

paaacman commented Apr 21, 2021

@zsimic about docker-compose (not related to cadvisor, but if it can help):
You can try docker-compose up -d --force-recreate or docker-compose up -d --force-recreate --build, it's usually what I do.

@zsimic
Copy link

zsimic commented Apr 22, 2021

Awesome tip thank you! Didn't know that, will make sure to use --force-recreate.

If anyone's looking into this, FWIW, I think it would make sense if by default cadvisor was started such that it uses very little CPU.
Seems like currently, the defaults make it use ~15% CPU, which is a lot.
It would be cool if that by default it did -docker_only=true (or whichever opton makes it use less CPU here)

@iwankgb
Copy link
Collaborator

iwankgb commented May 7, 2021

@zsimic metrics disabled by default do not include IO, indeed.
I would blame housekeeping_interval - its default value is 1 second. The value has been set to 1 second for 7 years and on one hand I expect everybody to set it on command line but on the other - I would prefer not to break someone's metrics by changing default value.

@jorgeancal
Copy link

We had this problem at Crunch Accounting for a really long time and we found a solution for it.

We are using this Helm chart and this is the config that we are using at the moment:

image:
  repository: gcr.io/cadvisor/cadvisor
  tag: v0.37.5
  pullPolicy: IfNotPresent
container:
  port: 8080

  additionalArgs:
    - --housekeeping_interval=30s    # kubernetes default args
    - --max_housekeeping_interval=35s
    - --event_storage_event_limit=default=0
    - --event_storage_age_limit=default=0
    - --store_container_labels=false
    - --whitelisted_container_labels=io.kubernetes.container.name,io.kubernetes.pod.name,io.kubernetes.pod.namespace
    - --global_housekeeping_interval=30s
    - --event_storage_event_limit=default=0
    - --event_storage_age_limit=default=0
    - --disable_metrics=percpu,process,sched,tcp,udp,diskIO,disk,network      # enable only cpu, memory
    - --docker_only          # only show stats for docker containers
    - --allow_dynamic_housekeeping=true
    - --storage_duration=1m0s

here are some graphs showing how CPU usage has improved

image

image

@FStefanni
Copy link

Hi,

I tried the suggested options on a server running 150 containers in plain docker (no kubernetes), but the server collapses.
Not sure if it is cadvisor or prometheus or grafana fault... but at the moment,
just disabling cadvisor make the server running fine, with node-exporter and process-exporter enabled.

Monitoring should be lightweight imho: resources should be heavily used by the actual services, and not
by monitoring. This is quite disappointing for me :(

Does anyone have a solution?

Regards.

@Macleykun
Copy link

After some testing on my end, i can say that i reduced on my rpi4b the cpu usage on the container from 16 to 0.9% with the following args:

-docker_only=true --housekeeping_interval=30s --disable_metrics=accelerator,cpu_topology,disk,memory_numa,tcp,udp,percpu,sched,process,hugetlb,referenced_memory,resctrl,cpuset,advtcp,memory_numa

However, i was wondering if there's more to add/change to lower the cpu usage even more (if that's still possible).
It is worth it to look into this :-) even the disable metrics was worth it for me.

@wolkenschieber
Copy link

I'm also seeing a high CPU usage, but it's not a constant load, but spikes:
image

Tuning intervals doesn't seem to have an effect:

    command:
      - "--docker_only=true"
      - "--housekeeping_interval=30s"
      - "--max_housekeeping_interval=35s"
      - "--global_housekeeping_interval=30s"
      - "--storage_duration=1m0s"
      - "--disable_metrics=percpu,sched,tcp,udp,disk,diskIO,accelerator,hugetlb,referenced_memory,cpu_topology,resctrl"

@igngi
Copy link

igngi commented Jun 6, 2023

Just wanted to add a note for people running cAdvisor on Azure Container Instances.
Passing the flags as part of the command directive didn't work for me (container could not start), so I had call cAdvisor as the first command:

command:
  - "/usr/bin/cadvisor"
  - "-logtostderr"
  - "-housekeeping_interval=30s"
  - "-docker_only=true"

port19x added a commit to port19x/rapture that referenced this issue Aug 8, 2023
@michaelkebe
Copy link

Storing the container labels had a pretty heavy impact in cpu usage in my case.

First I tried to disable some metrics with --disable_metrics, but I wasn't very helpful.

Then I looked at the metrics prometheus gathered from cadvisor. By default every container label is stored in the prometheus metrics as label as well. In my setup I use traefik and container labels for its configuration. For each metric all these labels are stored. So there # of metrics * # of container_labels relationship.

In my case --store_container_labels=false dropped the cpu usage significantly.

2023-08-25 12_18_10-View panel - Node Exporter Full - Dashboards - Grafana

@immortaly007
Copy link
Author

Very nice find! I can confirm similar results. This is cadvisor CPU usage before and after the change:
brave_gEQYG9ZDwz

@dimzeta
Copy link

dimzeta commented Nov 5, 2023

Thanks, I've just gone from 15% to 0.8% by adding these orders:

    command:
      - '-housekeeping_interval=10s'
      - '-docker_only=true'
      - '-store_container_labels=false'

@kroese
Copy link

kroese commented Nov 5, 2023

As store_container_labels=false has such an tremendous effect on CPU usage (factor 10 or more) wouldn't the only logical conclusion be that there is some bug in the code relating to store_container_labels?

Because I cannot think of any valid reason why something so simple and low-effort as storing a label or not, could impact the performance in such a big way. Unless there is some bug or extremely inefficient code causing it.

@henfiber
Copy link

@kroese I don't think storing the labels is as trivial as it sounds. In my case, the total size of the metrics without the labels was about ~16x smaller.

Test: visit http://your-cadvisor-ip:port/metrics and save the page as "metrics"

then run the following command:

cat metrics | wc

~25 MB
(for 52 containers, ~6000 unique metrics, 76 labels per metric)

Filtering out the labels to measure the size difference:

cat metrics | gawk '{gsub(/container_label_[^=]+="[^"]*"/, "", $0); print $0}' | tr -s ',' | wc

~1.5 MB

Prometheus also benefits from the 16x reduction in the payload it has to parse. Therefore, I suggest running with -store_container_labels=false. if you need a few of them, you may add them back with -whitelisted_container_labels.

@truncsphere
Copy link

truncsphere commented Nov 30, 2023

Using the suggestions here, I'm still not able to drop my CPU usage. The options below don't appear to be having any impact as the CPU usage per Grafana is virtually unchanged ~38%

command:
   - 'housekeeping_interval=10s'
   - 'docker_only=true'
   - 'store_container_labels=false'
   - 'disable_metrics=accelerator,hugetlb,cpu_topology,cpuset,oom_event,perf_event,tcp,udp,sched,resctrl,referenced_memory,process,percpu,memory_numa,app,advtcp'

@AnderssonPeter
Copy link

Storing the container labels had a pretty heavy impact in cpu usage in my case.

First I tried to disable some metrics with --disable_metrics, but I wasn't very helpful.

Then I looked at the metrics prometheus gathered from cadvisor. By default every container label is stored in the prometheus metrics as label as well. In my setup I use traefik and container labels for its configuration. For each metric all these labels are stored. So there # of metrics * # of container_labels relationship.

In my case --store_container_labels=false dropped the cpu usage significantly.

2023-08-25 12_18_10-View panel - Node Exporter Full - Dashboards - Grafana

Great find my cpu went from 3% to almost 0%!! 🙇

@AntoineGlacet
Copy link

AntoineGlacet commented Dec 15, 2023

Using the suggestions here, I'm still not able to drop my CPU usage. The options below don't appear to be having any impact as the CPU usage per Grafana is virtually unchanged ~38%

command:
   - 'housekeeping_interval=10s'
   - 'docker_only=true'
   - 'store_container_labels=false'
   - 'disable_metrics=accelerator,hugetlb,cpu_topology,cpuset,oom_event,perf_event,tcp,udp,sched,resctrl,referenced_memory,process,percpu,memory_numa,app,advtcp'

@truncsphere
your format is wrong, you need to put "-" before each command, read attentively

for me using:

 command:
     - '-housekeeping_interval=30s'
     - '-docker_only=true'
     - '-store_container_labels=false'

has been good for CPU usage

@truncsphere
Copy link

truncsphere commented Dec 15, 2023

@AntoineGlacet
I've tried this before and all I get in the logs are a list of flags and their descriptions. The data collection stops at this point. Snippet below

-storage_driver_kafka_topic string kafka topic (default "stats") -storage_driver_password string database password (default "root") -storage_driver_secure use secure connection with database -storage_driver_table string table name (default "stats")

Edit: Removed disable_metrics flag and the CPU usage is down and working properly.

openstack-mirroring pushed a commit to openstack/kolla-ansible that referenced this issue Jan 6, 2024
The prometheus_cadvisor container has high CPU usage. On various
production systems I checked it sits around 13-16% on controllers,
averaged over the prometheus 1m scrape interval. When viewed with top we
can see it is a bit spikey and can jump over 100%.

There are various bugs about this, but I found
google/cadvisor#2523 which suggests reducing
the per-container housekeeping interval. This defaults to 1s, which
provides far greater granularity than we need with the default
prometheus scrape interval of 60s.

Reducing the housekeeping interval to 60s on a production controller
reduced the CPU usage from 13% to 3.5% average. This still seems high,
but is more reasonable.

Change-Id: I89c62a45b1f358aafadcc0317ce882f4609543e7
Closes-Bug: #2048223
openstack-mirroring pushed a commit to openstack/openstack that referenced this issue Jan 6, 2024
* Update kolla-ansible from branch 'master'
  to 205fd639b86b0b391bcf5884a0b8500e5989322b
  - Merge "cadvisor: Set housekeeping interval to Prometheus scrape interval"
  - cadvisor: Set housekeeping interval to Prometheus scrape interval
    
    The prometheus_cadvisor container has high CPU usage. On various
    production systems I checked it sits around 13-16% on controllers,
    averaged over the prometheus 1m scrape interval. When viewed with top we
    can see it is a bit spikey and can jump over 100%.
    
    There are various bugs about this, but I found
    google/cadvisor#2523 which suggests reducing
    the per-container housekeeping interval. This defaults to 1s, which
    provides far greater granularity than we need with the default
    prometheus scrape interval of 60s.
    
    Reducing the housekeeping interval to 60s on a production controller
    reduced the CPU usage from 13% to 3.5% average. This still seems high,
    but is more reasonable.
    
    Change-Id: I89c62a45b1f358aafadcc0317ce882f4609543e7
    Closes-Bug: #2048223
@markgoddard
Copy link

Thanks to this thread for the --housekeeping_interval suggestion. I used it set a more sensible default in kolla-ansible.

I was wondering if 1m would be a better default in cAdvisor, to match the prometheus default scrape interval. I could put together a PR if people think it's a good suggestion.

openstack-mirroring pushed a commit to openstack/kolla-ansible that referenced this issue Jan 12, 2024
The prometheus_cadvisor container has high CPU usage. On various
production systems I checked it sits around 13-16% on controllers,
averaged over the prometheus 1m scrape interval. When viewed with top we
can see it is a bit spikey and can jump over 100%.

There are various bugs about this, but I found
google/cadvisor#2523 which suggests reducing
the per-container housekeeping interval. This defaults to 1s, which
provides far greater granularity than we need with the default
prometheus scrape interval of 60s.

Reducing the housekeeping interval to 60s on a production controller
reduced the CPU usage from 13% to 3.5% average. This still seems high,
but is more reasonable.

Change-Id: I89c62a45b1f358aafadcc0317ce882f4609543e7
Closes-Bug: #2048223
(cherry picked from commit 97e5c0e)
openstack-mirroring pushed a commit to openstack/kolla-ansible that referenced this issue Jan 12, 2024
The prometheus_cadvisor container has high CPU usage. On various
production systems I checked it sits around 13-16% on controllers,
averaged over the prometheus 1m scrape interval. When viewed with top we
can see it is a bit spikey and can jump over 100%.

There are various bugs about this, but I found
google/cadvisor#2523 which suggests reducing
the per-container housekeeping interval. This defaults to 1s, which
provides far greater granularity than we need with the default
prometheus scrape interval of 60s.

Reducing the housekeeping interval to 60s on a production controller
reduced the CPU usage from 13% to 3.5% average. This still seems high,
but is more reasonable.

Change-Id: I89c62a45b1f358aafadcc0317ce882f4609543e7
Closes-Bug: #2048223
(cherry picked from commit 97e5c0e)
openstack-mirroring pushed a commit to openstack/kolla-ansible that referenced this issue Jan 15, 2024
The prometheus_cadvisor container has high CPU usage. On various
production systems I checked it sits around 13-16% on controllers,
averaged over the prometheus 1m scrape interval. When viewed with top we
can see it is a bit spikey and can jump over 100%.

There are various bugs about this, but I found
google/cadvisor#2523 which suggests reducing
the per-container housekeeping interval. This defaults to 1s, which
provides far greater granularity than we need with the default
prometheus scrape interval of 60s.

Reducing the housekeeping interval to 60s on a production controller
reduced the CPU usage from 13% to 3.5% average. This still seems high,
but is more reasonable.

Change-Id: I89c62a45b1f358aafadcc0317ce882f4609543e7
Closes-Bug: #2048223
(cherry picked from commit 97e5c0e)
ntk148v added a commit to ntk148v/ansitheus that referenced this issue Jun 13, 2024
The default value is 1s, it can cause high CPU usage

google/cadvisor#2523
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests