-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Relatively) high CPU usage for the cadvisor container #2523
Comments
you should add a number of things to the disable_metrics list. See the help text for the flag: https://github.com/google/cadvisor/blob/master/cmd/cadvisor.go#L137 |
I am also experiencing this issue. That's annoying to have the monitoring tool to eat far more resources than the monitored containers. Best regards |
@dashpole I have a docker-compose set up for caadvisor. where to add --housekeeping_interval to 10s ? |
@gowrisankar22 to the container args. I'm not sure where those are specified... |
@gowrisankar22 This issue is about high cpu usage, not about docker-compose content.
|
I've disabled all statistics listed, specifically: Note that the statistics Now the CPU usage is down to ~1.5%. This is still the highest average CPU usage of all 31 containers running on my machine (followed by promotheus itself at ~1.3%). But I wouldn't say the issue is quite resolved: I expect cadvisor to be able to report statistics with intervals of ~15-30 seconds, without taking more than 1% of the CPU load/time, and I didn't expect to have to disable all statistics I can to get near that number. |
1% of how many cores? Also, the query interval isn't what really matters; it is the housekeeping interval. cAdvisor collects metrics in the background, and serves them from its cache. |
The machine has two cores, so I believe it would be ~1.5% of a single core. It is calculated as: But note that this usage is when disabling all metrics I could disable, and setting the housekeeping interval to 15s. With default settings, cadvisor would take up ~15% CPU usage, which in my opinion is too much for the default settings. |
It seems concerning that a tool that is a fraction of the functionality of a task manager use ~15% CPU tracking the bare essentials of CPU, Memory, and Network. I had to turn down the reporting frequency to something I think is pretty unreasonable (30s) to get it to a comfortable level. The reduction in cadvisor's CPU usage has coincided with a reduction of TeamCity server CPU usage (also running as a container in docker), which I believe also tracks CPU usage of the 'host' machine. Maybe the CPU tracking of both is playing off each other in a bad way? |
@idontusenumbers, have you tried to enable profiling and see what makes cAdvisor to use that much resources? |
I have not. I have no experience doing go dev or native profiling; just JS, Java, and PHP. I'd be happy to help if the issue can't be reproduced, but I'm going to need some guidance. |
@idontusenumbers you need to add argument |
Thanks @iwankgb, I made some tests with differents settings, with 60s profiles. My observations, so far, are :
My final configuration : # docker-compose.yml
...
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
volumes:
- /:/rootfs:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
- /sys/fs/cgroup:/cgroup:ro
- /dev/disk/:/dev/disk:ro
command:
- '-housekeeping_interval=10s'
- '-docker_only=true'
restart: unless-stopped
devices:
- /dev/kmsg:/dev/kmsg
security_opt:
- no-new-privileges:true
expose:
- 8080
networks:
- monitor-net
labels:
org.label-schema.group: "monitoring"
... # prometheus.yml
...
scrape_configs:
- job_name: 'cadvisor'
scrape_interval: 15s
static_configs:
- targets: ['cadvisor:8080']
... More details about this configuration here https://github.com/paaacman/dockprom. |
Thanks you very much for these advises. When applied, I drop form 8% CPU usage by cadvisor to 0.6% ; much more sustainable. |
For more details google/cadvisor#2523
Is it I don't see any difference when using housekeeping_interval and docker_only. Still high cpu usage 100% (1 core fully used) with around 100 dockers to monitor (4core cpu 3.5Ghz), using 0.38.8 version :/ Edit: Seems that all parameter must have only one dash. Also |
As you concluded, it's with 1 dash. |
Here is the configuration I'm using: cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.38.8
command:
- '-allow_dynamic_housekeeping=false'
- '-housekeeping_interval=10s'
- '-docker_only=true'
restart: unless-stopped
devices:
- /dev/kmsg:/dev/kmsg
security_opt:
- no-new-privileges:true
volumes:
- /:/rootfs:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
- /sys/fs/cgroup:/cgroup:ro
- /dev/disk/:/dev/disk:ro
networks:
- prometheus
logging:
driver: "fluentd"
options:
fluentd-address: "localhost:24224"
fluentd-async: "true" And cadvisor still consumes the exact same amount of CPU (~14% for 18 containers). Update: I was looking at the wrong dashboard. cadvisor does indeed consume a lot less CPU: 1 to 7% instead of a continuous ~14%. |
I'm observing the same thing: no matter what I do, cadvisor uses ~10-15% CPU, all the time. Confirmed with Tried all suggestions here, including double The reporting seems accurate -> cadvisor does seem to be the most CPU-using container out of the bunch I have (grafana, prometheus, pihole, synthing, ...), its self-reported CPU usage seems accurate (and so seem all the others). I would love to find a way to make it stop using so much CPU for no reason... Attached a screenshot of what it looks like on my system. |
Actually correction: turns out, everything I tried was a no-op... Anyhow, after actually applying the configs, CPU usage goes drastically down with this:
With that ^^, CPU usage goes to ~1.5% The docs on this page: https://github.com/google/cadvisor/blob/master/docs/runtime_options.md are actually completely out of date, not only do they state double dash Fun bonus: if one states an invalid
|
@zsimic about docker-compose (not related to cadvisor, but if it can help): |
Awesome tip thank you! Didn't know that, will make sure to use If anyone's looking into this, FWIW, I think it would make sense if by default |
@zsimic metrics disabled by default do not include IO, indeed. |
We had this problem at Crunch Accounting for a really long time and we found a solution for it. We are using this Helm chart and this is the config that we are using at the moment: image:
repository: gcr.io/cadvisor/cadvisor
tag: v0.37.5
pullPolicy: IfNotPresent
container:
port: 8080
additionalArgs:
- --housekeeping_interval=30s # kubernetes default args
- --max_housekeeping_interval=35s
- --event_storage_event_limit=default=0
- --event_storage_age_limit=default=0
- --store_container_labels=false
- --whitelisted_container_labels=io.kubernetes.container.name,io.kubernetes.pod.name,io.kubernetes.pod.namespace
- --global_housekeeping_interval=30s
- --event_storage_event_limit=default=0
- --event_storage_age_limit=default=0
- --disable_metrics=percpu,process,sched,tcp,udp,diskIO,disk,network # enable only cpu, memory
- --docker_only # only show stats for docker containers
- --allow_dynamic_housekeeping=true
- --storage_duration=1m0s here are some graphs showing how CPU usage has improved |
Hi, I tried the suggested options on a server running 150 containers in plain docker (no kubernetes), but the server collapses. Monitoring should be lightweight imho: resources should be heavily used by the actual services, and not Does anyone have a solution? Regards. |
Storing the container labels had a pretty heavy impact in cpu usage in my case. First I tried to disable some metrics with Then I looked at the metrics prometheus gathered from cadvisor. By default every container label is stored in the prometheus metrics as label as well. In my setup I use traefik and container labels for its configuration. For each metric all these labels are stored. So there In my case |
Thanks, I've just gone from 15% to 0.8% by adding these orders:
|
As Because I cannot think of any valid reason why something so simple and low-effort as storing a label or not, could impact the performance in such a big way. Unless there is some bug or extremely inefficient code causing it. |
@kroese I don't think storing the labels is as trivial as it sounds. In my case, the total size of the metrics without the labels was about ~16x smaller. Test: visit then run the following command:
~25 MB Filtering out the labels to measure the size difference:
~1.5 MB Prometheus also benefits from the 16x reduction in the payload it has to parse. Therefore, I suggest running with |
Using the suggestions here, I'm still not able to drop my CPU usage. The options below don't appear to be having any impact as the CPU usage per Grafana is virtually unchanged ~38%
|
Great find my cpu went from 3% to almost 0%!! 🙇 |
@truncsphere for me using:
has been good for CPU usage |
@AntoineGlacet
Edit: Removed |
The prometheus_cadvisor container has high CPU usage. On various production systems I checked it sits around 13-16% on controllers, averaged over the prometheus 1m scrape interval. When viewed with top we can see it is a bit spikey and can jump over 100%. There are various bugs about this, but I found google/cadvisor#2523 which suggests reducing the per-container housekeeping interval. This defaults to 1s, which provides far greater granularity than we need with the default prometheus scrape interval of 60s. Reducing the housekeeping interval to 60s on a production controller reduced the CPU usage from 13% to 3.5% average. This still seems high, but is more reasonable. Change-Id: I89c62a45b1f358aafadcc0317ce882f4609543e7 Closes-Bug: #2048223
* Update kolla-ansible from branch 'master' to 205fd639b86b0b391bcf5884a0b8500e5989322b - Merge "cadvisor: Set housekeeping interval to Prometheus scrape interval" - cadvisor: Set housekeeping interval to Prometheus scrape interval The prometheus_cadvisor container has high CPU usage. On various production systems I checked it sits around 13-16% on controllers, averaged over the prometheus 1m scrape interval. When viewed with top we can see it is a bit spikey and can jump over 100%. There are various bugs about this, but I found google/cadvisor#2523 which suggests reducing the per-container housekeeping interval. This defaults to 1s, which provides far greater granularity than we need with the default prometheus scrape interval of 60s. Reducing the housekeeping interval to 60s on a production controller reduced the CPU usage from 13% to 3.5% average. This still seems high, but is more reasonable. Change-Id: I89c62a45b1f358aafadcc0317ce882f4609543e7 Closes-Bug: #2048223
Thanks to this thread for the I was wondering if 1m would be a better default in cAdvisor, to match the prometheus default scrape interval. I could put together a PR if people think it's a good suggestion. |
The prometheus_cadvisor container has high CPU usage. On various production systems I checked it sits around 13-16% on controllers, averaged over the prometheus 1m scrape interval. When viewed with top we can see it is a bit spikey and can jump over 100%. There are various bugs about this, but I found google/cadvisor#2523 which suggests reducing the per-container housekeeping interval. This defaults to 1s, which provides far greater granularity than we need with the default prometheus scrape interval of 60s. Reducing the housekeeping interval to 60s on a production controller reduced the CPU usage from 13% to 3.5% average. This still seems high, but is more reasonable. Change-Id: I89c62a45b1f358aafadcc0317ce882f4609543e7 Closes-Bug: #2048223 (cherry picked from commit 97e5c0e)
The prometheus_cadvisor container has high CPU usage. On various production systems I checked it sits around 13-16% on controllers, averaged over the prometheus 1m scrape interval. When viewed with top we can see it is a bit spikey and can jump over 100%. There are various bugs about this, but I found google/cadvisor#2523 which suggests reducing the per-container housekeeping interval. This defaults to 1s, which provides far greater granularity than we need with the default prometheus scrape interval of 60s. Reducing the housekeeping interval to 60s on a production controller reduced the CPU usage from 13% to 3.5% average. This still seems high, but is more reasonable. Change-Id: I89c62a45b1f358aafadcc0317ce882f4609543e7 Closes-Bug: #2048223 (cherry picked from commit 97e5c0e)
The prometheus_cadvisor container has high CPU usage. On various production systems I checked it sits around 13-16% on controllers, averaged over the prometheus 1m scrape interval. When viewed with top we can see it is a bit spikey and can jump over 100%. There are various bugs about this, but I found google/cadvisor#2523 which suggests reducing the per-container housekeeping interval. This defaults to 1s, which provides far greater granularity than we need with the default prometheus scrape interval of 60s. Reducing the housekeeping interval to 60s on a production controller reduced the CPU usage from 13% to 3.5% average. This still seems high, but is more reasonable. Change-Id: I89c62a45b1f358aafadcc0317ce882f4609543e7 Closes-Bug: #2048223 (cherry picked from commit 97e5c0e)
The default value is 1s, it can cause high CPU usage google/cadvisor#2523
Can confirm still facing this issue, the default config shouldn't take this much resource. I don't know why isn't this fixed from so many years |
@schenklklopfer you can retain the labels you need with |
@henfiber Thanks! As I understand this right, I need to name the labels that work with So the entire |
How is it relevant to the current issue ? |
As the main problem will not be solved at all this thread is about options to reduce high CPU usage from cadvisor.
|
The documentation says:
What you need to do, is to
There is an example mentioned previously (2021-05) in this thread:
P.S. Besides the performance implications, exposing all labels by default may leak information in some cases, where keys, hashes etc can be exposed in the /metrics endpoint. |
Following paaacman's guidelines :
cAdvisor's CPU drops from 12% to less than 1.5% : And the memory also drops (even if it was not the main objective): |
I noticed that the CPU usage of cadvisor is the highest of all containers I'm running, which I feel is unexpected. it's still not crazy (an average of 7%), but I feel it should be less.
I'm using the following query to calculate CPU usage (in grafana):
Some info about my instance:
-docker_only --disable_metrics disk,tcp,udp
Is this the expected CPU usage? Is there something I can improve? Or am I monitoring CPU usage wrong?
Edit: I managed to bring it down to ~3% by setting the
--housekeeping_interval
to10s
. But it would be nice if it was even lower.The text was updated successfully, but these errors were encountered: