New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recurring high iowait due to kubelet 'du' process #61999
Comments
/sig node |
I've confirmed that the iowait issue starts when the node exhausts its IOPS burst balance for the root EBS volume. I've raised the IOPS limit for the time being by provisioning more expensive storage, but the question is why this node is generating so much I/O. Overall, the cluster is very lightly loaded (the web applications being hosted are low traffic). When I run
|
It's likely that du burns all Burst Balance for the gp2 ebs volume. You can check it in AWS CloudWatch. You can use a simple check to find out iops consumption for du: \time -f 'iops=%w' du /var/lib/docker/overlay/f8fb07fef9498184c4b2dfd7db4270dfadbe15d3bf59727ac23d4e6363f4595f > /dev/null In my case the result is: don't forget to drop fs cache before second run to get correct iops figures: sudo echo 3 >/proc/sys/vm/drop_caches |
Agreed, that's likely the cause. But i'm not sure why the command it runs is If it needs to know the size of the (read-only) container image, it could just calculate it once when the image is pulled. |
We have the same issue ( It only appears on nodes with high memory usage, so my current theory is this: When the kernel's disk cache can no longer hold the filesystem metadata Excerpt from
Click for full `/proc/meminfo`MemTotal: 8178056 kB MemFree: 190844 kB MemAvailable: 779988 kB Buffers: 88892 kB Cached: 226324 kB SwapCached: 0 kB Active: 7062952 kB Inactive: 135024 kB Active(anon): 6885820 kB Inactive(anon): 1068 kB Active(file): 177132 kB Inactive(file): 133956 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 1232 kB Writeback: 0 kB AnonPages: 6882688 kB Mapped: 158660 kB Shmem: 4224 kB Slab: 612732 kB SReclaimable: 531460 kB SUnreclaim: 81272 kB KernelStack: 46272 kB PageTables: 38492 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 4089028 kB Committed_AS: 18843660 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 81920 kB DirectMap2M: 8306688 kB We also had a java app crash-looping (too low memory request), which didn't exactly improve things, but might explain the ~800MB of available memory that is not used as disk cache. We'll try and increase the system memory reservation by 500 MiB (expecting it will be used as disk cache), and see whether the problem reappears. Update: 400 MiB more RAM didn't help. There's now quite a bit of free (not only available) RAM and the issue still occurs. |
Summary: In our case, 400 MiB additional system memory reservation weren't enough, but 800 MiB were. Details: After a system memory reservation increase by 400MiB wasn't enough, I noticed that one machine was having high I/O regularly for a few minutes (5 minutes high I/O, 5 minutes normal I/O). When looking into it, I noticed a Java app crashlooping. Giving my previous idea (not all filesystem metadata read by It looks like the Java app starting results in the kernel giving up part of the disk cache, which in turn results in I also found that we have quite a few Docker images with a high number of files (10k-90k), which probably impacts the amount of memory we need as a filesystem cache and the amount of I/O happening when Kubernetes (apparently In general, a nodes' performance heavily depending on the amount of memory available for caching filesystem metadata, related to the number of files in the Docker images running on the node, seems quite unpredictable. I'm not sure what a good solution could look like - I don't see a simple way to calculate disk usage without iterating over files. Running An additional measure (that would be useful anyway) could be to limit the number of concurrent |
cc @dashpole David, the /sig node |
I think the problem is that it seems to iterate over every file in the image filesystem, which could be an entire operating system's worth (eg. if the image is based on Debian) for every container present. It should just iterate the writable layer. |
This shouldn't be the case, we only run |
@meeee fyi, we already have a limit of concurrent commands to 20 - https://github.com/google/cadvisor/blob/master/fs/fs.go#L51 |
@dashpole It doesn't seem to be cadvisor that's running du, in my case it's kubelet
If it was measuring the writable layer, I think the overlay path would end with |
@glennschmidt the cadvisor code is running as part of the kubelet ( it's started here https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L670 ) |
@glennschmidt Monitoring the writable layer for overlay was fixed in 1.10. Overlay2 was fixed in 1.9 |
@dashpole Kubernetes 1.10? Hmm..I'm using 1.8 so in that case I don't know what's going on. I'm fairly confident it's measuring the whole filesystem. My writable layers are tiny (<20 files and <50KB) so I doubt i'd be seeing these problems if it was just looking at the writable layers. |
@glennschmidt it is measuring the whole filesystem. I am saying that the issue is fixed in newer versions. It affects older versions going back to 1.4, when it was added. |
@dashpole Whoops sorry I misread 1.10 as 1.1. Thanks for the explanation, I'll upgrade once 1.10 becomes supported with kops and hopefully it will address my problem. |
I was able to resolve the issue by editing kops cluster.spec with
It looks like by default kops configures docker to use overlay as default storage driver, while docker recommends using overlay2 as newer, more stable and faster. |
@markhilton I've done this too, and also upgraded to 1.9. I'll just wait and make sure the problem doesn't re-occur for a while, before closing the issue. |
@dims @dashpole - I'm looking at this issue as well and there does seem to be some indication that having directories with a large number of files (still working to quantify "large") can cause this. Using 1.8 at the moment and looking to move to overlay2 and 1.9. Apart from that just re-initializing the overlay and container directories seems to have helped as well with 1.8 and overlay. The one thing I noticed in the code at https://github.com/google/cadvisor/blob/master/fs/fs.go#L62 and https://github.com/google/cadvisor/blob/master/fs/fs.go#L66 is that although the maxConcurrentOps is set to 20, the channel is blocking. This can potentially have the effect of a backup of du/find requests. Any thoughts on making it non-blocking and returning an error if it is over the maxConcurrentOps limit? The kubelet won't be able to compute usage for that iteration but it will avoid a large backlog of requests for du and find. Apologies for the many edits... I've created an issue in cadvisor if we want to move forward with this. I can submit a PR if needed - google/cadvisor#2022 |
I haven't experienced this problem again since switching to |
This issue still exist.
kubectl version: v1.11.7 /var/log/message :
|
@0312birdzhang This problem how to solve? |
@du86796922 Not yet, but you can scale down the pods running on one node, then |
@0312birdzhang Thank you for your advice. I will try it. |
/kind bug
What happened:
I'm using Kubernetes 1.8.7 installed on AWS with kops. Every now and again one of my nodes will enter a state where it becomes unable to measure overlay FS usage without causing extremely high IO wait. When this happens the host enters a loop which looks like this:
du
processes start to time out with messages as shown belowdu
againOnce this starts, it keeps repeating forever until I drain and replace the node.
The same symptoms are also described by another user in this comment:
#23255 (comment)
However I can't find an open issue specifically for this, which is why i'm logging one.
Anything else we need to know?:
The kubelet log is full of messages like this
If I manually run
du
against one of the overlay directories that timed out in the log, it returns a result almost immediatelyAny suggestions about how to debug this are appreciated. I've tried to figure out which container filesystems are specifically causing the problem, but I'm not sure how to approach that, since I can't find any one particular
du
command line that hangs for a long time.Environment:
kubectl version
): 1.8.7uname -a
): 4.4.115-k8s Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Thu Feb 8 15:37:40 UTC 2018 x86_64 GNU/LinuxThe text was updated successfully, but these errors were encountered: