Increase the period of getting container filesystem/network stats #898

yujuhong · 2015-09-29T22:50:39Z

Frequent checking could cause high cpu usage, as reported by a kubernetes user in kubernetes/kubernetes#10451 (comment)

edit: The current housekeeping period is 1s

yujuhong · 2015-09-29T22:51:35Z

jimmidyson · 2015-09-30T05:12:52Z

From the referenced issue it seems running du is the culprit. Why does cadvisor run du at all? Don't think we care much about the size of any individual dirs do we? Shouldn't df of equivalent give enough info?

jimmidyson · 2015-09-30T05:42:32Z

AFAICT this du cheek only runs on aufs backed docker, basically docker running on ubuntu. Firstly I don't really like that inconsistency, but secondly I'm not sure I see the value of this check anyway. What do you think about removing it? In the meantime I'll have a think about how we might be able to implement in a better fashion.

Alternatively we could reduce the polling interval, but that just seems like delaying for when the number of containers rises.

yujuhong · 2015-09-30T23:33:13Z

AFAICT this du cheek only runs on aufs backed docker, basically docker running on ubuntu.

Do we not check disk usage in other cases?

I think we want more information of disk usage exposed to kubelet. @vishh is probably the one who added this, so he'd defend the choice better than I do.

I agree that reducing the polling interval is a short-term solution, but it'd help for the time being since we've got quite a few reports from the users. A better implementation is more than welcome :)

jimmidyson · 2015-10-01T05:41:46Z

Yes only aufs. See

cadvisor/container/docker/handler.go

Line 232 in b22a085

if !self.usesAufsDriver {

.

If this check is required, the way to perform it is dependent on the backing storage used for docker. In its current form this check would probably also work for devicemapper loopback but isn't going to work for a more production like deployment like using direct lvm.

However I can't think of a better way to do it tbh. I can't see this performing well enough at scale. Still vote to drop the check, open an issue to think if we can do this better.

jimmidyson · 2015-10-01T07:18:45Z

You probably already know this, but thought it worth just noting why du is causing CPU spikes. The only way for du to know dir size is to traverse the file tree & sum file size from file metadata. Unless the files are cached this is a relatively expensive operation. It also affects the disk cache which potentially affects performance of other running processes as files they access may have been evicted & need to be recached.

yujuhong · 2015-10-01T18:25:43Z

I am okay with disabling this for now since kubelet hasn't started utilizing it, but I think we still want the disk usage information in the near future. @vishh, WDYT?

vishh · 2015-10-02T16:09:58Z

Filesystem stats are useful mainly to figure out which container is hogging up disk space on a given node.
As you said @jimmidyson, there is no easy way to make this work across all storage drivers in docker.
I'm working on a quota based approach, but even that won't work on all deployments since it requires setting up quota.
I personally think fs usage feature will be useful.
If we can, we should probably add support for other storage backends like lvm and overlay for now.
For now, reducing the frequency will help. Another option is to place ulimits while exec'ing du.

jimmidyson · 2015-10-05T08:16:41Z

See #771 for another reason not to do du inside the containers' fileystem - blocks container deletes.

dchen1107 · 2015-10-06T19:53:34Z

@jimmidyson du is a temporary workaround for disk usage tracking in cAdvisor without disk quota. We are working on proposal / prototype on better disk usage tracking. One proposal is using disk quota tracking. But before we get there, we need signals at least to detect out-of-disk condition, and propagate such information to upstream layers for management. Thus increasing the interval for filesystem stats might be a ok workaround for short-term.

jimmidyson · 2015-10-06T20:26:56Z

@dchen1107 du isn't giving you out of disk notifications, that would be handled via df or similar I'd suggest. du is giving you information on where your disk is being used up though, which I agree is useful, but is currently having, to my mind, unacceptable impacts on both performance (high IO & CPU) & stability (blocking container GC). This is also only implemented for aufs which isn't great.

If we could somehow swap to df or equivalent for storage backends that may be a better approach, but I have no idea if that is possible.

vishh · 2015-10-07T16:52:19Z

@jimmidyson: We do not use du for out of disk conditions. We use statfs. I intend to add support for devicemapper and overlayfs soon, which should address the "aufs only" concern.
As @dchen1107 mentioned, identifying and getting rid of a disk hogging container is very useful in reality.

yujuhong mentioned this issue Sep 29, 2015

Kubelet taking large amounts of cpu in v0.19.3 kubernetes/kubernetes#10451

Closed

dchen1107 mentioned this issue Oct 6, 2015

Kubelet detects and handles OUT-OF-DISK situation kubernetes/kubernetes#4135

Closed

vishh mentioned this issue Oct 6, 2015

Control cpu usage when getting disk usage. #907

Closed

gordontyler mentioned this issue Oct 7, 2015

cadvisor is using 15-25% CPU with 54 containers rancher/rancher#2241

Closed

This was referenced Oct 8, 2015

Reduce cpu priority for du #909

Merged

Compute fs usage for docker containers less often. #910

Merged

vishh closed this as completed in #909 Oct 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase the period of getting container filesystem/network stats #898

Increase the period of getting container filesystem/network stats #898

yujuhong commented Sep 29, 2015

yujuhong commented Sep 29, 2015

jimmidyson commented Sep 30, 2015

jimmidyson commented Sep 30, 2015

yujuhong commented Sep 30, 2015

jimmidyson commented Oct 1, 2015

jimmidyson commented Oct 1, 2015

yujuhong commented Oct 1, 2015

vishh commented Oct 2, 2015

jimmidyson commented Oct 5, 2015

dchen1107 commented Oct 6, 2015

jimmidyson commented Oct 6, 2015

vishh commented Oct 7, 2015

Increase the period of getting container filesystem/network stats #898

Increase the period of getting container filesystem/network stats #898

Comments

yujuhong commented Sep 29, 2015

yujuhong commented Sep 29, 2015

jimmidyson commented Sep 30, 2015

jimmidyson commented Sep 30, 2015

yujuhong commented Sep 30, 2015

jimmidyson commented Oct 1, 2015

jimmidyson commented Oct 1, 2015

yujuhong commented Oct 1, 2015

vishh commented Oct 2, 2015

jimmidyson commented Oct 5, 2015

dchen1107 commented Oct 6, 2015

jimmidyson commented Oct 6, 2015

vishh commented Oct 7, 2015