Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cAdvisor occasionally gets into a state where it has no container metadata #1932

Open
bboreham opened this issue Apr 18, 2018 · 17 comments
Open
Labels

Comments

@bboreham
Copy link
Contributor

From kubernetes/kubernetes#39812 (comment)

The high-level symptom is that curl /metrics returns blank for image, namespace, etc.

example line from broken node:

container_cpu_system_seconds_total{container_name="",id="/kubepods/burstable/podf8b4aff6914beba1622e8dac2c24ddf2/ca26d89b90d306290a36891aa612cb06ff60f54b89ac7016d982e8a2cfa1036d/\"\"",image="",name="",namespace="",pod_name=""} 126.33

example line from working node:

container_cpu_system_seconds_total{container_name="fluentd-loggly",id="/kubepods/burstable/pode29b14ff-380e-11e8-b4ff-0a97ed59c75e/86fbd797b1ad353b890e548bfc742871be4cbebad5713a32b45257c72ea1656a",image="quay.io/weaveworks/fluentd-loggly@sha256:1b4f74d1cb976175c114802279c2f3f7ac3011740ba141503ef55def1e9b95c3",name="k8s_fluentd-loggly_fluentd-loggly-fl8cp_monitoring_e29b14ff-380e-11e8-b4ff-0a97ed59c75e_2",namespace="monitoring",pod_name="fluentd-loggly-fl8cp"} 95.1

We have seen this on two clusters in recent weeks, running k8s v1.9.3.

It appears that kubelet's view of the universe has diverged significantly from Docker's hence it does not have the metadata to tag container metrics. Lots of these in kubelet logs:

Apr 13 10:32:38 ip-172-20-2-209 kubelet[885]: E0413 10:32:38.395138     885 manager.go:1103] Failed to create existing container: /kubepods/besteffort/pod52908ce2-3c8a-11e8-9d5c-0a41257e78e8/f6276ace92468f578cddd61be4babd0ea3e03c3ad79735137a54ea4e72fdee08: failed to identify the read-write layer ID for container "f6276ace92468f578cddd61be4babd0ea3e03c3ad79735137a54ea4e72fdee08". - open /var/lib/docker/image/overlay2/layerdb/mounts/f6276ace92468f578cddd61be4babd0ea3e03c3ad79735137a54ea4e72fdee08/mount-id: no such file or directory
Apr 13 10:32:38 ip-172-20-2-209 kubelet[885]: E0413 10:32:38.395788     885 manager.go:1103] Failed to create existing container: /kubepods/burstable/pod97d16950-3d9f-11e8-9d5c-0a41257e78e8/2a21b4da59f8c7294e1551783637320cec4b51be4d3ed230274472565fa3d143: failed to identify the read-write layer ID for container "2a21b4da59f8c7294e1551783637320cec4b51be4d3ed230274472565fa3d143". - open /var/lib/docker/image/overlay2/layerdb/mounts/2a21b4da59f8c7294e1551783637320cec4b51be4d3ed230274472565fa3d143/mount-id: no such file or directory
Apr 13 10:32:38 ip-172-20-2-209 kubelet[885]: E0413 10:32:38.396443     885 manager.go:1103] Failed to create existing container: /kubepods/burstable/pod461ad3a8-3f01-11e8-9d5c-0a41257e78e8/9616bf49faab8e61fe9652796d0f37072addd8474b5da9171fdef0c933e02b07: failed to identify the read-write layer ID for container "9616bf49faab8e61fe9652796d0f37072addd8474b5da9171fdef0c933e02b07". - open /var/lib/docker/image/overlay2/layerdb/mounts/9616bf49faab8e61fe9652796d0f37072addd8474b5da9171fdef0c933e02b07/mount-id: no such file or directory

Restarting kubelet fixed the issue for the most recent incident. I'm told on another occasion it was necessary to drain, delete all docker files and restart.

@dashpole
Copy link
Collaborator

What version of docker are you using? What docker storage driver are you using? Does it happen consistently, or only occasionally?
Can you verify that the r/w lager id on both good/bad nodes is stored in /var/lib/docker/image/overlay2/layerdb/mounts/<CONTAINER_UID>/mount-id?

cAdvisor operates independently of kubelet. The kubelet only queries cAdvisor, and doesn't provide it any information. So it is purely a cAdvisor issue. It seems like the correct behavior here is to continue on, even if we do not have a r/w layer id for the docker container (only drop the "image" field, but not the others). There is also likely an underlying bug with the cAdvisor monitoring for a docker storage driver, or an incompatibility with a more recent docker version

@bboreham
Copy link
Contributor Author

I think it happens infrequently, although the only evidence is that nobody noticed it happening much. I have set a Prometheus alarm on count by (instance)(container_memory_rss) unless count by (instance)(container_memory_rss{namespace!=""}) which should let us pick it up faster next time.

I don't have any bad nodes at the moment, since we have variously restarted and blown things away. I will check next time it happens.

The Docker root dir is nonstandard, but I do have a bunch of files at /mnt/containers/docker/image/overlay2/layerdb/mounts/<CONTAINER_UID>/mount-id. What am I looking for wrt "r/w lager id" ?

# docker info
Containers: 116
 Running: 105
 Paused: 0
 Stopped: 11
Images: 2853
Server Version: 1.12.2
Storage Driver: overlay2
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: host null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-119-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 29.44 GiB
Name: ip-172-20-2-209
ID: 4KM4:IOI7:XXAZ:NZG7:XR3K:ETZG:JKWT:S5I2:V7VX:PHUX:CISI:MI2A
Docker Root Dir: /mnt/containers/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8

@dashpole
Copy link
Collaborator

sorry, typo. Should be r/w layer id, not r/w lager id.
From your log, it looks like cAdvisor isn't respecting the nonstandard docker root dir, since it is failing to open the file /var/lib/docker/image/overlay2/layerdb/mounts/<CONTAINER_ID/mount-id, when it should be looking in /mnt/containers/docker/image/overlay2/layerdb/mounts/<CONTAINER_ID>/mount-id. But if that is the case, I am surprised it is only happening occasionally. Can you check your "good" nodes, and see if you still see that error message? I am wondering if it is a red herring

@bboreham
Copy link
Contributor Author

It's happened again, on one node in each of two clusters, both immediately after a reboot.

We do not get the /var/lib/docker/image messages on working nodes.

I suspect a race in RootDir() which falls back to the "wrong" directory if it encounters an error, without logging. Immediately after reboot it may not be possible to talk to Docker.

I'll try setting the --docker-root flag

@jejer
Copy link

jejer commented Jul 10, 2018

It happened in our cluster.
docker graph changed to /data0/docker.

container / pod name lost in :10255/metrics/cadvisor

And lot of kubelet logs:

Jul 10 08:51:54 j-daily-e2e-control-worker-edge-storage-01 kubelet[2129]: E0710 08:51:54.776667    2129 manager.go:1130] Failed to create existing container: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podd66a596523b1edaba6565ad02fbf61d1.slice/docker-1f43e4af14b5bb32a708f5a49546ee69ceccf635db6a494c0b934380e563a11f.scope: failed to identify the read-write layer ID for container "1f43e4af14b5bb32a708f5a49546ee69ceccf635db6a494c0b934380e563a11f". - open /var/lib/docker/image/overlay2/layerdb/mounts/1f43e4af14b5bb32a708f5a49546ee69ceccf635db6a494c0b934380e563a11f/mount-id: no such file or directory

kubernetes 1.10.2
Redhat docker 1.13.1
Storage Driver: overlay2
Kernel Version: 3.10.0-693.21.1.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.4 (Maipo)

@jejer
Copy link

jejer commented Jul 10, 2018

seems a race condition between kubelet start and docker start.
I can reproduce by manually start kubelet and start docker.

--docker-root is deprecated, if docker daemon not running, kubelet failed to read from docker info, will kubelet use default docker graph to init cadvisor?

@bboreham
Copy link
Contributor Author

@jejer I think it just doesn't work at all if it can't initialize.

Where do you see --docker-root deprecated?

@bboreham
Copy link
Contributor Author

BTW after I set --docker-root in our system we never saw the problem again.

To fix the underlying problem we need some alternative to sync.Once that will run again if it went wrong.

@jejer
Copy link

jejer commented Jul 11, 2018

@bboreham ,
kubelet --help print --docker-root deprecated. and in the code:

dockerRootDirFlag = flag.String("docker_root", "/var/lib/docker", "DEPRECATED: docker root is read from docker info (this is a fallback, default: /var/lib/docker)")

I will try set --docker-root, thank you!

@bboreham
Copy link
Contributor Author

bboreham commented Jul 11, 2018

"DEPRECATED: docker root is read from docker info (this is a fallback, default: /var/lib/docker)"

OK, that is exactly this bug - if Docker is not ready, the info is not available.

Note it was known at the time #1275 (comment):

We cannot deprecate it until the clients of this package can deal with the fact that docker might not be available when cAdvisor is initialized.

@jejer
Copy link

jejer commented Aug 6, 2018

added --docker-root for kubelet and never saw this issue again.

@mariusgrigoriu
Copy link

Setting --docker-root solved the problem for us too. Too bad the flag is deprecated. Is there a plan to fix the underlying issue before that flag is removed?

@Shivkumar13
Copy link

--docker-root is already set in my case, but still facing the issue on some nodes.

@lubronzhan
Copy link

Hi @Shivkumar13 did you resolve the issue in the end? We saw this issue if using prometheus

@Shivkumar13
Copy link

Shivkumar13 commented Oct 18, 2020

Hi @Shivkumar13 did you resolve the issue in the end? We saw this issue if using prometheus

That issue was not resolved properly at that time, but I rebooted the node as a workaround, and suddenly it started showing metrics.

@aoxb
Copy link

aoxb commented Jun 23, 2022

k8s 1.13.3
Docker version 19.03.8
kernel 4.19.12-1.el7
遇到了同样的报错 --docker-root 解决了问题

@tbankar
Copy link

tbankar commented Jul 6, 2022

I am facing same issue on rhel8 but using standalone cadvisor binary.
podman version 4.0.2

Cannot use "docker-root" as it is deprecated.

Command: cadvisor -disable_root_cgroup_stats=false -docker_only=true -housekeeping_interval=30s -store_container_labels=false

Jul 05 23:33:40 cadvisor[3549756]: W0705 23:33:40.810522 3549756 manager.go:1168] Failed to process watch event {EventType:0 Name:/system.slice/46fd7e454d3469b4ab4c96c720cd034774c2c6ef142bc468724f24a7373fa078.service WatchSource:0}: failed to identify the read-write layer ID for container "46fd7e454d3469b4ab4c96c720cd034774c2c6ef142bc468724f24a7373fa078". - open /var/lib/containers/storage/image/overlay/layerdb/mounts/46fd7e454d3469b4ab4c96c720cd034774c2c6ef142bc468724f24a7373fa078/mount-id: no such file or directory
Jul 05 23:33:40 cadvisor[3549756]: W0705 23:33:40.813298 3549756 manager.go:1168] Failed to process watch event {EventType:0 Name:/system.slice/367400503c1be728d62170e3ae67013dbf32f581ea883aedbd639c9c22c29877.service WatchSource:0}: failed to identify the read-write layer ID for container "367400503c1be728d62170e3ae67013dbf32f581ea883aedbd639c9c22c29877". - open /var/lib/containers/storage/image/overlay/layerdb/mounts/367400503c1be728d62170e3ae67013dbf32f581ea883aedbd639c9c22c29877/mount-id: no such file or directory
Jul 05 23:33:40 cadvisor[3549756]: W0705 23:33:40.822480 3549756 manager.go:1168] Failed to process watch event {EventType:0 Name:/system.slice/5e19754c5258145b3e6313e4ac65624a083c2607a0d38f5650c1bd50b7f46e61.service WatchSource:0}: failed to identify the read-write layer ID for container "5e19754c5258145b3e6313e4ac65624a083c2607a0d38f5650c1bd50b7f46e61". - open /var/lib/containers/storage/image/overlay/layerdb/mounts/5e19754c5258145b3e6313e4ac65624a083c2607a0d38f5650c1bd50b7f46e61/mount-id: no such file or directory
Jul 05 23:33:40 cadvisor[3549756]: W0705 23:33:40.826270 3549756 manager.go:1168] Failed to process watch event {EventType:0 Name:/system.slice/703dfb2ea8a9efb8cccd401ba89da2adea5c1f8e187bcf73a7b5d6cc92d7b7cb.service WatchSource:0}: failed to identify the read-write layer ID for container "703dfb2ea8a9efb8cccd401ba89da2adea5c1f8e187bcf73a7b5d6cc92d7b7cb". - open /var/lib/containers/storage/image/overlay/layerdb/mounts/703dfb2ea8a9efb8cccd401ba89da2adea5c1f8e187bcf73a7b5d6cc92d7b7cb/mount-id: no such file or directory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants