Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing container metrics in kubelet (cAdvisor) in v1.5.1 #39812

Closed
ichekrygin opened this issue Jan 12, 2017 · 44 comments
Closed

Missing container metrics in kubelet (cAdvisor) in v1.5.1 #39812

ichekrygin opened this issue Jan 12, 2017 · 44 comments
Labels
area/cadvisor area/os/coreos lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@ichekrygin
Copy link

ichekrygin commented Jan 12, 2017

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.): No, this looks like a regression in v1.5.1

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): kubelet, metrics, cAdvisor


Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.1", GitCommit:"82450d03cb057bab0950214ef122b67c83fb11df", GitTreeState:"clean", BuildDate:"2016-12-14T00:57:05Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.1", GitCommit:"82450d03cb057bab0950214ef122b67c83fb11df", GitTreeState:"clean", BuildDate:"2016-12-14T00:52:01Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: aws
  • OS (e.g. from /etc/os-release): os_image="Container Linux by CoreOS 1235.5.0 (Ladybug)
  • Kernel (e.g. uname -a): Linux ip-10-72-161-5 4.7.3-coreos-r2 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Sun Jan 8 00:32:25 UTC 2017 x86_64 Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz GenuineIntel GNU/Linux
  • Install tools:
  • Others:

What happened: after upgrading to v1.5.1 cAdvisor does not show any subcontainers, resulting in missing ALL container system metrics. For example:

core@ip-10-72-161-5 ~ $ curl localhost:10255/metrics | grep container_cpu_user_seconds_total
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 44371  100 44371    0     0  5264# HELP container_cpu_user_seconds_total Cumulative user cpu time consumed in seconds.
k     # TYPE container_cpu_user_seconds_total counter
 0 container_cpu_user_seconds_total{id="/"} 0
--:--:-- --:--:-- --:--:-- 6190k

What you expected to happen: cAdviser returns sub-containers and container metrics like so:

core@ip-10-72-6-143 ~ $ curl localhost:10255/metrics | grep container_cpu_user_seconds_total | more
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP container_cpu_user_seconds_total Cumulative user cpu time consumed in seconds.
# TYPE container_cpu_user_seconds_total counter
container_cpu_user_seconds_total{id="/"} 2.74665981e+06
container_cpu_user_seconds_total{id="/docker"} 2.33384234e+06
container_cpu_user_seconds_total{id="/init.scope"} 6.26

How to reproduce it (as minimally and precisely as possible):
upgrade kubelet to v1.5.1 and check metrics:
via metrics endpoint: curl localhost:10255/metrics
or via cAdviser UI: http://10.72.20.134:4194/containers/

Anything else do we need to know:

docker version:

docker version
Client:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   34a2ead
 Built:        
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   34a2ead
 Built:        
 OS/Arch:      linux/amd64
@pires
Copy link
Contributor

pires commented Jan 14, 2017

Seems to be fixed in 1.5.2.

core@node-01 ~ $ curl 172.17.8.102:10255/metrics | grep container_cpu_user_seconds_total | more
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP container_cpu_user_seconds_total Cumulative user cpu time consumed in seconds.
# TYPE container_cpu_user_seconds_total counter
container_cpu_user_seconds_total{id="/"} 281.72
container_cpu_user_seconds_total{id="/docker"} 241.4
container_cpu_user_seconds_total{id="/init.scope"} 0.45
container_cpu_user_seconds_total{id="/system.slice"} 38.68
container_cpu_user_seconds_total{id="/system.slice/audit-rules.service"} 0
container_cpu_user_seconds_total{id="/system.slice/containerd.service"} 0.68
container_cpu_user_seconds_total{id="/system.slice/coreos-setup-environment.service"} 0
container_cpu_user_seconds_total{id="/system.slice/dbus.service"} 0.2
container_cpu_user_seconds_total{id="/system.slice/docker.service"} 29.61
container_cpu_user_seconds_total{id="/system.slice/etcd2.service"} 3.2
container_cpu_user_seconds_total{id="/system.slice/flanneld.service"} 1.62

@jakexks
Copy link
Contributor

jakexks commented Jan 26, 2017

It's not fixed in 1.5.2. In my experience it works after restarting the kubelet, but then gets into the state where it doesn't report container metrics again somehow.

Exactly the same setup as the original poster:

$ uname -a
Linux qa1-worker0 4.7.3-coreos-r2 #1 SMP Sun Jan 8 00:32:25 UTC 2017 x86_64 Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz GenuineIntel GNU/Linux

Container Linux by CoreOS 1235.5.0 (Ladybug)

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"08e099554f3c31f6e6f07b448ab3ed78d0520507", GitTreeState:"clean", BuildDate:"2017-01-12T04:52:34Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"08e099554f3c31f6e6f07b448ab3ed78d0520507", GitTreeState:"clean", BuildDate:"2017-01-12T04:52:34Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
$ curl -s qa1-worker0:10255/metrics | grep container_cpu_user_seconds_total
# HELP container_cpu_user_seconds_total Cumulative user cpu time consumed in seconds.
# TYPE container_cpu_user_seconds_total counter
container_cpu_user_seconds_total{id="/"} 0
$ docker version
Client:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   34a2ead
 Built:        
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   34a2ead
 Built:        
 OS/Arch:      linux/amd64

@piosz
Copy link
Member

piosz commented Jan 27, 2017

cc @dchen1107 @dashpole @timstclair

@piosz
Copy link
Member

piosz commented Jan 27, 2017

@kubernetes/sig-node-bugs

@piosz
Copy link
Member

piosz commented Jan 27, 2017

@ichekrygin @jakexks does kubelet summary endpoint have those informations? To verify this you can use:

curl -s http://localhost:10255/stats/summary

@dashpole
Copy link
Contributor

We have had some issues in the past with stats disappearing from certain coreOS distributions:
#32304, #30939, cadvisor#1344
cc @euank (coreOS), any ideas?

@dashpole
Copy link
Contributor

@justinsb, is #30939 fixed for aws?

@jakexks
Copy link
Contributor

jakexks commented Jan 30, 2017

@piosz There's nothing in the kubelet summary either:

$ curl -s http://localhost:10255/stats/summary
{
  "node": {
   "nodeName": "qa1-worker0",
   "startTime": null,
   "memory": {
    "time": "2017-01-30T11:21:23Z",
    "availableBytes": 16831172608,
    "usageBytes": 0,
    "workingSetBytes": 0,
    "rssBytes": 0,
    "pageFaults": 0,
    "majorPageFaults": 0
   },
   "fs": {
    "availableBytes": 4938715136,
    "capacityBytes": 6350921728,
    "usedBytes": 1064087552,
    "inodesFree": 1627641,
    "inodes": 1628800,
    "inodesUsed": 1159
   },
   "runtime": {
    "imageFs": {
     "availableBytes": 13845848064,
     "capacityBytes": 21003628544,
     "usedBytes": 1793178765,
     "inodesFree": 771906,
     "inodes": 1310720,
     "inodesUsed": 538814
    }
   }
  },
  "pods": []
 }

(There are many pods running on the node)

@dadux
Copy link

dadux commented Feb 2, 2017

Same issue with latest CoreOS stable. Metrics seem to disappear after a few hours.

$ curl -sk  https://localhost:10250/metrics | head
# HELP cadvisor_version_info A metric with a constant '1' value labeled by kernel version, OS version, docker version, cadvisor version & cadvisor revision.
# TYPE cadvisor_version_info gauge
cadvisor_version_info{cadvisorRevision="",cadvisorVersion="",dockerVersion="1.12.6",kernelVersion="4.7.3-coreos-r2",osVersion="Container Linux by CoreOS 1235.8.0 (Ladybug)"} 1
# HELP container_cpu_system_seconds_total Cumulative system cpu time consumed in seconds.
# TYPE container_cpu_system_seconds_total counter
container_cpu_system_seconds_total{id="/"} 0
# HELP container_cpu_user_seconds_total Cumulative user cpu time consumed in seconds.
# TYPE container_cpu_user_seconds_total counter
container_cpu_user_seconds_total{id="/"} 0

I think #33192 might be related ?

@ichekrygin
Copy link
Author

@dadux yes, it is related to #33192.

@ichekrygin
Copy link
Author

@piosz I think this issue impacts HPA, since with missing metrics HPA reports somewhat incorrect CPU usage. @jakexks, did you notice anything of this kind?

@piosz
Copy link
Member

piosz commented Feb 3, 2017

@philk
Copy link

philk commented Feb 3, 2017

It absolutely affects HPA (stops HPAs with pods on broken nodes from scaling either direction). That's how we initially discovered this issue on our systems.

@ichekrygin
Copy link
Author

@philk just curious, how r u guys coping w/ it. I am at the point of setting up a cronjob that restarts kube-kubelet.servcie every 12 hrs

@philk
Copy link

philk commented Feb 4, 2017

@ichekrygin just a simple script that runs every 5 minutes, curl -s localhost:10255/stats/summary | jq -Mr '.pods | any' then systemctl restart kubelet

@justinsb
Copy link
Member

justinsb commented Feb 5, 2017

@dashpole I believe it is fixed. I closed that issue (I'm sure someone will comment if not), but I have been unable to reproduce (so far) on our 1.5 AWS images.

@ichekrygin
Copy link
Author

@justinsb Do you mean on 1.5 AWS that are after 1.5.2?
@jakexks has repro on 1.5.2

@ichekrygin
Copy link
Author

This is what I see in kube-kubelet.service log when cAdvisor loses container metrics:

Feb 07 13:50:15 ip-10-72-161-195 kubelet[2401]: W0207 13:50:15.144290    2401 raw.go:87] Error while processing event ("/var/lib/rkt/pods/run/cd3ace64-d3de-4cb4-88ea-140752d3b570/stage1/rootfs/opt/stage2/flannel/
rootfs/sys/fs/cgroup/cpu,cpuacct/system.slice/var-lib-docker-overlay-d7ec9fb3aaa33f4865b82c9682b4fd80461751043b4e8ec31a3561d08a72f1a4-merged.mount": 0x40000100 == IN_CREATE|IN_ISDIR): open /var/lib/rkt/pods/run/c
d3ace64-d3de-4cb4-88ea-140752d3b570/stage1/rootfs/opt/stage2/flannel/rootfs/sys/fs/cgroup/cpu,cpuacct/system.slice/var-lib-docker-overlay-d7ec9fb3aaa33f4865b82c9682b4fd80461751043b4e8ec31a3561d08a72f1a4-merged.mo
unt: no such file or directory
Feb 07 13:50:15 ip-10-72-161-195 kubelet[2401]: W0207 13:50:15.144765    2401 raw.go:87] Error while processing event ("/var/lib/rkt/pods/run/cd3ace64-d3de-4cb4-88ea-140752d3b570/stage1/rootfs/opt/stage2/flannel/
rootfs/sys/fs/cgroup/blkio/system.slice/var-lib-docker-overlay-d7ec9fb3aaa33f4865b82c9682b4fd80461751043b4e8ec31a3561d08a72f1a4-merged.mount": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /var/lib/rkt/pod
s/run/cd3ace64-d3de-4cb4-88ea-140752d3b570/stage1/rootfs/opt/stage2/flannel/rootfs/sys/fs/cgroup/blkio/system.slice/var-lib-docker-overlay-d7ec9fb3aaa33f4865b82c9682b4fd80461751043b4e8ec31a3561d08a72f1a4-merged.m
ount: no such file or directory
Feb 07 13:50:15 ip-10-72-161-195 kubelet[2401]: W0207 13:50:15.144816    2401 raw.go:87] Error while processing event ("/var/lib/rkt/pods/run/cd3ace64-d3de-4cb4-88ea-140752d3b570/stage1/rootfs/opt/stage2/flannel/
rootfs/sys/fs/cgroup/memory/system.slice/var-lib-docker-overlay-d7ec9fb3aaa33f4865b82c9682b4fd80461751043b4e8ec31a3561d08a72f1a4-merged.mount": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /var/lib/rkt/po
ds/run/cd3ace64-d3de-4cb4-88ea-140752d3b570/stage1/rootfs/opt/stage2/flannel/rootfs/sys/fs/cgroup/memory/system.slice/var-lib-docker-overlay-d7ec9fb3aaa33f4865b82c9682b4fd80461751043b4e8ec31a3561d08a72f1a4-merged
.mount: no such file or directory

@fiksn
Copy link

fiksn commented Feb 23, 2017

I have the same problem on CoreOS 1235.9.0 with Kubernetes 1.5.3.

@philk workaround works like a charm, but the problem is that I don't want to restart kubelet on a (temporarily) cordoned host (where pods section is empty) all the time. So I tried with something like this:

kubelet-periodic-check.service

[Unit]
Description=Kubelet health check
Documentation=https://github.com/kubernetes/kubernetes/issues/33192 https://github.com/kubernetes/kubernetes/issues/39812

[Service]
Type=oneshot
ExecStart=/bin/sh -c "curl --connect-timeout 5 --max-time 10 http://127.0.0.1:4194/containers/system.slice/etcd2.service 2>/dev/null | grep -q failed && systemctl restart kubelet"

kubelet-periodic-check.timer

[Unit]
Description=Kubelet health check cron
Documentation=https://github.com/kubernetes/kubernetes/issues/33192 https://github.com/kubernetes/kubernetes/issues/39812

[Timer]
OnBootSec=13min
OnUnitActiveSec=13m
Unit=kubelet-periodic-check.service

[Install]
WantedBy=timers.target

I (ab)use the fact that when the problem happens http://127.0.0.1:4194/containers/system.slice/etcd2.service returns:

failed to get container "/system.slice/etcd2.service" with error: unknown container "/system.slice/etcd2.service"

and etcd2 should usually run all the time on my nodes. Perhaps this will help somebody. Hopefully this issue will get resolved properly soon tho.

@ajaybhande
Copy link

This issue seems to be with older versions as well. Refer #33192

@andrewsykim
Copy link
Member

I'm seeing this issue as well and its blocking HPA from doing any scaling in my production clusters. I would like to avoid adding hacks such as periodically restarting the kubelet if possible. Is anyone currently working on a fix for this? If not I don't mind digging into it a bit, if someone can point me in the right direction for where to start that would be great. Seems like cAdvisor code in kubelet would be a good start?

@vishh
Copy link
Contributor

vishh commented Mar 9, 2017 via email

@dashpole
Copy link
Contributor

dashpole commented Mar 9, 2017

Ill check it out.

@dashpole
Copy link
Contributor

dashpole commented Mar 9, 2017

This may already be fixed, although I am not sure if it has made it into kubernetes yet.
Looks like what cadvisor#1573 fixes

@dchen1107 dchen1107 added area/os/coreos area/cadvisor sig/node Categorizes an issue or PR as relevant to SIG Node. labels Mar 9, 2017
@andrewsykim
Copy link
Member

@dashpole any chance we can get a vendor update for the cadvisor in for the next release?

@mindw
Copy link

mindw commented Mar 10, 2017

What are the chances for fixing this for 1.4.x / 1.5.x?

Thanks!

@dashpole
Copy link
Contributor

So this will definitely be in 1.6. It was actually added two months ago #40095. However, adding this to 1.5 or 1.4 would require cherrypicking #40095, which may not happen, since it includes updating the aws dependencies and it is ~80k lines of code.

@timstclair
Copy link

@dashpole we should cherrypick the cAdvisor fix into the release-v0.24 branch for cherrypicking to k8s 1.4/1.5 branches.

@dashpole
Copy link
Contributor

cherrypick to 1.5: #43113

@andrewsykim
Copy link
Member

There was a fix that went out for this in 1.5.6, I upgraded my worker nodes to run 1.5.6 and I still see this bug.

@piosz
Copy link
Member

piosz commented Apr 4, 2017

@dashpole @timstclair can you please take a look?

1 similar comment
@piosz
Copy link
Member

piosz commented Apr 4, 2017

@dashpole @timstclair can you please take a look?

@dashpole
Copy link
Contributor

dashpole commented Apr 5, 2017

@andrewsykim are you running on a systemd-based system?

@andrewsykim
Copy link
Member

@dashpole yes I am (CoreOS)

@dashpole
Copy link
Contributor

dashpole commented Apr 7, 2017

@andrewsykim would you mind opening an issue to cAdvisor, and giving us some of the error messages you see in your logs? I have no experience with systemd, but I can find someone who does to check out your specific case, since it wasnt fixed by #1573

@nicklan
Copy link

nicklan commented Aug 1, 2017

We ran into this issue when running kubelet in a rkt container. We were seeing error messages of the form:
Aug 01 00:50:30 [nodename] kubelet-wrapper[20169]: E0801 00:50:30.626256 20169 manager.go:1031] Failed to create existing container: /docker/1f3994fd716cc132015b6059d47e84425370ce39d0d05016ba5c7f321c3b4f18: failed to identify the read-write layer ID for container "1f3994fd716cc132015b6059d47e84425370ce39d0d05016ba5c7f321c3b4f18". - open /storage/docker/image/overlay/layerdb/mounts/1f3994fd716cc132015b6059d47e84425370ce39d0d05016ba5c7f321c3b4f18/mount-id: no such file or directory

It appears cAdvisor tries to read those files, but inside the rkt container /storage/docker was not mounted. The solution was to bind mount that dir when starting kubelet in rkt with:

--volume dockerstorage,kind=host,source=/storage/docker,readOnly=true --mount volume=dockerstorage,target=/storage/docker

Interestingly, we only saw errors on our master nodes (i.e those running kube-apiserver). Worker nodes don't have the issue even though the docker dir isn't mounted, and can still export pod metrics. No idea why that is the case ¯\(ツ)

@d-shi
Copy link

d-shi commented Aug 18, 2017

Is there an update on this? I'm running k8s 1.7.2 and am also noticing that port 10255/metrics does not show pod metrics. However port 10255/stats/summary does. Restarting kubelet does not change anything.

@d-shi
Copy link

d-shi commented Aug 19, 2017

Actually nevermind... apparently the behavior of cadvisor has changed in 1.7. I was able to get Prometheus to grab pod metrics by following the setup mentioned here: https://raw.githubusercontent.com/prometheus/prometheus/master/documentation/examples/prometheus-kubernetes.yml

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 3, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 8, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@bboreham
Copy link
Contributor

/reopen

We have seen this on two clusters in recent weeks, running k8s v1.9.3.

It appears that kubelet's view of the universe has diverged significantly from Docker's hence it does not have the metadata to tag container metrics. Lots of these in kubelet logs:

Apr 13 10:32:38 ip-172-20-2-209 kubelet[885]: E0413 10:32:38.395138     885 manager.go:1103] Failed to create existing container: /kubepods/besteffort/pod52908ce2-3c8a-11e8-9d5c-0a41257e78e8/f6276ace92468f578cddd61be4babd0ea3e03c3ad79735137a54ea4e72fdee08: failed to identify the read-write layer ID for container "f6276ace92468f578cddd61be4babd0ea3e03c3ad79735137a54ea4e72fdee08". - open /var/lib/docker/image/overlay2/layerdb/mounts/f6276ace92468f578cddd61be4babd0ea3e03c3ad79735137a54ea4e72fdee08/mount-id: no such file or directory
Apr 13 10:32:38 ip-172-20-2-209 kubelet[885]: E0413 10:32:38.395788     885 manager.go:1103] Failed to create existing container: /kubepods/burstable/pod97d16950-3d9f-11e8-9d5c-0a41257e78e8/2a21b4da59f8c7294e1551783637320cec4b51be4d3ed230274472565fa3d143: failed to identify the read-write layer ID for container "2a21b4da59f8c7294e1551783637320cec4b51be4d3ed230274472565fa3d143". - open /var/lib/docker/image/overlay2/layerdb/mounts/2a21b4da59f8c7294e1551783637320cec4b51be4d3ed230274472565fa3d143/mount-id: no such file or directory
Apr 13 10:32:38 ip-172-20-2-209 kubelet[885]: E0413 10:32:38.396443     885 manager.go:1103] Failed to create existing container: /kubepods/burstable/pod461ad3a8-3f01-11e8-9d5c-0a41257e78e8/9616bf49faab8e61fe9652796d0f37072addd8474b5da9171fdef0c933e02b07: failed to identify the read-write layer ID for container "9616bf49faab8e61fe9652796d0f37072addd8474b5da9171fdef0c933e02b07". - open /var/lib/docker/image/overlay2/layerdb/mounts/9616bf49faab8e61fe9652796d0f37072addd8474b5da9171fdef0c933e02b07/mount-id: no such file or directory

kubelet is running as a systemd service, not in a container.

Restarting kubelet fixed the issue for today's instance. I'm told on another occasion it was necessary to drain, delete all docker files and restart.

@k8s-ci-robot
Copy link
Contributor

@bboreham: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

/reopen

We have seen this on two clusters in recent weeks, running k8s v1.9.3.

It appears that kubelet's view of the universe has diverged significantly from Docker's hence it does not have the metadata to tag container metrics. Lots of these in kubelet logs:

Apr 13 10:32:38 ip-172-20-2-209 kubelet[885]: E0413 10:32:38.395138     885 manager.go:1103] Failed to create existing container: /kubepods/besteffort/pod52908ce2-3c8a-11e8-9d5c-0a41257e78e8/f6276ace92468f578cddd61be4babd0ea3e03c3ad79735137a54ea4e72fdee08: failed to identify the read-write layer ID for container "f6276ace92468f578cddd61be4babd0ea3e03c3ad79735137a54ea4e72fdee08". - open /var/lib/docker/image/overlay2/layerdb/mounts/f6276ace92468f578cddd61be4babd0ea3e03c3ad79735137a54ea4e72fdee08/mount-id: no such file or directory
Apr 13 10:32:38 ip-172-20-2-209 kubelet[885]: E0413 10:32:38.395788     885 manager.go:1103] Failed to create existing container: /kubepods/burstable/pod97d16950-3d9f-11e8-9d5c-0a41257e78e8/2a21b4da59f8c7294e1551783637320cec4b51be4d3ed230274472565fa3d143: failed to identify the read-write layer ID for container "2a21b4da59f8c7294e1551783637320cec4b51be4d3ed230274472565fa3d143". - open /var/lib/docker/image/overlay2/layerdb/mounts/2a21b4da59f8c7294e1551783637320cec4b51be4d3ed230274472565fa3d143/mount-id: no such file or directory
Apr 13 10:32:38 ip-172-20-2-209 kubelet[885]: E0413 10:32:38.396443     885 manager.go:1103] Failed to create existing container: /kubepods/burstable/pod461ad3a8-3f01-11e8-9d5c-0a41257e78e8/9616bf49faab8e61fe9652796d0f37072addd8474b5da9171fdef0c933e02b07: failed to identify the read-write layer ID for container "9616bf49faab8e61fe9652796d0f37072addd8474b5da9171fdef0c933e02b07". - open /var/lib/docker/image/overlay2/layerdb/mounts/9616bf49faab8e61fe9652796d0f37072addd8474b5da9171fdef0c933e02b07/mount-id: no such file or directory

kubelet is running as a systemd service, not in a container.

Restarting kubelet fixed the issue for today's instance. I'm told on another occasion it was necessary to drain, delete all docker files and restart.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dashpole
Copy link
Contributor

@bboreham can you open an issue with cAdvisor? We can debug it there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cadvisor area/os/coreos lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests