Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node flapping between Ready/NotReady with PLEG issues #45419

Closed
deitch opened this issue May 5, 2017 · 187 comments

Comments

@deitch
Copy link
Contributor

commented May 5, 2017

Is this a request for help? No

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): PLEG NotReady kubelet


Is this a BUG REPORT or FEATURE REQUEST? Bug

Kubernetes version (use kubectl version): 1.6.2

Environment:

  • Cloud provider or hardware configuration: CoreOS on AWS
  • OS (e.g. from /etc/os-release):CoreOS 1353.7.0
  • Kernel (e.g. uname -a): 4.9.24-coreos
  • Install tools:
  • Others:

What happened:

I have a 3-worker cluster. Two and sometimes all three nodes keep dropping into NotReadywith the following messages in journalctl -u kubelet:

May 05 13:59:56 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 13:59:56.872880    2858 kubelet_node_status.go:379] Recording NodeNotReady event message for node ip-10-50-20-208.ec2.internal
May 05 13:59:56 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 13:59:56.872908    2858 kubelet_node_status.go:682] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2017-05-05 13:59:56.872865742 +0000 UTC LastTransitionTime:2017-05-05 13:59:56.872865742 +0000 UTC Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m7.629592089s ago; threshold is 3m0s}
May 05 14:07:57 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:07:57.598132    2858 kubelet_node_status.go:379] Recording NodeNotReady event message for node ip-10-50-20-208.ec2.internal
May 05 14:07:57 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:07:57.598162    2858 kubelet_node_status.go:682] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2017-05-05 14:07:57.598117026 +0000 UTC LastTransitionTime:2017-05-05 14:07:57.598117026 +0000 UTC Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m7.346983738s ago; threshold is 3m0s}
May 05 14:17:58 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:17:58.536101    2858 kubelet_node_status.go:379] Recording NodeNotReady event message for node ip-10-50-20-208.ec2.internal
May 05 14:17:58 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:17:58.536134    2858 kubelet_node_status.go:682] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2017-05-05 14:17:58.536086605 +0000 UTC LastTransitionTime:2017-05-05 14:17:58.536086605 +0000 UTC Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m7.275467289s ago; threshold is 3m0s}
May 05 14:29:59 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:29:59.648922    2858 kubelet_node_status.go:379] Recording NodeNotReady event message for node ip-10-50-20-208.ec2.internal
May 05 14:29:59 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:29:59.648952    2858 kubelet_node_status.go:682] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2017-05-05 14:29:59.648910669 +0000 UTC LastTransitionTime:2017-05-05 14:29:59.648910669 +0000 UTC Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m7.377520804s ago; threshold is 3m0s}
May 05 14:44:00 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:44:00.938266    2858 kubelet_node_status.go:379] Recording NodeNotReady event message for node ip-10-50-20-208.ec2.internal
May 05 14:44:00 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:44:00.938297    2858 kubelet_node_status.go:682] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2017-05-05 14:44:00.938251338 +0000 UTC LastTransitionTime:2017-05-05 14:44:00.938251338 +0000 UTC Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m7.654775919s ago; threshold is 3m0s}

docker daemon is fine (local docker ps, docker images, etc. all work and respond immediately).

using weave networking installed via kubectl apply -f https://git.io/weave-kube-1.6

What you expected to happen:

Nodes to be ready.

How to reproduce it (as minimally and precisely as possible):

Wish I knew how!

Anything else we need to know:

All of the nodes (workers and masters) on same private subnet with NAT gateway to Internet. Workers in security group that allows unlimited access (all ports) from masters security group; masters allow all ports from same subnet. proxy is running on workers; apiserver, controller-manager, scheduler on masters.

kubectl logs and kubectl exec always hang, even when run from the master itself (or from outside).

@yujuhong yujuhong added the sig/node label May 5, 2017

@yujuhong

This comment has been minimized.

Copy link
Member

commented May 5, 2017

@deitch, how many containers were running on the node? What's the overall cpu utilization of your nodes?

@deitch

This comment has been minimized.

Copy link
Contributor Author

commented May 5, 2017

Basically none. kube-dns, weave-net, weave-npc, and 3 template sample services. Actually only one, because two had no image and were going to be cleaned up. AWS m4.2xlarge. Not a resource issue.

I ended up having to destroy the nodes and recreate. No PLEG messages since destroy/recreate, and they seem 50% ok. They stay Ready, although they still refuse to allow kubectl exec or kubectl logs.

I really struggled to find any documentation on what PLEG really is, but more importantly how to check its own logs and state and debug it.

@deitch

This comment has been minimized.

Copy link
Contributor Author

commented May 5, 2017

Hmm... to add to the mystery, no container can resolve any hostnames, and kubedns gives:

E0505 17:30:49.412272       1 reflector.go:199] pkg/dns/config/sync.go:114: Failed to list *api.ConfigMap: Get https://10.200.0.1:443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dkube-dns&resourceVersion=0: dial tcp 10.200.0.1:443: getsockopt: no route to host
E0505 17:30:49.412285       1 reflector.go:199] pkg/dns/dns.go:148: Failed to list *api.Service: Get https://10.200.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.200.0.1:443: getsockopt: no route to host
E0505 17:30:49.412272       1 reflector.go:199] pkg/dns/dns.go:145: Failed to list *api.Endpoints: Get https://10.200.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 10.200.0.1:443: getsockopt: no route to host
I0505 17:30:51.855370       1 logs.go:41] skydns: failure to forward request "read udp 10.100.0.3:60364->10.50.0.2:53: i/o timeout"

FWIW, 10.200.0.1 is the kube api service internally, 10.200.0.5 is DNS, 10.50.20.0/24 and 10.50.21.0/24 are the subnets (2 separate AZs) on which masters and workers run.

Something just really fubar in the networking?

@deitch

This comment has been minimized.

Copy link
Contributor Author

commented May 5, 2017

Something just really fubar in the networking?

@bboreham could this be related to weave and not kube (or at least misconfigured weave)? Standard weave with the IPALLOC_RANGE=10.100.0.0/16 added as discussed at weaveworks/weave#2736

@qiujian16

This comment has been minimized.

Copy link
Contributor

commented May 11, 2017

@deitch pleg is for kubelet to periodically list pods in the node to check healthy and update cache. If you see pleg timeout log, it may not be related to dns, but because kubelet's call to docker is timeout.

@deitch

This comment has been minimized.

Copy link
Contributor Author

commented May 11, 2017

Thanks @qiujian16 . The issue appears to have gone away, but I have no idea how to check it. Docker itself appeared healthy. I was wondering if it could be networking plugin, but that should not affect the kubelet itself.

Can you give me some pointers here on checking pleg healthiness and status? Then we can close this out until I see the issue recur.

@qiujian16

This comment has been minimized.

Copy link
Contributor

commented May 11, 2017

@deitch pleg is the short for "pod lifecycle event generator", it is an internal component of kubelet and i do not think you can directly check its status, see (https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-lifecycle-event-generator.md)

@deitch

This comment has been minimized.

Copy link
Contributor Author

commented May 11, 2017

Is it an internal module in the kubelet binary? Is it another standalone container (docker, runc, cotnainerd)? It is just a standalone binary?

Basically, if kubelet reports PLEG errors, it is very helpful to find out what those errors are, and then check its status, try and replicate.

@qiujian16

This comment has been minimized.

Copy link
Contributor

commented May 11, 2017

it is an internal module

@yujuhong

This comment has been minimized.

Copy link
Member

commented May 11, 2017

@deitch most likely docker was not as responsive at times, causing PLEG to miss its threshold.

@bjhaid

This comment has been minimized.

Copy link
Contributor

commented May 11, 2017

I am having a similar issue on all nodes but one a cluster I just created,
logs:

kube-worker03.foo.bar.com kubelet[3213]: E0511 19:00:59.139374    3213 remote_runtime.go:109] StopPodSandbox "12c6a5c6833a190f531797ee26abe06297678820385b402371e196c69b67a136" from runtime service failed: rpc error: code = 4 desc = context deadline exceeded
May 11 19:00:59 kube-worker03.foo.bar.com kubelet[3213]: E0511 19:00:59.139401    3213 kuberuntime_gc.go:138] Failed to stop sandbox "12c6a5c6833a190f531797ee26abe06297678820385b402371e196c69b67a136" before removing: rpc error: code = 4 desc = context deadline exceeded
May 11 19:01:04 kube-worker03.foo.bar.com kubelet[3213]: E0511 19:01:04.627954    3213 pod_workers.go:182] Error syncing pod 1c43d9b6-3672-11e7-a6da-00163e041106
("kube-dns-4240821577-1wswn_kube-system(1c43d9b6-3672-11e7-a6da-00163e041106)"), skipping: rpc error: code = 4 desc = context deadline exceeded
May 11 19:01:18 kube-worker03.foo.bar.com kubelet[3213]: E0511 19:01:18.627819    3213 pod_workers.go:182] Error syncing pod 1c43d9b6-3672-11e7-a6da-00163e041106
("kube-dns-4240821577-1wswn_kube-system(1c43d9b6-3672-11e7-a6da-00163e041106)"),
skipping: rpc error: code = 4 desc = context deadline exceeded
May 11 19:01:21 kube-worker03.foo.bar.com kubelet[3213]: I0511 19:01:21.627670    3213 kubelet.go:1752] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m0.339074625s ago; threshold is 3m0s]

I have downgraded docker and also restarted virtually everything to no avail, the nodes are all managed via puppet, so I expect them to be completely identical, I have no clue what is wrong. Docker logs in debug mode shows it's getting these requests

@deitch

This comment has been minimized.

Copy link
Contributor Author

commented May 11, 2017

@bjhaid what are you using for networking? I was seeing some interesting networking issues at the time.

@bjhaid

This comment has been minimized.

Copy link
Contributor

commented May 11, 2017

@deitch weave, but I don't think this is a networking related problem, since it seems to be a communication problem between kubelet and docker. I can confirm docker is getting these requests from kubelet via debug logging of docker

@deitch

This comment has been minimized.

Copy link
Contributor Author

commented May 11, 2017

My Pleg issues appear to be gone, although I won't feel confident until next time I set up these clusters afresh (all via terraform modules I built).

Weave issues appear to exist, or possibly k8s/docker.

@bjhaid

This comment has been minimized.

Copy link
Contributor

commented May 11, 2017

@deitch did you do anything to make the Pleg issues go away or they magic happened?

@bjhaid

This comment has been minimized.

Copy link
Contributor

commented May 11, 2017

Actually it's hostname resolution, the controllers could not resolve the hostname for the newly created nodes, sorry for the noise

@bjhaid

This comment has been minimized.

Copy link
Contributor

commented May 11, 2017

I was quick to report things being fine, problem still exists, I'll keep looking and report back if I find anything

@gbergere

This comment has been minimized.

Copy link

commented May 19, 2017

I guess this issue is related to weave-kube I had the same issue and this time in order to solve it without recreating the cluster I had to remove weave and re-apply it (with a reboot of node in order to propagate the remove order)... And it's back

So I have no clue why or how by I'm pretty sure it's due to weave-kube-1.6

@bjhaid

This comment has been minimized.

Copy link
Contributor

commented May 19, 2017

Forgot to return here, my problem was due to the weave interface not coming up so the containers didn't have networking, however this was due to our firewall blocking weave data and vxlan ports, once I opened this ports things were fine

@deitch

This comment has been minimized.

Copy link
Contributor Author

commented May 19, 2017

There were two sets of issues I had, possibly related.

  1. PLEG. I believe they have gone away, but I have not recreated enough clusters to be completely confident. I do not believe I changed very much (i.e. anything) directly to make that happen.
  2. Weave issues wherein containers were unable to connect to anything.

Suspiciously, all of the issues with pleg happened at exactly the same time as the weave network issues.

Bryan @ weaveworks, pointed me to the coreos issues. CoreOS has a rather aggressive tendency to try and manage bridges, veths, basically everything. Once I disabled CoreOS from doing it except on lo and actually physical interfaces on the host, all of my problems left.

Are the people still with problems running coreos?

@hollowimage

This comment has been minimized.

Copy link

commented May 26, 2017

We've been plagued by these issues for the last month or so, (I want to say after upgrading the clusters to 1.6.x from 1.5.x) and its just as mysterious.

we're running weave, debian jessie AMIs in aws, and every once in a while a cluster will decide that PLEG is not healthy.

Weave seems okay in this case, because pods are coming up fine up util a point.
One thing we noted is if scale ALL our replicas down, the issue seems to go away, but as we start scaling deployments and statefulsets back up, around a certain number of containers, this happens. (at least this time).

docker ps; docker info seem fine on the node.
resource utilization is nominal: 5% cpu util, 1.5/8gb of RAM used (according to root htop), total node resource provisioning sits around 30% with everything that's supposed to be scheduled on it, scheduled.

We cannot get our head around this at all.

I really wish PLEG check was a little more verbose, and we had some actual detailed docu about what the beep its doing, because there seem to be a HUGE number of issues open about it, no one really knows what it is, and for such a critical module, i would love ot be able to reproduce the checks that it sees as failing.

@deitch

This comment has been minimized.

Copy link
Contributor Author

commented May 26, 2017

I second the thoughts on pleg mysteriousness. On my end though, after much work for my client, stabilizing coreos and its misbehaviour with networks helped a lot.

@yujuhong

This comment has been minimized.

Copy link
Member

commented May 27, 2017

The PLEG health check does very little. In every iteration, it calls docker ps to detect container states changes, and call docker ps and inspect to get the details of those containers.
After finishing each iteration, it updates a timestamp. If the timestamp hasn't been updated for a while (i.e., 3 minutes), the health check fails.

Unless your node is loaded with huge number of pods that PLEG can't finish doing all these in 3 minutes (which should not happen), the most probable cause would be that docker is slow. You may not observe that in your occasional docker ps check, but that doesn't mean it's not there.

If we don't expose the "unhealthy" status, it'd hide many problems from the users and potentially cause more issue. For example, kubelet'd silently not reacting to changes in a timely manner and cause even more confusion.

Suggestions on how to make this more debuggable are welcome...

@anurag

This comment has been minimized.

Copy link

commented May 27, 2017

Running into PLEG unhealthy warnings and flapping node health status : k8s 1.6.4 with weave. Only appears on a subset of (otherwise identical) nodes.

@agabert

This comment has been minimized.

Copy link

commented Jun 1, 2017

Just a quick heads up, in our case the flapping workers and pods stuck in ContainerCreating was a problem with the security groups of our EC2 instances not allowing weave traffic between master and workers and among the workers. Therefore the node could not properly come up and got stuck in NotReady.

kuberrnetes 1.6.4

with proper security group it works now.

@wirehead

This comment has been minimized.

Copy link

commented Jun 1, 2017

I am experiencing something like this issue with this config...

Kubernetes version (use kubectl version): 1.6.4

Environment:
Cloud provider or hardware configuration: single System76 server
OS (e.g. from /etc/os-release): Ubuntu 16.04.2 LTS
Kernel (e.g. uname -a): Linux system76-server 4.4.0-78-generic #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Install tools: kubeadm + weave.works

Since this is a single-node cluster, I don't think my version of this issue is related to security groups or firewalls.

@hollowimage

This comment has been minimized.

Copy link

commented Jun 2, 2017

The issue with security groups would make sense if you're just starting up the cluster. But these issues we're seeing are on clusters that have been running for months, with security groups in place.

@zoltrain

This comment has been minimized.

Copy link

commented Jun 2, 2017

I had something similar just happen to me running kubelet version 1.6.2 on GKE.

One of our nodes got shifted into a not ready state, the kubelet logs on that node had two complaints, one that the PLEG status check failed, and two interestingly, that the image listing operations failed.

Some examples of which image function calls failed.
image_gc_manager.go:176
kuberuntime_image.go:106
remote_image.go:61

Which I'm assuming are calls to the the docker daemon.

As this was happening I saw disk IO spike a lot, especially the read operations. From ~50kb/s mark to 8mb/s mark.

It corrected itself after about 30-45 minutes, but maybe it was a image GC sweep causing the increased IO?

As has been said, PLEG monitors the pods through the docker daemon, if that's doing a lot of operations art once could the PLEG checks be queued?

@bergman

This comment has been minimized.

Copy link

commented Jun 26, 2017

I'm seeing this problem in 1.6.4 and 1.6.6 (on GKE) with flapping NotReady as the result. Since this is the latest version available on GKE I'd love to have any fixes backported to the next 1.6 release.

One interesting thing is that the time that PLEG was last seen active doesn't change and is always a huge number (perhaps it's at some limit of whatever type it's stored in).

[container runtime is down PLEG is not healthy: pleg was last seen active 2562047h47m16.854775807s ago; threshold is 3m0s]

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Apr 19, 2019

@calder: Reiterating the mentions to trigger a notification:
@kubernetes/sig-network-bugs

In response to this:

@kubernetes/sig-network-bugs @thockin @spiffxp: Friendly ping. This seems to have stalled again.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Kanshiroron

This comment has been minimized.

Copy link

commented Apr 29, 2019

Hello,
We also found this issue on one of our platform. The only difference we had with other clusters is that it only had a single master node. And indeed, we recreated the cluster with 3 masters and we haven't noticed the issue so far (few days now).
So I guess my question is: has anybody noticed this issue on a multi-master (>=3) cluster?

@mshade

This comment has been minimized.

Copy link

commented May 1, 2019

@Kanshiroron yes, we have a 3 master cluster and ran into this issue on one worker node yesterday. We drained the node and restarted it and it's come back cleanly. Platform is Docker EE with k8s v1.11.8 and Docker Enterprise 18.09.2-ee

@standaloneSA

This comment has been minimized.

Copy link

commented May 8, 2019

We have a 3-master cluster (with a 3-node etcd cluster). There are 18 worker nodes, and each of the nodes is running between 50-100 docker containers on average (not pods, but total containers).

I am seeing a definite positive correlation between sizeable pod spinup events and needing to reboot a node because of PLEG errors. On occasion, some of our spin-ups cause the creation of >100 containers across the infrastructure, and when this happens, we almost always see resulting PLEG errors.

Do we have an understanding of what is causing this, on a node or a cluster level?

@thockin

This comment has been minimized.

Copy link
Member

commented May 9, 2019

I am a little disconnected from this -- do we know what is going on? Is there a fix @bboreham (since you seemed to know what was up)? Is there a PR?

@bboreham

This comment has been minimized.

Copy link
Contributor

commented May 9, 2019

I suspect this symptom can be caused in many ways, and we don't have much to go on for most of the "I have the same problem" comments here.

One of those ways was detailed at #45419 (comment) and similar at #45419 (comment) - calls into CNI could hang forever, breaking Kubelet. Issue #65743 says we should add a timeout.

To address this we decided to plumb a Context through into libcni so that cancellation could be implemented by exec.CommandContext(). PR #71653 adds a timeout on the CRI side of that API.

(for clarity: no change to CNI plugins is involved; this is a change to the code that executes the plugin)

@standaloneSA

This comment has been minimized.

Copy link

commented May 11, 2019

Okay, got a chance to do some debugging on a PLEG swarm (what we've been calling this lately), and I found some correlations between PLEG errors reported by K8s and entries in the Docker.service log:

On two servers, I found this:

From the script that was watching for errors:

Sat May 11 03:27:19 PDT 2019 - SERVER-A
Found:   Ready            False     Sat, 11 May 2019 03:27:10 -0700   Sat, 11 May 2019 03:13:16 -0700   KubeletNotReady              PLEG is not healthy: pleg was last seen active 16m53.660513472s ago; threshold is 3m0s

The matching entries in SERVER-A's output from 'journalctl -u docker.service':

May 11 03:10:20 SERVER-A dockerd[1133]: time="2019-05-11T03:10:20.641064617-07:00" level=error msg="stream copy error: reading from a closed fifo"
May 11 03:10:20 SERVER-A dockerd[1133]: time="2019-05-11T03:10:20.641083454-07:00" level=error msg="stream copy error: reading from a closed fifo"
May 11 03:10:20 SERVER-A dockerd[1133]: time="2019-05-11T03:10:20.740845910-07:00" level=error msg="Error running exec a9fe257c0fca6ff3bb05a7582015406e2f7f6a7db534b76ef1b87d297fb3dcb9 in container: OCI runtime exec failed: exec failed: container_linux.go:344: starting container process caused \"process_linux.go:113: writing config to pipe caused \\\"write init-p: broken pipe\\\"\": unknown"
May 11 03:10:20 SERVER-A dockerd[1133]: time="2019-05-11T03:10:20.767528843-07:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
27 lines of this^^ repeated

Then, on a different server, from my script:


Sat May 11 03:38:25 PDT 2019 - SERVER-B
Found:   Ready            False     Sat, 11 May 2019 03:38:16 -0700   Sat, 11 May 2019 03:38:16 -0700   KubeletNotReady              PLEG is not healthy: pleg was last seen active 3m6.168050703s ago; threshold is 3m0s

and from the docker journal:

May 11 03:35:25 SERVER-B dockerd[1102]: time="2019-05-11T03:35:25.745124988-07:00" level=error msg="stream copy error: reading from a closed fifo"
May 11 03:35:25 SERVER-B dockerd[1102]: time="2019-05-11T03:35:25.745139806-07:00" level=error msg="stream copy error: reading from a closed fifo"
May 11 03:35:25 SERVER-B dockerd[1102]: time="2019-05-11T03:35:25.803182460-07:00" level=error msg="1a5dbb24b27cd516373473d34717edccc095e712238717ef051ce65022e10258 cleanup: failed to delete container from containerd: no such container"
May 11 03:35:25 SERVER-B dockerd[1102]: time="2019-05-11T03:35:25.803267414-07:00" level=error msg="Handler for POST /v1.38/containers/1a5dbb24b27cd516373473d34717edccc095e712238717ef051ce65022e10258/start returned error: OCI runtime create failed: container_linux.go:344: starting container process caused \"process_linux.go:297: getting the final child's pid from pipe caused \\\"EOF\\\"\": unknown"
May 11 03:35:25 SERVER-B dockerd[1102]: time="2019-05-11T03:35:25.876522066-07:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
May 11 03:35:25 SERVER-B dockerd[1102]: time="2019-05-11T03:35:25.964447832-07:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"

Unfortunately, when validating this across my "healthy" nodes, I can find instances of these occurring together as well.
I will work to correlate this with other variables, but searching for these error messages leads to some interesting discussions:

Docker-ce 18.06.1-ce-rc2: can't run a container, "stream copy error: reading from a closed fifo"

Moby: Concurrently starting many containers from same checkpoint leads to error: "context deadline exceeded" #29369

Kubernetes: Increase maximum pods per node #23349

That last link has an especially interesting comment by @dElogics (referencing this ticket):

Just to add some valuable information, running many many pods per node results in #45419. As a fix, delete the docker directory and restart docker and kubelet together.

@thaonguyenct

This comment has been minimized.

Copy link

commented May 12, 2019

In my case, I'm using K8s v1.10.2 and docker-ce v18.03.1. I find some logs of kubelet that running on node flapping Ready/NotReady like:

E0512 09:17:56.721343 4065 pod_workers.go:186] Error syncing pod e5b8f48a-72c2-11e9-b8bf-005056871a33 ("uac-ddfb6d878-f6ph2_default(e5b8f48a-72c2-11e9-b8bf-005056871a33)"), skipping: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0512 09:17:17.154676 4065 kuberuntime_manager.go:859] PodSandboxStatus of sandbox "a34943dabe556924a2793f1be2f7181aede3621e2c61daef0838cf3fc61b7d1b" for pod "uac-ddfb6d878-f6ph2_default(e5b8f48a-72c2-11e9-b8bf-005056871a33)" error: rpc error: code = DeadlineExceeded desc = context deadline exceeded

And I found this pod uac-ddfb6d878-f6ph2_default is terminating, so my workaround solution is I force delete pod and remove all containers of this pod on that node and this node works fine after that:

$ kubectl delete pod uac-ddfb6d878-f6ph2 --force --grace-period=0

$ docker ps -a | grep uac-ddfb6d878-f6ph2_default

@soggiest

This comment has been minimized.

Copy link
Contributor

commented May 29, 2019

Hello! We have started Bug Freeze for 1.15. Is this issue still planned on being incorporated into 1.15?

@rblaine95

This comment has been minimized.

Copy link

commented May 29, 2019

Hi there

We were experiencing this same problem on our OKD cluster.
We investigated the nodes that were flapping and, after some digging, we found what we believed to be our problem.

When we investigated the node flapping, we found that the nodes that were flapping were having absurdly high average load values, one of the nodes (16 cores, 32 threads, 96GB memory) had an average load value of, at its peak, 850.
Three of our nodes have Rook Ceph running on them.

We discovered that Prometheus was using block storage from Rook Ceph and was flooding the block devices with Read/Writes.
At the same time, ElasticSearch was also using block storage from Rook Ceph. We found that ElasticSearch processes would attempt to perform a disk I/O operation, while Prometheus was flooding the block devices, and would enter an uninterruptible state while waiting for the I/O operation to finish.
Then another ES process would attempt the same.
Then another.
And another.

We'd enter into a state where our node's entire CPU had reserved threads for ES processes that were in uninterruptible states waiting for the Ceph block device to free up from the Prometheus flooding.

Even though the CPU wasn't at 100% load, the threads were being reserved.

This caused all other processes to wait for CPU time, which caused any Docker operations to fail, which caused PLEG to time out, and the node begins flapping.

Our solution was to reboot the offending Prometheus pod.

OKD/K8s version:

$ oc version
oc v3.11.0+0cbc58b
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://okd.example.net:8443
openshift v3.11.0+d0f1080-153
kubernetes v1.11.0+d4cacc0

Docker version on node:

$ docker version
Client:
 Version:         1.13.1
 API version:     1.26
 Package version: docker-1.13.1-88.git07f3374.el7.centos.x86_64
 Go version:      go1.9.4
 Git commit:      07f3374/1.13.1
 Built:           Fri Dec  7 16:13:51 2018
 OS/Arch:         linux/amd64

Server:
 Version:         1.13.1
 API version:     1.26 (minimum version 1.12)
 Package version: docker-1.13.1-88.git07f3374.el7.centos.x86_64
 Go version:      go1.9.4
 Git commit:      07f3374/1.13.1
 Built:           Fri Dec  7 16:13:51 2018
 OS/Arch:         linux/amd64
 Experimental:    false

Edit:
To summarize: I don't think this is a K8s/OKD problem, I think this is a "some resource on your node is being locked by something which is causing processes to pile up waiting for CPU time and that's breaking everything" problem.

@soggiest

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

/milestone v1.16

@makoscafee

This comment has been minimized.

Copy link
Member

commented Aug 23, 2019

@bboreham @soggiest Hello! I'm bug triage shadow for the 1.16 release cycle and considering this issue is tagged for 1.16, but not updated for a long time, I'd like to check its status. The code freeze is starting on August 29th (about 1.5 weeks from now), which means that there should be a PR ready (and merged) until then.

Do we still target this issue to be fixed for 1.16?

@rikatz

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2019

@makoscafee Can confirm this doesn't happen anymore in 1.13.6 (and further versions) and docker 18.06.3-ce

For us, this seems to be somehow related to a Timeout while calling CNI or some external integration.

We've faced this recently but with other scenario, while some of the NFS servers used in our cluster cracked (and the whole I/O from our nodes has freezed), kubelet started printing PLEG issues related with the inability to start newer containers because of I/O timeout.

So this might be a sign that, with CNI and CRI this is PROBABLY solved, as we didn't saw this again in our cluster related to Network issues.

@bboreham

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2019

@makoscafee as I commented before, there are many people reporting many things in this issue; I shall answer here on the specific point of timing-out a CNI request.

From looking at the code, I do not think kubelet has been updated to use the new behaviour in CNI where the context can be cancelled.

E.g. calling CNI here:

res, err := cniNet.AddNetworkList(context.TODO(), netConf, rt)

This PR would add a timeout: #71653, but is still outstanding.

I can't speak for what would have changed in 1.13.6 to cause @rikatz experience.

@rikatz

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2019

Indeed, I've had a lot of upgrades in Calico since then, maybe something there (and not in Kubernetes Code) has changed. Also Docker (which might also be a problem in that time) was upgraded numerous times, so no right path to follow here

I feel some shame here of not taking notes that time when the issue happened (sorry about that) to at least tell you what changed from there to our problems today.

@dariuscernea

This comment has been minimized.

Copy link

commented Aug 27, 2019

Hi all,

Just wanted to share our experience with this error.
We encountered this error on a freshly deployed cluster running docker EE 19.03.1 and k8s v1.14.3.

For us it seems that the issue was triggered by the logging driver. Our docker engine is setup to use fluentd logging driver. After the fresh deployment of the cluster fluentd was not yet deployed. If at this point we tried to schedule any pods on the workers we encountered the same behavior you guys described above (PLEG error on kubelet containers on the worker nodes and worker nodes randomly being reported unhealthy)
However, once we deployed fluentd and docker was able to make the connection to it all the issues went away. So it seems that not being able to communicate with fluentd was the root cause.

Hope this helps. Cheers

@nikopen

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

This has been a long standing issue (k8s 1.6!) and has troubled quite a few people working with k8s.

Apart from overloaded nodes (max cpu%, io, interrupts), PLEG issues are sometimes caused by nuanced issues between kubelet, docker, logging, networking and so on, with remediation of the issue sometimes being brutal (restarting all nodes etc, depending on the case).

As far as the original post goes, with #71653 finally being merged, kubelet is updated and now is able to timeout a CNI request, cancelling the context before the deadline is exceeded.

Kubernetes 1.16 will contain the fix.
I will also open PRs to cherry-pick this back to 1.14 and 1.15 as they do have a CNI version that contains the new timeout feature ( >= 0.7.0 ). 1.13 has an older CNI v without the feature.

Thus, this can be finally closed.

/close

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Aug 28, 2019

@nikopen: Closing this issue.

In response to this:

This has been a long standing issue (k8s 1.6!) and has troubled quite a few people working with k8s.

There are various things that cause PLEG issues and have generally been complicated between kubelet, docker, logging, networking and so on, with remediation of the issue sometimes being brutal (restarting all nodes etc, depending on the case).

As far as the original post goes, with #71653 finally being merged, kubelet is updated and now is able to timeout a CNI request, cancelling the context before the deadline is exceeded.

Kubernetes 1.16 will contain the fix.
I will also open PRs to cherry-pick this back to 1.14 and 1.15 as they do have a CNI version that contains the new timeout feature ( >= 0.7.0 ). 1.13 has an older CNI v without the feature.

Thus, this can be finally closed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Misteur-Z

This comment has been minimized.

Copy link

commented Aug 28, 2019

From personal experience since 1.6 in production, PLEG issues usually show up when a node is drowning.

  • CPU load is sky high
  • Disk I/O is maxed (logging?)
  • Global overload (CPU+disk+network) => CPU is interrupted all the time

Result => Docker daemon is not responsive

@Aisuko

This comment has been minimized.

Copy link

commented Sep 6, 2019

From personal experience since 1.6 in production, PLEG issues usually show up when a node is drowning.

  • CPU load is sky high
  • Disk I/O is maxed (logging?)
  • Global overload (CPU+disk+network) => CPU is interrupted all the time

Result => Docker daemon is not responsive

I agree with this, I use the 1.14.5 version Kubernetes, has the same issue.

@zx1986

This comment has been minimized.

Copy link

commented Sep 19, 2019

same issue with v1.13.10 run in calico network.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.