Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet evicts all Pods when config changes to meet eviction threshold #83612

Open
alitoufighi opened this issue Oct 8, 2019 · 12 comments

Comments

@alitoufighi
Copy link

commented Oct 8, 2019

What happened:
My worker node had nodefs.inodesFree threshold of 5% and I changed it to 15%. As I had about 10% of free inodes, after restarting kubelet, it met its threshold. I expected it to evict Pods one-by-one to clean up, but it evicted all Pods on that node and started cleaning everything. As a result, my inode free space became 90%, but I didn't need that and it caused me to lose one of worker nodes for about 30 minutes.

$ df -ih /
Filesystem     Inodes IUsed IFree IUse% Mounted on
/dev/sda2         13M  1.3M   12M   10% /

What you expected to happen:
Cleaning Pods resources one-by-one until node meets the healthy threshold.

How to reproduce it (as minimally and precisely as possible):
When nodefs.inodesFree hasn't met the threshold, change it to a value less than current free value. Then, restart kubelet service. It trigged these events:

Events:
  Type     Reason                   Age                  From               Message
  ----     ------                   ----                 ----               -------
  Normal   Starting                 36m                  kubelet, worker-1  Starting kubelet.
  Normal   NodeHasSufficientMemory  36m                  kubelet, worker-1  Node worker-1 status is now: NodeHasSufficientMemory
  Normal   NodeHasSufficientPID     36m                  kubelet, worker-1  Node worker-1 status is now: NodeHasSufficientPID
  Normal   NodeNotReady             36m                  kubelet, worker-1  Node worker-1 status is now: NodeNotReady
  Normal   NodeAllocatableEnforced  35m                  kubelet, worker-1  Updated Node Allocatable limit across pods
  Normal   NodeReady                35m                  kubelet, worker-1  Node worker-1 status is now: NodeReady
  Warning  EvictionThresholdMet     35m                  kubelet, worker-1  Attempting to reclaim inodes
  Normal   NodeHasDiskPressure      35m                  kubelet, worker-1  Node worker-1 status is now: NodeHasDiskPressure
  Normal   NodeHasNoDiskPressure    4m29s (x2 over 36m)  kubelet, worker-1  Node worker-1 status is now: NodeHasNoDiskPressure

Environment:

  • Kubernetes version (use kubectl version): I experienced this on two machines with both 1.15.3 and 1.16.1 client versions. The server version is 1.16.1
  • OS (e.g: cat /etc/os-release): Ubuntu 18.04.3 LTS
  • Kernel (e.g. uname -a): 4.15.0-64-generic

/sig node

@alitoufighi alitoufighi added the kind/bug label Oct 8, 2019
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Oct 8, 2019

@alitoufighi: The label(s) sig/@node cannot be applied. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other

In response to this:

What happened:
My worker node had nodefs.inodesFree threshold of 5% and I changed it to 15%. As I had about 10% of free inodes, after restarting kubelet, it met its threshold. I expected it to evict Pods one-by-one to clean up, but it evicted all Pods on that node and started cleaning everything. As a result, my inode free space became 90%, but I didn't need that and it caused me to lose one of worker nodes for about 30 minutes.

$ df -ih /
Filesystem     Inodes IUsed IFree IUse% Mounted on
/dev/sda2         13M  1.3M   12M   10% /

What you expected to happen:
Cleaning Pods resources one-by-one until node meets the healthy threshold.

How to reproduce it (as minimally and precisely as possible):
When nodefs.inodesFree hasn't met the threshold, change it to a value less than current free value. Then, restart kubelet service. It trigged these events:

Events:
 Type     Reason                   Age                  From               Message
 ----     ------                   ----                 ----               -------
 Normal   Starting                 36m                  kubelet, worker-1  Starting kubelet.
 Normal   NodeHasSufficientMemory  36m                  kubelet, worker-1  Node worker-1 status is now: NodeHasSufficientMemory
 Normal   NodeHasSufficientPID     36m                  kubelet, worker-1  Node worker-1 status is now: NodeHasSufficientPID
 Normal   NodeNotReady             36m                  kubelet, worker-1  Node worker-1 status is now: NodeNotReady
 Normal   NodeAllocatableEnforced  35m                  kubelet, worker-1  Updated Node Allocatable limit across pods
 Normal   NodeReady                35m                  kubelet, worker-1  Node worker-1 status is now: NodeReady
 Warning  EvictionThresholdMet     35m                  kubelet, worker-1  Attempting to reclaim inodes
 Normal   NodeHasDiskPressure      35m                  kubelet, worker-1  Node worker-1 status is now: NodeHasDiskPressure
 Normal   NodeHasNoDiskPressure    4m29s (x2 over 36m)  kubelet, worker-1  Node worker-1 status is now: NodeHasNoDiskPressure

Environment:

  • Kubernetes version (use kubectl version): 1.16 on both client and version
  • OS (e.g: cat /etc/os-release): Ubuntu 18.04.3 LTS
  • Kernel (e.g. uname -a): 4.15.0-64-generic

/sig @node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@zouyee

This comment has been minimized.

Copy link
Member

commented Oct 8, 2019

/assign

@zouyee

This comment has been minimized.

Copy link
Member

commented Oct 9, 2019

Please paste the log if the following fields exist.

eviction manager: pods ranked for eviction
@alitoufighi

This comment has been minimized.

Copy link
Author

commented Oct 9, 2019

@zouyee There were no eviction manager journal logs in that period.
After regular initialization logs, There are these:

Oct 08 13:56:22 worker-1 kubelet[25214]: I1008 13:56:22.822440   25214 kubelet_node_status.go:72] Attempting to register node worker-1
Oct 08 13:56:22 worker-1 kubelet[25214]: I1008 13:56:22.873677   25214 kubelet_node_status.go:114] Node worker-1 was previously registered
Oct 08 13:56:22 worker-1 kubelet[25214]: I1008 13:56:22.877155   25214 kubelet_node_status.go:75] Successfully registered node worker-1
Oct 08 13:56:23 worker-1 kubelet[25214]: I1008 13:56:23.128059   25214 setters.go:521] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2019-10-08 13:56:23.127977829 +0330 +0330 m=+22.921647551 LastTransitionTime:2019-10-08 13:56:23.127977829 +0330 +0330 m=+22.921647551 Reason:KubeletNotReady Message:container runtime status check may not have completed yet}
Oct 08 13:56:23 worker-1 kubelet[25214]: I1008 13:56:23.191070   25214 kubelet.go:1822] skipping pod synchronization - container runtime status check may not have completed yet
Oct 08 13:56:23 worker-1 kubelet[25214]: I1008 13:56:23.992373   25214 kubelet.go:1822] skipping pod synchronization - container runtime status check may not have completed yet
Oct 08 13:56:25 worker-1 kubelet[25214]: I1008 13:56:25.592988   25214 kubelet.go:1822] skipping pod synchronization - container runtime status check may not have completed yet
Oct 08 13:56:27 worker-1 kubelet[25214]: I1008 13:56:27.769456   25214 cpu_manager.go:155] [cpumanager] starting with none policy
Oct 08 13:56:27 worker-1 kubelet[25214]: I1008 13:56:27.769496   25214 cpu_manager.go:156] [cpumanager] reconciling every 10s
Oct 08 13:56:27 worker-1 kubelet[25214]: I1008 13:56:27.769513   25214 policy_none.go:42] [cpumanager] none policy: Start
Oct 08 13:56:27 worker-1 kubelet[25214]: I1008 13:56:27.794759   25214 plugin_manager.go:116] Starting Kubelet Plugin Manager
Oct 08 13:56:28 worker-1 kubelet[25214]: I1008 13:56:28.949288   25214 reconciler.go:203] operationExecutor.VerifyControllerAttachedVolume started for volume "varlibdockercontainers" (UniqueName: "kubernetes.io/host-path/31ca3aca-b40c-4be8-8397-e1f87f3b19e3-varlibdockercontainers") pod "fluent-bit-tmh2s" (UID: "31ca3aca-b40c-4be8-8397-e1f87f3b19e3")
Oct 08 13:56:28 worker-1 kubelet[25214]: I1008 13:56:28.949534   25214 reconciler.go:203] operationExecutor.VerifyControllerAttachedVolume started for volume "kube-proxy" (UniqueName: "kubernetes.io/configmap/f7a213d2-4a06-4980-8c83-71f813475641-kube-proxy") pod "kube-proxy-6qdwr" (UID: "f7a213d2-4a06-4980-8c83-71f813475641")
... Similar log for all other volumes on this node
Oct 08 13:56:29 worker-1 kubelet[25214]: I1008 13:56:29.679267   25214 reconciler.go:150] Reconciler: start to sync state
Oct 08 14:25:12 worker-1 kubelet[25214]: W1008 14:25:12.564625   25214 watcher.go:87] Error while processing event ("/sys/fs/cgroup/cpu,cpuacct/libcontainer_14454_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/cpu,cpuacct/libcontainer_14454_systemd_test_default.slice: no such file or directory
Oct 08 14:25:12 worker-1 kubelet[25214]: W1008 14:25:12.564767   25214 watcher.go:87] Error while processing event ("/sys/fs/cgroup/blkio/libcontainer_14454_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/blkio/libcontainer_14454_systemd_test_default.slice: no such file or directory
Oct 08 14:25:12 worker-1 kubelet[25214]: W1008 14:25:12.564814   25214 watcher.go:87] Error while processing event ("/sys/fs/cgroup/memory/libcontainer_14454_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/memory/libcontainer_14454_systemd_test_default.slice: no such file or directory
Oct 08 14:25:12 worker-1 kubelet[25214]: W1008 14:25:12.564844   25214 watcher.go:87] Error while processing event ("/sys/fs/cgroup/devices/libcontainer_14454_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/libcontainer_14454_systemd_test_default.slice: no such file or directory
Oct 08 14:25:12 worker-1 kubelet[25214]: W1008 14:25:12.572573   25214 watcher.go:87] Error while processing event ("/sys/fs/cgroup/memory/libcontainer_14454_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/memory/libcontainer_14454_systemd_test_default.slice: no such file or directory
Oct 08 14:25:12 worker-1 kubelet[25214]: W1008 14:25:12.573466   25214 watcher.go:87] Error while processing event ("/sys/fs/cgroup/devices/libcontainer_14454_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/libcontainer_14454_systemd_test_default.slice: no such file or directory
Oct 08 14:25:12 worker-1 kubelet[25214]: W1008 14:25:12.731236   25214 watcher.go:87] Error while processing event ("/sys/fs/cgroup/devices/libcontainer_14470_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/libcontainer_14470_systemd_test_default.slice: no such file or directory
Oct 08 14:25:13 worker-1 kubelet[25214]: W1008 14:25:13.680919   25214 watcher.go:87] Error while processing event ("/sys/fs/cgroup/devices/libcontainer_14549_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/libcontainer_14549_systemd_test_default.slice: no such file or directory
Oct 08 14:25:13 worker-1 kubelet[25214]: W1008 14:25:13.774640   25214 watcher.go:87] Error while processing event ("/sys/fs/cgroup/cpu,cpuacct/libcontainer_14555_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/cpu,cpuacct/libcontainer_14555_systemd_test_default.slice: no such file or directory
Oct 08 14:25:13 worker-1 kubelet[25214]: W1008 14:25:13.774728   25214 watcher.go:87] Error while processing event ("/sys/fs/cgroup/blkio/libcontainer_14555_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/blkio/libcontainer_14555_systemd_test_default.slice: no such file or directory
Oct 08 14:25:13 worker-1 kubelet[25214]: W1008 14:25:13.774777   25214 watcher.go:87] Error while processing event ("/sys/fs/cgroup/memory/libcontainer_14555_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/memory/libcontainer_14555_systemd_test_default.slice: no such file or directory
Oct 08 14:25:13 worker-1 kubelet[25214]: W1008 14:25:13.774809   25214 watcher.go:87] Error while processing event ("/sys/fs/cgroup/devices/libcontainer_14555_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/libcontainer_14555_systemd_test_default.slice: no such file or directory
Oct 08 14:25:15 worker-1 kubelet[25214]: I1008 14:25:15.461340   25214 reconciler.go:177] operationExecutor.UnmountVolume started for volume "default-token-wnrfq" (UniqueName: "kubernetes.io/secret/567d28ea-f19f-4bbd-bdf2-48288f25f015-default-token-wnrfq") pod "567d28ea-f19f-4bbd-bdf2-48288f25f015" (UID: "567d28ea-f19f-4bbd-bdf2-48288f25f015")
Oct 08 14:25:15 worker-1 kubelet[25214]: I1008 14:25:15.500167   25214 operation_generator.go:860] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/567d28ea-f19f-4bbd-bdf2-48288f25f015-default-token-wnrfq" (OuterVolumeSpecName: "default-token-wnrfq") pod "567d28ea-f19f-4bbd-bdf2-48288f25f015" (UID: "567d28ea-f19f-4bbd-bdf2-48288f25f015"). InnerVolumeSpecName "default-token-wnrfq". PluginName "kubernetes.io/secret", VolumeGidValue ""
Oct 08 14:25:15 worker-1 kubelet[25214]: I1008 14:25:15.562210   25214 reconciler.go:297] Volume detached for volume "default-token-wnrfq" (UniqueName: "kubernetes.io/secret/567d28ea-f19f-4bbd-bdf2-48288f25f015-default-token-wnrfq") on node "worker-1" DevicePath ""

After Reconciler: start to sync state logs, it took 30 minutes to continue. In that time, there was DiskPressure on that node.

@zouyee

This comment has been minimized.

Copy link
Member

commented Oct 9, 2019

The log should be output according to the implementation when evicted happened.

// start the eviction manager monitoring
	go func() {
		for {
			if evictedPods := m.synchronize(diskInfoProvider, podFunc); evictedPods != nil {
				klog.Infof("eviction manager: pods %s evicted, waiting for pod to be cleaned up", format.Pods(evictedPods))
				m.waitForPodsCleanup(podCleanedUpFunc, evictedPods)
			} else {
				time.Sleep(monitoringInterval)
			}
		}
	}()

BTW If it's a production env, it is recommended to save the related log to disk.

@alitoufighi

This comment has been minimized.

Copy link
Author

commented Oct 9, 2019

Ah, I actually made another change that time I didn't think that was the cause of garbage collection, but it seems it was.
I had changed imageGCHighThresholdPercent and the scenario above happened also for this one. It was 85% first and I changed it to 80%. Apparently, in that time the disk usage was 82% (oops) and it triggered the garbage collection.
But the rest of the story is the same: Clearing all Pods and containers instead of stop when reaching imageGCLowThresholdPercent (which was 70%).

I also found some more related logs from my other node:

Oct 08 11:45:29 K8-How kubelet[27068]: W1008 11:45:29.808797   27068 eviction_manager.go:333] eviction manager: attempting to reclaim inodes
Oct 08 11:45:29 K8-How kubelet[27068]: I1008 11:45:29.808931   27068 container_gc.go:85] attempting to delete unused containers
Oct 08 11:45:30 K8-How kubelet[27068]: I1008 11:45:30.077926   27068 image_gc_manager.go:317] attempting to delete unused images

After that, I see lots of logs in these types:

Oct 08 11:45:30 K8-How kubelet[27068]: I1008 11:45:30.884797   27068 image_gc_manager.go:371] [imageGCManager]: Removing image "sha256:7c52d46cf32d1f159d2d97374f54ad1a0a93db703ae4e99cc79e30c93b10780b" to free 3168959442 bytes
Oct 08 11:45:31 K8-How kubelet[27068]: W1008 11:45:31.455798   27068 eviction_manager.go:160] Failed to admit pod kube-flannel-ds-amd64-qgsnt_kube-system(ddc8c75a-e4f8-438b-91e1-ac731598e6b4) - node has conditions: [DiskPressure]
Oct 08 11:45:34 K8-How kubelet[27068]: I1008 11:45:34.320290   27068 reconciler.go:177] operationExecutor.UnmountVolume started for volume "default-token-mdkkw" (UniqueName: "kubernetes.io/secret/9f4b5c6c-ab81-4127-a9aa-98af6fbd1487-default-token-mdkkw") pod "9f4b5c6c-ab81-4127-a9aa-98af6fbd1487" (UID: "9f4b5c6c-ab81-4127-a9aa-98af6fbd1487")
Oct 08 11:45:34 K8-How kubelet[27068]: I1008 11:45:34.320417   27068 operation_generator.go:860] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/88cdee78-be75-4cc8-bd79-3342cb108cc1-default-token-mdkkw" (OuterVolumeSpecName: "default-token-mdkkw") pod "88cdee78-be75-4cc8-bd79-3342cb108cc1" (UID: "88cdee78-be75-4cc8-bd79-3342cb108cc1"). InnerVolumeSpecName "default-token-mdkkw". PluginName "kubernetes.io/secret", VolumeGidValue ""
Oct 08 11:45:34 K8-How kubelet[27068]: I1008 11:45:34.321367   27068 reconciler.go:297] Volume detached for volume "default-token-mdkkw" (UniqueName: "kubernetes.io/secret/ac61f8d2-76d9-46d1-99c0-5a07faf0784c-default-token-mdkkw") on node "k8-how" DevicePath ""

It seems that the cause of garbage collection was imageGCManager, not eviction manager.

There is another interesting thing that while I saw eviction of all of my Pods; but nothing is seen in logs. It just considered all of containers as unused and started removing them.

Finally, this is the last related log seen:

Oct 08 12:16:42 K8-How kubelet[27068]: I1008 12:16:42.282859   27068 eviction_manager.go:340] eviction manager: able to reduce inodes pressure without evicting pods.

P.S. Thanks for your recommendation. I'll consider saving logs to disk separately from journal.

@zouyee

This comment has been minimized.

Copy link
Member

commented Oct 9, 2019

It just considered all of containers as unused and started removing them.

The processing logic is as follows

synchronize 
                    -> if m.reclaimNodeLevelResources(thresholdToReclaim.Signal, resourceToReclaim) 
                          -> nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}

It’s a bit strange to delete all containers as unused.

@alitoufighi

This comment has been minimized.

Copy link
Author

commented Oct 13, 2019

It was strange to me too!

My logs show that I got a signal for DeleteUnusedImages for all of my Pods container images (That were Running and suddenly got Evicted). I actually don't know about the flow of signals and how they're raised in these scenarios, but these lines was ran on my worker:

// Remove image. Continue despite errors.
klog.Infof("[imageGCManager]: Removing image %q to free %d bytes", image.id, image.size)
err := im.runtime.RemoveImage(container.ImageSpec{Image: image.id})

Don't you think my experience was caused by restarting the Kubelet service? I think Kubelet had problems getting status of running containers at startup and Evicted them, maybe?!

@alitoufighi

This comment has been minimized.

Copy link
Author

commented Oct 15, 2019

This happened again today. The left side is my worker node which I changed nodefs.inodesFree value in evictionHard section from 5% to 15%. It evicted all of pods after restarting kubelet.

image

In logs, I see this for all of my Pods:

eviction_manager.go:159] Failed to admit pod <POD NAME> - node has conditions: [DiskPressure]

Then it starts detaching volumes and then it removes images (like above).

@tedyu

This comment has been minimized.

Copy link
Contributor

commented Oct 15, 2019

		if result := podAdmitHandler.Admit(attrs); !result.Admit {
			return false, result.Reason, result.Message

I think this is because the admission decision is made for all Pods when there is pressure.
It seems we can introduce staged eviction where subset of the Pods are evicted and recheck for resource pressure.

@zouyee

This comment has been minimized.

Copy link
Member

commented Oct 17, 2019

if result := podAdmitHandler.Admit(attrs); !result.Admit {
			return false, result.Reason, result.Message

When restarting kubelet, the admit operation is triggered, pods will be evicted.

synchronize 
                    -> if m.reclaimNodeLevelResources(thresholdToReclaim.Signal, resourceToReclaim) 
                          -> nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}

and when it is working normally, the garbage collection operation is triggered. Is there any problem with the above two processing methods?

// After restarting, kubelet will get all existing pods through
// ADD as if they are new pods. These pods will then go through the
// admission process and *may* be rejected. This can be resolved
// once we have checkpointing.
@alitoufighi

This comment has been minimized.

Copy link
Author

commented Oct 19, 2019

@zouyee
As @tedyu said, do you think a different staged scan when there's some pressure on node for getting existing Pods would solve this?
Can Kubelet find out how much size of disk, inode, etc. each Pod gets? So it could schedule some low-priority Pods to be evicted until it meets the threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.