Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some pods are falsely evicted from the stopped node #9703

Closed
alita1991 opened this issue Mar 8, 2024 · 1 comment
Closed

Some pods are falsely evicted from the stopped node #9703

alita1991 opened this issue Mar 8, 2024 · 1 comment

Comments

@alita1991
Copy link

Environmental Info:
K3s Version: v1.27.5+k3s1

Node(s) CPU architecture, OS, and Version: Linux ip-10-190-34-107 6.5.0-1014-aws #14~22.04.1-Ubuntu SMP Thu Feb 15 15:27:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 3 servers

Describe the bug:

When a node is stopped using the Hypervisor interface, the otel-collector pod persists in the running state indefinitely. To initiate rescheduling, I must delete the pod. This action transitions the pod to the Terminating state, allowing it to be rescheduled eventually.

Steps To Reproduce:

  • Installed K3s
  • Installed several services that generate Kubernetes resources in the form of deployments
  • Powered off the node from the Hypervisor
  • Waited for the pods of type deployment to be evicted from the stopped node
  • Verified the status of the pods of type deployment to ensure they were all terminated on the stopped pod

Expected behavior:
The pod is expected to transition to the Terminating state, while a new pod should be scheduled on a healthy node.

Actual behavior:
The pod remains in the running state on a stopped node.

Additional context / logs:
14m Normal TaintManagerEviction pod/central-metrics-collector-5f5b6c599f-8gwpn Marking for deletion Pod k3s-loki/central-metrics-collector-5f5b6c599f-8gwpn, but is not happening (kubectl get events)

k3s-loki central-metrics-collector-5f5b6c599f-8gwpn 1/1 Running (kubectl get pods)

k3s-loki central-metrics-collector 0/1 (kubectl get deployments)

@brandond
Copy link
Contributor

brandond commented Mar 8, 2024

Kubernetes cannot reason about pods on nodes that do not have a running kubelet. You may have deleted the pod, but Kubernetes does not actually know if it is terminated or not because the kubelet is not running to provide status updates. It may actually be running on a node suffering a network outage, it may be running but the Kubelet is stopped, it might not be running at all. Kubernetes has no way of knowing.

There will be no updates to the pod status until the node either comes back online, or the node is deleted and the pod is force-deleted as an orphan.

This is not a K3s issue, this is just how Kubernetes works. There are some discussions about tuning the apiserver and controller-manager to reduce internal node monitor intervals, which may aid in evicting pods from offline nodes, at #1264 (comment)

@brandond brandond closed this as completed Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done Issue
Development

No branches or pull requests

2 participants