Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods are not moved when Node in NotReady state #55713

Closed
marczahn opened this issue Nov 14, 2017 · 91 comments
Closed

Pods are not moved when Node in NotReady state #55713

marczahn opened this issue Nov 14, 2017 · 91 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@marczahn
Copy link

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

What happened:

To simulate a crashed worker node I stopped the kubelet service on that node (Debian Jessie).
The node got into unknow state (resp. NotReady) as expected:

NAME                STATUS     ROLES         AGE       VERSION
lls-lon-db-master   Ready      master,node   7d        v1.8.0+coreos.0
lls-lon-testing01   NotReady   node          6d        v1.8.0+coreos.0

The pods running on the lls-lon-testing01 stay declared as running:

test-core-services   infrastructure-service-deployment-5cb868f49-94gh4                 1/1       Running   0          4h        10.233.96.204   lls-lon-testing01

But is declared as ready: false on describe:

Name:           infrastructure-service-deployment-5cb868f49-94gh4
Namespace:      test-core-services
Node:           lls-lon-testing01/10.100.0.5
Start Time:     Tue, 14 Nov 2017 10:31:41 +0000
Labels:         app=infrastructure-service
                pod-template-hash=176424905
Annotations:    kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"test-core-services","name":"infrastructure-service-deployment-5cb868f49","uid":"d...
Status:         Running
IP:             10.233.96.204
Created By:     ReplicaSet/infrastructure-service-deployment-5cb868f49
Controlled By:  ReplicaSet/infrastructure-service-deployment-5cb868f49
Containers:
  infrastructure-service:
    Container ID:  docker://3b750d7cad0c24386cade1e4fedac24ab2621f4991d3302d15c30d9e68749b7b
    Image:         index.docker.io/looplinesystems/infrastructure-service:latest
    Image ID:      docker-pullable://looplinesystems/infrastructure-service@sha256:632591a86ca67f3e19718727e717b07da3b5c79251ce9deede969588b6958272
    Ports:         7110/TCP, 7111/TCP, 7112/TCP
    Command:
      /infrastructurectl
      daemon
    State:          Running
      Started:      Tue, 14 Nov 2017 10:31:48 +0000
    Ready:          True
    Restart Count:  0
    Environment Variables from:
      infrastructure-service-config  Secret  Optional: false
    Environment:                     <none>
    Mounts:
      /var/log/services from logs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-hm2hs (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  logs:
    Type:  HostPath (bare host directory volume)
    Path:  /var/log/services
  default-token-hm2hs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-hm2hs
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     <none>
Events:          <none>

What you expected to happen:
I excpected the pods on the "crashed" node to be moved to the remaining node.

How to reproduce it (as minimally and precisely as possible):
In my situtation: Having a node (A)and a master+node (B) installed with Kubespray. Running at least one pod on each node. Stopping Kubelet on A and wait

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.0+coreos.0", GitCommit:"a65654ef5b593ac19fbfaf33b1a1873c0320353b", GitTreeState:"clean", BuildDate:"2017-09-29T21:51:03Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.0+coreos.0", GitCommit:"a65654ef5b593ac19fbfaf33b1a1873c0320353b", GitTreeState:"clean", BuildDate:"2017-09-29T21:51:03Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
Intel(R) Xeon(R) CPU E5-2670
4 Cores
4 GB RAM
  • OS (e.g. from /etc/os-release): Debian GNU/Linux 8 (jessie)
  • Kernel (e.g. uname -a): Linux lls-lon-db-master 3.16.0-4-amd64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Debian 3.16.43-2+deb8u5 (2017-09-19) x86_64 GNU/Linux
  • Install tools: Kubespray
  • Others: -
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 14, 2017
@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 14, 2017
@marczahn
Copy link
Author

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Nov 14, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 14, 2017
@jhorwit2
Copy link
Contributor

jhorwit2 commented Dec 11, 2017

@marczahn how long did you wait after turning off the Kubelet? By default pods won't be moved for 5m minutes which is configurable via the following flag on the controller manager.

--pod-eviction-timeout duration                                     The grace period for deleting pods on failed nodes. (default 5m0s)

This allows for cases like a node reboot to not reschedule pods unnecessarily.

@marczahn
Copy link
Author

I know this parameter and Iwas waiting way longer for than the eviction-timeout. It definitely happened nothing.

@jamesgetx
Copy link

We encountered the same problem. Our k8s version is 1.8.4, docker version is 1.12.4

@marczahn
Copy link
Author

I wrote a script, that can be run as a cronjob:

#!/bin/sh

KUBECTL="/usr/local/bin/kubectl"

# Get only nodes which are not drained yet
NOT_READY_NODES=$($KUBECTL get nodes | grep -P 'NotReady(?!,SchedulingDisabled)' | awk '{print $1}' | xargs echo)
# Get only nodes which are still drained
READY_NODES=$($KUBECTL get nodes | grep '\sReady,SchedulingDisabled' | awk '{print $1}' | xargs echo)

echo "Unready nodes that are undrained: $NOT_READY_NODES"
echo "Ready nodes: $READY_NODES"


for node in $NOT_READY_NODES; do
  echo "Node $node not drained yet, draining..."
  $KUBECTL drain --ignore-daemonsets --force $node
  echo "Done"
done;

for node in $READY_NODES; do
  echo "Node $node still drained, uncordoning..."
  $KUBECTL uncordon $node
  echo "Done"
done;

It is actually checking if a node is down and not drained and vice versa. Hope it helps

@erkules
Copy link

erkules commented Mar 6, 2018

Got the same issue 1.9.3 . No eviction after 30 minutes.

$ kubectl version Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-07T11:55:20Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-07T11:55:20Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}

@albertvaka
Copy link

+1. Is this the intended behavior? If it is, then load balancers should keep serving traffic to those pods (now they do not).

@mypine
Copy link

mypine commented May 17, 2018

we encounter the same problem on 1.6.3

@trajakovic
Copy link

trajakovic commented May 30, 2018

got same problem as @erkules

kubectl version                                                                                                                                       
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-05-12T04:12:12Z", GoVersion:"go1.9.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-07T11:55:20Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}

tried to drain master node (in setup of 3 nodes), but command keeps hanging.

if it helps, cluster is installed on AWS using kops 1.9.x, 3 masters on separate AZs (m4.large instances)

kubectl logs -n kube-system -f kube-controller-manager-ip-<redacted>


I0529 21:59:35.448967       1 flags.go:52] FLAG: --address="0.0.0.0"
I0529 21:59:35.449022       1 flags.go:52] FLAG: --allocate-node-cidrs="true"
I0529 21:59:35.449096       1 flags.go:52] FLAG: --allow-untagged-cloud="false"
I0529 21:59:35.449106       1 flags.go:52] FLAG: --allow-verification-with-non-compliant-keys="false"
I0529 21:59:35.449116       1 flags.go:52] FLAG: --alsologtostderr="false"
I0529 21:59:35.449122       1 flags.go:52] FLAG: --attach-detach-reconcile-sync-period="1m0s"
I0529 21:59:35.449157       1 flags.go:52] FLAG: --cidr-allocator-type="RangeAllocator"
I0529 21:59:35.449191       1 flags.go:52] FLAG: --cloud-config=""
I0529 21:59:35.449216       1 flags.go:52] FLAG: --cloud-provider="aws"
I0529 21:59:35.449229       1 flags.go:52] FLAG: --cloud-provider-gce-lb-src-cidrs="130.211.0.0/22,35.191.0.0/16,209.85.152.0/22,209.85.204.0/22"
I0529 21:59:35.449276       1 flags.go:52] FLAG: --cluster-cidr="100.96.0.0/11"
I0529 21:59:35.449283       1 flags.go:52] FLAG: --cluster-name="staging.kubernetes.sa.dev.superbet.k8s.local"
I0529 21:59:35.449290       1 flags.go:52] FLAG: --cluster-signing-cert-file="/srv/kubernetes/ca.crt"
I0529 21:59:35.449296       1 flags.go:52] FLAG: --cluster-signing-key-file="/srv/kubernetes/ca.key"
I0529 21:59:35.449302       1 flags.go:52] FLAG: --concurrent-deployment-syncs="5"
I0529 21:59:35.449313       1 flags.go:52] FLAG: --concurrent-endpoint-syncs="5"
I0529 21:59:35.449319       1 flags.go:52] FLAG: --concurrent-gc-syncs="20"
I0529 21:59:35.449325       1 flags.go:52] FLAG: --concurrent-namespace-syncs="10"
I0529 21:59:35.449331       1 flags.go:52] FLAG: --concurrent-rc-syncs="5"
I0529 21:59:35.449337       1 flags.go:52] FLAG: --concurrent-replicaset-syncs="5"
I0529 21:59:35.449343       1 flags.go:52] FLAG: --concurrent-resource-quota-syncs="5"
I0529 21:59:35.449349       1 flags.go:52] FLAG: --concurrent-service-syncs="1"
I0529 21:59:35.449355       1 flags.go:52] FLAG: --concurrent-serviceaccount-token-syncs="5"
I0529 21:59:35.449361       1 flags.go:52] FLAG: --configure-cloud-routes="true"
I0529 21:59:35.449367       1 flags.go:52] FLAG: --contention-profiling="false"
I0529 21:59:35.449373       1 flags.go:52] FLAG: --controller-start-interval="0s"
I0529 21:59:35.449379       1 flags.go:52] FLAG: --controllers="[*]"
I0529 21:59:35.449449       1 flags.go:52] FLAG: --deleting-pods-burst="0"
I0529 21:59:35.449456       1 flags.go:52] FLAG: --deleting-pods-qps="0.1"
I0529 21:59:35.449466       1 flags.go:52] FLAG: --deployment-controller-sync-period="30s"
I0529 21:59:35.449473       1 flags.go:52] FLAG: --disable-attach-detach-reconcile-sync="false"
I0529 21:59:35.449479       1 flags.go:52] FLAG: --enable-dynamic-provisioning="true"
I0529 21:59:35.449485       1 flags.go:52] FLAG: --enable-garbage-collector="true"
I0529 21:59:35.449491       1 flags.go:52] FLAG: --enable-hostpath-provisioner="false"
I0529 21:59:35.449529       1 flags.go:52] FLAG: --enable-taint-manager="true"
I0529 21:59:35.449536       1 flags.go:52] FLAG: --experimental-cluster-signing-duration="8760h0m0s"
I0529 21:59:35.449568       1 flags.go:52] FLAG: --feature-gates=""
I0529 21:59:35.449598       1 flags.go:52] FLAG: --flex-volume-plugin-dir="/usr/libexec/kubernetes/kubelet-plugins/volume/exec/"
I0529 21:59:35.449606       1 flags.go:52] FLAG: --horizontal-pod-autoscaler-downscale-delay="5m0s"
I0529 21:59:35.449612       1 flags.go:52] FLAG: --horizontal-pod-autoscaler-sync-period="30s"
I0529 21:59:35.449618       1 flags.go:52] FLAG: --horizontal-pod-autoscaler-tolerance="0.1"
I0529 21:59:35.449628       1 flags.go:52] FLAG: --horizontal-pod-autoscaler-upscale-delay="3m0s"
I0529 21:59:35.449634       1 flags.go:52] FLAG: --horizontal-pod-autoscaler-use-rest-clients="true"
I0529 21:59:35.449640       1 flags.go:52] FLAG: --insecure-experimental-approve-all-kubelet-csrs-for-group=""
I0529 21:59:35.449646       1 flags.go:52] FLAG: --kube-api-burst="30"
I0529 21:59:35.449652       1 flags.go:52] FLAG: --kube-api-content-type="application/vnd.kubernetes.protobuf"
I0529 21:59:35.449660       1 flags.go:52] FLAG: --kube-api-qps="20"
I0529 21:59:35.449667       1 flags.go:52] FLAG: --kubeconfig="/var/lib/kube-controller-manager/kubeconfig"
I0529 21:59:35.449674       1 flags.go:52] FLAG: --large-cluster-size-threshold="50"
I0529 21:59:35.449680       1 flags.go:52] FLAG: --leader-elect="true"
I0529 21:59:35.449686       1 flags.go:52] FLAG: --leader-elect-lease-duration="15s"
I0529 21:59:35.449692       1 flags.go:52] FLAG: --leader-elect-renew-deadline="10s"
I0529 21:59:35.449698       1 flags.go:52] FLAG: --leader-elect-resource-lock="endpoints"
I0529 21:59:35.449705       1 flags.go:52] FLAG: --leader-elect-retry-period="2s"
I0529 21:59:35.449711       1 flags.go:52] FLAG: --log-backtrace-at=":0"
I0529 21:59:35.449721       1 flags.go:52] FLAG: --log-dir=""
I0529 21:59:35.449728       1 flags.go:52] FLAG: --log-flush-frequency="5s"
I0529 21:59:35.449735       1 flags.go:52] FLAG: --loglevel="1"
I0529 21:59:35.449741       1 flags.go:52] FLAG: --logtostderr="true"
I0529 21:59:35.449747       1 flags.go:52] FLAG: --master=""
I0529 21:59:35.449753       1 flags.go:52] FLAG: --min-resync-period="12h0m0s"
I0529 21:59:35.449759       1 flags.go:52] FLAG: --namespace-sync-period="5m0s"
I0529 21:59:35.449766       1 flags.go:52] FLAG: --node-cidr-mask-size="24"
I0529 21:59:35.449772       1 flags.go:52] FLAG: --node-eviction-rate="0.1"
I0529 21:59:35.449779       1 flags.go:52] FLAG: --node-monitor-grace-period="40s"
I0529 21:59:35.449786       1 flags.go:52] FLAG: --node-monitor-period="5s"
I0529 21:59:35.449792       1 flags.go:52] FLAG: --node-startup-grace-period="1m0s"
I0529 21:59:35.449826       1 flags.go:52] FLAG: --node-sync-period="0s"
I0529 21:59:35.449834       1 flags.go:52] FLAG: --pod-eviction-timeout="5m0s"
I0529 21:59:35.449867       1 flags.go:52] FLAG: --port="10252"
I0529 21:59:35.449877       1 flags.go:52] FLAG: --profiling="true"
I0529 21:59:35.449884       1 flags.go:52] FLAG: --pv-recycler-increment-timeout-nfs="30"
I0529 21:59:35.449917       1 flags.go:52] FLAG: --pv-recycler-minimum-timeout-hostpath="60"
I0529 21:59:35.449924       1 flags.go:52] FLAG: --pv-recycler-minimum-timeout-nfs="300"
I0529 21:59:35.449955       1 flags.go:52] FLAG: --pv-recycler-pod-template-filepath-hostpath=""
I0529 21:59:35.449962       1 flags.go:52] FLAG: --pv-recycler-pod-template-filepath-nfs=""
I0529 21:59:35.449985       1 flags.go:52] FLAG: --pv-recycler-timeout-increment-hostpath="30"
I0529 21:59:35.449992       1 flags.go:52] FLAG: --pvclaimbinder-sync-period="15s"
I0529 21:59:35.449998       1 flags.go:52] FLAG: --register-retry-count="10"
I0529 21:59:35.450004       1 flags.go:52] FLAG: --resource-quota-sync-period="5m0s"
I0529 21:59:35.450011       1 flags.go:52] FLAG: --root-ca-file="/srv/kubernetes/ca.crt"
I0529 21:59:35.450018       1 flags.go:52] FLAG: --route-reconciliation-period="10s"
I0529 21:59:35.450024       1 flags.go:52] FLAG: --secondary-node-eviction-rate="0.01"
I0529 21:59:35.450031       1 flags.go:52] FLAG: --service-account-private-key-file="/srv/kubernetes/server.key"
I0529 21:59:35.450039       1 flags.go:52] FLAG: --service-cluster-ip-range=""
I0529 21:59:35.450045       1 flags.go:52] FLAG: --service-sync-period="5m0s"
I0529 21:59:35.450051       1 flags.go:52] FLAG: --stderrthreshold="2"
I0529 21:59:35.450057       1 flags.go:52] FLAG: --terminated-pod-gc-threshold="12500"
I0529 21:59:35.450064       1 flags.go:52] FLAG: --unhealthy-zone-threshold="0.55"
I0529 21:59:35.450071       1 flags.go:52] FLAG: --use-service-account-credentials="true"
I0529 21:59:35.450077       1 flags.go:52] FLAG: --v="2"
I0529 21:59:35.450084       1 flags.go:52] FLAG: --version="false"
I0529 21:59:35.450095       1 flags.go:52] FLAG: --vmodule=""
I0529 21:59:35.450113       1 controllermanager.go:108] Version: v1.9.3

@rchicoli
Copy link

I am not quite sure, if this is related to this issue. Let me know, if I should create a new one.
After restarting the cluster, Kubernetes API is reporting wrong POD status.
As you can see all Nodes are offline (kubelet and docker are not running), so I expected at least an Unknown Pod status. Somehow it shows running, even after few minutes later.

root@kube-controller-1:~# kubectl get all
NAME                            READY     STATUS    RESTARTS   AGE
pod/webapper-856ff74c66-59b2t   1/1       Running   0          9h
pod/webapper-856ff74c66-qhlmb   1/1       Running   0          9h

NAME                 TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)    AGE
service/kubernetes   ClusterIP   10.32.0.1     <none>        443/TCP    2d
service/webapper     ClusterIP   10.32.0.100   <none>        8080/TCP   6h

NAME                       DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/webapper   2         2         2            0           9h

NAME                                  DESIRED   CURRENT   READY     AGE
replicaset.apps/webapper-856ff74c66   2         2         0         9h

root@kube-controller-1:~# kubectl get nodes
NAME            STATUS     ROLES     AGE       VERSION
kube-worker-1   NotReady   <none>    2d        v1.11.0
kube-worker-2   NotReady   <none>    2d        v1.11.0

root@kube-controller-1:~# kubectl exec -ti webapper-856ff74c66-qhlmb sh
Error from server: error dialing backend: dial tcp 10.0.0.17:10250: connect: connection refused

I am using the latest kubernetes version

root@kube-controller-1:~# kube-apiserver --version
Kubernetes v1.11.0
root@kube-controller-1:~# kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:17:28Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}

@adampl
Copy link

adampl commented Jul 24, 2018

I have just run into this issue (v1.10.1). I suspect it has something to do with volumes not being detached/unmounted.

@zeelichsheng
Copy link

zeelichsheng commented Aug 13, 2018

I encountered a similar issue. I experimenting with the kubernetes autoscaler. When I manually stop a node VM, the node itself goes into NotReady state. And after a while, the pod scheduled on the removed node goes to Unknown state.

At this point, Kubernetes behaves correctly by creating a new pod, and autoscaler creates a new node to schedule the new pod.

However, the removed pod gets stuck in Unknown state. The original node cannot be removed by autoscaler from Kubernetes because autoscaler still thinks there is load (i.e. the stuck pod) on the node.

NAME READY STATUS RESTARTS AGE
busybox-6b76d7d9c8-7xb48 0/1 Pending 0 2m
busybox-6b76d7d9c8-xlgpz 1/1 Unknown 0 11m

NAME STATUS ROLES AGE VERSION
master-5f517752-9b64-11e8-8caa-0612df8b7178 Ready master 4d v1.10.2
worker-a20f2c3e-9f19-11e8-a52a-0612df8b7178 NotReady worker 10m v1.10.2
worker-f2f997c8-9f1a-11e8-a52a-0612df8b7178 Ready worker 56s v1.10.2

This is part of the output of describing the NotReady worker node, which shows that Kubernetes still thinks the stuck pod is scheduled on this node:

Non-terminated Pods: (2 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits


default busybox-6b76d7d9c8-xlgpz 1 (50%) 1 (50%) 0 (0%) 0 (0%)

If we have to manually and forcefully remove the pod by using "kubectl delete pod --force --grade-period=0", it means that autoscaler will be affected and not correctly managing cluster resources without user interference.

@adampl
Copy link

adampl commented Aug 13, 2018

@zeelichsheng check kubernetes/enhancements#551 and #58635.

@huyqut
Copy link

huyqut commented Sep 21, 2018

Hi, I have the same problem. No pods are evicted if a node is "NotReady" even after --pod-eviction-timeout set on kube-controller-manager. Are there any workarounds?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2018
@hzxuzhonghu
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2018
@jitendra1987
Copy link

jitendra1987 commented Jan 29, 2019

I am also seeing this issue.
Cluster : 1 master, 1 worker
kubernetes=1.12.5, docker-ce=18.09.1
steps:

  1. launched few test pods scheduled to worker.
  2. Then Shutdown the worker node.
  3. Observed that kubernetes marked worker as NotReady but pod eviction is not getting started even after 5 mins.

@jtackaberry
Copy link

jtackaberry commented Feb 2, 2019

Surely this is one of the first failure modes everyone tests? It's the first worker-related failure I tested while evaluating Kubernetes. I even gracefully shutdown the worker node and let all kube processes exit cleanly. IMO it very much violates the Principle of Least Astonishment that pods assigned to NotReady nodes remain in the Running state.

(1.13.3 with a single node test cluster.)

@adampl
Copy link

adampl commented Feb 5, 2019

@jtackaberry It's not so simple. The cluster nodes need an external monitor or hypervisor to reliably determine whether the NotReady node is actually shut down, in order to take into account a possible split-brain scenario. In other words, you cannot assume that pods are not running just because the node is not responding. See: kubernetes/enhancements#719

@huyqut
Copy link

huyqut commented Feb 11, 2019

@jitendra1987 can you test a cluster with 1 master and 2 workers? I also tested with 1 master and 1 worker and pod eviction didn't happen. However, when there are 3 machines in the cluster, it happens normally.

@ironreality
Copy link

With 1 master + 2 nodes configuration the problem isn't reproducible.
With 1 master + 1 node - it occurs.
Tested it with kubeadm-installed cluster.
The testing platform: Virtualbox+ Ubuntu 16.04 + K8s 1.13.3 + Docker 18.09.2.

@Craftoncu
Copy link

Any fix yet? Annoying issue

@inboxamitraj
Copy link

Any fix yet? Annoying issue

fixed in 1.21.0 pods moved to healthy worker01 node from worker02 node which went down.
vagrant@master01:~$ k get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
test-deploy-686f764bd-4vsqh 1/1 Running 0 5m53s 10.32.0.3 worker01
test-deploy-686f764bd-5nvz6 1/1 Running 0 5m53s 10.32.0.4 worker01
test-deploy-686f764bd-c76v8 1/1 Terminating 1 21h 10.32.0.2 worker02
test-deploy-686f764bd-wqk6j 1/1 Terminating 1 21h 10.32.0.3 worker02

@adampl
Copy link

adampl commented May 14, 2021

Apparently there is still some issue: #101674

@victor-sudakov
Copy link

I seem to have this issue with v1.21.0 on Debian10/amd64, a test cluster of 1 master and 3 worker nodes:

I create a pod without a nodeSelector, find the node it is running on, and shut down or poweroff its host emulating hardware failure or maintenance. I expect my pod to be recreated at another healthy node, but this never happens. I have to reapply the pod definition to make it run again and it says "pod XXX created.

Expected behavior: the pod should have been moved/recreated to some healthy node when the timeout expires.

@hconnan
Copy link

hconnan commented May 19, 2021

Hi guys, I got the same issue on a Kubernetes cluster v1.19.7 i.e. some nodes did not get the taint NoExecute as expected.
However, this has been fixed recently in:

Hope it's helpful!

@adampl
Copy link

adampl commented May 21, 2021

@elChipardo Did you actually read the preceding comment of @victor-sudakov ? He mentions v1.21.0

@antoinetran
Copy link

I confirm the fix and validated with Kubernetes 1.20.6 (contained in Rke 1.2.8 / Rancher 2.5.8).

@victor-sudakov
Copy link

I confirm the fix and validated with Kubernetes 1.20.6 (contained in Rke 1.2.8 / Rancher 2.5.8).

What do you mean by "confirm the fix"? I've just checked, on Kubernetes v1.21.1/Debian10, when a Node is powered off or dies, its Pods are in Terminating status forever, and never get moved elsewhere. When the Node is back alive, its Pods are gone for good and have to be redeployed again.

@antoinetran
Copy link

antoinetran commented Jun 9, 2021

I confirm the fix and validated with Kubernetes 1.20.6 (contained in Rke 1.2.8 / Rancher 2.5.8).

What do you mean by "confirm the fix"? I've just checked, on Kubernetes v1.21.1/Debian10, when a Node is powered off or dies, its Pods are in Terminating status forever, and never get moved elsewhere. When the Node is back alive, its Pods are gone for good and have to be redeployed again.

In Kubernetes 1.20.4: the shutdown of a node results in node being NotReady, but the pods hosted by the node runs like nothing happened. However doing logs or exec does not work (normal).
In Kubernetes 1.20.6: the shutdown of a node results, after the eviction timeout, of pods being in Terminating status, with pods being rescheduled in other nodes. The never ending Terminating seems normal to me.

However we noticed that pods from statefulset are not moved in another node, but are still in Terminating, while pods from deployments and jobs are also in Terminating but also rescheduled elsewhere. Maybe that is your case.

@victor-sudakov
Copy link

with pods being rescheduled in other nodes

I have never seen this happen unless the pods are part of a deployment. If you have created a pod ("kind: Pod", not "kind: Deployment") it never gets rescheduled. Maybe it's by design?

@adampl
Copy link

adampl commented Jun 10, 2021

@victor-sudakov Yes, a Pod is by definition bound to a certain Node. Rescheduling is nothing else than deleting and creating a new Pod, which is usually controlled by a ReplicaSet, which is usually owned by a Deployment.

@ricosega
Copy link

I have the same situation in version 1.20.0, when shutting down a node it remains tainted node.kubernetes.io/unreachable: NoSchedule for the eternity and all the pods are in status Running just like here:

Apparently there is still some issue: #101674

But if I taint myself the node with node.kubernetes.io/unreachable:NoExecute then after the eviction time I've noticed the same things @antoinetran said:

  • The pods from Deployments are rescheduled to other node but keeping the old one in a Terminating status forever.
  • The pods from Statefulsets remains always in Terminating status and are never rescheduled.

This seems to be same issue also: #98851

Can anyone confirm what version is solved?

@haircommander
Copy link
Contributor

It sounds like this has been fixed in all supported versions of k8s: #55713 (comment)

as such, I'm closing this
/close

@k8s-ci-robot
Copy link
Contributor

@haircommander: Closing this issue.

In response to this:

It sounds like this has been fixed in all supported versions of k8s: #55713 (comment)

as such, I'm closing this
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@amitkatyal
Copy link

I am facing a similar issue with the daemon sets. The daemon sets pods remain in the running state when a node is in a not-ready state. Even though the node is in a not-ready state but since the pod is still in the running state, the headless service exposing the daemonset as endpoints returns the IP address of the daemon set pod corresponding to the not ready node.
Since headless service is returning the IP address of the daemon set pod which is not running is causing the problem.

I understand that the daemonset pod to remain in the running state is expected behavior as the daemon set controller is not able to reach out to the API server but is there an option to ensure that headless doesn't return the IP address of the POD corresponding to the down node.

@zhangguanzhang
Copy link

zhangguanzhang commented Sep 18, 2021

may you need set this for kube-apiserver :

   --enable-admission-plugins=....,DefaultTolerationSeconds \
  --default-not-ready-toleration-seconds=60 \
  --default-unreachable-toleration-seconds=60 \

the value of --node-monitor-grace-period=50s is default for kube-controller-manager.
so if set --default-unreachable-toleration-seconds=60, after a node shutdown, a pod will to be Terminating after 50s+60s
The state does not affect the flow direction of the svc, and it should go to the end of the trip. so the best way is :

   --enable-admission-plugins=....,DefaultTolerationSeconds \
  --default-not-ready-toleration-seconds=300 \
  --default-unreachable-toleration-seconds=10 \

@knkarthik
Copy link

I'm still having this issue on EKS v1.20.7-eks-d88609. The behaviour is same as observed by @ricosega and others.

Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.7-eks-d88609", GitCommit:"d886092805d5cc3a47ed5cf0c43de38ce442dfcb", GitTreeState:"clean", BuildDate:"2021-07-31T00:29:12Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}

@zhangguanzhang
Copy link

I'm still having this issue on EKS v1.20.7-eks-d88609. The behaviour is same as observed by @ricosega and others.

Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.7-eks-d88609", GitCommit:"d886092805d5cc3a47ed5cf0c43de38ce442dfcb", GitTreeState:"clean", BuildDate:"2021-07-31T00:29:12Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}

Did you solve this problem?

@riupie
Copy link

riupie commented Oct 13, 2021

may you need set this for kube-apiserver :

   --enable-admission-plugins=....,DefaultTolerationSeconds \
  --default-not-ready-toleration-seconds=60 \
  --default-unreachable-toleration-seconds=60 \

the value of --node-monitor-grace-period=50s is default for kube-controller-manager. so if set --default-unreachable-toleration-seconds=60, after a node shutdown, a pod will to be Terminating after 50s+60s。 The state does not affect the flow direction of the svc, and it should go to the end of the trip. so the best way is :

   --enable-admission-plugins=....,DefaultTolerationSeconds \
  --default-not-ready-toleration-seconds=300 \
  --default-unreachable-toleration-seconds=10 \

I think this is the best workaround for now. I use k8s 1.20.7 and pod-eviction-timeout still not work.

@m0sh1x2
Copy link

m0sh1x2 commented Oct 13, 2021

may you need set this for kube-apiserver :

   --enable-admission-plugins=....,DefaultTolerationSeconds \
  --default-not-ready-toleration-seconds=60 \
  --default-unreachable-toleration-seconds=60 \

the value of --node-monitor-grace-period=50s is default for kube-controller-manager. so if set --default-unreachable-toleration-seconds=60, after a node shutdown, a pod will to be Terminating after 50s+60s。 The state does not affect the flow direction of the svc, and it should go to the end of the trip. so the best way is :

   --enable-admission-plugins=....,DefaultTolerationSeconds \
  --default-not-ready-toleration-seconds=300 \
  --default-unreachable-toleration-seconds=10 \

I think this is the best workaround for now. I use k8s 1.20.7 and pod-eviction-timeout still not work.

This doesn't seem to apply for StatefulSets that have PVCs under a terminated/powered off node(using rook-ceph with rbd). The cronjob that @marczahn noted should work as I have tested it manually #55713 (comment) but this functionality should be covered by the Scheduler?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests