Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ignored pod-eviction-timeout settings #74651

Closed
danielloczi opened this issue Feb 27, 2019 · 17 comments
Closed

ignored pod-eviction-timeout settings #74651

danielloczi opened this issue Feb 27, 2019 · 17 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@danielloczi
Copy link

danielloczi commented Feb 27, 2019

What happened: I modified the pod-eviction-timeout settings of kube-controller-manager on the master node (in order to to decrease the amount of time before k8s re-creates a pod in case of node failure). The default value is 5 minutes, I configured 30 seconds. Using the sudo docker ps --no-trunc | grep "kube-controller-manager" command I checked that the modification was successfully applied:

kubeadmin@nodetest21:~$ sudo docker ps --no-trunc | grep "kube-controller-manager"
387261c61ee9cebce50de2540e90b89e2bc710b4126a0c066ef41f0a1fb7cf38   sha256:0482f640093306a4de7073fde478cf3ca877b6fcc2c4957624dddb2d304daef5                         "kube-controller-manager --address=127.0.0.1 --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf --client-ca-file=/etc/kubernetes/pki/ca.crt --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt --cluster-signing-key-file=/etc/kubernetes/pki/ca.key --controllers=*,bootstrapsigner,tokencleaner --kubeconfig=/etc/kubernetes/controller-manager.conf --leader-elect=true --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt --root-ca-file=/etc/kubernetes/pki/ca.crt --service-account-private-key-file=/etc/kubernetes/pki/sa.key --use-service-account-credentials=true --pod-eviction-timeout=30s" 

I applied a basic deployment with two replicas:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      containers:
      - image: busybox
        command:
        - sleep
        - "3600"
        imagePullPolicy: IfNotPresent
        name: busybox
      restartPolicy: Always

The first pod created on the first worker node, the second pod created on the second worker node:

NAME         STATUS   ROLES    AGE   VERSION
nodetest21   Ready    master   34m   v1.13.3
nodetest22   Ready    <none>   31m   v1.13.3
nodetest23   Ready    <none>   30m   v1.13.3

NAMESPACE     NAME                                 READY   STATUS    RESTARTS   AGE   IP          NODE         NOMINATED NODE   READINESS GATES
default       busybox-74b487c57b-5s6g7             1/1     Running   0          13s   10.44.0.2   nodetest22   <none>           <none>
default       busybox-74b487c57b-6zdvv             1/1     Running   0          13s   10.36.0.1   nodetest23   <none>           <none>
kube-system   coredns-86c58d9df4-gmcjd             1/1     Running   0          34m   10.32.0.2   nodetest21   <none>           <none>
kube-system   coredns-86c58d9df4-wpffr             1/1     Running   0          34m   10.32.0.3   nodetest21   <none>           <none>
kube-system   etcd-nodetest21                      1/1     Running   0          33m   10.0.1.4    nodetest21   <none>           <none>
kube-system   kube-apiserver-nodetest21            1/1     Running   0          33m   10.0.1.4    nodetest21   <none>           <none>
kube-system   kube-controller-manager-nodetest21   1/1     Running   0          20m   10.0.1.4    nodetest21   <none>           <none>
kube-system   kube-proxy-6mcn8                     1/1     Running   1          31m   10.0.1.5    nodetest22   <none>           <none>
kube-system   kube-proxy-dhdqj                     1/1     Running   0          30m   10.0.1.6    nodetest23   <none>           <none>
kube-system   kube-proxy-vqjg8                     1/1     Running   0          34m   10.0.1.4    nodetest21   <none>           <none>
kube-system   kube-scheduler-nodetest21            1/1     Running   1          33m   10.0.1.4    nodetest21   <none>           <none>
kube-system   weave-net-9qls7                      2/2     Running   3          31m   10.0.1.5    nodetest22   <none>           <none>
kube-system   weave-net-h2cb6                      2/2     Running   0          33m   10.0.1.4    nodetest21   <none>           <none>
kube-system   weave-net-vkb62                      2/2     Running   0          30m   10.0.1.6    nodetest23   <none>           <none>

To test the correct pod eviction I shutdown the first worker node. After ~1 min the status of the first worker node changed to "NotReady", then
I had to wait +5 minutes (which is the default pod eviction timeout) for pod on the turned off node to be re-created on the other node.

What you expected to happen:
After the node status reports "NotReady", the pod should be re-created on the other node after 30 seconds instead if the default 5 minutes!

How to reproduce it (as minimally and precisely as possible):
Create three nodes. Init Kubernetes on the first node (sudo kubeadm init), apply network plugin (kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"), then join the other two nodes (like: kubeadm join 10.0.1.4:6443 --token xdx9y1.z7jc0j7c8g8lpjog --discovery-token-ca-cert-hash sha256:04ae8388f607755c14eed702a23fd47802d5512e092b08add57040a2ae0736ac).
Add pod-eviction-timeout parameter to Kube Controller Manager on the master node: sudo vi /etc/kubernetes/manifests/kube-controller-manager.yaml:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: ""
  creationTimestamp: null
  labels:
    component: kube-controller-manager
    tier: control-plane
  name: kube-controller-manager
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-controller-manager
    - --address=127.0.0.1
    - --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --client-ca-file=/etc/kubernetes/pki/ca.crt
    - --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
    - --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
    - --controllers=*,bootstrapsigner,tokencleaner
    - --kubeconfig=/etc/kubernetes/controller-manager.conf
    - --leader-elect=true
    - --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
    - --root-ca-file=/etc/kubernetes/pki/ca.crt
    - --service-account-private-key-file=/etc/kubernetes/pki/sa.key
    - --use-service-account-credentials=true
    - --pod-eviction-timeout=30s

(the yaml is truncated, only the related first part is showed here).

Check that the settings is applied:
sudo docker ps --no-trunc | grep "kube-controller-manager"

Apply a deployment with two replicas, check that one pod is created on first worker node, the second is created on the second worker node.
Shut down one of the nodes, and check the elapsed time between the event, when the node reports "NotReady" and the pod re-created.

Anything else we need to know?:
I experience the same issue in multi-master environment also.

Environment:

  • Kubernetes version (use kubectl version): v1.13.3
    Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.3", GitCommit:"721bfa751924da8d1680787490c54b9179b1fed0", GitTreeState:"clean", BuildDate:"2019-02-01T20:08:12Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.3", GitCommit:"721bfa751924da8d1680787490c54b9179b1fed0", GitTreeState:"clean", BuildDate:"2019-02-01T20:00:57Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: Azure VM
  • OS (e.g: cat /etc/os-release): NAME="Ubuntu" VERSION="16.04.5 LTS (Xenial Xerus)"
  • Kernel (e.g. uname -a): Linux nodetest21 4.15.0-1037-azure Add warnings about self signed certs and MitM attacks. #39~16.04.1-Ubuntu SMP Tue Jan 15 17:20:47 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Others: Docker v18.06.1-ce
@danielloczi danielloczi added the kind/bug Categorizes issue or PR as related to a bug. label Feb 27, 2019
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Feb 27, 2019
@danielloczi
Copy link
Author

danielloczi commented Feb 27, 2019

@kubernetes/sig-node-bugs
@kubernetes/sig-apps-bugs

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 27, 2019
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Feb 27, 2019

@danielloczi: Reiterating the mentions to trigger a notification:
@kubernetes/sig-node-bugs, @kubernetes/sig-apps-bugs

In response to this:

@kubernetes/sig-node-bugs
@kubernetes/sig-apps-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ChiefAlexander
Copy link
Member

ChiefAlexander commented Feb 27, 2019

I also ran into this issue while testing setting the eviction timeout lower. After poking around at this for sometime I figured out that the cause is the new TaintBasedEvictions.

In version 1.13, the TaintBasedEvictions feature is promoted to beta and enabled by default, hence the taints are automatically added by the NodeController (or kubelet) and the normal logic for evicting pods from nodes based on the Ready NodeCondition is disabled.

Setting the feature flag for this to false causes pods to be evicted like expected. I have not taken time to search through the taint based eviction code but I would guess that we are not utilizing this eviction timeout flag within it.

@ChiefAlexander
Copy link
Member

ChiefAlexander commented Feb 27, 2019

Looking into this more. With TaintBasedEvictions set to true you can set your pods eviction time within its spec under tolerations:
https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-based-evictions
The default values of these are getting set by an admission controller: https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/admission/defaulttolerationseconds/admission.go#L34
Those two flags can be set via the kube-apiserver and should achieve the same effect.

@liucimin
Copy link
Contributor

liucimin commented Feb 28, 2019

// Controller will not proactively sync node health, but will monitor node
// health signal updated from kubelet. There are 2 kinds of node healthiness
// signals: NodeStatus and NodeLease. NodeLease signal is generated only when
// NodeLease feature is enabled. If it doesn't receive update for this amount
// of time, it will start posting "NodeReady==ConditionUnknown". The amount of
// time before which Controller start evicting pods is controlled via flag
// 'pod-eviction-timeout'.
// Note: be cautious when changing the constant, it must work with
// nodeStatusUpdateFrequency in kubelet and renewInterval in NodeLease
// controller. The node health signal update frequency is the minimal of the
// two.
// There are several constraints:
// 1. nodeMonitorGracePeriod must be N times more than  the node health signal
//    update frequency, where N means number of retries allowed for kubelet to
//    post node status/lease. It is pointless to make nodeMonitorGracePeriod
//    be less than the node health signal update frequency, since there will
//    only be fresh values from Kubelet at an interval of node health signal
//    update frequency. The constant must be less than podEvictionTimeout.
// 2. nodeMonitorGracePeriod can't be too large for user experience - larger
//    value takes longer for user to see up-to-date node health.

@danielloczi
Copy link
Author

danielloczi commented Feb 28, 2019

Thanks for your feedback ChiefAlexander!
That is the situation, you wrote. I checked the pods, and sure there are the default values assigned to pod for toleration:

kubectl describe pod busybox-74b487c57b-95b6n | grep -i toleration -A 2
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s

So I just simply added my own values to the deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      tolerations:
      - key: "node.kubernetes.io/unreachable"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 2
      - key: "node.kubernetes.io/not-ready"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 2
      containers:
      - image: busybox
        command:
        - sleep
        - "3600"
        imagePullPolicy: IfNotPresent
        name: busybox
      restartPolicy: Always

After applying the deployment in case of node failure, node status changes to "NotReady", then pods re-created after 2 seconds.

So we don't have to deal with pod-eviction-timeout anymore, timeout can be set on Pod basis! Cool!

Thanks again for your help!

@nick0323
Copy link

nick0323 commented Jun 10, 2019

@danielloczi Hi danielloczi , How do you fix this issue? I also meet this issue

@zdyxry
Copy link

zdyxry commented Jun 25, 2019

@323929 I think @danielloczi doesn't care about the pod-eviction-timeout parameter in kube-controller-manager, but solves it by using Taint based Evictions , I tested with Taint based Evictions , it's worked for me.

@danielloczi
Copy link
Author

danielloczi commented Jun 26, 2019

That is right: I simply started to use Taint based Eviction.

@kamilgregorczyk
Copy link

kamilgregorczyk commented Jan 2, 2020

Is it possible to make it global? I don't want to enable that for each pod config, especially that I use a lot of prepared things from helm

@morgwai
Copy link

morgwai commented Feb 8, 2020

+1 for having the possibility to configure it per whole cluster. tuning per pod or per deployment is rarely useful: in most cases a sane global value is waaay more convenient and the current default of 5m is waaay to long for many cases.

please please reopen this issue.

@richardqa
Copy link

richardqa commented Mar 6, 2020

I am facing this same problem, Is there a way to unenable Taint based Evictions and that pod-eviction-timeout works in global mode?

@hrbasic
Copy link

hrbasic commented Mar 30, 2020

I am facing this same problem, Is there a way to unenable Taint based Evictions and that pod-eviction-timeout works in global mode?

I think that you can configure global pod eviction via apiserver: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/
I didn't try this, but as I can see there are options: --default-not-ready-toleration-seconds and --default-unreachable-toleration-seconds.

@enricovittorini
Copy link

enricovittorini commented Apr 13, 2020

Why had this bug been marked as closed? It does look like the original issue is not solved, but only work-arounded.
It is not clear to me why the pod-eviction-timeout flag is not working

@pythonzm
Copy link

pythonzm commented Apr 15, 2020

same issue

@voarsh2
Copy link

voarsh2 commented Apr 18, 2021

I use those lines in deployment - as others say, global/cluster setting is better.
How am I supposed to hit SLA's when it's 5 minutes?

@zhangguanzhang
Copy link

zhangguanzhang commented Sep 18, 2021

I use those lines in deployment - as others say, global/cluster setting is better.
How am I supposed to hit SLA's when it's 5 minutes?

may you need set this for kube-apiserver :

   --enable-admission-plugins=....,DefaultTolerationSeconds \
  --default-not-ready-toleration-seconds=60 \
  --default-unreachable-toleration-seconds=60 \

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests