Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet refuses to admit critical static pod #80203

Closed
yuchengwu opened this issue Jul 16, 2019 · 16 comments
Closed

kubelet refuses to admit critical static pod #80203

yuchengwu opened this issue Jul 16, 2019 · 16 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@yuchengwu
Copy link
Contributor

yuchengwu commented Jul 16, 2019

What happened:
In our cluster, we have configured our system pod as static critical by applying system-node-critical priorityClassName. But these pods cannot get admitted by kubelet when node under memory resource pressure.

What you expected to happen:
Critical static pod should always be get scheduled and admitted even node under resource pressure.

How to reproduce it (as minimally and precisely as possible):

  1. Do something(e.g. using dd to exhaust node disk space or creating a pod with large request value to run out of the node's mem/cpu) to make a node under resource pressure
  2. Create a static critical pod and place it into kubelet's manifest path which specified by --pod-manifest-path argument
  3. Check kubelet log, then we can see some events like
Jul 16 12:34:12 tdcdev04 kubelet.3[24520]: W0716 12:34:12.491413   24520 eviction_manager.go:146] Failed to admit pod k8s-proxy-tdcdev04_kube-system(fb03ac506cd98a40b7e21150cad8a7ce) - node has conditions
: [MemoryPressure]                                                                                                                                                                                          
Jul 16 12:34:12 tdcdev04 kubelet.3[24520]: I0716 12:34:12.491525   24520 server.go:238] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"k8s-proxy-tdcdev04", UID:"fb03ac506cd98a40b7e211
50cad8a7ce", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'Evicted' The node was low on resource: [MemoryPressure].                                                         
Jul 16 12:34:12 tdcdev04 kubelet.3[24520]: I0716 12:34:12.494431   24520 round_trippers.go:436] GET https://127.0.0.1:16443/api/v1/namespaces/kube-system/pods/k8s-proxy-tdcdev04 404 Not Found in 2 millise
conds                                                                                                                                                                                                       
Jul 16 12:34:12 tdcdev04 kubelet.3[24520]: I0716 12:34:12.494541   24520 status_manager.go:453] Pod "k8s-proxy-tdcdev04" (fb03ac506cd98a40b7e21150cad8a7ce) does not exist on the server                         

Anything else we need to know?:
A critical pod should get admitted by design even node under pressure , https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/eviction/eviction_manager.go#L138-L140

So why kublelet rejected the pod, I looked into IsCriticalPod func

func IsCriticalPod(pod *v1.Pod) bool {
if pod.Spec.Priority != nil && IsCriticalPodBasedOnPriority(*pod.Spec.Priority) {
return true
}
return false
}

apparently kubelet check if pod critical based on .pod.Spec.Priority, I remember .pod.Spec.Priority will be populated automatically if .pod.Spec.PriorityClassName was specified. But when it was populated? before or after kubelet admit phase, if it is the former then all things make sense, if it is the latter, then the issue may caused by other reasons, e.g. incorrect configuration.

To confirm that, I added several line to print pod Priority field then I found the Priority was actually not filled up before admit, surprising! As our product K8S differs from the master, I used minikube to start up a fresh new cluster to see if this had been fixed, steps are

  1. minikube start
wyc@smile:~$ minikube start
😄  minikube v1.2.0 on linux (amd64)
💿  Downloading Minikube ISO ...
 129.33 MB / 129.33 MB [============================================] 100.00% 0s
🔥  Creating virtualbox VM (CPUs=2, Memory=2048MB, Disk=20000MB) ...
🐳  Configuring environment for Kubernetes v1.15.0 on Docker 18.09.6
💾  Downloading kubeadm v1.15.0
💾  Downloading kubelet v1.15.0
🚜  Pulling images ...
🚀  Launching Kubernetes ...
⌛  Verifying: apiserver proxy etcd scheduler controller dns
🏄  Done! kubectl is now configured to use "minikube"
  1. added line 133-135 logs to see if Pod.Spec.Priority be populated before kubelet admit, replace the modified kubelet with minikube provided.
133     klog.Infof(">>%s: %#v", format.Pod(attrs.Pod), attrs.Pod.Spec)
134     klog.Infof(">>%s: %#v", format.Pod(attrs.Pod), attrs.Pod.Spec.Priority)
135     klog.Infof(">>%s: %#v", format.Pod(attrs.Pod), attrs.Pod.Spec.PriorityClassName)
136     if len(m.nodeConditions) == 0 {
137         return lifecycle.PodAdmitResult{Admit: true}
138     }
139     // Admit Critical pods even under resource pressure since they are required for system stability.
140     // https://github.com/kubernetes/kubernetes/issues/40573 has more details.
141     if kubelettypes.IsCriticalPod(attrs.Pod) {
142         return lifecycle.PodAdmitResult{Admit: true}
143     }
  1. check kubelet log
# list all static pod
$ ls /etc/kubernetes/manifests/
addon-manager.yaml.tmpl  etcd.yaml  kube-apiserver.yaml  kube-controller-manager.yaml  kube-scheduler.yaml
# check events
$ journalctl -u kubelet | grep ">>" | grep -v 133
Jul 16 06:03:36 minikube kubelet[16861]: I0716 06:03:36.251735   16861 eviction_manager.go:134] >>kube-addon-manager-minikube_kube-system(ade72382abf2c550d837ad50ea27f284): (*int32)(nil)
Jul 16 06:03:36 minikube kubelet[16861]: I0716 06:03:36.251794   16861 eviction_manager.go:135] >>kube-addon-manager-minikube_kube-system(ade72382abf2c550d837ad50ea27f284): ""
Jul 16 06:03:36 minikube kubelet[16861]: I0716 06:03:36.255912   16861 eviction_manager.go:134] >>etcd-minikube_kube-system(15dbee5e6e7e06f663de0d61f4b3a740): (*int32)(nil)
Jul 16 06:03:36 minikube kubelet[16861]: I0716 06:03:36.255942   16861 eviction_manager.go:135] >>etcd-minikube_kube-system(15dbee5e6e7e06f663de0d61f4b3a740): "system-cluster-critical"
Jul 16 06:03:36 minikube kubelet[16861]: I0716 06:03:36.265247   16861 eviction_manager.go:134] >>kube-apiserver-minikube_kube-system(ce3a102d72e6ec1fe667393995fa92fa): (*int32)(nil)
Jul 16 06:03:36 minikube kubelet[16861]: I0716 06:03:36.265278   16861 eviction_manager.go:135] >>kube-apiserver-minikube_kube-system(ce3a102d72e6ec1fe667393995fa92fa): "system-cluster-critical"
Jul 16 06:03:36 minikube kubelet[16861]: I0716 06:03:36.268770   16861 eviction_manager.go:134] >>kube-controller-manager-minikube_kube-system(fc0455b20d15579e82a0434971644deb): (*int32)(nil)
Jul 16 06:03:36 minikube kubelet[16861]: I0716 06:03:36.268800   16861 eviction_manager.go:135] >>kube-controller-manager-minikube_kube-system(fc0455b20d15579e82a0434971644deb): "system-cluster-critical"
Jul 16 06:03:36 minikube kubelet[16861]: I0716 06:03:36.272653   16861 eviction_manager.go:134] >>kube-scheduler-minikube_kube-system(07583a6ba6655453f57d6f4a6babeb2b): (*int32)(nil)
Jul 16 06:03:36 minikube kubelet[16861]: I0716 06:03:36.272680   16861 eviction_manager.go:135] >>kube-scheduler-minikube_kube-system(07583a6ba6655453f57d6f4a6babeb2b): "system-cluster-critical"
Jul 16 06:03:36 minikube kubelet[16861]: I0716 06:03:36.276849   16861 eviction_manager.go:134] >>kube-proxy-vtzm6_kube-system(a48da953-aabf-476b-b5fa-9bb824687bf0): (*int32)(0xc0007e4198)
Jul 16 06:03:36 minikube kubelet[16861]: I0716 06:03:36.276876   16861 eviction_manager.go:135] >>kube-proxy-vtzm6_kube-system(a48da953-aabf-476b-b5fa-9bb824687bf0): "system-node-critical"
Jul 16 06:03:36 minikube kubelet[16861]: I0716 06:03:36.284813   16861 eviction_manager.go:134] >>coredns-5c98db65d4-5l7qn_kube-system(91f87fa2-c75b-4953-b474-fed587120dd0): (*int32)(0xc0007e41d8)
Jul 16 06:03:36 minikube kubelet[16861]: I0716 06:03:36.284842   16861 eviction_manager.go:135] >>coredns-5c98db65d4-5l7qn_kube-system(91f87fa2-c75b-4953-b474-fed587120dd0): "system-cluster-critical"

obviously, kubelet failed to admit critical pod probably because static pod priority value not setted at a right point which expects earlier than kubelet admit phase.

Environment:

wyc@smile:~$ minikube version 
minikube version: v1.2.0
  • Kubernetes version (use kubectl version):
wyc@smile:~$ kubectl version
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.1", GitCommit:"eec55b9ba98609a46fee712359c7b5b365bdd920", GitTreeState:"clean", BuildDate:"2018-12-13T10:39:04Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:32:14Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
$ cat /etc/os-release 
NAME=Buildroot
VERSION=2018.05.3
ID=buildroot
VERSION_ID=2018.05.3
PRETTY_NAME="Buildroot 2018.05.3"
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:
@yuchengwu yuchengwu added the kind/bug Categorizes issue or PR as related to a bug. label Jul 16, 2019
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 16, 2019
@yuchengwu
Copy link
Contributor Author

yuchengwu commented Jul 16, 2019

workaround:
for K8S version <= 1.15 , this can be workaound by setting scheduler.alpha.kubernetes.io/critical-pod: "" annotation and kubelet enable ExperimentalCriticalPodAnnotation=true feature gate
however for version > 1.15, this feature had been removed from #79554

@yuchengwu
Copy link
Contributor Author

/sig node
/assign @bsalamat

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 16, 2019
@bsalamat
Copy link
Member

PodSpec.Priority is populated at admission time. It is populated by Priority admission plugin. If your cluster bring-up scripts do not enable this plugin, pod priority will not be populated.

@yuchengwu
Copy link
Contributor Author

PodSpec.Priority is populated at admission time. It is populated by Priority admission plugin. If your cluster bring-up scripts do not enable this plugin, pod priority will not be populated.

Thanks for the reminding, I checked PodPriority=true had been enabled for kubelet, controller, apiserver and scheduler,

[root@tdcdev01 _output]# ps aux| grep PodPriority=true
root      5289 10.6  0.1 2436864 189800 ?      Ssl  Jul16 123:01 /opt/kubernetes/bin/kubelet --logtostderr=true --v=6 --address=0.0.0.0 --cluster_dns=10.10.0.10 --cluster_domain=transwarp.local --port=10250 --hostname_override=tdcdev01 --log_dir=/var/log/kubernetes --pod-manifest-path=/opt/kubernetes/manifests-multi --pod-infra-container-image=google_containers/pause:3.0 --network-plugin=cni --resolv-conf=/dev/null --max-pods=200 --volume-stats-agg-period=0 --node-labels=master=true,worker=true --fail-swap-on=false --runtime-cgroups=/systemd/system.slice --kubelet-cgroups=/systemd/system.slice --eviction-hard=memory.available<10000Mi --system-reserved=memory=20000Mi --node-ip=172.16.3.231 --kubeconfig=/srv/kubernetes/kubeconfig --feature-gates=BlockVolume=true,MountPropagation=true,PersistentLocalVolumes=true,VolumeScheduling=true,LocalStorageCapacityIsolation=true,PodPriority=true,DevicePlugins=true,ExperimentalCriticalPodAnnotation=true --kube-api-qps=100 --client-ca-file=/srv/kubernetes/ca.pem --authorization-mode=Webhook --authentication-token-webhook=true --allow_privileged=true
root      9529 33.8  4.7 9908604 6276652 ?     Ssl  Jul15 879:21 controller-manager --feature-gates=RotateKubeletServerCertificate=true,SupportIPVSProxyMode=true,PodPriority=true,ExpandPersistentVolumes=true,VolumeScheduling=true,BlockVolume=true --v=4 --kubeconfig=/srv/kubernetes/kubeconfig --leader-elect --root-ca-file=/srv/kubernetes/ca.pem --service-account-private-key-file=/srv/kubernetes/service-account-key.pem --cluster-signing-cert-file=/srv/kubernetes/ca.pem --cluster-signing-key-file=/srv/kubernetes/ca-key.pem --controllers=*,tokencleaner,bootstrapsigner --logtostderr=true --log_dir=/var/log/kubernetes/ --allocate-node-cidrs=true --cluster-cidr=10.11.0.0/16 --terminated-pod-gc-threshold=100
root     11414  0.0  0.0 112712  1000 pts/2    S+   10:26   0:00 grep --color=auto PodPriority=true
root     22368 23.1  1.5 6174132 2044812 ?     Ssl  Jun24 7547:57 apiserver --v=6 --etcd_servers=https://tdcdev02:2379,https://tdcdev01:2379,https://tdcdev03:2379 --etcd-cafile=/etc/kubernetes/ssl/etcd-ca.pem --etcd-certfile=/etc/kubernetes/ssl/etcd.pem --etcd-keyfile=/etc/kubernetes/ssl/etcd-key.pem --apiserver-count=3 --insecure-bind-address=127.0.0.1 --insecure-port=8080 --bind-address=172.16.3.231 --secure-port=15443 --runtime_config=api/all --runtime_config=admissionregistration.k8s.io/v1alpha1 --runtime_config=settings.k8s.io/v1alpha1 --runtime-config=scheduling.k8s.io/v1alpha1=true --kubelet-https=true --admission_control=NamespaceLifecycle,NamespaceExists,LimitRanger,PersistentVolumeLabel,DefaultStorageClass,DefaultTolerationSeconds,ResourceQuota,ServiceAccount,Initializers,PodPreset,Priority,PersistentVolumeClaimResize --authorization-mode=Node,RBAC --token-auth-file=/etc/kubernetes/ssl/token.csv --service-account-key-file=/etc/kubernetes/ssl/service-account.pem --service-cluster-ip-range=10.10.0.0/16 --client_ca_file=/etc/kubernetes/ssl/ca.pem --tls_cert_file=/etc/kubernetes/ssl/kubernetes.pem --tls_private_key_file=/etc/kubernetes/ssl/kubernetes-key.pem --requestheader-client-ca-file=/etc/kubernetes/ssl/ca.pem --requestheader-allowed-names=aggregator,kubernetes --requestheader-extra-headers-prefix="X-Remote-Extra-" --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --proxy-client-cert-file=/etc/kubernetes/ssl/metrics-server.pem --proxy-client-key-file=/etc/kubernetes/ssl/metrics-server-key.pem --enable-swagger-ui=true --allow_privileged=true --feature-gates=AdvancedAuditing=true,MountPropagation=true,PersistentLocalVolumes=true,SupportIPVSProxyMode=true,BlockVolume=true,PodPriority=true,ExpandPersistentVolumes=true,VolumeScheduling=true --audit-policy-file=/etc/kubernetes/ssl/audit-policy.yaml --audit-log-maxage=30 --audit-log-maxbackup=3 --audit-log-maxsize=100 --audit-log-path=/var/log/kubernetes/audit.log --kubelet-client-certificate=/etc/kubernetes/ssl/admin.pem --kubelet-client-key=/etc/kubernetes/ssl/admin-key.pem --event-ttl=12h --licence-cafile=/etc/kubernetes/ssl/ca.pem --licence-certfile=/etc/kubernetes/ssl/admin.pem --licence-keyfile=/etc/kubernetes/ssl/admin-key.pem --licence-server=127.0.0.1:60907
root     24679 76.5  0.1 2315504 190648 ?      Ssl  Jun24 24910:21 scheduler --feature-gates=PodPriority=true,VolumeScheduling=true --v=4 --kubeconfig=/srv/kubernetes/kubeconfig --leader-elect --profiling --logtostderr=true --log_dir=/var/log/kubernetes/

also we can see PodSpec.Priority is populated after the pod had successfully been brought up. But as I had mentioned

But when it was populated? before or after kubelet admit phase, ...

seems that the creation steps for a static pod are

  1. kubelet noticed a static pod manifest is placed in manifest path then
  2. start to handle pod addition and check if it can be admitted by calling the registered Admit handlers, if the answer if Yes then enter step 3, otherwise reject the pod
  3. kubelet sync pod and create a mirror pod if the pod is a static pod, assuming the pod does not already have a mirror pod, sending the pod config to apiserver, then its Priority populated by Priority admission plugin

so if node under resource pressure, static pod will fail at step 2, and it won't get a chance to populate its Priority value.

But for non-static, all things works find because step 3 happens before step 2.

@bsalamat
Copy link
Member

Priority is populated when a pod object is created, but as you noted, the problem is that a pod object (mirror pod) for a static pod is created after the actual pod is created on the kubelet.

One option to solve this problem is to add logic to Kubelet to resolve PriorityClass for static pods.

@dchen1107 Who would be able to help with this issue?

@dchen1107
Copy link
Member

The issue is a regression caused by #79554. In this case, we should revert the pr until kube-proxy is properly moving to daemonset.

@bsalamat
Copy link
Member

I chatted with @dchen1107 offline. She and SIG node will discuss the solution to add code to Kubelet to resolve PriorityClassName to the numeric value of priority for static pods.

In the meantime, we are reverting #79554 until this issue is resolved.

@tedyu
Copy link
Contributor

tedyu commented Jul 18, 2019

I was thinking about the following change to CreateMirrorPod of pkg/kubelet/pod/mirror_client.go

@@ -71,6 +72,9 @@ func (mc *basicMirrorClient) CreateMirrorPod(pod *v1.Pod) error {
                        return nil
                }
        }
+       if err == nil && IsStaticPod(pod) {
+               apiPod.Spec.PriorityClassName = scheduling.SystemClusterCritical
+       }
        return err
 }

@yuchengwu
Copy link
Contributor Author

yuchengwu commented Jul 18, 2019

I was thinking about the following change to CreateMirrorPod of pkg/kubelet/pod/mirror_client.go

@@ -71,6 +72,9 @@ func (mc *basicMirrorClient) CreateMirrorPod(pod *v1.Pod) error {
                        return nil
                }
        }
+       if err == nil && IsStaticPod(pod) {
+               apiPod.Spec.PriorityClassName = string(*pod.Spec.Priority)
+       }
        return err
 }

Hi, @tedyu thanks for hacking on that, I don't think this change can resolve the issue. As in my previous comment #80203 (comment) stated, the priority should be properly assigned before kubelet admit phase which means before calling Admit handlers at step 2, but your change happens at step 3, besides critical pod judgement based on priority not priority class name.

And one more question, why do you assign Priority value to PriorityClassName? they are two different things, the priority value is a numeric number indicates the pod priority and the the PriorityClassName is a priority class name we intended to choose to populate the priority value. To avoid confusion and keep the pod spec api consistent, we'd better not mix them.

Further more, @bsalamat I personally believe a better solution for this issue is not only focusing on resolve PriorityClassName to the numeric value of priority but also taking care of other fields that will be populated by admission plugins or filled up by default funcs when creating a pod(may allowing us bypass future potential regressions), though it is not easy to achieve for mirror pods need to be treated differently in multiple places in kubelet, making them brittle.

@tedyu
Copy link
Contributor

tedyu commented Jul 18, 2019

Thanks for the feedback.
I modified the priority assignment after posting the previous comment.

I restored it to the first form.

I will keep what you said in mind.

@dchen1107
Copy link
Member

We are going to discuss this issue at today's SIG Node meeting. @yuchengwu can you join us today?

@dchen1107
Copy link
Member

Static pods & mirror pods are error-prone, and SIG Node wanted to deprecated the feature for long or limited its usage for self hosted kubernetes. Meanwhile, we understand that many K8s productions already depend on static pods for some critical pods before daemonset and other practical reasons.

Assuming all static pods are critical, and should be admitted and scheduled by Kubelet even node under resource pressure, I proposed to grant the static pods with the critical priority class, and the highest priority number.

Again, we discourage a random user to use the static pod; if you have to use the static pods, please be cautious with the resource capacity plan on the node.

We discussed this at today's SIG Node, @bsalamat attended too. I proposed the above thoughts, and had the agreement from the community.

@bsalamat
Copy link
Member

Thanks, @dchen1107! As Dawn wrote above, the plan is to consider all static pods critical, regardless of their priority classes. After this change is made to the kubelet, we can merge #80342 again to remove the critical pod annotation.

@yuchengwu
Copy link
Contributor Author

Thanks @dchen1107 , I'd happy to join the meeting though it was over.

Meanwhile, we understand that many K8s productions already depend on static pods for some critical pods before daemonset and other practical reasons.

Well, this is our situation.

Assuming all static pods are critical, and should be admitted and scheduled by Kubelet even node under resource pressure, I proposed to grant the static pods with the critical priority class, and the highest priority number.

I am fine with this change, one question is does that means a static pod .spec.priorityClassName will only be configured to system-node-critical? One concern is given a static pod is always treated as critical pod and static critical pod will not be evicted by kubelet, how we handle memory leak or other bugs in a static pod that will eventually cause node failure.

@dixudx
Copy link
Member

dixudx commented Aug 9, 2019

dup of #80489
and this has been fixed in #80491

close it now. please reopen if needed.

/close

@k8s-ci-robot
Copy link
Contributor

@dixudx: Closing this issue.

In response to this:

dup of #80489
and this has been fixed in #80491

close it now. please reopen if needed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

6 participants